Operations on BlastP and ClusterW
- 範例檔案 : 20160708_blastp_clustalw.rar
Dataset
input : swissprot (2013/11/11 captured from ExPASy)
description :
- get one amino acid sequence ABL1_HUMAN from the database
- search the database through BLASTP to get
- identity > 50%
- e-value < 1e-10
- alignment length >= 80
- find the best result on one specie
output :
- find the result from the description and try to point out several nucleotide positions which are highly conversed among species but not in Drosophila.
- the nucleotide position follow the rules: the position in Drosophila would not be a gap, the positions among several other species would be not gaps and must highly conserved
Database preparation
- database constructed by blast suits:
$ makeblastdb -in ./swissprot -dbtype prot -parse_seqids
- get main sequences from database for protein blast:
$ blastdbcmd -entry ABL1_HUMAN -db ./swissprot > ABL1.fa
- blastp in the database and generate the output file(-outfmt 6) blastpRes(file name)
$ blastp -query ./ABL1.fa -db ./swissprot -outfmt 6 -max_target_seqs 434882 > blastpRes
- the output in format 6 : E.g. refer to the following result
gi|85681908|sp|P00519.4|ABL1_HUMAN gi|85681908|sp|P00519.4|ABL1_HUMAN 100.00 1130 0 0 1 1130 1 1130 0.0 2323
gi|85681908|sp|P00519.4|ABL1_HUMAN gi|59802613|sp|P00520.3|ABL1_MOUSE 87.75 1135 122 11 1 1130 1 1123 0.0 1920
column | description | example of 2 |
---|---|---|
1 | Query label | ABL1_HUMAN |
2 | Target (database sequence) label | ABL1_MOUSE |
3 | Percent identity | 87.75 |
4 | Alignment length | 1135 |
5 | Number of mismatches | 122 |
6 | Number of gap opens | 11 |
7 | 1-based position in the beginning of query. | 1 |
8 | 1-based position in the end of query. | 1130 |
9 | 1-based position in the beginning of target | 1 |
10 | 1-based position in the end of target. | 1123 |
11 | E-value calculation through Karlin-Altschul statistics | 0.0 |
12 | Bit score calculation through Karlin-Altschul statistics | 1920 |
Stage I
- extract the subject ID whose identify > 50%, e-value < 1e-10 and alignment length >= 80
- subject IDs with the same species should be combined into one and keep the higher identity
- refer to the Perl script file named procedureI.pl
#!/usr/bin/perl -w
use strict;
# global variable declarations
my $inputFileName = "";
my %specConser; # save the complete search id and identity
my @temp;
my $flag;
my $specieName;
# data input
# there are two way to get the input file name
# one is from the parameter and
# the other is changing the content of variable
if(scalar(@ARGV) == 0 and length($inputFileName) == 0) {
die("No input file assigned.\n");
}
else {
$inputFileName = $ARGV[0];
}
if(open(fin,$inputFileName)) {
foreach my $line (<fin>) {
chomp($line);
@temp = split("\t",$line);
# it is possible to direct use scientific notations to calculate
if($temp[2] > 50 and $temp[3] >= 80 and $temp[10] < 1e-10) {
#print $line."\n";
$temp[1] =~ m/(.*)_(.*)/;
# only specie name
#print $2."\n";
$specieName = $2;
$flag = 0;
foreach my $ele (keys(%specConser)) {
$ele =~ m/(.*)_(.*)/;
# the same specie in the hash
# compare the identity and conserve the hightest one
if($specieName eq $2) {
# hit happen and check which one is larger and than break of the loop
if ($specConser{$ele} < $temp[2]) {
# change value and store the higher one into the hash
delete($specConser{$ele});
$specConser{$temp[1]} = $temp[2];
}
$flag = 1;
last;
}
}
# species is not in the hash
if($flag == 0) { $specConser{$temp[1]} = $temp[2]; }
}
}
}
else {
die("Input file error.\n");
}
foreach my $keyValue (keys(%specConser)) {
print $keyValue,"\n";
}
- get the final subject ID and its identity (saved as a file sid)
$ perl ./procedureI.pl ./ blastpRes > sid
- the following is the result:
Stage II
- when the file, named sid, is finished search the database to get the sequences of all subject IDs in the file (saved as a file aaseq.fa)
$ blastdbcmd -db ./swissprot -entry_batch ./sid > aaseq.fa
- using clustalw to multiple sequence alignment among this amino acid sequence data in fasta format, not in base difference format (data saved as aaseq.fasta)
$ clustalw2 ./aaseq.fa –output=FASTA
- partial result of ./aaseq.fasta
Stage III
modify the file foxp2.pl to make sure that the nucleotides highly conserved among each species but changed in Drosophila
Refer to the perl script file name foxp2_modify.pl
#!/usr/bin/perl
use strict;
# file check
open(IN, $ARGV[0]) or die "Could not open file\n";
# global variable declaration
my %nondrome; # save the id and sequence of other species
my $id = undef; # temp usage
my $seq; # temp usage
my $id_drome; # Drosophila id
my $dromeseq; # Drosophila sequence
# data input
while (<IN>) {
chomp;
if (/^>/) {
if ($id) {
if ( $id !~ /DROME/ ) {
$nondrome{$id} = $seq;
}
else {
$dromeseq = $seq;
$id_drome = $id;
}
}
$id = $_;
$id =~ s/>//;
$seq = '';
}
else {
$seq .= $_;
}
}
$nondrome{$id} = $seq;
close IN;
# start to find
my $totalSpeciesCount = scalar(keys(%nondrome)); # record how many the species are
my %compEachBase; # record each possible base and record them
my $flag = 0; # to survey the reocrding hash table
my $unique = 1;
my $sub_drome;
my $countTotal = 1; # record all the result
for ( my $i = 0 ; $i < length($dromeseq) ; $i++ ) {
$unique = 1;
$sub_drome = substr( $dromeseq, $i, 1 ); # each time to check one base of Drosophila
if ($sub_drome eq '-') { next; } # the base of Drosophila is a gap
%compEachBase = ();
foreach my $key ( keys %nondrome ) { # check each species
my $sub_amino = substr( $nondrome{$key}, $i, 1 ); # get the same base but on different species
if ($sub_amino eq $sub_drome or $sub_amino eq '-') {
$unique = -1; # unique is a flag to reference further
last;
}
# check whether this amino acid is in the hash
$flag = 0;
foreach my $ele (keys(%compEachBase)) {
if($sub_amino eq $ele) {
$flag = 1;
last;
}
}
# add or create
if($flag == 1) { $compEachBase{$sub_amino} += 1; }
else { $compEachBase{$sub_amino} = 1; }
}
if ($unique == -1) { next; }
elsif($unique == 1){
$flag = 0;
foreach my $ele (keys(%compEachBase)) {
if($compEachBase{$ele} == $totalSpeciesCount) {
my $pos = $i + 1;
print "No.".($countTotal++)." -> At position $pos\n";
print "$id_drome\t$sub_drome\n";
foreach my $key ( keys %nondrome ) {
my $aa = substr( $nondrome{$key}, $i, 1 );
print "$key\t$aa\n";
}
last;
}
}
}
}
print "\nThere are ".($countTotal-1)." positions\nwhich are highly conserved among species but not in Drosophila.\n";
There are 16 positions which are highly conserved among species but not in Drosophila.
Generate the result that saves nucleotides highly conserved among species but not in Drosophila.
$ perl ./foxp2_modify.pl ./aaseq.fasta > ./totalPositions.txt
- The following is the partial result (refer to the file, named totalPositions.txt)