Operations on BlastP and ClusterW

範例檔案 : 20160708_blastp_clustalw.rar

Dataset

input : swissprot (2013/11/11 captured from ExPASy)
description :
1. get one amino acid sequence ABL1_HUMAN from the database
2. search the database through BLASTP to get
  - identity > 50%
  - e-value < 1e-10
  - alignment length >= 80
  - find the best result on one specie
output :
1. find the result from the description and try to point out several nucleotide positions which are highly conversed among species but not in Drosophila.
2. the nucleotide position follow the rules: the position in Drosophila would not be a gap, the positions among several other species would be not gaps and must highly conserved

Database preparation

database constructed by blast suits:

$ makeblastdb -in ./swissprot -dbtype prot -parse_seqids

get main sequences from database for protein blast:

$ blastdbcmd -entry ABL1_HUMAN -db ./swissprot > ABL1.fa

blastp in the database and generate the output file(-outfmt 6) blastpRes(file name)

$ blastp -query ./ABL1.fa -db ./swissprot -outfmt 6 -max_target_seqs 434882 > blastpRes

the output in format 6 : E.g. refer to the following result

gi|85681908|sp|P00519.4|ABL1_HUMAN gi|85681908|sp|P00519.4|ABL1_HUMAN  100.00 1130  0  0  1  1130  1  1130  0.0  2323
gi|85681908|sp|P00519.4|ABL1_HUMAN gi|59802613|sp|P00520.3|ABL1_MOUSE  87.75  1135  122  11  1  1130  1  1123  0.0  1920

column	description	example of 2
1	Query label	ABL1_HUMAN
2	Target (database sequence) label	ABL1_MOUSE
3	Percent identity	87.75
4	Alignment length	1135
5	Number of mismatches	122
6	Number of gap opens	11
7	1-based position in the beginning of query.	1
8	1-based position in the end of query.	1130
9	1-based position in the beginning of target	1
10	1-based position in the end of target.	1123
11	E-value calculation through Karlin-Altschul statistics	0.0
12	Bit score calculation through Karlin-Altschul statistics	1920

Stage I

extract the subject ID whose identify > 50%, e-value < 1e-10 and alignment length >= 80
subject IDs with the same species should be combined into one and keep the higher identity
refer to the Perl script file named procedureI.pl

#!/usr/bin/perl -w

use strict;

# global variable declarations
my $inputFileName = "";
my %specConser;          # save the complete search id and identity
my @temp;
my $flag;
my $specieName;

# data input
# there are two way to get the input file name
# one is from the parameter and
# the other is changing the content of variable
if(scalar(@ARGV) == 0 and length($inputFileName) == 0) { 
    die("No input file assigned.\n"); 
}
else { 
    $inputFileName = $ARGV[0];
}

if(open(fin,$inputFileName)) {
    foreach my $line (<fin>) {
        chomp($line);
        @temp = split("\t",$line);
        # it is possible to direct use scientific notations to calculate
        if($temp[2] > 50 and $temp[3] >= 80 and $temp[10] < 1e-10) {
            #print $line."\n";
            $temp[1] =~ m/(.*)_(.*)/;

            # only specie name
            #print $2."\n"; 
            $specieName = $2;

            $flag = 0;
            foreach my $ele (keys(%specConser)) {
                $ele =~ m/(.*)_(.*)/;

                # the same specie in the hash
                # compare the identity and conserve the hightest one
                if($specieName eq $2) { 
                    # hit happen and check which one is larger and than break of the loop
                    if ($specConser{$ele} < $temp[2]) {
                        # change value and store the higher one into the hash
                        delete($specConser{$ele});
                        $specConser{$temp[1]} = $temp[2];
                    }
                    $flag = 1;
                    last;
                }
            }
            # species is not in the hash
            if($flag == 0) { $specConser{$temp[1]} = $temp[2]; }
        }
    }
}
else {
    die("Input file error.\n"); 
}

foreach my $keyValue (keys(%specConser)) {
    print $keyValue,"\n";
}

get the final subject ID and its identity (saved as a file sid)

$ perl ./procedureI.pl ./ blastpRes > sid

the following is the result:

Stage II

when the file, named sid, is finished search the database to get the sequences of all subject IDs in the file (saved as a file aaseq.fa)

$ blastdbcmd -db ./swissprot -entry_batch ./sid > aaseq.fa

using clustalw to multiple sequence alignment among this amino acid sequence data in fasta format, not in base difference format (data saved as aaseq.fasta)

$ clustalw2 ./aaseq.fa –output=FASTA

partial result of ./aaseq.fasta

Stage III

modify the file foxp2.pl to make sure that the nucleotides highly conserved among each species but changed in Drosophila
Refer to the perl script file name foxp2_modify.pl

#!/usr/bin/perl

use strict;

# file check
open(IN, $ARGV[0]) or die "Could not open file\n";

# global variable declaration
my %nondrome;           # save the id and sequence of other species
my $id = undef;         # temp usage
my $seq;                # temp usage
my $id_drome;           # Drosophila id
my $dromeseq;           # Drosophila sequence

# data input
while (<IN>) {
    chomp;
    if (/^>/) {
        if ($id) {
            if ( $id !~ /DROME/ ) {
                $nondrome{$id} = $seq;
            }
            else {
                $dromeseq = $seq;
                $id_drome  = $id;
            }
        }
        $id = $_;
        $id =~ s/>//;
        $seq = '';
    }
    else {
        $seq .= $_;
    }
}
$nondrome{$id} = $seq;
close IN;

# start to find
my $totalSpeciesCount = scalar(keys(%nondrome));                # record how many the species are
my %compEachBase;                                               # record each possible base and record them
my $flag = 0;                                                   # to survey the reocrding hash table
my $unique = 1;
my $sub_drome;
my $countTotal = 1;                                             # record all the result

for ( my $i = 0 ; $i < length($dromeseq) ; $i++ ) {
    $unique = 1;
    $sub_drome = substr( $dromeseq, $i, 1 );                    # each time to check one base of Drosophila
    if ($sub_drome eq '-') { next; }                            # the base of Drosophila is a gap

    %compEachBase = ();
    foreach my $key ( keys %nondrome ) {                        # check each species
        my $sub_amino = substr( $nondrome{$key}, $i, 1 );       # get the same base but on different species

        if ($sub_amino eq $sub_drome or $sub_amino eq '-') {
            $unique = -1;                                       # unique is a flag to reference further
            last;
        }

        # check whether this amino acid is in the hash
        $flag = 0;
        foreach my $ele (keys(%compEachBase)) {
            if($sub_amino eq $ele) { 
                $flag = 1;  
                last;
            }
        }

        # add or create
        if($flag == 1) { $compEachBase{$sub_amino} += 1; }
        else { $compEachBase{$sub_amino} = 1; }     
    }

    if ($unique == -1) { next; }
    elsif($unique == 1){
        $flag = 0;
        foreach my $ele (keys(%compEachBase)) {
            if($compEachBase{$ele} == $totalSpeciesCount) {
                my $pos = $i + 1;
                print "No.".($countTotal++)." -> At position $pos\n";
                print "$id_drome\t$sub_drome\n";
                foreach my $key ( keys %nondrome ) {
                    my $aa = substr( $nondrome{$key}, $i, 1 );
                    print "$key\t$aa\n";
                }                
                last;
            }
        }
    }
}

print "\nThere are ".($countTotal-1)." positions\nwhich are highly conserved among species but not in Drosophila.\n";

There are 16 positions which are highly conserved among species but not in Drosophila.
Generate the result that saves nucleotides highly conserved among species but not in Drosophila.

$ perl ./foxp2_modify.pl ./aaseq.fasta > ./totalPositions.txt

The following is the partial result (refer to the file, named totalPositions.txt)

Operations on BlastP and ClusterW

Operations on BlastP and ClusterW

Dataset

Database preparation

Stage I

Stage II

Stage III

results matching ""

No results matching ""