便捷工具

提取最长蛋白序列(转录本一样适用)

2023-01-04  本文已影响0人  余绕
对于蛋白的fa文件,由于是每个转录本对应一个蛋白序列。由于后期注释需要以基因为单位,因此需要对fa文件进行处理,计算每一个转录本的蛋白序列长度,然后提取最长的蛋白序列,该序列即为该基因的蛋白序列。具体代码(关键实现代码是if 判断语句)如下:
open FA, "$ARGV[0]";
$/=">";
<FA>;
while(<FA>){
        chomp;
            my($id,$seq)=(split /\n/,$_,2)[0,1];
            $seq=~s/\n//g;
                   ($ID1,$ID2)=split /\./,$id;   
           #print"$ID1\n$seq\n"; 
                  $length=length($seq);
         if(exists  $hash{$ID1}){

            if($length > length($hash{$ID1})) {
             $hash{$ID1}=$seq;           
              }  
               else{
                     next;
                  }
          }
          else{
          
               $hash{$ID1}=$seq;
          }
      }

foreach (keys %hash){

#$len=length($hash{$_});

print">$_\n$hash{$_}\n";

}     

运行代码:

perl longgest_protein_ID.pl  SWO.v3.0.protein.fa > longgest_protein.fa

其中SWO.v3.0.protein.fa如下:

head -20  SWO.v3.0.protein.fa  
>Cs_ont_5g000040.3
MTEFMDLEAQDGVRMPWNVIPGTKQEASNCVVPVSAIYTPIKAFPVNNNSMPILPYAPLRCRTCRSILNPFSIVDFAAKI
WICPFCFQRNHFPPHYASITDDNLPAELFPQYTTIEYEPPGPGEKSSVPPVFMFVVDTCIIEEEMSFLKSALSQAIDLLP
DNSLVGLITFGTLVQVHELGGFGQIIPKTYVFKGSKDVSKDQLLEQLNFFIKKPKPSTGVIAGVRDGLSSDTIARFLVPA
FDCEFTLNSVLEELQKDPWPVPPDQRSTRCTGTALSIAASLLGACVPGSGARILAFVGGPSTEGPAAIVSKNLSEPIRSH
KDLDKDSAPHYHKAVKFYDALSKQLVHQGHVLDLFACALDQVGVAELKVAVEKTGGLVVLSDSFGHAVFKDSVRRVFHSG
DYDLGLSSNGIFEINCSKDIKVQGIIGPCASLEKKGPLCSDAVVGQGNTSAWKMCGLDKATSLCLVFEIVKKEIPDATLQ
STNNQFYFQFLTYYQHNCGQMRLRVTTLSRRWVAGPGSVQDLIAGFDQEAAAVVMARLVSFKMEIEAEFDPIRWLDKALI
HMCSRFGDYQKDSPSSFSLSPRFSIFPQFMFHLRRSQFVQVFNNSPDETAYFRMILNRENVTNSVVMIQPSLISYSFHSG
PEPALLDVAAIAADRILLLDSYFTVVIFHGATIAQWRKAGYHNQPEHQAFAQLLQAPHDDAGVIMKERFPVPRLVICDQH
GSQVGSYHNVQQMGGSLWMIVVCYLLSPGKDILGHSPMDVCT*
>Cs_ont_5g044840.1
MESLALPCLNSFTSSRASFHKEQSNKLLAKARYLNSSKPASLRLTPYYIAPSFNLGGIRMQAGGEEYELKQMRDMAAAKK
RWDALIRDGKVKVLTPREAGYAVQLSSKTLLDVRPSTERKKAWIKGSIWIPIFDIDDTFDAGSLPQKVTNFVMGGWWSGV
PTLSYNKQFLSKVEEKLPKDTDLIVACQKGLRSLAACELLYNAGYRNLFWVQGGLEAAEEEDLVREGPQPLKFAGIGGLS
EFLGWTGQQRAQAAREGWGYRLLYSARLVGVFLAADALIIGAQQVGRYIQDIRSH*
>Cs_ont_4g025730.1
MSVSMRDLDSAFQGAGQKAGIEIWRIENFKPVLVPKSSHGKFFTGDSYVILKTTASKSGALRHDIHYWLGKDTSQDEAGT
AAIKTVELDAALGGRAVQYREVQGHETEKFLSYFKPCIIPQEGGIASGFKRAEAEEHKIRLFVCRGKHVIHVKEVPFSRS
SLNHDDIFILDTQSKIFQFNGSNSSIQERAKALEVVQYIKDTYHDGKCEVAVVEDGKLMADAEAGEFWGFFGGFAPLPRK

上一篇下一篇

猜你喜欢

热点阅读