宏基因组基因组学

Kraken2 report完全分类注释

2021-06-17  本文已影响0人  胡童远

kraken2 手册:https://github.com/DerrickWood/kraken2/wiki/Manual

1 kraken2 report是这样子的

想把各分类都注释上去呢

2 python3代码

思路:
(1) 根据第四列判断domain phylum class order family genus species,将分类信息赋值给变量,不考虑含数字的分类。
(2) 根据条件将分类信息加到第六列输出即可。

(U)nclassified, (R)oot, (D)omain, (K)ingdom, (P)hylum, (C)lass, (O)rder, (F)amily, (G)enus, or (S)pecies

#!/usr/bin/env python3
import os, sys, re
ms, infile_name, outfile_name = sys.argv

with open(infile_name, 'r') as infile:
    with open(outfile_name, 'w') as o:
        Domain = ""
        for line in infile:
            line = line.strip()
            lines = re.split(r'\t', line)
            lines[5] = re.split(r'^[ ]*', lines[5])[1]
            if lines[3] == "D":
                Domain = lines[5]
                o.write("\t".join(lines))
                o.write("\n")
            elif lines[3] == "P":
                Phylum = lines[5]
                lines[5] = Domain+";"+Phylum
                o.write("\t".join(lines))
                o.write("\n")
            elif lines[3] == "C":
                Class = lines[5]
                lines[5] = Domain+";"+Phylum+";"+Class
                o.write("\t".join(lines))
                o.write("\n")
            elif lines[3] == "O":
                Order = lines[5]
                lines[5] = Domain+";"+Phylum+";"+Class+";"+Order
                o.write("\t".join(lines))
                o.write("\n")
            elif lines[3] == "F":
                Family = lines[5]
                lines[5] = Domain+";"+Phylum+";"+Class+";"+Order+";"+Family
                o.write("\t".join(lines))
                o.write("\n")
            elif lines[3] == "G":
                Genus = lines[5]
                lines[5] = Domain+";"+Phylum+";"+Class+";"+Order+";"+Family+";"+Genus
                o.write("\t".join(lines))
                o.write("\n")
            elif lines[3] == "S":
                Species = lines[5]
                lines[5] = Domain+";"+Phylum+";"+Class+";"+Order+";"+Family+";"+Genus+";"+Species
                o.write("\t".join(lines))
                o.write("\n")

上面漏掉了kingdom的信息,尤其是真菌是Kingdom在结果中完全看不到真菌。奇怪的是Kraken注释,只有部分Domain有kingdom,修改代码。后续数据处理不变。

#!/usr/bin/env python3
import os, sys, re
ms, infile_name, outfile_name = sys.argv

with open(infile_name, 'r') as infile:
    with open(outfile_name, 'w') as o:
        Domain = ""
        for line in infile:
            line = line.strip()
            lines = re.split(r'\t', line)
            lines[5] = re.split(r'^[ ]*', lines[5])[1]
            if lines[3] == "D":
                Domain = lines[5]
                o.write("\t".join(lines))
                o.write("\n")
                Kingdom = "NG"  ## Kingdom非必须,Domain必须,在这里赋空值
            elif lines[3] == "K":
                Kingdom = lines[5]
                lines[5] = Domain+";"+Kingdom
                o.write("\t".join(lines))
                o.write("\n")
            elif lines[3] == "P":
                Phylum = lines[5]
                lines[5] = Domain+";"+Kingdom+";"+Phylum
                o.write("\t".join(lines))
                o.write("\n")
            elif lines[3] == "C":
                Class = lines[5]
                lines[5] = Domain+";"+Kingdom+";"+Phylum+";"+Class
                o.write("\t".join(lines))
                o.write("\n")
            elif lines[3] == "O":
                Order = lines[5]
                lines[5] = Domain+";"+Kingdom+";"+Phylum+";"+Class+";"+Order
                o.write("\t".join(lines))
                o.write("\n")
            elif lines[3] == "F":
                Family = lines[5]
                lines[5] = Domain+";"+Kingdom+";"+Phylum+";"+Class+";"+Order+";"+Family
                o.write("\t".join(lines))
                o.write("\n")
            elif lines[3] == "G":
                Genus = lines[5]
                lines[5] = Domain+";"+Kingdom+";"+Phylum+";"+Class+";"+Order+";"+Family+";"+Genus
                o.write("\t".join(lines))
                o.write("\n")
            elif lines[3] == "S":
                Species = lines[5]
                lines[5] = Domain+";"+Kingdom+";"+Phylum+";"+Class+";"+Order+";"+Family+";"+Genus+";"+Species
                o.write("\t".join(lines))
                o.write("\n")

3 运行脚本,查看结果

任意找一个样本的kraken2结果,包含所有物种物种注释结果,即使reads count是0。行数均为31030。

conda activate python3.7
python3 ../script/kraken_anno.py sample_report.tsv kraken_anno.txt

4 抽提种水平注释

cat kraken_anno.txt | awk -F"\t" '{if($4=="S") print $6}' > kraken_anno2.txt

行数为17091,该数据库中各物种总数。
为保证处理结果与kraken2的合并结果species完全一致,还需做:

' -> 空
[] -> 空

5 统计各门中基因组数量

四大域/界:古菌,细菌,真核,病毒
kraken2数据库的基因组统计,门水平

      1 Archaea Candidatus Geothermarchaeota
      1 Archaea Candidatus Korarchaeota
      1 Archaea Candidatus Lokiarchaeota
      1 Archaea Candidatus Micrarchaeota
     57 Archaea Crenarchaeota
    230 Archaea Euryarchaeota
     24 Archaea Thaumarchaeota
     17 Bacteria        Acidobacteria
   1164 Bacteria        Actinobacteria
     17 Bacteria        Aquificae
      3 Bacteria        Armatimonadetes
    484 Bacteria        Bacteroidetes
      1 Bacteria        Balneolaeota
      1 Bacteria        Caldiserica
      1 Bacteria        Calditrichaeota
      1 Bacteria        Candidatus Bipolaricaulota
      1 Bacteria        Candidatus Cloacimonetes
      1 Bacteria        Candidatus Omnitrophica
      4 Bacteria        Candidatus Saccharibacteria
     22 Bacteria        Chlamydiae
     14 Bacteria        Chlorobi
     19 Bacteria        Chloroflexi
      1 Bacteria        Chrysiogenetes
      8 Bacteria        Coprothermobacterota
    190 Bacteria        Cyanobacteria
      5 Bacteria        Deferribacteres
     36 Bacteria        Deinococcus-Thermus
      2 Bacteria        Dictyoglomi
      3 Bacteria        Elusimicrobia
      1 Bacteria        Fibrobacteres
   1001 Bacteria        Firmicutes
     25 Bacteria        Fusobacteria
      4 Bacteria        Gemmatimonadetes
      2 Bacteria        Ignavibacteriae
      1 Bacteria        Kiritimatiellaeota
      9 Bacteria        Nitrospirae
     35 Bacteria        Planctomycetes
   3102 Bacteria        Proteobacteria
     60 Bacteria        Spirochaetes
      8 Bacteria        Synergistetes
    152 Bacteria        Tenericutes
      7 Bacteria        Thermodesulfobacteria
     32 Bacteria        Thermotogae
     16 Bacteria        Verrucomicrobia
     24 Eukaryota       Apicomplexa
     52 Eukaryota       Ascomycota
      2 Eukaryota       Bacillariophyta
      5 Eukaryota       Basidiomycota
      1 Eukaryota       Cercozoa
      4 Eukaryota       Chlorophyta
      1 Eukaryota       Chordata
      1 Eukaryota       Ciliophora
      7 Eukaryota       Euglenozoa
      2 Eukaryota       Evosea
      4 Eukaryota       Microsporidia
      4 Eukaryota       Rhodophyta
     69 Eukaryota       Streptophyta
    729 Viruses Artverviricota
    410 Viruses Cossaviricota
    903 Viruses Cressdnaviricota
      7 Viruses Dividoviricota
    222 Viruses Duplornaviricota
     46 Viruses Hofneiviricota
    789 Viruses Kitrinoviricota
   1153 Viruses Lenarviricota
    645 Viruses Negarnaviricota
    127 Viruses Nucleocytoviricota
    111 Viruses Peploviricota
     61 Viruses Phixviricota
    894 Viruses Pisuviricota
     94 Viruses Preplasmiviricota
    417 Viruses Saleviricota
   3542 Viruses Uroviricota
上一篇下一篇

猜你喜欢

热点阅读