SwiftOrtho鉴定同源基因

2021-04-30 本文已影响0人 DumplingLucky

SwiftOrtho是19年出的鉴定同源基因的软件，相比与OrthoMCL，可以多线程，运行速度大大提升。
文章：SwiftOrtho: A fast, memory-efficient, multiple genome orthology classifier.

1. 环境配置

确保已安装以下程序

Python 3.7 (可使用 Anaconda 安装).
GNU编译器GCC 6.0或以上版本.
MCL(可选的).

2. 软件下载及安装

git clone https://github.com/Rinoahu/SwiftOrtho
cd SwiftOrtho
bash ./install.sh
cd example
bash ./run.sh

3. 运行软件

All-to-all同源搜索

python SwiftOrtho/bin/find_hit.py -p blastp -i input.fsa -d input.fsa -o input.fsa.sc -e 1e-5 -s 111111

-i | -d：fasta格式的蛋白质序列。 intput.fsa中每个蛋白质序列的标识符应类似于：> xxx | yyyy，其中xxx是分类码，而yyyy是序列标识符。例如：

>A|a1
MENIHDLWERALAEMEKKVSKPSYETWLKSTKANDIQNDVITITAPNEFARDWLEEHYAG
LTSDTIEHLTGARLTPRFVIPQNELEDDFLIEPPKKKKPVSDNGSQNNGTKTMLNDKYTF
...
>A|b2
MQFTIQRDRFVHDVQNVAKAVSSRTTIPILTGIKIVADHEGVTLTGSDSDVSIETFIPKE
...

-o：输出文件。包含14列的表格文本文件。前12列与blastp -m8格式相同，后2列是查询和目标序列的长度。例如：

A|a1    A|a1    100.00  450 0   0   1   450 1   450 2.88e-261   897 450 450
A|a1    B|b2    53.52   340 158 0   111 450 240 579 1.60e-105   380 450 583
...

-e：期望值。
-s：种子。

同系物推断

python SwiftOrtho/bin/find_orth.py -i input.fsa.sc -c 0.5 -y 0 > input.fsa.sc.orth

-i：输入文件。它是步骤1或blast -m8的输出文件。
-c：成对序列的比对覆盖率的阈值。
-y：成对序列比对同一性的阈值。
-s：分类单元和序列ID之间的分隔符。默认值为|。

：输出文件。它是一个表格文本文件，包含4列，格式如下

    OT  A|a1    B|b1    1.33510402833
    IP  A|a1    A|a2    1.23374340949
    CO  A|a2    B|b2    1.41459539212
    ...

Col 1: orthology relationship, one of OT(ortholog), CO(co-ortholog), or IP(in-paralog).
Col 2: identifier of gene in species A.
Col 3: identifier of gene in species B.
Col 4: weight of orthology relationship.

3. 给同源关系聚类

SwiftOrtho应用了两种聚类算法: Markov和Python中的Affinity Aropagation算法。

    # use MCL
    $python SwiftOrtho/bin/find_cluster.py -i input.fsa.sc.orth -a mcl -I 1.5 > input.fsa.sc.orth.mcl
    # use APC
    $python SwiftOrtho/bin/find_cluster.py -i input.fsa.sc.orth -a apc -I 1.5 > input.fsa.sc.orth.apc

-i：输入文件，例如：

    OT  A|a1    B|b1    1.33510402833
    IP  A|a1    A|a2    1.23374340949
    CO  A|a2    B|b2    1.41459539212
    ...

-a：要聚类的算法。 [mcl | apc]。
-I：仅用于mcl的inflation参数。

：输出文件。该文件包含几行。每行代表一个直系同源群。在每一行中，存在相同或不同物种的基因标识符。例如：

    A|a1    A|a2    B|b1    C|c1    D|d1
    A|a3    B|b3    C|c3
    A|a4    B|b4
    A|a5    A|a6
    ...

如果要使用原始 MCL进行群集，则可以按照以下步骤操作：

    $cut -f2-4 input.fsa.sc.orth > input.fsa.sc.orth.xyz
    $mcl input.fsa.sc.orth.xyz --abc -I 1.5 -o input.fsa.sc.orth.mcl -te 4

官方提供了集成的脚本用于同源基因的鉴定：
merge.py用于将多个fasta文件合并为一个文件，并为每个物种的基因标识符添加标签。

python scripts / merge.py dir_name> merge.fasta

dir_name是包含所有fasta文件的目录。
Run_all.py用于自动执行以下步骤：
(1) all-to-all同源搜索。
(2) 同系物推断。
(3) 聚类同源群体。
(4) 进行泛基因组分析并评估泛基因组的主要特征
(5) 使用保守的蛋白质来构建物种系统树。
(6) 如果提供了操纵子信息，则执行操纵子聚类。[可选的]
环境配置要求：

FastTree 2
以下比对工具任选其一
1. FAMSA(推荐)
2. MAFFT
3. MUSCLE
trimAl
使用方法：

python run_all.py -i test.fsa -p test.fsa.operon -a 4

-i：输入文件。 Fasta格式的蛋白质序列。
-p：操纵子注释文件。该文件的第一列应类似于x0-> x1-> x2-> x3或x0 <-x1 <-x2 <-x3。 x＃代表基因标识符，<-或->代表基因链。例如：

    A|a0-->A|a1 unknown-->COG1607   unknown::unknown-->I::Acyl-CoA hydrolase::Lipid transport and metabolism
    B|b0<--B|b1 COG4644<--COG1961   X::Transposase and inactivated derivatives, TnpA family::Mobilome: prophages, transposons<--L::Site-specific DNA recombinase related to the DNA invertase Pin::Replication, recombination and repair

-a：线程数。
结果文件：
test.fsa.sc: 全部对所有同源性搜索的结果。
test.fsa.aln.trim: 修剪的保守基因的比对的串联蛋白质序列。
test.fsa.nwk: 系统发育树，该树由保守基因的比对蛋白质序列构建而成。
test.fsa.opc: 同系物关系。
test.fsa.mcl: 直系群体。
test.fsa.operon.mcl: 分组的操纵子，反映了跨多个物种的操纵子的保守性。该文件包含几行。每行都包含来自相同或不同物种的物种信息。例如：

 A1-->A2-->A3   B1<--B2<--B3    C1<--C2<--C3
 A4<--A5<--A6   B4<--B5<--B6    C4<--C5<--C6
 A7-->A8    B7<--B8<--B9
 ....

test.fsa.pan: 泛基因组的主要特征。例如：

 # Statistics and profile of pan-genome:
 # The methods can be found in Hu X, et al. Trajectory and genomic determinants of fungal-pathogen speciation and host adaptation.
 #
 # statistic of core, shared and specific genes:
 # Feature       core    shared  specific        taxon
 # Number        27      2117    9766    5
 #
 # ω(core size of pan-genome) and 95% confidence interval:
 # κc    τc      ω
 # 18001.747907101293±97986.86937584748  0.4604747552601067±0.5879003578202601   29.071595667457963±45.51565446328978
 #
 # θ(new gene number for everay new genome sequenced) and 95% confidence interval:
 # κs    τs      tg(θ)
 # 1334.0072284367752±2342.5492209911768 2.2743910535524314±9.701652708550565    1952.605831944348±1311.6323603805986
 #
 # κ(size and openess of pan-genome, open if γ > 0) and 95% confidence interval:
 # κ     γ
 # 2899.5570049130965±179.58438208536737 0.8785342365438822±0.04423040927927408
 #
 # Type and frequency of each gene group in different species:
 ################################################################################
 #family type    GCF_000005825.2_ASM582v2        GCF_000006645.1_ASM664v1        GCF_000006605.1_ASM660v1        GCF_000005845.2_ASM584v2        GC
 F_000006625.1_ASM662v1
 group_000000000 Share   0       1       0       1       0
 group_000000001 Specific        2       0       0       0       0
 group_000000002 Specific        2       0       0       0       0
 group_000000003 Specific        2       0       0       0       0
 group_000000004 Specific        3       0       0       0       0
 ...

参考：https://github.com/Rinoahu/SwiftOrtho

SwiftOrtho鉴定同源基因

1. 环境配置

2. 软件下载及安装

3. 运行软件

3. 给同源关系聚类

猜你喜欢

热点阅读