【跟着manual学软件】MCScan-transposed

2020-05-29 本文已影响0人 ShawnMagic

软件介绍

首先放官网：MCScanX-transposed: detecting transposed gene duplications based on multiple collinearity scans
文献：MCScanX-transposed: detecting transposed gene duplications based on multiple colinearity scans

他能干什么

首先放文章摘要：

Gene duplication occurs via different modes such as segmental and single-gene duplications. Transposed gene duplication, a specific form of single-gene duplication, ‘copies’ a gene from an ancestral chromosomal location to a novel location. MCScanX is a toolkit for detection and evolutionary analysis of gene colinearity. We have developed MCScanX-transposed, a software package to detect transposed gene duplications that occurred within different epochs, based on execution of MCScanX within and between related genomes. MCScanX-transposed can be also used for integrative analysis of gene duplication modes for a genome and to annotate a gene family of interest with gene duplication modes.

首先理解什么是Transposed gene duplication（我理解为基于转座事件的基因复制），说人话就是一个基因从祖先染色体位置『跳』到一个全新的位置。
MCScanX-transposed有完整的MCScanX的功能，基于MCScanX，MCScanX-transposed可以综合的分析基因复制事件和对基因家族成员复制模型进行检测。

本文想干啥

既然作者都说了MCScanX-transposed的特点，那我的主要关注点就是对于复制事件的分析了，共线性的问题，MCScanX的教程多如牛毛，就不造轮子了。

软件安装

官网：

unzip MCScanX-transposed.zip
cd MCScanX-transposed
make

推荐关注风槿如画_7847

之前再安装MCScanX时，mac遇到了一些问题，需要再编译之前修改几个文件。
详情见：MCScanX安装、报错解决方法及简单使用，当然macos用户还需要安装xcode.
MCscanX-transpose好像解决了这些问题，直接下载好源码，make
具体参考：MCScanX-transposed安装及使用

依赖1：Bio::seqIO

在后续脚本的使用中会调用Bioperl的Bio::seqIO模块，有些操作系统不知道是不是有安装的，但是mac没有，或许你现在不了解怎么回事，也可以不用管，但是当你按照下面的命里一跑就会提示报错，报错后google吧... 可以根据安装Bioperl最基本模块Bio::SeqIO安装
这里我记录下我的解决方案：

## 首先确保你装了cpan
cpan
## 安装提示选择mirror等一系列配置，基本是一路yes下来的，有一个部分会询问你是当前用户安装还是什么的，我选了sudo
Install Bio::SeqIO
## 这个过程巨长，跟网速有关，不能做到无痛，两个原因，首先我选择了sudo，中间需要输入几次密码，还有再编译的时候他会询问你是否编译，也是一路yes

依赖2：clustalw

这里装clustalw的方式五花八门，我为了省事用conda装的，记得之前直接

conda install clustalw -y

就可以了，但是报错提示在我提供的频道里找不到这个软件，于是上google搜索关键词conda clustalw发现：https://anaconda.org/biobuilds/clustalw

conda install -c biobuilds clustalw

这里坑爹的是clustalw好像更新成了clustalw2，在bash下clustalw也没有这个alias，用Tab补齐后发现变成了clustalw2，原本想着不行重新装一个clustalw算了，后来打开看了下也没有多大区别，所以搞了个骚操作

vim add_ka_ks.pl
## 用/clustalw搜索了下脚本中调用clustalw的代码，发现:
system("clustalw -infile='temp7734.pro'");
## 然后按i进入插入模式，把clustalw改成clustalw2 ESC  :wq保存退出。

当你也可以用什么alias之类的或者直接装一个clustalw

依赖3 JDK

在编译之前需要安装JDK，不然下游分析包中的java无法编译成功，下载方法直接百度就好了

软件使用

示范数据

首先可以打开软件包中的data文件夹，官方给的实例数据，这些不仅是实例数据，更是一个脑残版数据格式实例，仿佛在说：别TM老问我数据啥格式，你打开一个看看依样画葫芦就行！

2) The input data include:
a) “at.gff”, chromosomal positions of A. thaliana genes
b) “at.blast”, intraspecies blastp in A. thaliana
c) “at_al.gff”, chromosomal positions of A. thaliana and A. lyrata genes
d) “at_al.blast”, interspecies blastp between A. thaliana and A. lyrata
e) “at_br.gff”, chromosomal positions of A. thaliana and Brassica rapa genes
f) “at_br.blast”, interspecies blastp between A. thaliana and Brassica rapa
g) “at_cp.gff”, chromosomal positions of A. thaliana and Carica papaya genes
h) “at_cp.blast”, interspecies blastp between A. thalianaand Cariaca papaya
i) “at_pt.gff”, chromosomal positions of A. thaliana and Populus trichocarpa genes
j) “at_pt.blast”, interspecies blastp between A. thaliana and Populus trichocarpa
k) “at_vv.gff”, chromosomal positions of A. thaliana and Vitis vinifera genes
l) “at_vv.blast”, interspecies blastp between A. thaliana and Vitis vinifera
m) “at.cds”, A. thaliana coding sequences
n) “mads.genes”, A. thaliana MADS box genes
o) “mads.newick”, bracket tree of A. thaliana MADS box genes

准备文件

其实需要准备xx.blast, xx.gff, xx.cds, xx.genes以及xx.newick这几个文件就可以了，做blast的话，可以conda下面安装blast，然后把准备好的蛋白序列进行blast，我做的是target to outgroup,也就是target作为query，outgroup作为database。
如果不习惯代码行操作，推荐使用TBtools的blast GUI wrapper目录下的Two Sequence File工具，选择参数的时候把输出格式选择为table，把NumofHits选择为5. 软件Q&A部分推荐blast结果no more than 5的原因。
gff文件和MCScanX的一样，具体自己翻翻简书，有写好的命令行一键把原始gff转换成MCScanX的gff格式的。
cds文件就是测序玩组装好的cds文件，格式为fasta
xxx.genes xxx家族的基因ID
xxx.newick，这个不用说，就是这个家族基于蛋白序列的系统进化树

核心代码

MCScanX-transposed
This program carries out detection of transposed gene duplications within different epochs and classificaiton of gene duplication modes.
Usage:

perl MCScanX-transposed.pl -i data_directory -t target_species -c outgroup_species(comma_delimited) -o output_directory

Optional:
-x number_of_different_epoches (if specified, outgroup species must be provided in the order of divergence from the target species(most recent first), default: 1, only consider the transposed duplications that occurred after the divergence between target species and all outgroups )
-a 1 or 0(are segmental duplicates ancestral loci or not? default: 1, yes)
-d number_of_genes(maximum distance to call proximal, default: 10)"
IMPORTANT:Users must prepare the input files by carefully reading the following instructions (1-4).

All input files should be stored under ONE folder(the "data_directory" parameter)

For the target genome in which gene duplicaiton modes will be classified, please prepare two input files:
a) "[target_species].gff", a gene position file for the target species, following a tab-delimited format: "sp&chr_NO gene starting_position ending_position"
b) "[target_species].blast", a blastp output file (m8 format) for the target species (self-genome comparison).

For each outgroup genome, please prepare two input files:
a) "[target_species][outgroup_species].gff", a gene position file for the target_species and outgroup_species, following a tab-delimited format:"sp&chr_NO gene starting_position ending_position"
b) "[target_species][outgroup_species].blast", a blastp output file (m8 format) between the target and outgroup species (cross-genome comparison).

For example, assuming that you are going to classify gene duplication modes in Arabidopsis thaliana (ID: at), using Brassica rapa (ID: br) and Carica papaya (ID: cp) as outgroups, you need to prepare 6 input files: "at.gff","at.blast", "at_br.gff", "at_br.blast","at_br.gff","at_cp.gff" and "at_cp.blast".

主体包括了4个参数

-i你存放数据的目录，注意测试发现必须吧数据放在软件路径下的data路径（文件夹）下，可以在data下面建立子目录，比如我做的棉花，那你的路径

~/xxx/MCScanX-transposed/result/cotton

-t 这个和MCScanX操作一样，如果按照示例数据来，你要比较拟南芥和其他那么，你所有拟南芥的数据均是以at.xxx命名的，那么这里跟的就是at，严格按照文件名不要发挥想象写成什么At,Arabidopsis,ninanjie😂.
-c 比对物种的简称，每个物种简称直接用逗号隔开，注意是英文逗号
-o 输出文件，包括位置，前缀等等。
还有三个可选参数
-x 遗传学没学好，这里理解有些困难，epoches的中文意思翻译过来是世代，纪元（如有错误请指出，谢谢）如果指定-x参数，outgroup中的物种必须按照与目标物种世代关系远近排列，关系越近，排名越靠前。默认值为1，意思是target物种中转座重复只发生在target物种和其他物种分化之后。
-a 片段重复是否在祖先位点...... 默认是1
-d 定义为proximal的最大基因个数，默认是10。(能理解啥意思，翻译过来感觉贼别扭ಥ_ಥ)

Examples:

perl MCScanX-transposed.pl -i data -t at -c al,br,cp,pt,vv -o result/test1 -x 3
perl MCScanX-transposed.pl -i data -t at -c br,cp,pt,vv -o result/test2

Different modes of gene duplication including segmental, tandem, proximal and transposed are output as separate files (".pairs") with each line containing one gene duplication. In transposed duplications,the first duplicated gene is the transposed locus. Unique duplicates belonging to each mode are also output as a separate file (".genes"). Interim collinearity files generated by MCScanX are available under the output directory.

Google了一下at和al的关系：

Arabidopsis lyrata is the closest well-characterized relative of the model plant Arabidopsis thaliana, thought to have separated ~5 mya

这里可以看到at和al的亲缘关系是要比br,cp,pt,vv等要近的多的，所以在有al的比较中修改了-x的值为3，但是在没有al时，认为这些转座重复的产生是在物种分化之后形成的，这里就需要用到默认的-x值1。当认为转座重复事件发生在物种分化之前的时候，输出结果中会计算目标物种和outgroup中每个物种的同源基因复制关系：

The results are under the directory “result/at_result”. Segmental, tandem, proximal and transposed duplications are in the files “at.segmental.pairs”, “at.tandem.pairs”, “at.proximal.pairs” and “at.transposed.pairs” respectively. Segmental, tandem, proximal and transposed duplicates are in the files “at.segmental.genes”, “at.tandem.genes”, “at.proximal.genes” and “at.transposed.genes” respectively.
The transposed duplications that occurred after A. thaliana-A. lyrata divergence, between A. thaliana-A. lyrata and Arabidopsis-Brassica divergence and between Arabidopsis-Brassica and Arabidopsis-Carica divergence are in the files "at.transposed_after_al.pairs", "at.transposed_between_al_br.pairs" and "at.transposed_between_br_cp.pairs" respectively.
Note that interim collinearity files are also available, including “at. Collinearity”, “at_al.collinearity”, “at_br.collinearity”, “at_cp.collinearity”, “at_pt.collinearity” and “at_vv.collinearity”.

计算KaKs

不要问我计算的结果和kakscalculator等其他软件哪个更好，作为一个调包侠这问题超纲了，如果你懂计算原理，自己用sublime，editplus等软件打开add_ka_ks.pl这个脚本看，里面有计算过程。
看脚本这里需要用到Bio::SeqIO和clustalw,如果你之前没有装的话这里会报错，所以如果你跑这个脚本报错的话，自行百度Google，stackoverflow，biostar上找方案吧。我的解决方案不一定适合你，特别是跨平台。

比如要计算at和al分化之后拟南芥中重复基因的kaks，需要执行：

“perl add_ka_ks.pl -d data/at.cds -i result/at_result/at.transposed_after_al.pairs -o result/at.transposed_after_al.pairs.kaks”

我做的陆地棉的结果如下：

Transposed	Location	Parental	Location	E-value	Ka	Ks
Gh_A01G000200	A01:60647	Gh_Contig00288G000400	Contig00288:58357	0.0	0.0861685543371287	0.102513449889525
Gh_A01G015700	A01:1329366	Gh_D11G230300	D11:32648963	7.48e-176	0.107481363628075	0.100119438035867
Gh_A01G016300	A01:1392126	Gh_D11G091700	D11:7573295	0.0	0.150062951494286	0.93426202813272

基因家族重复事件

这个很简单，需要你把家族成员的ID写到一个文件中，然后运行：

perl detect_dup_modes_for_a_family.pl -i data/mads.genes -d result/at_result/at -o result/mads.duplication.modes

结果就是你指定的输出文件，参数和上面的核心参数也一致。最后会列出一个表，表中会注明复制事件的模式，例如是串联重复（tandem），片段重复（segmental），近端重复(proximal)还是转座重复（transposed）。

在进化树上显示基因家族复制事件

很奇怪，用示例数据可以直接跑出结果，如下：

mads.tree.modes.png

但是我尝试了下自己的数据发现会报错：

Exception in thread "main" java.lang.NullPointerException
    at annotate_tree_with_dup_modes.recursion(annotate_tree_with_dup_modes.java:144)
    at annotate_tree_with_dup_modes.processtree(annotate_tree_with_dup_modes.java:207)
    at annotate_tree_with_dup_modes.main(annotate_tree_with_dup_modes.java:379)

经过各种尝试还是失败.... 没有接触过java，真的不太了解这个错误的原因，但是上面这个图实现的原理也很简单，完全可以用其他软件或者其他方式实现，所以不是刚需，也没有再纠结。喜欢折腾的小伙伴可以折腾下！

不同世代转座事件的可视化

同样这个也没有尝试，因为我的结果中不存在transposed-duplication，全部都是segmental或者tandem，放上示例数据结果：

after_al.png

总结

MCScanX-Transposed 个人感觉是一款非常容易上手，门槛不高的软件（调包侠的角度），至于结果，毕竟不懂原理，不敢妄加评论。