step by step基因家族分析

【跟着manual学软件】MCScan-transposed

2020-05-29  本文已影响0人  ShawnMagic

软件介绍

首先放官网:MCScanX-transposed: detecting transposed gene duplications based on multiple collinearity scans
文献:MCScanX-transposed: detecting transposed gene duplications based on multiple colinearity scans

他能干什么

首先放文章摘要:

Gene duplication occurs via different modes such as segmental and single-gene duplications. Transposed gene duplication, a specific form of single-gene duplication, ‘copies’ a gene from an ancestral chromosomal location to a novel location. MCScanX is a toolkit for detection and evolutionary analysis of gene colinearity. We have developed MCScanX-transposed, a software package to detect transposed gene duplications that occurred within different epochs, based on execution of MCScanX within and between related genomes. MCScanX-transposed can be also used for integrative analysis of gene duplication modes for a genome and to annotate a gene family of interest with gene duplication modes.

本文想干啥

既然作者都说了MCScanX-transposed的特点,那我的主要关注点就是对于复制事件的分析了,共线性的问题,MCScanX的教程多如牛毛,就不造轮子了。

软件安装

官网:

unzip MCScanX-transposed.zip
cd MCScanX-transposed
make

推荐关注风槿如画_7847

依赖1:Bio::seqIO

在后续脚本的使用中会调用BioperlBio::seqIO模块,有些操作系统不知道是不是有安装的,但是mac没有,或许你现在不了解怎么回事,也可以不用管,但是当你按照下面的命里一跑就会提示报错,报错后google吧... 可以根据安装Bioperl最基本模块Bio::SeqIO安装
这里我记录下我的解决方案:

## 首先确保你装了cpan
cpan
## 安装提示选择mirror等一系列配置,基本是一路yes下来的,有一个部分会询问你是当前用户安装还是什么的,我选了sudo
Install Bio::SeqIO
## 这个过程巨长,跟网速有关,不能做到无痛,两个原因,首先我选择了sudo,中间需要输入几次密码,还有再编译的时候他会询问你是否编译,也是一路yes

依赖2:clustalw

这里装clustalw的方式五花八门,我为了省事用conda装的,记得之前直接

conda install clustalw -y

就可以了,但是报错提示在我提供的频道里找不到这个软件,于是上google搜索关键词conda clustalw发现:https://anaconda.org/biobuilds/clustalw

conda install -c biobuilds clustalw

这里坑爹的是clustalw好像更新成了clustalw2,在bash下clustalw也没有这个alias,用Tab补齐后发现变成了clustalw2, 原本想着不行重新装一个clustalw算了,后来打开看了下也没有多大区别,所以搞了个骚操作

vim add_ka_ks.pl
## 用/clustalw搜索了下脚本中调用clustalw的代码,发现:
system("clustalw -infile='temp7734.pro'");
## 然后按i进入插入模式,把clustalw改成clustalw2 ESC  :wq保存退出。

当你也可以用什么alias之类的或者直接装一个clustalw

依赖3 JDK

在编译之前需要安装JDK,不然下游分析包中的java无法编译成功,下载方法直接百度就好了

软件使用

示范数据

首先可以打开软件包中的data文件夹,官方给的实例数据,这些不仅是实例数据,更是一个脑残版数据格式实例,仿佛在说:别TM老问我数据啥格式,你打开一个看看依样画葫芦就行!

2) The input data include:
a) “at.gff”, chromosomal positions of A. thaliana genes
b) “at.blast”, intraspecies blastp in A. thaliana
c) “at_al.gff”, chromosomal positions of A. thaliana and A. lyrata genes
d) “at_al.blast”, interspecies blastp between A. thaliana and A. lyrata
e) “at_br.gff”, chromosomal positions of A. thaliana and Brassica rapa genes
f) “at_br.blast”, interspecies blastp between A. thaliana and Brassica rapa
g) “at_cp.gff”, chromosomal positions of A. thaliana and Carica papaya genes
h) “at_cp.blast”, interspecies blastp between A. thalianaand Cariaca papaya
i) “at_pt.gff”, chromosomal positions of A. thaliana and Populus trichocarpa genes
j) “at_pt.blast”, interspecies blastp between A. thaliana and Populus trichocarpa
k) “at_vv.gff”, chromosomal positions of A. thaliana and Vitis vinifera genes
l) “at_vv.blast”, interspecies blastp between A. thaliana and Vitis vinifera
m) “at.cds”, A. thaliana coding sequences
n) “mads.genes”, A. thaliana MADS box genes
o) “mads.newick”, bracket tree of A. thaliana MADS box genes

准备文件

核心代码

MCScanX-transposed
This program carries out detection of transposed gene duplications within different epochs and classificaiton of gene duplication modes.
Usage:

perl MCScanX-transposed.pl -i data_directory -t target_species -c outgroup_species(comma_delimited) -o output_directory

Optional:
-x number_of_different_epoches (if specified, outgroup species must be provided in the order of divergence from the target species(most recent first), default: 1, only consider the transposed duplications that occurred after the divergence between target species and all outgroups )
-a 1 or 0(are segmental duplicates ancestral loci or not? default: 1, yes)
-d number_of_genes(maximum distance to call proximal, default: 10)"
IMPORTANT:Users must prepare the input files by carefully reading the following instructions (1-4).

  1. All input files should be stored under ONE folder(the "data_directory" parameter)
  2. For the target genome in which gene duplicaiton modes will be classified, please prepare two input files:
    a) "[target_species].gff", a gene position file for the target species, following a tab-delimited format: "sp&chr_NO gene starting_position ending_position"
    b) "[target_species].blast", a blastp output file (m8 format) for the target species (self-genome comparison).
  3. For each outgroup genome, please prepare two input files:
    a) "[target_species][outgroup_species].gff", a gene position file for the target_species and outgroup_species, following a tab-delimited format:"sp&chr_NO gene starting_position ending_position"
    b) "[target_species]
    [outgroup_species].blast", a blastp output file (m8 format) between the target and outgroup species (cross-genome comparison).
  4. For example, assuming that you are going to classify gene duplication modes in Arabidopsis thaliana (ID: at), using Brassica rapa (ID: br) and Carica papaya (ID: cp) as outgroups, you need to prepare 6 input files: "at.gff","at.blast", "at_br.gff", "at_br.blast","at_br.gff","at_cp.gff" and "at_cp.blast".

主体包括了4个参数

~/xxx/MCScanX-transposed/result/cotton

Examples:

perl MCScanX-transposed.pl -i data -t at -c al,br,cp,pt,vv -o result/test1 -x 3
perl MCScanX-transposed.pl -i data -t at -c br,cp,pt,vv -o result/test2

Different modes of gene duplication including segmental, tandem, proximal and transposed are output as separate files (".pairs") with each line containing one gene duplication. In transposed duplications,the first duplicated gene is the transposed locus. Unique duplicates belonging to each mode are also output as a separate file (".genes"). Interim collinearity files generated by MCScanX are available under the output directory.

Google了一下atal的关系

Arabidopsis lyrata is the closest well-characterized relative of the model plant Arabidopsis thaliana, thought to have separated ~5 mya

这里可以看到atal的亲缘关系是要比br,cp,pt,vv等要近的多的,所以在有al的比较中修改了-x的值为3,但是在没有al时,认为这些转座重复的产生是在物种分化之后形成的,这里就需要用到默认的-x值1。当认为转座重复事件发生在物种分化之前的时候,输出结果中会计算目标物种和outgroup中每个物种的同源基因复制关系:

The results are under the directory “result/at_result”. Segmental, tandem, proximal and transposed duplications are in the files “at.segmental.pairs”, “at.tandem.pairs”, “at.proximal.pairs” and “at.transposed.pairs” respectively. Segmental, tandem, proximal and transposed duplicates are in the files “at.segmental.genes”, “at.tandem.genes”, “at.proximal.genes” and “at.transposed.genes” respectively.
The transposed duplications that occurred after A. thaliana-A. lyrata divergence, between A. thaliana-A. lyrata and Arabidopsis-Brassica divergence and between Arabidopsis-Brassica and Arabidopsis-Carica divergence are in the files "at.transposed_after_al.pairs", "at.transposed_between_al_br.pairs" and "at.transposed_between_br_cp.pairs" respectively.
Note that interim collinearity files are also available, including “at. Collinearity”, “at_al.collinearity”, “at_br.collinearity”, “at_cp.collinearity”, “at_pt.collinearity” and “at_vv.collinearity”.

计算KaKs

不要问我计算的结果和kakscalculator等其他软件哪个更好,作为一个调包侠这问题超纲了,如果你懂计算原理,自己用sublime,editplus等软件打开add_ka_ks.pl这个脚本看,里面有计算过程。
看脚本这里需要用到Bio::SeqIOclustalw,如果你之前没有装的话这里会报错,所以如果你跑这个脚本报错的话,自行百度Google,stackoverflow,biostar上找方案吧。我的解决方案不一定适合你,特别是跨平台。

比如要计算atal分化之后拟南芥中重复基因的kaks,需要执行:

“perl add_ka_ks.pl -d data/at.cds -i result/at_result/at.transposed_after_al.pairs -o result/at.transposed_after_al.pairs.kaks”

我做的陆地棉的结果如下:

Transposed Location Parental Location E-value Ka Ks
Gh_A01G000200 A01:60647 Gh_Contig00288G000400 Contig00288:58357 0.0 0.0861685543371287 0.102513449889525
Gh_A01G015700 A01:1329366 Gh_D11G230300 D11:32648963 7.48e-176 0.107481363628075 0.100119438035867
Gh_A01G016300 A01:1392126 Gh_D11G091700 D11:7573295 0.0 0.150062951494286 0.93426202813272

基因家族重复事件

这个很简单,需要你把家族成员的ID写到一个文件中,然后运行:

perl detect_dup_modes_for_a_family.pl -i data/mads.genes -d result/at_result/at -o result/mads.duplication.modes

结果就是你指定的输出文件,参数和上面的核心参数也一致。最后会列出一个表,表中会注明复制事件的模式,例如是串联重复(tandem),片段重复(segmental),近端重复(proximal)还是转座重复(transposed)。

在进化树上显示基因家族复制事件

很奇怪,用示例数据可以直接跑出结果,如下:


mads.tree.modes.png

但是我尝试了下自己的数据发现会报错:

Exception in thread "main" java.lang.NullPointerException
    at annotate_tree_with_dup_modes.recursion(annotate_tree_with_dup_modes.java:144)
    at annotate_tree_with_dup_modes.processtree(annotate_tree_with_dup_modes.java:207)
    at annotate_tree_with_dup_modes.main(annotate_tree_with_dup_modes.java:379)

经过各种尝试还是失败.... 没有接触过java,真的不太了解这个错误的原因,但是上面这个图实现的原理也很简单,完全可以用其他软件或者其他方式实现,所以不是刚需,也没有再纠结。 喜欢折腾的小伙伴可以折腾下!

不同世代转座事件的可视化

同样这个也没有尝试,因为我的结果中不存在transposed-duplication,全部都是segmental或者tandem,放上示例数据结果:


after_al.png

总结

MCScanX-Transposed 个人感觉是一款非常容易上手,门槛不高的软件(调包侠的角度),至于结果,毕竟不懂原理,不敢妄加评论。

上一篇 下一篇

猜你喜欢

热点阅读