物种分歧时间的估计以及基因家族的收缩与扩张
0.软件的安装略
1.物种树的构建
参见 https://www.jianshu.com/p/336b65ca1b67
假设我们已经通过上述网页拿到了A-c.phy和相应的物种树文件
2.估计碱基替换速率(对应软件版本4.8, 4.9的参数和本文不同)
软件版本4.8
/data/01/user157/software/paml4.8/bin/baseml baseml.ctl
baseml.input.tree的内容:
6 1
((Bamboo,XD),((sppCa,sppJu)'@.001',(sppGa,sppGo)));
#需要标定时间,@.001表示大约在10万年
#标定的时间可以用timetree获得,输入两个物种,可以得到这两个物种的最晚分化时间
baseml.ctl内容如下, 前两行是上面提到的输入文件:
seqfile = A-c.phy
treefile = baseml.input.tree
outfile = mlb * main result file
noisy = 3 * 0,1,2,3: how much rubbish on the screen
verbose = 0 * 1: detailed output, 0: concise output
runmode = 0 * 0: user tree; 1: semi-automatic; 2: automatic
* 3: StepwiseAddition; (4,5):PerturbationNNI
model = 7 * 0:JC69, 1:K80, 2:F81, 3:F84, 4:HKY85
* 5:T92, 6:TN93, 7:REV, 8:UNREST, 9:REVu; 10:UNRESTu
Mgene = 0 * 0:rates, 1:separate; 2:diff pi, 3:diff kapa, 4:all diff
clock = 1 * 0:no clock, 1:clock; 2:local clock; 3:CombinedAnalysis
fix_kappa = 0 * 0: estimate kappa; 1: fix kappa at value below; 2: kappa for branches
kappa = 2 * initial or fixed kappa
fix_alpha = 0 * 0: estimate alpha; 1: fix alpha at value below
alpha = 0.5 * initial or fixed alpha, 0:infinity (constant rate)
Malpha = 0 * 1: different alpha's for genes, 0: one alpha
ncatG = 5 * # of categories in the dG, AdG, or nparK models of rates
nparK = 0 * rate-class models. 1:rK, 2:rK&fK, 3:rK&MK(1/K), 4:rK&MK
nhomo = 0 * 0 & 1: homogeneous, 2: kappa for branches, 3: N1, 4: N2
getSE = 1 * 0: don't want them, 1: want S.E.s of estimates
RateAncestor = 0 * (0,1,2): rates (alpha>0) or ancestral states
Small_Diff = 7e-6
cleandata = 0 * remove sites with ambiguity data (1:yes, 0:no)?
* icode = 0 * (with RateAncestor=1. try "GC" in data,model=4,Mgene=4)
* fix_blength = 1 * 0: ignore, -1: random, 1: initial, 2: fixed, 3: proportional
method = 0 * Optimization method 0: simultaneous; 1: one branch a time
输出文件mlb中会有这么几行,记住这个数字, 这个就是替换率
Substitution rate is per time unit
0.395351 +- 0.000685
3.第一次运行mcmctree
/data/01/user157/software/paml4.8/bin/mcmctree mcmctree.ctl
输出文件out.BV, 用作下面步骤的输入文件
mcmctreel.input.tree的内容:
6 1
((Bamboo,XD),((sppCa,sppJu)'>.001',(sppGa,sppGo)));
#需要标定时间,>.001表示大于10万年
seed = -1
seqfile = A-c.phy
treefile = mcmc.input.tree
outfile = out
ndata = 1
seqtype = 0 * 0: nucleotides; 1:codons; 2:AAs
usedata = 3 * 0: no data; 1:seq like; 2:use in.BV; 3: out.BV
clock = 2 * 1: global clock; 2: independent rates; 3: correlated rates
RootAge = <1.0 * safe constraint on root age, used if no fossil for root.
model = 0 * 0:JC69, 1:K80, 2:F81, 3:F84, 4:HKY85
alpha = 0 * alpha for gamma rates at sites
ncatG = 5 * No. categories in discrete gamma
cleandata = 0 * remove sites with ambiguity data (1:yes, 0:no)?
BDparas = 1 1 0 * birth, death, sampling
kappa_gamma = 6 2 * gamma prior for kappa
alpha_gamma = 1 1 * gamma prior for alpha
rgene_gamma = 1 2.529397927411 1 * gamma prior for overall rates for genes ### 1/替换率
sigma2_gamma = 1 10 1 * gamma prior for sigma^2 (for clock=2 or 3)
finetune = 1: .05 0.1 0.12 0.1 .3 * auto (0 or 1) : times, rates, mixing, paras, RateParas, FossilErr
print = 1
burnin = 500000
sampfreq = 5000
nsample = 20000
cat out.BV > in.BV #用作后面的输入文件
4.第二次运行mcmctree,把上面的in.BV拷贝在第二次运行的目录下
/data/01/user157/software/paml4.8/bin/mcmctree mcmctree.ctl
这一次的mcmctree.ctl的内容:(区别在于usedata = 2
seed = -1
seqfile = A-c.phy
treefile = mcmc.input.tree
outfile = out
ndata = 1
seqtype = 0 * 0: nucleotides; 1:codons; 2:AAs
usedata = 2 * 0: no data; 1:seq like; 2:use in.BV; 3: out.BV
clock = 2 * 1: global clock; 2: independent rates; 3: correlated rates
RootAge = <1.0 * safe constraint on root age, used if no fossil for root.
model = 0 * 0:JC69, 1:K80, 2:F81, 3:F84, 4:HKY85
alpha = 0 * alpha for gamma rates at sites
ncatG = 5 * No. categories in discrete gamma
cleandata = 0 * remove sites with ambiguity data (1:yes, 0:no)?
BDparas = 1 1 0 * birth, death, sampling
kappa_gamma = 6 2 * gamma prior for kappa
alpha_gamma = 1 1 * gamma prior for alpha
rgene_gamma = 1 2.292111240743 1 * gamma prior for overall rates for genes ### 1/替换率
sigma2_gamma = 1 10 1 * gamma prior for sigma^2 (for clock=2 or 3)
finetune = 1: .05 0.1 0.12 0.1 .3 * auto (0 or 1) : times, rates, mixing, paras, RateParas, FossilErr
print = 1
burnin = 500000
sampfreq = 5000
nsample = 20000
*** Note: Make your window wider (100 columns) before running the program.
然后就会输出FigTree.tre,也就是最后的结果了
5.可以重复第四步,结果稳定即可
mcmc.ctl的最后几个参数可以尝试更换
6.我们就拿到了Cafe的输入文件(raw)
内容如下:
#NEXUS
BEGIN TREES;
UTREE 1 = ((Bamboo: 0.507727, XD: 0.507727) [&95%={0.266, 0.907}]: 0.055557, ((sppCa: 0.010037, sppJu: 0.010037) [&95%={0.004, 0.020}]: 0.010716, (sppGa: 0.012062, sppGo: 0.012062) [&95%={0.005, 0.024}]: 0.008690) [&95%={0.009, 0.041}]: 0.542531) [&95%={0.286, 0.978}];
END;
需要删掉空格和置信区间,空格一定要删干净,这个就是cafe要求的输入文件格式
我命名为FigTree.nwk
#grep "UTREE 1 =" FigTree.tre | sed -E -e "s/\[[^]]*\]//g" -e "s/[ \t]//g" -e "/^$/d" -e "s/UTREE1=//" > FigTree.nwk
((Bamboo:0.507727,XD:0.507727):0.055557,((sppCa:0.010037,sppJu: 0.010037):0.010716,(sppGa:0.012062,sppGo:0.012062):0.008690):0.542531);
7.cafe的安装可以用conda,这里略,然后准备第二个输入文件,也是OrthoFinder的结果
awk -v OFS="\t" '{if($1=="Orthogroup"){print"Descript",$1,$2,$3,$4,$5,$6,$7}else{print"(null)",$1,$2,$3,$4,$5,$6,$7}}' Orthogroups.GeneCount.tsv > GeneCounts.tsv
python /data/01/user158/kuangzhuoran/software/CAFE5/docs/tutorial/clade_and_size_filter.py -i GeneCounts.tsv -o gene_family_filter.txt -s
cafe5 -i gene_family_filter.txt -t FigTree.nwk -o out -c 20
8.输出的文件都仔细看一下, 提取某个物种显著扩张收缩的基因
cat Base_family_results.txt | grep "y"|cut -f1 >p0.05.significant
#提取显著的OG
head -n 1 Base_change.tab > tmp1
grep -f p0.05.significant Base_change.tab > tmp2
cat tmp1 tmp2 > Base_p0.05change.tab
#提取对应OG的expand和expansion数
cat Base_p0.05change.tab | cut -f1,4 | grep "+[1-9]" | cut -f1 > sppJu.significant.expand
#cut -f1,4 的这个4,就是对应的节点/物种,我想要研究的物种
#grep "+[1-9]" 就是提取显著扩张的
cat Base_p0.05change.tab | cut -f1,4 | awk '{if($2!="+0") print}' | grep "-" | cut -f1 > sppJu.significant.expansion
#cut -f1,4 的这个4,就是对应的节点/物种,我想要研究的物种
#grep "-" 就是提取显著收缩的
#这个Orthogroups.tsv也是OrthoFinder的结果
grep -f judaei.significant.expansion Orthogroups.tsv | cut -f7 | sed "s/ /\n/g" | sed "s/\t/\n/g" | sed "s/,//g" | sort | uniq > sppJu.significant.expansion.gene
#cut -f7 就是就是对应的节点/物种
#这样就拿到了显著收缩的基因
9.如果是提取某个节点显著收缩与扩张的基因
awk 'NR!=1 && $13>0 {print $0}' Base_count.tab | cut -f1 > node11.orthogroups
#比如我要提取Node11节点
grep -f node11.orthogroups Orthogroups.txt |sed "s/ /\n/g" | grep -E "carmeli|galili|golani|judaei" | sort | uniq > node11.genes
#这个就是背景库
cat Base_p0.05change.tab | cut -f1,13 | grep "+[1-9]" | cut -f1 > node11.significant.expand
grep -f node11.significant.expand Orthogroups.tsv | cut -f7,8,9,10 | sed "s/ /\n/g" | sed "s/\t/\n/g" | sed "s/,//g" | sort | uniq > node11.significant.expand.gene
cat Base_p0.05change.tab | cut -f1,13 | awk '{if($2!="+0") print}' | grep "-" | cut -f1 > node11.significant.expansion
grep -f node11.significant.expansion Orthogroups.tsv | cut -f7,8,9,10 | sed "s/ /\n/g" | sed "s/\t/\n/g" | sed "s/,//g" | sort | uniq > node11.significant.expansion.gene
#Node11显著扩张的基因
#cut -f1,13 的这个13,就是Node11节点对应的列
#node11.genes、node11.significant.expand.gene、node11.significant.expansion.gene就是后续做富集要用到的