生信专题基因家族等基因家族

物种分歧时间的估计以及基因家族的收缩与扩张

2023-06-06  本文已影响0人  深山夕照深秋雨OvO

0.软件的安装略
1.物种树的构建
参见 https://www.jianshu.com/p/336b65ca1b67
假设我们已经通过上述网页拿到了A-c.phy和相应的物种树文件

2.估计碱基替换速率(对应软件版本4.8, 4.9的参数和本文不同)

软件版本4.8

/data/01/user157/software/paml4.8/bin/baseml baseml.ctl

baseml.input.tree的内容:

6 1
((Bamboo,XD),((sppCa,sppJu)'@.001',(sppGa,sppGo)));
#需要标定时间,@.001表示大约在10万年
#标定的时间可以用timetree获得,输入两个物种,可以得到这两个物种的最晚分化时间

baseml.ctl内容如下, 前两行是上面提到的输入文件:

      seqfile = A-c.phy
     treefile = baseml.input.tree

      outfile = mlb       * main result file
        noisy = 3   * 0,1,2,3: how much rubbish on the screen
      verbose = 0   * 1: detailed output, 0: concise output
      runmode = 0   * 0: user tree;  1: semi-automatic;  2: automatic
                    * 3: StepwiseAddition; (4,5):PerturbationNNI 

        model = 7   * 0:JC69, 1:K80, 2:F81, 3:F84, 4:HKY85
                    * 5:T92, 6:TN93, 7:REV, 8:UNREST, 9:REVu; 10:UNRESTu

        Mgene = 0   * 0:rates, 1:separate; 2:diff pi, 3:diff kapa, 4:all diff

        clock = 1   * 0:no clock, 1:clock; 2:local clock; 3:CombinedAnalysis
    fix_kappa = 0   * 0: estimate kappa; 1: fix kappa at value below; 2: kappa for branches
        kappa = 2  * initial or fixed kappa

    fix_alpha = 0    * 0: estimate alpha; 1: fix alpha at value below
        alpha = 0.5   * initial or fixed alpha, 0:infinity (constant rate)
       Malpha = 0   * 1: different alpha's for genes, 0: one alpha
        ncatG = 5   * # of categories in the dG, AdG, or nparK models of rates
        nparK = 0   * rate-class models. 1:rK, 2:rK&fK, 3:rK&MK(1/K), 4:rK&MK 

        nhomo = 0   * 0 & 1: homogeneous, 2: kappa for branches, 3: N1, 4: N2
        getSE = 1   * 0: don't want them, 1: want S.E.s of estimates
 RateAncestor = 0   * (0,1,2): rates (alpha>0) or ancestral states

   Small_Diff = 7e-6
    cleandata = 0  * remove sites with ambiguity data (1:yes, 0:no)?
*        icode = 0  * (with RateAncestor=1. try "GC" in data,model=4,Mgene=4)
*  fix_blength = 1  * 0: ignore, -1: random, 1: initial, 2: fixed, 3: proportional
       method = 0  * Optimization method 0: simultaneous; 1: one branch a time

输出文件mlb中会有这么几行,记住这个数字, 这个就是替换率

Substitution rate is per time unit
   0.395351 +- 0.000685

3.第一次运行mcmctree
/data/01/user157/software/paml4.8/bin/mcmctree mcmctree.ctl
输出文件out.BV, 用作下面步骤的输入文件

mcmctreel.input.tree的内容:

6 1
((Bamboo,XD),((sppCa,sppJu)'>.001',(sppGa,sppGo)));
#需要标定时间,>.001表示大于10万年
          seed = -1
       seqfile = A-c.phy
      treefile = mcmc.input.tree
       outfile = out

         ndata = 1
       seqtype = 0  * 0: nucleotides; 1:codons; 2:AAs
       usedata = 3    * 0: no data; 1:seq like; 2:use in.BV; 3: out.BV
         clock = 2    * 1: global clock; 2: independent rates; 3: correlated rates
       RootAge = <1.0  * safe constraint on root age, used if no fossil for root.

         model = 0    * 0:JC69, 1:K80, 2:F81, 3:F84, 4:HKY85
         alpha = 0    * alpha for gamma rates at sites
         ncatG = 5    * No. categories in discrete gamma

     cleandata = 0    * remove sites with ambiguity data (1:yes, 0:no)?

       BDparas = 1 1 0    * birth, death, sampling
   kappa_gamma = 6 2      * gamma prior for kappa
   alpha_gamma = 1 1      * gamma prior for alpha

   rgene_gamma = 1 2.529397927411 1  * gamma prior for overall rates for genes ### 1/替换率
  sigma2_gamma = 1 10 1   * gamma prior for sigma^2     (for clock=2 or 3)

      finetune = 1: .05  0.1  0.12  0.1 .3  * auto (0 or 1) : times, rates, mixing, paras, RateParas, FossilErr

         print = 1
        burnin = 500000
      sampfreq = 5000
       nsample = 20000

cat out.BV > in.BV #用作后面的输入文件

4.第二次运行mcmctree,把上面的in.BV拷贝在第二次运行的目录下
/data/01/user157/software/paml4.8/bin/mcmctree mcmctree.ctl

这一次的mcmctree.ctl的内容:(区别在于usedata = 2

         seed = -1
       seqfile = A-c.phy
      treefile = mcmc.input.tree
       outfile = out

         ndata = 1
       seqtype = 0  * 0: nucleotides; 1:codons; 2:AAs
       usedata = 2    * 0: no data; 1:seq like; 2:use in.BV; 3: out.BV
         clock = 2    * 1: global clock; 2: independent rates; 3: correlated rates
       RootAge = <1.0  * safe constraint on root age, used if no fossil for root.

         model = 0    * 0:JC69, 1:K80, 2:F81, 3:F84, 4:HKY85
         alpha = 0    * alpha for gamma rates at sites
         ncatG = 5    * No. categories in discrete gamma

     cleandata = 0    * remove sites with ambiguity data (1:yes, 0:no)?

       BDparas = 1 1 0    * birth, death, sampling
   kappa_gamma = 6 2      * gamma prior for kappa
   alpha_gamma = 1 1      * gamma prior for alpha

   rgene_gamma = 1 2.292111240743 1  * gamma prior for overall rates for genes  ### 1/替换率
  sigma2_gamma = 1 10 1   * gamma prior for sigma^2     (for clock=2 or 3)

      finetune = 1: .05  0.1  0.12  0.1 .3  * auto (0 or 1) : times, rates, mixing, paras, RateParas, FossilErr

         print = 1
        burnin = 500000
      sampfreq = 5000
       nsample = 20000

*** Note: Make your window wider (100 columns) before running the program.

然后就会输出FigTree.tre,也就是最后的结果了

5.可以重复第四步,结果稳定即可
mcmc.ctl的最后几个参数可以尝试更换

6.我们就拿到了Cafe的输入文件(raw)
内容如下:

#NEXUS
BEGIN TREES;

        UTREE 1 = ((Bamboo: 0.507727, XD: 0.507727) [&95%={0.266, 0.907}]: 0.055557, ((sppCa: 0.010037, sppJu: 0.010037) [&95%={0.004, 0.020}]: 0.010716, (sppGa: 0.012062, sppGo: 0.012062) [&95%={0.005, 0.024}]: 0.008690) [&95%={0.009, 0.041}]: 0.542531) [&95%={0.286, 0.978}];

END;

需要删掉空格和置信区间,空格一定要删干净,这个就是cafe要求的输入文件格式
我命名为FigTree.nwk

#grep  "UTREE 1 =" FigTree.tre | sed -E -e "s/\[[^]]*\]//g" -e "s/[ \t]//g" -e "/^$/d" -e "s/UTREE1=//" > FigTree.nwk

((Bamboo:0.507727,XD:0.507727):0.055557,((sppCa:0.010037,sppJu: 0.010037):0.010716,(sppGa:0.012062,sppGo:0.012062):0.008690):0.542531);

7.cafe的安装可以用conda,这里略,然后准备第二个输入文件,也是OrthoFinder的结果

awk -v  OFS="\t" '{if($1=="Orthogroup"){print"Descript",$1,$2,$3,$4,$5,$6,$7}else{print"(null)",$1,$2,$3,$4,$5,$6,$7}}' Orthogroups.GeneCount.tsv > GeneCounts.tsv

python /data/01/user158/kuangzhuoran/software/CAFE5/docs/tutorial/clade_and_size_filter.py -i GeneCounts.tsv -o gene_family_filter.txt -s

cafe5 -i gene_family_filter.txt -t FigTree.nwk -o out -c 20

8.输出的文件都仔细看一下, 提取某个物种显著扩张收缩的基因

cat Base_family_results.txt | grep "y"|cut -f1 >p0.05.significant
#提取显著的OG

head -n 1 Base_change.tab > tmp1
grep -f p0.05.significant Base_change.tab > tmp2
cat tmp1 tmp2 > Base_p0.05change.tab
#提取对应OG的expand和expansion数

cat Base_p0.05change.tab | cut -f1,4 | grep "+[1-9]" | cut -f1  > sppJu.significant.expand
#cut -f1,4 的这个4,就是对应的节点/物种,我想要研究的物种
#grep "+[1-9]" 就是提取显著扩张的

cat Base_p0.05change.tab | cut -f1,4 | awk '{if($2!="+0") print}' | grep "-" | cut -f1 > sppJu.significant.expansion
#cut -f1,4 的这个4,就是对应的节点/物种,我想要研究的物种
#grep "-" 就是提取显著收缩的

#这个Orthogroups.tsv也是OrthoFinder的结果
grep -f judaei.significant.expansion Orthogroups.tsv | cut -f7 | sed "s/ /\n/g" | sed "s/\t/\n/g" | sed "s/,//g" | sort | uniq > sppJu.significant.expansion.gene
#cut -f7 就是就是对应的节点/物种
#这样就拿到了显著收缩的基因

9.如果是提取某个节点显著收缩与扩张的基因

awk 'NR!=1 && $13>0 {print $0}' Base_count.tab | cut -f1 > node11.orthogroups
#比如我要提取Node11节点

grep -f node11.orthogroups Orthogroups.txt |sed "s/ /\n/g" | grep -E "carmeli|galili|golani|judaei" | sort | uniq > node11.genes
#这个就是背景库

cat Base_p0.05change.tab | cut -f1,13 | grep "+[1-9]" | cut -f1  > node11.significant.expand
grep -f node11.significant.expand Orthogroups.tsv | cut -f7,8,9,10 | sed "s/ /\n/g" | sed "s/\t/\n/g" | sed "s/,//g" | sort | uniq > node11.significant.expand.gene

cat Base_p0.05change.tab | cut -f1,13 | awk '{if($2!="+0") print}' | grep "-" | cut -f1 > node11.significant.expansion
grep -f node11.significant.expansion Orthogroups.tsv | cut -f7,8,9,10 | sed "s/ /\n/g" | sed "s/\t/\n/g" | sed "s/,//g" | sort | uniq > node11.significant.expansion.gene
#Node11显著扩张的基因
#cut -f1,13 的这个13,就是Node11节点对应的列

#node11.genes、node11.significant.expand.gene、node11.significant.expansion.gene就是后续做富集要用到的
上一篇下一篇

猜你喜欢

热点阅读