小麦GWAS绘制LDdacy图

2020-07-25 本文已影响0人夏大希

需要的软件vcftools
tassel
plink
popLDdecay
原始数据类型：hapmap格式数值型
hapmap格式字符型
处理路线：
hapmap字符型变成vcf文件，popLDdecay可以直接识别VCF文件

1.文件格式的区别

1.1 .hmp.txt

555396932427471316.jpg

1.1.1需要注意的问题

1）在rs和assembly后面添加“#”，自己生成hapmap格式的时候，注意检查#问题

2）前十一列是必须的

3）小麦中需要注意的问题是chrom这个地方是数值型不能写成1A 1B 1D 等类型的字符型

1.2 vcf文件

##fileformat=VCFv4.0 
##fileDate=20160306 
##source="Stacks v1.37"
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data"> 
##INFO=<ID=AF,Number=.,Type=Float,Description="Allele Frequency"> 
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype"> 
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth"> 
##FORMAT=<ID=AD,Number=1,Type=Integer,Description="Allele Depth"> 
##FORMAT=<ID=GL,Number=.,Type=Float,Description="Genotype Likelihood"> 
#CHROM    POS    ID    REF    ALT    QUAL    FILTER    INFO    FORMAT    sample1    sample2
1    14418789    10006974    T    C    .    PASS    .    GT    1/0    1/1

更详细的内容参考
https://www.jianshu.com/p/957efb50108f（VCF格式,简书刘小泽）
https://www.internationalgenome.org/wiki/Analysis/vcf4.0/
https://www.omicsclass.com/article/6（看懂VCF文件）
https://blog.csdn.net/genome_denovo/article/details/78697679
https://www.jianshu.com/p/badd24cbc538（vcftools）
https://blog.csdn.net/White_Cat_wisdom/article/details/103523068（VCFtools的使用）

1.3 plink文件

plink文件包含两个文件 .map文件 .ped文件

### .ped文件类型前面是6列，后面跟着基因型
The PED file is a white-space (space or tab) delimited file: the first six columns are mandatory: 
     Family ID
     Individual ID
     Paternal ID
     Maternal ID
     Sex (1=male; 2=female; other=unknown)
     Phenotype
The IDs are alphanumeric: the combination of family and individual ID should uniquely identify a person. A PED file must have 1 and only 1 phenotype in the sixth column. The phenotype can be either a quantitative trait or an affection status column: PLINK will automatically detect which type (i.e. based on whether a value other than 0, 1, 2 or the missing genotype code is observed).

####.map 文件类型
By default, each line of the MAP file describes a single marker and must contain exactly 4 columns: 
     chromosome (1-22, X, Y or 0 if unplaced)
     rs# or snp identifier
     Genetic distance (morgans)
     Base-pair position (bp units)

http://zzz.bwh.harvard.edu/plink/data.shtml（值得再去看看）

2. 必要软件安装

2.1 安装Linux tassel 或者window版本的tassel

安装tassel之前要安装java

2.1.1 java安装

##删除自带的openjdk

##首先看自己的java的环境
java -version
命令看下是否有java环境.有的话会出现jdk的版本信息.
一般情况下,我们都要将Linux自带的openjdk卸载掉,然后装SUN的jdk
如果正常情况下完全删除java环境后，运行java -version 出现的是没有这个文件的信息，此处由于前期安装了python、R等软件导致我的java环境特别多，删除掉自带的OpenJDK，java -version还是表示没有删除干净，索性把所有的带有java的都给删掉了，这个不知道后期对我数据处理利用的时候有没有影响，不过对于今天调用R来说没有影响。删除用的语句是
命令:rpm -qa|grep java  
查询centos是否有自带的OpenJDK
删除：rpm -e

[root@175 usr] #rpm -qa | grep java
java-1.8.0-openjdk-headless-1.8.0.212.b04-0.el7_6.x86_64
R-java-devel-3.6.0-1.el7.x86_64
javapackages-tools-3.4.1-11.el7.noarch
java-1.8.0-openjdk-1.8.0.212.b04-0.el7_6.x86_64
tzdata-java-2018e-3.el7.noarch
java-1.8.0-openjdk-devel-1.8.0.212.b04-0.el7_6.x86_64
R-java-3.6.0-1.el7.x86_64
python-javapackages-3.4.1-11.el7.noarch
[root@175 usr]# rpm -e --nodeps java-1.8.0-openjdk-headless-1.8.0.212.b04-0.el7_6.x86_64
[root@175 usr]# rpm -e --nodeps java-1.8.0-openjdk-1.8.0.212.b04-0.el7_6.x86_64
[root@175 usr]# rpm -e --nodeps java-1.8.0-openjdk-devel-1.8.0.212.b04-0.el7_6.x86_64

#下载
wget http://download.oracle.com/otn-pub/java/jdk/8u112-b15/jdk-8u112-linux-x64.rpm
#安装
rpm -ivhc jdk-8u112-linux-x64.rpm
#测试安装
java -vsrsion
#查看java版本
##更改环境变量
vim   /etc/profile
把下面这些添加到最后保存退出
export JAVA_HOME=/usr/local/jdk-11.0.8
export PATH=.:$JAVA_HOME/bin:$PATH
export CLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar
#编辑完后执行脚本
source /etc/profile

参考：
https://java.com/zh_CN/download/help/linux_uninstall.xml
https://www.cnblogs.com/ljxt/p/11612636.html
https://www.jianshu.com/p/4580fff0ac1c(linux java环境配置）

2.1.2 Tassel 安装

#github clone
git clone https://bitbucket.org/tasseladmin/tassel-5-standalone.git
cd tassel-5-standalone
更改环境变量后没什么反应，就直接调用的时候前面加上了环境位置，这个地方还需要修改一下

参考
https://blog.csdn.net/Pele_Lee/article/details/103327433（Linux系统下TASSEL5.0的安装及使用）

2.1.3 Plink安装

#下载
wget http://s3.amazonaws.com/plink1-assets/plink_linux_x86_64_20190617.zip
#解压
unzip plink_linux_x86_64_20190617.zip
#环境配置 和tassel一样

参考网址：
https://www.jianshu.com/p/07c23dba05ea（生信小工具：Plink之下载安装与其基本格式（1））

2.1.4 vcftools安装

去官网下载了vcftools 压缩包vcftools-vcftools-v0.1.16-18-g581c231.tar.zip
通过FileZilla上传到Linux 新建的vcftools文件夹下
##解压
tar -xvf  vcftools-vcftools-v0.1.16-18-g581c231.tar.zip
##添加环境变量
vim /etc/profile
source /etc/profile
##
cd vcftools/ 
./configure
make
make install

参考网址：
https://www.cnblogs.com/triple-y/p/9662553.html
https://www.cnblogs.com/triple-y/p/10331449.html

2.1.5 popLDdecay安装

git clone https://github.com/BGI-shenzhen/PopLDdecay.git 
        cd PopLDdecay; chmod 755 configure; ./configure;
        make;
        mv PopLDdecay  bin/;    #     [rm *.o]

以上安装的软件环境配置都是相同的方法

3. hapmap格式转换成vcf格式

第一种方法是load 到window 版本的TASSEL中，然后可以输出VCF格式、plink格式
第二种是利用linux系统的tassel
使用 tassel 将测试文件（test.hmp.txt）转为 vcf4.0 格式文件

#先将test.hmp.txt排序生成 tassel.test.sort.hmp.txt 文件
/u2/software/tassel/tassel-5-standalone/run_pipeline.pl -SortGenotypeFilePlugin -inputFile tassel.test.hmp.txt -outputFile test.sort.hmp.txt -fileType HapMap
将排好序的 .hmp.txt 文件生成 vcf4.0 格式的文件
#将 tassel.test.sort.hmp.txt 文件转为vcf4.0 格式的文件 tassel.test.vcf：
run_pipeline.pl -fork1 -h tassel.test.sort.hmp.txt -export -exportType VCF
#或者使用下面的代码
run_pipeline.pl -Xmx5g -fork1 -h test.hmp.txt -export -exportType VCF -runfork1
得到test.vcf文件

参考网址：
https://zhuanlan.zhihu.com/p/38512217

4.利用PopLDdecay 绘制LDdecay图

PopLDdecay 的运行主要分为两步：
第一步，计算 LD decay，如果输入文件是 VCF 格式，直接运行即可；如果输入格式为 PLINK 的格式，先转换一下格式；另外可以用 -SubPop 参数设置 subgroup。
 # 1)  For gatk VCF file deal , run PopLDdecay  direct
            ./bin/PopLDdecay    -InVCF  SNP.vcf.gz  -OutStat LDdecay   
      # 2)  For plink [.ped .map], chang plink 2 genotype first  2) run PopLDdecay  
            perl bin/mis/plink2genotype.pl    -inPED in.ped -inMAP in.map  -outGenotype out.genotype ;      ./bin/PopLDdecay        -InGenotype out.genotype -OutStat LDdecay 
      # 3)  To Calculate the subgroup GroupA LDdecay in VCF Files   # put GroupA sample name into GroupA_sample.list
            ./bin/PopLDdecay   -InVCF    -OutStat    -SubPop    GroupA_sample.list

第二步，绘制 LD decay 图，可以使用 PopLDdecay 提供的程序绘制，如下所示
  #    2.1  For one Population
        perl  bin/Plot_OnePop.pl  -inFile   LDdecay.stat.gz  -output  Fig
        #    2.2  For one Population  muti chr          # List Format [chrResultPathWay]
        perl  bin/Plot_OnePop.pl  -inList   Chr.ReslutPath.List  -output Fig
        #    2.3  For muti Population                   #  List Format :[Pop.ResultPath  PopID ]
        perl  bin/Plot_MutiPop.pl  -inList  Pop.ReslutPath.list  -output Fig

参考网址：
https://github.com/BGI-shenzhen/PopLDdecay
http://wap.sciencenet.cn/blog-656335-1168505.html

5. 利用plink格式的数据绘制LDdacy图

同hapmap格式转换成vcf一样
使用 vcftools 将 tassel.test.vcf 转为 plink（.ped，.map）格式
#将 tassel.test.vcf 转为 tassel.test.vcf2plink
#1)vcftools --vcf tassel.test.vcf --plink --out  tassel.test.vcf2plink
# 2)  For plink [.ped .map], chang plink 2 genotype first  2) run PopLDdecay  
            perl bin/mis/plink2genotype.pl    -inPED in.ped -inMAP in.map  -outGenotype out.genotype ;      ./bin/PopLDdecay        -InGenotype out.genotype -OutStat LDdecay 

#3)perl  bin/Plot_OnePop.pl  -inFile   LDdecay.stat.gz  -output  Fig

万万没想到，画出来的图死丑，也不知道是哪出现了问题，难道我没有用窗口美化，这个后续在进行研究，昨天熬夜熬到凌晨三点搞这个，搞出来的图丑的要死，又重新搞了一种用R的方法，多找，用英文找，总有一款适合自己。

书接上文，早晨爬起继续搞，给自己点曙光

thDG9M9U13.jpg

6.利用vcf文件，Linux R 进行绘制LDdacy

在此先膜拜一下
 https://www.biostars.org/p/300381/
Question: LD-decay in a r2 vs distance(cm) plot
I would like to create a LD vs distance(cm) plot in R using an output from PLINK. I have been trying with different --ld-window-kb numbers but I can not plot them in R. Furthermore, I have a question about average r2. In some papers ld_decay is plotted average r2 vs distance and some papers used r2 vs distance. Can everybody explain more? 
thanks for your attention 
#对vcf文件进行操作
vcftools --gzvcf output.bcf.flts.vcf.gz --plink --out output_in_plink1
plink --noweb --file output_in_plink1 --make-bed --out plinkedout
plink --bfile plinkedout --extract plinkedout_thin.bim --recode --make-bed --out plinkedout_thin
plink --bfile "plinkedout_thin" --r2 --ld-window-r2 0 --ld-window 999999 --ld-window-kb 8000 --out "snp-thin"
cat snp-thin.ld | sed 1,1d | awk -F " " 'function abs(v) {return v < 0 ? -v : v}BEGIN{OFS="\t"}{print abs($5-$2),$7}' | sort -k1,1n > MYLD.ld.summary
#调用R
library(dplyr)
library(stringr)
library(ggplot2)

dfr <- read.delim("snp-thin.ld.summary",sep="",header=F,check.names=F,stringsAsFactors=F)
colnames(dfr) <- c("dist","rsq")
dfr$distc <- cut(dfr$dist,breaks=seq(from=min(dfr$dist)-1,to=max(dfr$dist)+1,by=100000))
dfr1 <- dfr %>% group_by(distc) %>% summarise(mean=mean(rsq),median=median(rsq))
dfr1 <- dfr1 %>% mutate(start=as.integer(str_extract(str_replace_all(distc,"[\\(\\)\\[\\]]",""),"^[0-9-e+.]+")),
                        end=as.integer(str_extract(str_replace_all(distc,"[\\(\\)\\[\\]]",""),"[0-9-e+.]+$")),
                        mid=start+((end-start)/2))
ggplot()+
  geom_point(data=dfr1,aes(x=start,y=mean),size=0.4,colour="grey20")+
  geom_line(data=dfr1,aes(x=start,y=mean),size=0.3,alpha=0.5,colour="grey40")+
  labs(x="Distance (Megabases)",y=expression(LD~(r^{2})))+
  scale_x_continuous(breaks=c(0,2*10^6,4*10^6,6*10^6,8*10^6),labels=c("0","2","4","6","8"))+
  theme_bw()

出来的图达到了预期，代码在细研究，这个画图先粗糙的整理一下，或者随着我认知的增加，会有更多的简便方法。
老子困了，要补觉去了。

thUFAVBI8Y.jpg

参考网址：
http://wap.sciencenet.cn/blog-656335-1168505.html（PopLDdecay: 基于 VCF 文件的快速、高效计算连锁不平衡的工具）
https://github.com/BGI-shenzhen/PopLDdecay
https://www.jianshu.com/p/a36bd4145ef7（LD衰减图的理解与应用）

参考网址：

https://github.com/BGI-shenzhen/PopLDdecay
https://www.biostars.org/p/300381/(plink 绘制LDdecay）