玉米全基因关联分析 ( GWAS Analysis Pipeli
全基因组关联分析是基于数量性状连锁不平衡理论,通过表型与基因型关联分析获得影响表型性状的显著关联标记。
基本流程1. 软件
Tassel / Gaipt / FarmCPU / mrMLM / PLINK / Structure / SPAGeDi
以上软件都是在 GWAS 分析中可能用到的软件。
2. 数据
Genotype ( hmp || vcf format file ) + Penotype ( Traits )
关于表型的处理请见前一篇文章。
3. 分析流程
从原始数据的获得到 Call SNP 的流程请参考之前文章。
本文的分析从 hmp/vcf 格式的基因型文件开始。
3.1 基因型过滤
过滤标准如下:
1. Filter SNP Call Rate less than 90 % (max missing rate is 0.1)
2. Filter mult-allelic sites (include only bi-allelic sites)
3. Filter SNP heterozygosity rate more than 0.2
4. Hardy-Weinberg Equilibrium using an exact test pvalue less than 0.01
5. Minor Allele Frequency (MAF) <0.05
6. Filter Sample Call Rate less than 90 % (max sample call missing rate is 0.1)
7. Missing Rate < 0.05 (For Structure) **
8. Genetic Diversity (GD > 0.45, For Structure) **
具体操作流程如下:
## 1. 使用 Tassel 5 进行转换,可以用 GUI 操作,命令行操作如下:
# Step 1: sort hmp foramt file
/opt/bio/tassel-5-standalone/run_pipeline.pl -SortGenotypeFilePlugin -inputFile my_genotype.hmp.txt -outputFile my_genotype.sort.hmp.txt -fileType Hapmap
# Step 2 : convert hmp to vcf
/opt/bio/tassel-5-standalone/run_pipeline.pl -fork1 -h my_genotype.sort.hmp.txt -export -exportType VCF
## 切记,需要把基因型中 -- 转换为 NN
## 切记,需要把基因型中 -- 转换为 NN
## 切记,需要把基因型中 -- 转换为 NN
## 2. 使用 Plink 软件对进行基因型过滤
# Step3: filter genotype
/opt/bio/plink-1.9/plink --vcf LiZL_Genotype_Sort.vcf --maf 0.05 --geno 0.05 --mind 0.7 --recode 'vcf' --biallelic-only strict --out LiZL_Genotype_Filter
## Step 3: Convert to normal format
/opt/bio/tassel-5-standalone/run_pipeline.pl -fork1 -vcf ZXX_Genotype_Filter.vcf -export ZXX_Genotype_Filter.hmp.txt -exportType HapmapDiploid
3.2 群体结构
SNP 标记过滤完成后,使用程序在各条染色体上随机均匀提取一定数目的标记用来计算 Structure 群体结构,之所以不使用全部的标记是因为,全部标记间连锁作用明显并且计算量庞大。
一般使用 5000 个标记进行群体分层估算。同时群体所需要的标记更严格一些,过滤标准如下:
1. missing rate < 0.05,
2. Genetic diversity ( GD > 0.45 )
使用 plink 软件进行 Structure 数据格式转化:
## Step1 sort hapmap
/opt/bio/tassel-5-standalone/run_pipeline.pl -SortGenotypeFilePlugin -inputFile ForStructure.txt -outputFile For_Q.sort.hmp.txt -fileType Hapmap
## Step2 Convert hapmap
/opt/bio/tassel-5-standalone/run_pipeline.pl -fork1 -h For_Q.sort.hmp.txt -export For_Q -exportType VCF
## Step3 Make for Q matrix
/opt/bio/plink-1.9/plink --vcf For_Q.vcf --recode structure --out Structure
数据准备好之后,使用经典的 Structure 软件进行群体 K 值估算。
Structure 软件运行参数设置(Command-line version):
群体结构主要有两个参数文件, mainparams 和 extraparams 文件,具体设置如下:
每个参数均以 #define 开头:
KEY PARAMETERS FOR THE PROGRAM structure. YOU WILL NEED TO SET THESE IN ORDER TO RUN THE PROGRAM. VARIOUS OPTIONS CAN BE ADJUSTED IN THE FILE extraparams.
基本参数配置 ( mainparams ):
Basic Program Parameters
#define MAXPOPS 2 // (int) 预设的亚群数
#define BURNIN 10000 // (int) length of burnin period
#define NUMREPS 20000 // (int) number of MCMC reps after burnin
Input/Output files
#define INFILE infile // (str) 输入文件
#define OUTFILE outfile //(str) 输出的结果文件
Data file format
#define NUMINDS 100 // (int) 材料数目
#define NUMLOCI 100 // (int) 标记数目
#define PLOIDY 2 // (int) 几倍体
#define MISSING -9 // (int) 基因型缺失值
#define ONEROWPERIND 0 // (B) 数据格式(单行/双行)
#define LABEL 1 // (B) 是否包含材料名称
#define POPDATA 1 // (B) Input file contains a population identifier
#define POPFLAG 0 // (B) Input file contains a flag which says whether to use popinfo when USEPOPINFO==1
#define LOCDATA 0 // (B) Input file contains a location identifier
#define PHENOTYPE 0 // (B) Input file contains phenotype information
#define EXTRACOLS 0 // (int) Number of additional columns of data before the genotype data start.
#define MARKERNAMES 1 // (B) 是否包含标记名称(首行)
#define RECESSIVEALLELES 0 // (B) data file contains dominant markers (eg AFLPs)
// and a row to indicate which alleles are recessive
#define MAPDISTANCES 1 // (B) data file contains row of map distances
// between loci (plink 软件出来格式有该行 )
Advanced data file options
#define PHASED 0 // (B) Data are in correct phase (relevant for linkage model only)
#define PHASEINFO 0 // (B) the data for each individual contains a line indicating phase (linkage model)
#define MARKOVPHASE 0 // (B) the phase info follows a Markov model.
#define NOTAMBIGUOUS -999 // (int) for use in some analyses of polyploid data
控制参数配置 ( extraparams ):
EXTRA PARAMS FOR THE PROGRAM structure. THESE PARAMETERS CONTROL HOW THE PROGRAM RUNS. ATTRIBUTES OF THE DATAFILE AS WELL AS K AND RUNLENGTH ARE SPECIFIED IN mainparams.
PROGRAM OPTIONS
#define NOADMIX 0 // (B) Use no admixture model (0=admixture model, 1=no-admix)
#define LINKAGE 0 // (B) Use the linkage model model
#define USEPOPINFO 0 // (B) Use prior population information to pre-assign individuals to clusters
#define LOCPRIOR 0 //(B) Use location information to improve weak data
#define FREQSCORR 1 // (B) allele frequencies are correlated among pops
#define ONEFST 0 // (B) assume same value of Fst for all subpopulations.
#define INFERALPHA 1 // (B) Infer ALPHA (the admixture parameter)
#define POPALPHAS 0 // (B) Individual alpha for each population
#define ALPHA 1.0 // (d) Dirichlet parameter for degree of admixture (this is the initial value if INFERALPHA==1).
#define INFERLAMBDA 0 // (B) Infer LAMBDA (the allele frequencies parameter)
#define POPSPECIFICLAMBDA 0 //(B) infer a separate lambda for each pop (only if INFERLAMBDA=1).
#define LAMBDA 1.0 // (d) Dirichlet parameter for allele frequencies
PRIORS
#define FPRIORMEAN 0.01 // (d) Prior mean and SD of Fst for pops.
#define FPRIORSD 0.05 // (d) The prior is a Gamma distribution with these parameters
#define UNIFPRIORALPHA 1 // (B) use a uniform prior for alpha; otherwise gamma prior
#define ALPHAMAX 10.0 // (d) max value of alpha if uniform prior
#define ALPHAPRIORA 1.0 // (only if UNIFPRIORALPHA==0): alpha has a gamma prior with mean A*B, and
#define ALPHAPRIORB 2.0 // variance A*B^2.
#define LOG10RMIN -4.0 //(d) Log10 of minimum allowed value of r under linkage model
#define LOG10RMAX 1.0 //(d) Log10 of maximum allowed value of r
#define LOG10RPROPSD 0.1 //(d) standard deviation of log r in update
#define LOG10RSTART -2.0 //(d) initial value of log10 r
USING PRIOR POPULATION INFO (USEPOPINFO)
#define GENSBACK 2 //(int) For use when inferring whether an individual is an immigrant, or has an immigrant ancestor in the past GENSBACK generations. eg, if GENSBACK==2, it tests for immigrant ancestry back to grandparents.
#define MIGRPRIOR 0.01 //(d) prior prob that an individual is a migrant (used only when USEPOPINFO==1). This should be small, eg 0.01 or 0.1.
#define PFROMPOPFLAGONLY 0 // (B) only use individuals with POPFLAG=1 to update P.
LOCPRIOR MODEL FOR USING LOCATION INFORMATION
#define LOCISPOP 1 //(B) use POPDATA for location information
#define LOCPRIORINIT 1.0 //(d) initial value for r, the location prior
#define MAXLOCPRIOR 20.0 //(d) max allowed value for r
OUTPUT OPTIONS
#define PRINTNET 1 // (B) Print the "net nucleotide distance" to screen during the run
#define PRINTLAMBDA 1 // (B) Print current value(s) of lambda to screen
#define PRINTQSUM 1 // (B) Print summary of current population membership to screen
#define SITEBYSITE 0 // (B) whether or not to print site by site results. (Linkage model only) This is a large file!
#define PRINTQHAT 0 // (B) Q-hat printed to a separate file. Turn this on before using STRAT.
#define UPDATEFREQ 100 // (int) frequency of printing update on the screen. Set automatically if this is 0.
#define PRINTLIKES 0 // (B) print current likelihood to screen every rep
#define INTERMEDSAVE 0 // (int) number of saves to file during run
#define ECHODATA 1 // (B) Print some of data file to screen to check that the data entry is correct.
(NEXT 3 ARE FOR COLLECTING DISTRIBUTION OF Q:)
#define ANCESTDIST 0 // (B) collect data about the distribution of ancestry coefficients (Q) for each individual
#define NUMBOXES 1000 // (int) the distribution of Q values is stored as a histogram with this number of boxes.
#define ANCESTPINT 0.90 // (d) the size of the displayed probability interval on Q (values between 0.0--1.0)
MISCELLANEOUS
#define COMPUTEPROB 1 // (B) Estimate the probability of the Data under the model. This is used when choosing the best number of subpopulations.
#define ADMBURNIN 500 // (int) [only relevant for linkage model]: Initial period of burnin with admixture model (see Readme)
#define ALPHAPROPSD 0.025 // (d) SD of proposal for updating alpha
#define STARTATPOPINFO 0 // Use given populations as the initial condition for population origins. (Need POPDATA==1). It is assumed that the PopData in the input file are between 1 and k where k<=MAXPOPS.
#define RANDOMIZE 1 // (B) use new random seed for each run
#define SEED 2245 // (int) seed value for random number generator (must set RANDOMIZE=0)
#define METROFREQ 10 // (int) Frequency of using Metropolis step to update Q under admixture model (ie use the metr. move every i steps). If this is set to 0, it is never used.(Proposal for each q^(i) sampled from prior. The goal is to improve mixing for small alpha.)
#define REPORTHITRATE 0 // (B) report hit rate if using METROFREQ
参数配置中:
"(int)" means that this takes an integer value.
"(B)" means that this variable is Boolean ( 1 for True, and 0 for False)
"(str)" means that this is a string
Structure 命令行使用:
structure -m mainparams -e extraparams -K MAXPOPS -L NUMLOCI -N NUMINDS -i input file -o output file -D SEED
K 值计算
程序运行结束后可以将输出结果打包上传到下面这个网站,进行K值估算。
点击进入: StructureHarvester
3.3 亲缘关系
GWAS 混合线性模型中,亲缘关系作为协变量对结果有一定的影响,使用 Tassel / GAPIT / SPAGeDi 等软件可以轻松计算亲缘关系矩阵。
这里我们演示 SPAGeDi 1.4 软件运行过程。
软件下载地址:SPAGeDi-1.4b
为了计算快速,我们随机抽取在染色体上均匀分布的5000个标记进行 Kinship 计算,保证了标记的随机及代表性操作方法如下:
perl Step3.Random_fetch.pl Genotype_Final.hmp.txt > For_K.hmp.txt
perl Step5.Format_kinship.pl For_K.hmp.txt > K.in
## 得到的输入文件需要将 NN 转换为 0
SPAGeDi 软件得到 K 矩阵的后续处理:
1. 对角线加1
2. 小于0的系数用0替换
3. 所有系数加倍
在获得群体结构 Q 矩阵和亲缘关系 K 矩阵后,我们可以使用 Tassel / Gapit / FarmCPU / mrMLM 等软件进行分析,具体操作见下一篇。
3.4 关联分析
在GWAS研究中,当涉及质量性状时一般采用 Logistic 回归模型进行分析,对于数量性状的研究,主要采用线性回归模型进行关联分析。
在 Logistic 回归模型中,基因型是应变量,群体结构和表型是自变量;
而在线性回归模型中,表型是应变量,其他品种、性别、群体结构和基因型数据则是自变量。
线性模型包括两种:一般线性模型(general linear model,GLM)和混合线性模型(mixed linear model,MLM)。复杂数量性状通常受到多种因素的共同影响,而混合模型中可以加入固定效应和随机效应,因此,以研究数量性状的全基因组关联分析方法常采用混合线性模型进行分析。
在GWAS中,群体分层(population stratification)和多重假设检验(multiple testing adjusting)是引起研究结果分析误差的重要原因。处理这两种误差的一种可能的策略是采用基于家系的关联研究,该方法可以避免群体分成对关联分析结果的影响。所谓群体分层,是指群体内存在等位基因频率不同的亚群体。由于自然选择、遗传漂变、群体分层等诸多因素都会影响到群体中的连锁不平衡,因此,在进行关联分析时, 一些非原因等位基因也可以同真实 QTL 形成连锁不平衡表现为与研究性状关联,从而导致伪关联或假阳性的出现。
## Step3.Random_fetch.pl
#!/usr/bin/perl -w
use strict;
use Data::Dumper;
open IN, $ARGV[0] || die $!;
my $head = <IN>;
print $head;
my $total;
my %genotype;
while(<IN>){
chomp;
my @a = split /\t/, $_;
$genotype{$a[2]}{$a[0]} = $_;
if ($a[2] ne "0"){
$total++;
}
}
close IN;
for my $chr (1..10){
my @snp = keys %{$genotype{$chr}};
my $max = scalar (@snp);
my $num = ($max / $total) * 5000;
$num = sprintf ("%d",$num);
my %random = &random($max, $num+1);
foreach my $k (keys %random){
my $id = $snp[$k];
my $out = $genotype{$chr}{$id};
print "$out\n";
}
}
sub random{
my ($max, $num) = @_;
my %hash;
while ((keys %hash) < $num) {
$hash{int(rand($max))} = 1;
}
return %hash;
}
__END__
## Step5.Format_kinship.pl
#!/usr/bin/perl -w
use strict;
use Data::Dumper;
open IN, $ARGV[0] || die $!;
chomp(my $head = <IN>);
my @samples = split /\t/, $head;
my (%hash, @snps);
while(<IN>){
chomp;
my $n = 1;
my @a = split /\t/, $_;
for (my $i=11;$i<@a;$i++){
$a[$i] =~ tr/ATCG/1234/;
my @tmp = split (//, $a[$i]);
my $allen = join ",",@tmp;
$hash{$samples[$i]}{$a[0]} = $allen;
}
push @snps, $a[0];
}
close IN;
print "307\t0\t0\t5005\t1\t2\n0\n";
print join ("\t", "Name", sort @snps)."\n";
foreach my $sample (sort keys %hash){
my @out;
foreach my $snp (sort keys %{$hash{$sample}}){
push @out,$hash{$sample}{$snp};
}
print join ("\t",$sample,@out)."\n";
}
__END__
#!/usr/bin/perl -w
use strict;
open IN, $ARGV[0] || die $!;
my $head = <IN>;
print $head;
while(<IN>){
chomp;
my @a = split /\t/, $_;
for (my $i=1; $i<@a; $i++){
if ($a[$i] le "0.0"){
$a[$i] = "0.0";
}else{
$a[$i] = 2 * $a[$i];
}
}
print join("\t", @a)."\n";
}
close IN;
END
缩略词:
BSSS, Iowa Stiff Stalk Synthetic;
NSS, Non-Stiff Stalk;
PA, Group A germplasm derived from modern U.S. hybrids;
PB, Group B germplasm derived from modern U.S. hybrids;
SPT, Sipingtou;
SS, Stiff Stalk;
TS, Tropical or semitropical;
SSR, Simple sequence repeat (markers);
PCA, Principal component analysis;
PIC, Polymorphism information content;
RFLP, Restriction fragment length polymorphism;
SNPs, Single nucleotide polymorphisms;
CIMMYT, International Maize and Wheat Improvement Center;
DNA,Deoxyribonucleicacid;
GAPIT, Genomic Association and Prediction Integrated Tool;
GBS, Genotyping by sequencing;
GD, Genetic diversity;
LD, Linkage disequilibrium;
LnP(D), Log likelihood of data;
LRC, Luda Red Cob;
Mb, Megabase;
NJ, Neighborjoining;