基因型数据与1000G merge后做PCA看人群分布
写在前面,对于生信新手来说,不熟悉代码的基础,总是空中楼阁一般,想搜一条代码、做一些适合自己数据的运算更是难上加难,踩过的坑总要填,为了新手们不要像我一样浪费时间在这苦恼的代码上,我尽量用通俗的话让大家能检索到。
我目前要做的是想看下我的基因型数据人种(population)分布情况,是欧洲人?亚洲人?还是其他的?因为我要把非欧洲人去掉以进行后续研究。
因为1kg有population信息,所以我首先要把自己的基因型数据和1kg的merge到一起,再做PCA(MDS也可以),然后通过R包绘图,看我的基因型文件的样本落在了1kg人种的那一堆儿里。
总体思路是这样,我会引用一些其他推文的代码,因为毕竟不是自己写得,都是查啊查。感谢其他作者哦,我会写明出处。
下载并整理1000G基因型文件
这是地址http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/
自己下载吧。下载后的格式是vcf,要转成plink格式。
for num in {1..22}
do plink --vcf ALL.chr${num}.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz --recode --out chr${num}
done
for num in {1..22}
do plink --file chr${num} --make-bed --out chr${num}
done
for num in {1..22}
do plink --bfile chr${num} --exclude ./1000g_phase3-merge.missnp --make-bed --out ./chr${num}
done
#新建chr_merge_list文件,chr1-chr22,每行一个;
plink --bfile chr1 --merge-list ./chr_merge_list --make-bed --keep-allele-order --snps-only --out 1000g_phase3
1KG与自己的数据merge
这一步有点多,新手们千万别烦,照着代码贴上去做一遍就明白了。
#提取自己数据集的snp名称;
awk '{print$2}' hcp_qc.bim > hcp_SNPs.txt
#从1KG中提取与hcp数据集相同的snp;
plink --bfile 1000g_phase3 --extract hcp_SNPs.txt --make-bed --out 1kG_sameSNP_hcp
#同理,反配一遍;
awk '{print$2}' 1kG_sameSNP_hcp.bim > hcp_1KG_SNPs.txt
plink --bfile hcp_qc --extract hcp_1KG_SNPs.txt --make-bed --out hcp_sameSNP_1KG
#same build
awk '{print$2,$4}' hcp_sameSNP_1KG.bim > buildhcp.txt
plink --bfile 1kG_sameSNP_hcp --update-map buildhcp.txt --make-bed --out 1kG_sameSNP_hcp_same_build
#有相同allele1
awk '{print$2,$5}' 1kG_sameSNP_hcp_same_build.bim > 1kg_ref-list.txt
plink --bfile hcp_sameSNP_1KG --reference-allele 1kg_ref-list.txt --make-bed --out hcp_sameSNP_1KG-adj
#解决flip问题
awk '{print$2,$5,$6}' 1kG_sameSNP_hcp_same_build.bim > 1kG_sameSNP_hcp_same_build_tmp
awk '{print$2,$5,$6}' hcp_sameSNP_1KG-adj.bim > hcp_sameSNP_1KG-adj_tmp
sort 1kG_sameSNP_hcp_same_build_tmp hcp_sameSNP_1KG-adj_tmp |uniq -u > all_differences.txt
awk '{print$1}' all_differences.txt | sort -u > flip_list.txt
plink --bfile hcp_sameSNP_1KG-adj --flip flip_list.txt --reference-allele 1kg_ref-list.txt --make-bed --out corrected_hcp
#查看翻转后哪个还有问题
awk '{print$2,$5,$6}' corrected_hcp.bim > corrected_hcp_tmp
sort 1kG_sameSNP_hcp_same_build_tmp corrected_hcp_tmp |uniq -u > uncorresponding_SNPs.txt
#去掉还是配不上的snp
awk '{print$1}' uncorresponding_SNPs.txt | sort -u > SNPs_for_exlusion.txt
plink --bfile corrected_hcp --exclude SNPs_for_exlusion.txt --make-bed --out HCP
plink --bfile 1kG_sameSNP_hcp_same_build --exclude SNPs_for_exlusion.txt --make-bed --out 1kG
#merge hcp and 1KG
plink --bfile HCP --bmerge 1kG.bed 1kG.bim 1kG.fam --allow-no-sex --make-bed --out hcp_merge_1KG
下载1KG panel文件
http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/
这里有人种信息,我们画图时就要以这个信息分组。
# Convert population codes into superpopulation codes (i.e., AFR,AMR,ASN, and EUR).
awk '{print$1,$1,$2}' integrated_call_samples_v3.20130502.ALL.panel > race_1kG.txt
sed 's/JPT/ASN/g' race_1kG.txt>race_1kG2.txt
sed 's/ASW/AFR/g' race_1kG2.txt>race_1kG3.txt
sed 's/CEU/EUR/g' race_1kG3.txt>race_1kG4.txt
sed 's/CHB/ASN/g' race_1kG4.txt>race_1kG5.txt
sed 's/CHD/ASN/g' race_1kG5.txt>race_1kG6.txt
sed 's/YRI/AFR/g' race_1kG6.txt>race_1kG7.txt
sed 's/LWK/AFR/g' race_1kG7.txt>race_1kG8.txt
sed 's/TSI/EUR/g' race_1kG8.txt>race_1kG9.txt
sed 's/MXL/AMR/g' race_1kG9.txt>race_1kG10.txt
sed 's/GBR/EUR/g' race_1kG10.txt>race_1kG11.txt
sed 's/FIN/EUR/g' race_1kG11.txt>race_1kG12.txt
sed 's/CHS/ASN/g' race_1kG12.txt>race_1kG13.txt
sed 's/PUR/AMR/g' race_1kG13.txt>race_1kG14.txt
# Create a racefile of your own data.
awk '{print$1,$2,"OWN"}' hcp_qc.fam>racefile_own.txt
cat race_1kG14.txt racefile_own.txt | sed -e '1i\
FID IID race' > racefile.txt
在R中画PCA图,以下都是在R中完成
setwd('D:/my_subject/PCA')
data<- read.table(file="hcp_merge_1KG_pca.eigenvec",header=FALSE)
names(data)=c("FID","IID","PC1","PC2","PC3","PC4","PC5","PC6","PC7","PC8","PC9","PC10","PC11","PC12","PC13","PC14","PC15","PC16","PC17","PC18","PC19","PC20")
race<- read.table(file="racefile.txt",header=TRUE)
datafile<- merge(data,race,by=c("IID","FID"))
head(datafile)
pdf("PCA.pdf",width=10,height=10)
for (i in 1:nrow(datafile))
{
if (datafile[i,23]=="OWN") {plot(datafile[i,4],datafile[i,3],type="p",xlim=c(-0.04,0.03),ylim=c(-0.04,0.03),xlab="PC2",ylab="PC1",pch=1,cex=0.5,col="yellow")}
par(new=T)
if (datafile[i,23]=="EUR") {plot(datafile[i,4],datafile[i,3],type="p",xlim=c(-0.04,0.03),ylim=c(-0.04,0.03),xlab="PC2",ylab="PC1",pch=1,cex=0.5,col="green")}
par(new=T)
if (datafile[i,23]=="ASN") {plot(datafile[i,4],datafile[i,3],type="p",xlim=c(-0.04,0.03),ylim=c(-0.04,0.03),xlab="PC2",ylab="PC1",pch=1,cex=0.5,col="black")}
par(new=T)
if (datafile[i,23]=="AMR") {plot(datafile[i,4],datafile[i,3],type="p",xlim=c(-0.04,0.03),ylim=c(-0.04,0.03),xlab="PC2",ylab="PC1",pch=1,cex=0.5,col="blue")}
par(new=T)
if (datafile[i,23]=="AFR") {plot(datafile[i,4],datafile[i,3],type="p",xlim=c(-0.04,0.03),ylim=c(-0.04,0.03),xlab="PC2",ylab="PC1",pch=1,cex=0.5,col="red")}
par(new=T)
}
legend("topright", pch=c(1,1,1,1,1),c("EUR","ASN","AMR","AFR","OWN"),col=c("green","black","blue","red","yellow"),bty="o",cex=1)
dev.off()
注意的几点:
datafile[i,23]=="EUR"
这句的23要改一下,你的人种信息放到哪列的就要改成相应的列数;
xlim=c(-0.04,0.03)
这句要根据你自己的PCA最大值最小值改一下范围;
datafile[i,4],datafile[i,3]
这个要改一下,我的PC2是第4列,所以前面写4,看你画哪两列的图;
xlab="PC2",ylab="PC1"
这句标签也改一下,对应你前面选好的列;
大功告成!图就做好了,我的数据人群分布太散了,就不贴上来了,以免误导大家。
参考推文:https://www.jianshu.com/p/286050959dbd
感谢作者哦,我只是要做他内容中的一小部分,所以当时找这块内容的时候太不容易了。还是生信网友推给我的。感谢各位在生信路上一起的小伙伴们。