R语言实现蛋白质相互作用网络——PPI
2019-04-17 本文已影响39人
PriscillaBai
可以用网页版做,但有上限2000个基因的限制。所以今天开发一下怎么用R飞一波。
1. 下载STRING数据库中蛋白质相互作用网络
![](https://img.haomeiwen.com/i9640232/cc3bfd4b8e3fc205.png)
2. 下载Uniprot ID转换文件
![](https://img.haomeiwen.com/i9640232/c405c68966aa854b.png)
![](https://img.haomeiwen.com/i9640232/68fff01b2f6f8da8.png)
打开terminal
wget -c ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/idmapping/
得到的是网页形式,乱码了
![](https://img.haomeiwen.com/i9640232/9bdacad23088b259.png)
继续刚才的步骤,拷贝链接
wget -c ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/idmapping/by_organism/
重复上述步骤,拷贝链接,wget
![](https://img.haomeiwen.com/i9640232/9875f550b23db71e.png)
3. 万事俱备,我们现在手里有三个文件,我整理一下他们之间的关系
1)input gene:我们关注的基因,symbol格式,可用Y叔的包转化成Uniprot ID
2)id文件:uniprot和ENSG之间的对应关系
3)PPI互作文件:ENSG格式
接下来就开始愉快的数据清洗啦~~
- 准备工作:
library(dynamicTreeCut)
library(openxlsx)
library(stringr)
library(Matrix)
library(WGCNA)
Sys.setenv(LANGUAGE = "en") #显示英文报错信息
options(stringsAsFactors = FALSE) #禁止chr转成factor
setwd("/Users/baiyunfan/desktop")
- 读取这三个文件
idmapping<-read.table("HUMAN_9606_idmapping_selected.tab",header = F,as.is=T,sep="\t")
ppi <- read.table("9606.protein.actions.v11.0.txt",header=T,sep = "\t")
gene<-read.table("turquoise.txt",sep=",")
![](https://img.haomeiwen.com/i9640232/d519d801289c5fbc.png)
![](https://img.haomeiwen.com/i9640232/17a4387c10e302cf.png)
![](https://img.haomeiwen.com/i9640232/561ea2aac435e079.png)
- 将我们的输入基因SYMBOL转化成UNIPROT ID
library(clusterProfiler)
m<-bitr(gene[,2],fromType = "SYMBOL",toType = "UNIPROT",OrgDb = "org.Hs.eg.db")
colnames(idmapping)[1]<-"UNIPROT"
- 通过idmapping文件,将UNIPROT,SYMBOL,ENSP三种ID联系到一起
n<-merge(m,idmapping[,c(1,21)],by="UNIPROT",all.x=T)
n<-n[-which(n[,3]==""),]
![](https://img.haomeiwen.com/i9640232/6407c7a67721ae62.png)
- 上图可看出,第三列有多个ENSP挤在一个格里,按照分号给拆分一下
prots<-str_split(n[,3],"[;]")
names(prots)<-n[,1]
prots_tmp<-unlist(lapply(1:length(prots), function(x){paste(names(prots)[x], prots[[x]],sep=";")}))
prots_mat <- str_split(prots_tmp,"[;]",2,simplify = T)
colnames(prots_mat) <- c("uniprot","ensemblprot")
![](https://img.haomeiwen.com/i9640232/ca4a7a4525f2a62e.png)
- ppi文件前面多个9606.,需要清洗掉
ppi$item_id_a <- str_replace(ppi$item_id_a,"9606.","")
ppi$item_id_b <- str_replace(ppi$item_id_b,"9606.","")
- 将PPI中的目标基因留下,其余的删掉,并删去重复的
ppi<-ppi[which(ppi[,1] %in% prots_mat[,2] & ppi[,2] %in% prots_mat[,2]),]
ppi$identical<-paste0(ppi[,1],ppi[,2])
ppi<-ppi[!duplicated(ppi$identical),]
ppi$identical<-paste0(ppi[,2],ppi[,1])
ppi<-ppi[!duplicated(ppi$identical),]
![](https://img.haomeiwen.com/i9640232/8c5749270fc23af2.png)
最后两行就是我们的目标蛋白互作啦~