test5: pandas merge 两个数据框
2020-06-09 本文已影响0人
夕颜00
目的:从string库下载文件做PPI分析
原因:string网站上下载的蛋白ID为9606.ENSP* 样式,需要转换成gene_name
1、文件1:9606.protein.links.v11.0.txt
protein1 protein2 combined_score
9606.ENSP00000000233 9606.ENSP00000272298 490
9606.ENSP00000000233 9606.ENSP00000253401 198
9606.ENSP00000000233 9606.ENSP00000401445 159
2、注释文件:9606.protein.info.v11.0.txt
protein_external_id preferred_name protein_size annotation
9606.ENSP00000000233 ARF5 180 ADP-ribosylation...
9606.ENSP00000000412 M6PR 277 Cation-dependent...
3、目标转换文件:
gene1 gene2
ARF5 CALM2
ARF5 ARHGEF9
ARF5 ERN1
4、脚本:
import pandas as pd
infofile = "E:/Script/python/test5/9606.protein.info.v11.0.txt"
linkfile = "E:/Script/python/test5/9606.protein.links.v11.0.txt"
out = "E:/Script/python/test5/merge.txt"
info = pd.read_table(infofile)
links = pd.read_table(linkfile, sep=" ") ##文件是以空格分隔符,非常规\t分隔符
# print(links.head())
# print(links.columns)
result1 = pd.merge(links, info, left_on="protein1", right_on="protein_external_id", how='left').iloc[:,0:5]
result2 = pd.merge(result1,info,left_on="protein2",right_on="protein_external_id", how='left').iloc[:,[4,6]]
result2.columns=["gene1","gene2"]
# print(result2.head())
result2.to_csv(out,sep="\t",index =False)