python: 获取基因组cluster核心基因

2021-07-30  本文已影响0人  胡童远

方法简单,使用set & set遍历基因集即可。

1 基因组cluster list

head total_lacto.list
CNGBCC1950658
CNGBCC1950669
CNGBCC1950686
CNGBCC1950698
CNGBCC1950902

2 基因组基因list

head ../cog_uniq/CNGBCC1950658.tsv
COG0006
COG0008
COG0009
COG0012
COG0015

3 python 计算 core gene

思路:
readlines基因组list,去除换行符
用list挨个一个基因list给head,
接着,挨个打开基因list给临时tail,用set & set计算交集

#!/usr/bin/env python
import os, sys, re
g_list = "total_lacto.list"
with open(g_list, 'r') as g_list_file:
    # 列表文件中的文件
    tmp_list = []
    for tmp in g_list_file.readlines():
        tmp = tmp.strip()
        tmp = "../cog_uniq/{}.tsv".format(tmp)
        tmp_list.append(tmp)
            
    # 两两交集
    num = 1
    with open(tmp_list[0]) as head:
        head = head.readlines()
        print("\t head done...")
        for tail in tmp_list[1:len(tmp_list)]:
            with open(tail) as tail:
                tail = tail.readlines()
                # 核心算法
                head = set(head) & set(tail)
                num = num + 1
                print("\t intersect {} done...".format(num))
                    
    # 输出
    out_name = "./lacto_core_cog.tsv"
    with open(out_name, 'w') as o:
        out_file = ''.join(head)
        o.write(out_file)
        print("\t write done...")

手动抽样验证算法准确性

上一篇下一篇

猜你喜欢

热点阅读