过滤序列比对长度较短的文件

2022-03-16 本文已影响0人徐诗芬

一个很有意思的小脚本，现在写脚本的思路越来越偏向高效化了。
两个小技巧：1. 这里用python写文件的话运行速度会慢很多，因此用subprocess调用shell脚本会更快。2. 使用了yield，它是一个迭代器，相当于读取一个然后处理一个。不用return，因为return相当于需要全部遍历完文件再进入循环，会慢很多。

#!/usr/bin/python
# -*- coding: utf-8 -*-
# conda activate python3
"""
    作者：徐诗芬
    内容：过滤一系列.mafft.pep.fa文件中序列比对长度低于50的文件，
    先读取判断文件序列长度，再用subprocess进行拷贝文件，因此只适用于Linux系统而不适用于window！
    日期：2022.3.12
"""
import os
import sys
import subprocess

def file_extract(path):
    # 提取基因蛋白序列文件，生成一个文件路径列表
    # file_list = []
    for root, dirs, files in os.walk(path):
        for f in files:
            file = os.path.join(root, f)
            yield file

def fasta_len(inf):
    # 按行读取序列
    # 输入fasta文件，返回名称，序列
    with open(inf) as f:
        global name
        dict = {}
        for line in f:
            line = line.strip()
            if line.startswith('>'):
                name = line
                dict[name] = ''
                i = 0
            else:
                dict[name] += line
                i += len(dict[name])
        return i

def main():
    inpath = sys.argv[1]
    outpath1 = sys.argv[2]   # 文件没有用的，<50个氨基酸的序列
    outpath2 = sys.argv[3]   # 文件有用的，≥50个氨基酸的序列
    subprocess.call(['mkdir', outpath1])
    subprocess.call(['mkdir', outpath2])
    for file in file_extract(inpath):
        if fasta_len(file) < 50:
            subprocess.call(['cp', file, outpath1])
        else:
            subprocess.call(['cp', file, outpath2])

try:
    main()
except IndexError:
    usage()

过滤序列比对长度较短的文件

猜你喜欢

热点阅读