Python生信生物信息编程

python生信小练习(二)

2018-02-06  本文已影响66人  杨亮_SAAS
  1. 写程序 transferMultipleColumToMatrix.py 将文件(multipleColExpr.txt)中基因在多个组织中的表达数据转换为矩阵形式
    用到的知识点

Answer:

file = r'E:\Bioinformatics\Python\practice\chentong\notebook-master\data\multipleColExpr.txt'
with open(file) as f:
    all = f.readlines()
    lines = all[1:]             #去除首行的表头
    aDict = {}
    sample = []                   #定义样本存储变量
    for line in lines:
        details = line.split('\t')[:3]      #取文件每行的前三列
        key1 = details[0]
        if key1 not in aDict:               #若key1不在定义字典中,则进行记录,否则不记录
            aDict[key1] = {}
            key2 = details[1]
            aDict[key1][key2] = details[2]
        else:
            key2 = details[1]
            aDict[key1][key2] = details[2]
        sample.append(key2)              #记录样本
print('Name' + '\t'+ '\t'.join([i for i in set(sample)]))    #set的作用是抹去重复
for key, subD in aDict.items():          #对嵌套字典进行打印
    print(key, end = '\t')
    for subK, subV in subD.items():
        print('\t{}'.format(subV), end = '')
    print()
运行结果
  1. 写程序 reverseComplementary.py计算序列 ACGTACGTACGTCACGTCAGCTAGAC的反向互补序列
    用到的知识点
seq = 'ACGTACGTACGTCACGTCAGCTAGAC'
complementary = []
for i in list(seq):
    if i == 'A':
        i = 'T'
    elif i == 'T':
        i = 'A'
    elif i == 'G':
        i = 'C'
    elif i == 'C':
        i = 'G'
    complementary.append(i)
complementary.reverse()
print(''.join(complementary))

#results:
GTCTAGCTGACGTGACGTACGTACGT
  1. 写程序 collapsemiRNAreads.py转换smRNA-Seq的测序数据
    输入文件格式(mir.collapse, tab-分割的两列文件,第一列为序列,第二列为序列被测到的次数)
    ID_REF VALUE
    ACTGCCCTAAGTGCTCCTTCTGGC 2
    ATAAGGTGCATCTAGTGCAGATA 25
    TGAGGTAGTAGTTTGTGCTGTTT 100
    TCCTACGAGTTGCATGGATTC 4
    输出文件格式 (mir.collapse.fa, 名字的前3个字母为样品的特异标示,中间的数字表示第几条序列,是序列名字的唯一标示,第三部分是x加每个reads被测到的次数。三部分用下划线连起来作为fasta序列的名字。)
    >ESB_1_x2
    ACTGCCCTAAGTGCTCCTTCTGGC
    >ESB_2_x25
    ATAAGGTGCATCTAGTGCAGATA
    >ESB_3_x100
    TGAGGTAGTAGTTTGTGCTGTTT
    >ESB_4_x4
    TCCTACGAGTTGCATGGATTC
    Answer:
#方法一(繁琐):
file = r'E:\Bioinformatics\Python\practice\chentong\notebook-master\data\mir.collapse'
with open(file) as f:
    all = f.readlines()
    lines = all[1:]
#----去除首行----
    miRNA = {}
    for line in lines:
        info = line.split()
        seq = info[0]
        miRNA[seq] = info[1]

num = 1
for key, value in miRNA.items():
    print('>' + 'ESB' + '_' + str(num) + '_' + 'x' + value, end = '\n')
    print(key)
    num +=1
方法二(简洁):
file = r'E:\Bioinformatics\Python\practice\chentong\notebook-master\data\mir.collapse'

head = 1
sample = 'ESB'
lineno = 0
for line in open(file):
    if head:
        head -=1
        continue          #continue用来告诉Python跳过当前循环块中的剩余语句,然后继续进行下一轮循环
    #----- skip header line ------
    seq, value = line.split()
    lineno += 1
    print('>{}_{}_x{}\n{}'.format(sample, lineno, value, seq))
部分运行结果
  1. 简化的短序列匹配程序 (map.py) 把short.fa中的序列比对到ref.fa, 输出短序列匹配到ref.fa文件中哪些序列的哪些位置
    用到的知识点
f1 = r'E:\Bioinformatics\Python\practice\chentong\notebook-master\data\short.fa'
f2 = r'E:\Bioinformatics\Python\practice\chentong\notebook-master\data\ref.fa'
#通过生成两个字典的方式进行查找
#short字典中,基因名为去除'>'及'\n'后,剩余部分
#ref字典中,基因名为去除'>'及'\n'后,剩余部分
short = {}
ref = {}
for line in open(f1):
    if line.startswith('>'):
        key = line.strip('>\n')
        short[key] = []
    else:
        short[key] = line.strip()
#----end reading f1-------------------
for line in open(f2):
    if line.startswith('>'):
        key = line.strip('>\n')
        ref[key] = []
    else:
        ref[key].append(line.strip())
#----end reading f2(ref)--------------

#以单个ref为参照,对所有待查找序列进行遍历
for key2, value2 in ref.items():
    #将ref中的序列进行连接,合并为一条长序列
    seqRef = ''.join(value2)
    for key1, value1 in short.items():
        start = seqRef.find(value1)
        while start != -1:         #表明ref中可以查找到short序列
            print('{}\t{}\t{}\t{}'.format(key2, start + 1, start + len(value1), value1))
            new = seqRef[start+1:].find(value1)     #继续在剩余序列中查找
            if new == -1:
                break
            start = start + new + 1    #若new不等于-1,重新对start赋值(继续查找后续序列,一个循环能够对目标序列查找两遍)  
比对结果
上一篇 下一篇

猜你喜欢

热点阅读