使用python处理生物信息数据(三)
2020-03-01 本文已影响0人
你猜我菜不菜
Python学习的第三天。
1. 整合质谱数据到代谢通路中
假设你想确定癌症细胞样品中的一些蛋白属于哪些代谢或调控通路,首先你有一个文件数据包含细胞循环相关通路的蛋白列表,然后你有一个通过检测癌细胞质谱实验获得的蛋白列表。这些文件可能来自于Reactome 这样的公共数据库,也可以来自于你的实验结果。实际上你需要做的是对比两个文件列表,看你关注的蛋白属于哪些通路。
#细胞循环通路相关蛋白数据导入
list_a = []
for line in open("cell_cycle_proteins.txt"):
list_a.append(line.strip())
print(list_a)
['P62258', 'P61981', 'P62191', 'P17980', 'P43686', 'P35998', 'P62333', 'Q99460', 'O75832']
#癌细胞中蛋白数据导入
list_b = []
for line in open("cancer_cell_proteins.txt"):
list_b.append(line.strip())
print(list_b)
['P43686', 'P62333']
#比较list_b中的蛋白是否在list_a中
for protein in list_a:
if protein in list_b:
print(protein,'detected in the cancer cell')
else:
print(protein,'not detected')
P62258 not detected
P61981 not detected
P62191 not detected
P17980 not detected
P43686 detected in the cancer cell
P35998 not detected
P62333 detected in the cancer cell
Q99460 not detected
O75832 not detected
这个例子包含了几个非常有用的操作,首先是for循环和open()读取文件,然后if/elif/else语句判断list_b中的蛋白是否在list_a中,并对结果进行了解释打印出来了。
#if/elif/else语句的结构
if <condition 1>:
<statements 1>
[elif <condition 2>]:
<statements 2>]
[elif <condition 3>]:
pass]
…
[else:
<statements N>]
当if/elif/else语句中有多个条件的时候,就需要Boolean operator进行布尔运算将多个条件组织到一起了,百度了一下布尔运算是数字符号化的逻辑推演法。
2. if/elif/else语句中布尔运算的使用
判断序列中是否存在一些特定的序列结构
seq = "MGSNKSKPKDASQRRRSLEPAENVHGAGGGAFPASQTPSKPASADGHRGPSAAFAPAAAE"
if 'GGG' in seq and 'RRR' in seq:
print('GGG is at position: ', seq.find('GGG'))
print('RRR is at position: ', seq.find('RRR'))
GGG is at position: 27
RRR is at position: 13
if 'WWW' in seq or 'AAA' in seq:
print('Either WWW or AAA occur in the sequence')
Either WWW or AAA occur in the sequence
if 'AAA' in seq and not 'PPP' in seq:
print('AAA occurs in the sequence but not PPP')
AAA occurs in the sequence but not PPP
布尔运算符
简单的布尔运算符使用
for i in range(30):
if i < 4:
print("prime number:", i)
elif i % 2 == 0:
print("multiple of two:", i)
elif i % 3 == 0:
print("multiple of three:", i)
elif i % 5 == 0:
print("multiple of five:", i)
else:
print("prime number:", i)
prime number: 0
prime number: 1
prime number: 2
prime number: 3
multiple of two: 4
multiple of five: 5
multiple of two: 6
prime number: 7
multiple of two: 8
multiple of three: 9
multiple of two: 10
prime number: 11
multiple of two: 12
prime number: 13
multiple of two: 14
multiple of three: 15
multiple of two: 16
prime number: 17
multiple of two: 18
prime number: 19
multiple of two: 20
multiple of three: 21
multiple of two: 22
prime number: 23
multiple of two: 24
multiple of five: 25
multiple of two: 26
multiple of three: 27
multiple of two: 28
prime number: 29
3. 列表解析操作
a = [x for x in range(5)]
print(a)
Out[13]: [0, 1, 2, 3, 4]
squares = [x**2 for x in range(5)]
print(squares)
[0, 1, 4, 9, 16]
bases = ['A', 'C', 'T', 'G']
bases
Out[15]: ['A', 'C', 'T', 'G']
seq = 'GGACXCAGXXGATT'
print(seq)
GGACXCAGXXGATT
seqlist = [base for base in seq if base in bases]
print(seqlist)
['G', 'G', 'A', 'C', 'C', 'A', 'G', 'G', 'A', 'T', 'T']
4. 处理fasta格式的序列文件
4.1 读取fasta格式文件
my_file = open('SwissProt.fasta','r')
out_file = open('SwissProt.header','w')
for line in my_file:
if line[0:1] == '>':
out_file.write(line)
out_file.close()
4.2从一个多序列的fasta文件的标题中提取特定信息
我们常常要提取序列标题中的信息,比如特定数据库的获得号P31946,或者提取序列的物种信息,基因注释信息。
fasta文件
#读取文件
input_file = open("SwissProt.fasta","r")
Accession_Codes_list = []
for line in input_file:
if line[0] == '>': #以>开头的行
fields = line.split('|') #以"|"作为分隔符
Accession_Codes_list.append(fields[1]) #以"|"作为分隔符的,提取第2个字符段
print(Accession_Codes_list)
['P31946', 'P62258', 'Q04917', 'P61981', 'P31947', 'P27348', 'P63104', 'P30443']
4.3 处理GenBank格式文件
从GenBank格式文件提取获得号,比如图中的AY810830这个号。
GenBank格式文件
genbank_file = open("AY810830.gb")
output_file = open("AY810830.fasta","w")
flag = False
for line in genbank_file:
if line[0:9] == 'ACCESSION':
accession = line.split()[1].strip()
output_file.write('>' + accession + '\n')
if line[0:6] == 'ORIGIN':
flag = True
elif flag:
fields = line.split()
if fields != []:
seq = ''.join(fields[1:])
output_file.write(seq.upper() + '\n')
genbank_file.close()
output_file.close()
4.4 处理GenBank格式文件
fasta_file = open('SwissProt.fasta','r')
out_file = open('SwissProtHuman.fasta','w')
seq = ''
for line in fasta_file:
if line[0] == '>' and seq == '':
# process the first line of the input file
header = line
elif line[0] != '>':
# join the lines with sequence
seq = seq + line
elif line[0] == '>' and seq != '':
# in subsequent lines starting with '>',
# write the previous header and sequence
# to the output file. Then re-initialize
# the header and seq variables for the next record
if "Homo sapiens" in header:
out_file.write(header + seq)
seq = ''
header = line
if "Homo sapiens" in header:
out_file.write(header + seq)
out_file.close()