最新版pysam操作vcf文件
2023-07-10 本文已影响0人
bioshimmer
一、安装
pysam目前只能在Linux比较容易安装使用,Windows不太行。
mamba install pysam
二、使用例子
读取vcf文件和索引文件
import pysam
in_vcf = "draw.4dtv.sites.vcf.gz"
index_file_path = "draw.4dtv.sites.vcf.gz.tbi"
f1 = pysam.VariantFile(in_vcf,index_filename = index_file_path)
获得header标头信息:样本、contigs、过滤条件、info、vcf版本等
print(list(f1.header.samples))
print(list(f1.header.contigs))
print(list(f1.header.filters))
print(list(f1.header.info))
print(f1.header.version)
⚠直接打印会报错,需要转换一下
print(f1.header.samples)
<pysam.libcbcf.VariantHeaderSamples object at 0x7fade61efc70>
获取位点信息以及输出vcf文件
outvcf = "out.vcf.gz"
#输出文vcf.gz文件,需要wb参数,默认文本模式
#另外最后记得close()关闭文件
f2 = pysam.VariantFile(outvcf,"wb",header = f1.header,threads = 2)
#fetch不指定位置参数则获取所有位点
for rec in f1.fetch("HiC_scaffold_1",1,100000):
out_vcf.write(rec)
print(rec.pos,rec.info["DP"],rec.alleles)
f1.close()
f2.close()
run.sh
保留双等位基因的脚本示例
import pysam
in_vcf = "draw.4dtv.sites.vcf.gz"
index_file_path = "draw.4dtv.sites.vcf.gz.tbi"
outvcf = "out.vcf.gz"
f1 = pysam.VariantFile(in_vcf,index_filename = index_file_path)
f2 = pysam.VariantFile(outvcf,"wb",header = f1.header,threads = 2)
for rec in f1.fetch():
if (len(rec.alleles[0]) == 1) and (len(rec.alleles[1]) == 1):
f2.write(rec)
else:
pass
f1.close()
f2.close()
三、参数说明
pysam操作vcf文件有主要的四个类:
- pysam.VariantFile:读取vcf或bcf文件
- pysam.VariantRecord(*args, **kwargs):每一个变异位点
- pysam.VariantHeader:读取vcf或bcf的标头
- pysam.VariantHeaderRecord(*args, **kwargs):操作标头的类
1. class pysam.VariantFile()
Parameters:
- mode:文本文件模式有r和w,如果是压缩文件要加上b,rb或者wb,pysam也能自动检测文件类型
f1 = pysam.VariantFile('ex1.vcf','r')
f1 = pysam.VariantFile('ex1.vcf.gz,'rb')
f1 = pysam.VariantFile('ex1.vindex_filename(string)
以上写法都正确 - index_filename(string):index索引文件的路径
- drop_samples():读取时忽略该样本
- duplicate_filehandle():默认true
- ignore_truncation():Default=False
- threads:压缩解压VCF/VCF文件的线程数,Default=1
- close()关闭pysam.VariantFile
- copy()
- fetch(self, contig=None, start=None, stop=None, region=None, reopen=False, end=None, reference=None)这个函数可以得到每个变异位点
- get_reference_name(self, tid)
- get_tid(self, reference)
- is_valid_tid(self, tid)
- new_record(self, *args, **kwargs)
- open(self, filename, mode=u'r', index_filename=None, VariantHeader header=None, drop_samples=False, duplicate_filehandle=True, ignore_truncation=False, threads=1)
- reset(self)
- subset_samples(self, include_samples)
- write(self, VariantRecord record) → int
2. class pysam.VariantRecord()
Variant record突变记录
- allels:参考等位基因和替代等位基因的python元组
- alts:替代等位基因的元组
- chrom:染色体或者contig的名字
- contig:染色体或者contig的名字
- copy(self):返回VariantRecord
- filter:过滤信息,详见(VariantRecordFilter)
- format:sample format metadata (see VariantRecordFormat)
- id:vcf的id列的名字,没有就是None
- info:info data (see VariantRecordInfo)
- pos:record start position on chrom/contig (1-based inclusive)
- qual:vcf的qual列,没有就是None
- ref:参考等位基因
- rid:internal reference id number
- rlen:record length on chrom/contig (aka rec.stop - rec.start)
- samples:sample data (see VariantRecordSamples)
- start:record start position on chrom/contig (0-based inclusive)
- stop:record stop position on chrom/contig (0-based exclusive)
- translate(self, VariantHeader dst_header)
3. class pysam.VariantHeader
- add_line(self, line)
- add_meta(self, key, value=None, items=None)
- add_record(self, VariantHeaderRecord record)
- add_sample(self, name)
- add_samples(self, *args)
- alts
- contigs
- copy(self)
- filters
- formats
- info
- merge(self, VariantHeader header)
- new_record(self, contig=None, start=0, stop=0, alleles=None, id=None, qual=None, filter=None, info=None, samples=None, **kwargs)
- records
- samples
- version:vcf版本
4. class pysam.VariantHeaderRecord()
header record from a VariantHeader object
- attrs:sequence of additional header attributes
- get(self, key, default=None):
D.get(k[,d]) -> D[k] if k in D, else d. d defaults to None. - items(self):