Pysam BAM操作

2020-09-07 本文已影响0人茄子_0937

SAM/BAM/CRAM files

class pysam.AlignmentFile

AlignmentFile(filepath_or_object, mode=None, template=None, \\
reference_names=None, reference_lengths=None, text=NULL, header=None, \\
add_sq_text=False, check_header=True, check_sq=True, reference_filename=None, \\
filename=None, index_filename=None, filepath_index=None, require_index=False, \\
duplicate_filehandle=True, ignore_truncation=False, threads=1)

filepath_or_object 既可以是文件路径，也可以是文件对象；

如果输入的文件没有 index 将无法使用 fetch() 和 pileup()；

如果是写文件:

给了 template ,header 拷贝来源的 AlignmentFile (template 必须是 AlignmentFile).
若给了 header , header 就有这个多层字典构建

默认'r' mode 下，会检查头文件(check_header = True)和染色体名称的定义(check_sq = True).

参数：

mode (string)-

'r' 表示 reading、'w' 表示 writing。对于二进制的BAM文件要添加'b',合法的值一般为'r','w','rb','wb'。例如：
```
f = pysam.AlignmentFile('ex1.bam','rb')
```
template (AlignmentFile) - writing 时从 template 拷贝 header信息
threads (integer) - 压缩或解压BAM文件的线程数，默认为1

check_index(self)

存在 index时会返回 True

Raise:

AttributeError - 是SAM文件，没有index
ValueError - 文件无法打开或index 无法打开

close(self)

关闭 pysam.AlignmentFile .

count(self, contig=None, start=None, stop=None, region=None, until_eof=False, read_callback='nofilter', reference=None, end=None)

计算给定区间内的reads 数量

这个区间通过 contig 、start、stop 进行描述，也支持 samtools regin 格式的描述（‘chr1:10000:20000’）

不支持SAM格式的文件

Parameters:

contig (string) – 染色体号
start (int) – 起始位点 (0-based inclusive)
stop (int) – 结束位点 (0-based exclusive)
region (string) – samtools regin 格式的描述

Raises: ValueError — 基因组的坐标超过范围或无效的

count_coverage(self, contig, start=None, stop=None, region=None, quality_threshold=15, read_callback='all', reference=None, end=None)

计算reads在这个基因区间的覆盖度，结果是 four array.arrays of the same length in order A C G T

参数：

contig (string) – 染色体号
start (int) – 起始位点 (0-based inclusive)
stop (int) – 结束位点 (0-based exclusive)
quality_threshold (int) – 最低的质量值
read_callback (string or function) – 选择一个 call-back 用于过滤reads

all

跳过flag:BAM_FUNMAP, BAM_FSECONDARY, BAM_FQCFAIL, BAM_FDUP 的reads

nofilter

不进行过滤

此外，可以自行创建一个check_read(read) 的模块，对想要计算的read 返回 True 即可。

Raises: ValueError – if the genomic coordinates are out of range or invalid.

Returns: four array.arrays of the same length in order A C G T

Return type: tuple

例子：

pysam01.png

fetch(self, contig=None, start=None, stop=None, region=None, tid=None, until_eof=False, multiple_iterators=False, reference=None, end=None)

获取指定区域内的reads，返回一个 AlignedSegment 对象的 iteration

不指定 contig 或 region 就会获取文件中所有的reads，reads的顺序按参考序列排序，不一定和文件中的顺序相同。如果没有index，则 until_eof = True。

如果只设定了 contig ，则整个contig中的reads都会获取

不支持 SAM文件

参数：

until_eof (bool) – 设定为True 时，reads返回顺序和文件中的一致，同时也会返回unmapped的reads
multiple_iterators (bool) – 设定为True时，同时可以使用多个iterator，但是会增加开销

Return type: An iterator over a collection of reads.

例子：

pysam05.png

find_introns(self, read_iterator)

返回一个字典 {(start, stop): count} ，列出reads中的 intronic sites( 通过ciger值中的'N'来辨认 )，支持度是reads的数量。

read_iterator 可以是 fetch(…) 的结果，也可以是对这些reads再进行过滤的生成器。

Ex. samfile.find_introns((read for read in samfile.fetch(…) if read.is_reverse))

get_index_statistics(self)

返回每个染色体 mapped/unmapped reads 的统计。

Returns: list – ‘mapped’, ‘unmapped’ and ‘total’.

Return type: a list of records for each chromosome. Each record has the attributes ‘contig’,

例子：

pysam03.png

head(self, n, multiple_iterators=True)

返回包含文件前几行的 iterator

参数：

multiple_iterators (bool) – is set to True by default in order to avoid changing the current file position.

Return type: an iterator over a collection of reads

例子：

pysam04.png

mate(self, AlignedSegment read)

返回 AlignedSegment 的配对

会改变文件指针位置，会干扰不是re-opened 文件的 iterators

大量处理时会很慢，推荐使用readname sorted BAM

Raises: ValueError – if the read is unpaired or the mate is unmapped

例子：

pysam06.png

pileup(*self, contig=None, start=None, stop=None, region=None, reference=None, end=None, *kwargs)

pileup()方法可以获得特定区域的每个碱基；每次迭代会生成一个PileupColumn对象。该对象包含了覆盖到该位点的所有reads，这些reads 在 PileupColumn.pileups中被记录为 PileupRead对象

若没有指定 contig 和 region 则用所有reads 进行pileup。

不支持SAM文件

覆盖指定区域的所有reads都会返回；返回的第一个碱基，是第一条read的第一个碱基所以不一定是 query region的第一个碱基

参数：

truncate (bool) – 默认情况下会返回覆盖区间内所有的reads ，但是truncate 设定为True 则，仅返回query 区间内的碱基
max_depth(int) - 允许的最大 read 深度，默认是 8000
stepper(string) - 控制iterator 如何前进,可选的参数为：

all : 跳过 flag为BAM_FUNMAP, BAM_FSECONDARY, BAM_FQCFAIL, BAM_FDUP 的reads

**nofilter** : 不过滤

samtools: 与csamtoos pileup 相同的过滤过程，为了完全兼容需要 fastafile ，下面的选项后这个过滤有关。
fastafile (FastaFile object.) - 对于某些steppers 必须
ignore_overlaps (bool) – 设定为True 时，若 read pairs 有 overlap ,仅取高质量的碱基，这是默认的
flag_filter (int) – 过滤 BAM_FUNMAP | BAM_FSECONDARY | BAM_FQCFAIL | BAM_FDUP （应该是根据二进制flag值进行识别）
flag_require (int) – 只保留给定flag值得reads ,默认 0
ignore_orphans (bool) – 默认开启，过滤paired reads that are not in a proper pair
min_base_quality (int) – Minimum base quality. Bases below the minimum quality will not be output.
adjust_capq_threshold (int) – 调整mapping quality，默认0 不调整，推荐的值是 50
min_mapping_quality (int) – only use reads above a minimum mapping quality. The default is 0.
compute_baq (bool) – re-alignment 计算 per-Base Alignment Qualities (BAQ). 默认会做re-alignment。需要reference sequence，如果没有，就不做了
redo_baq (bool) – 重新计算 BAQ ，默认不做

Return type: an iterator over genomic positions.

pysam07.png

class pysam.AlignedSegment(AlignmentHeader header=None)**

表示 SAM 或 BAM文件中的一个 aligned segment

cigarstring

得到这条read 的cigar值，string 格式

cigartuples

得到tuples 形式的cigar值

pysam09.png

get_blocks(self)

没有间隔的区间，比对结果的起始、终止位点的列表，若两个块之间有插入，则会相邻。

pysam08.png

flag

得到 flag 值

get_forward_sequence(self)

获得reads的序列。如果是mapping到反链，得到是反向互补的序列,

get_forward_qualities(self)

同上，得到的是质量值，结果是 arrary

pysam10.png

get_overlap(self, uint32_t start, uint32_t end)

得到read 在指定区间内的碱基覆盖数量

get_reference_positions(self, full_length=False)

得到read 每个碱基比对到参考基因组的位置，当full_length=True 时，会显示soft-clip的位置None

get_reference_sequence(self)

得到read 比对到参考基因位置的，参考基因碱基序列

pysam11.png

get_tags(self, with_value_type=False)

has_tag(self, tag)

获取optional aligment section

pysam12.png

infer_query_length(self, always=False)

infer_read_length(self)

根据CIGAR 值推导 query length 和 read length，但是不包含 hard-clipped bases

pysam13.png

一些reads 属性 (只记录了部分)

pysam14.png

class pysam.PileupColumn

A pileup of reads at a particular reference sequence position (column). A pileup column contains all the reads that map to a certain target base.

覆盖到指定区域的reads 集合。所以返回的第一个值不一定是指定的区间最左侧，而是能覆盖到指定区间最左侧的reads，最靠左的碱基。

pysam15.png

class pysam.PileupRead

alignment

返回 pysam.AlignedSegment object

indel

indel length for the position following the current pileup site.

接下来是ins 则为正数，del 则为负数，不是indel 是 0

其它

is_del
is_head
is _refskip
is_tail
query_position : position of the read base at the pileup site, 0-based. None if is_del or is_refskip is set.
query_position_or_next: position of the read base at the pileup site, 0-based. If the current position is a deletion, returns the next aligned base.

pysam16.png

Pysam BAM操作

SAM/BAM/CRAM files

猜你喜欢

热点阅读