R菜🐣日记——走R包数据科学与R语言

GenomicAlignments: 用R处理+操作BAM文件

2019-06-18  本文已影响46人  美式永不加糖

愿天堂没有头痛


短序列比对的呈现和操纵

Lawrence M, Huber W, Pagès H, Aboyoun P, Carlson M, Gentleman R, Morgan M and Carey V (2013). “Software for Computing and Annotating Genomic Ranges.”PLoS Computational Biology, 9. http://dx.doi.org/10.1371/journal.pcbi.1003118,

1. 关于 GenomicAlignments

GenomicAlignments 可以高效储存和操作短序列比对 (short genomic alignments) ,包括 read counting, computing the coverage, junction detection, 以及比对中核苷酸含量的操作。包内有 GAlignments , GAlignmentPairs , GAlignmentsList 三种对象。

2. 访问序列信息

2.1 载入 BAM 文件

从 BAM 文件得到已比对的 reads 和其序列。

这时用到了一个数据包:RNAseqData.HNRNPC.bam.chr14 .

library(GenomicAlignments)
library(RNAseqData.HNRNPC.bam.chr14)
bamfiles <- RNAseqData.HNRNPC.bam.chr14_BAMFILES 
names(bamfiles)
# [1] "ERR127306" "ERR127307" "ERR127308" "ERR127309" "ERR127302"
# [6] "ERR127303" "ERR127304" "ERR127305"

调用 Rsamtools 包内的函数 quickBamFlagSummary() 查看 BAM 文件中的序列是单端或双端比对。

library(Rsamtools)
quickBamFlagSummary(bamfiles[1], main.groups.only=TRUE) 
#                                 group |    nb of |    nb of | mean / max
#                                    of |  records |   unique | records per
#                               records | in group |   QNAMEs | unique QNAME
# All records........................ A |   800484 |   393300 | 2.04 / 10
#   o template has single segment.... S |        0 |        0 |   NA / NA
#   o template has multiple segments. M |   800484 |   393300 | 2.04 / 10
#       - first segment.............. F |   400242 |   393300 | 1.02 / 5
#       - last segment............... L |   400242 |   393300 | 1.02 / 5
#       - other segment.............. O |        0 |        0 |   NA / NA
# 
# Note that (S, M) is a partitioning of A, and (F, L, O) is a partitioning of M.
# Indentation reflects this.

在利用 readGAlignments() 读取基因组比对前,需要用函数 ScanBamParam() 构建一些参数。

flag1 <- scanBamFlag(isFirstMateRead=TRUE, 
                     isSecondMateRead=FALSE,
                     isDuplicate=FALSE, 
                     isNotPassingQualityControls=FALSE) 
## 作者在这里为了简化后面的流程,只选择了对应于第1个segment的比对序列
param1 <- ScanBamParam(flag=flag1, what="seq") 
gal1 <- readGAlignments(bamfiles[1], use.names=TRUE, param=param1) 
class(gal1)  ## 获得了一个GAlignments对象
# [1] "GAlignments"
# attr(,"package")
# [1] "GenomicAlignments"
gal1
# GAlignments object with 400242 alignments and 1 metadata column:
#                      seqnames strand       cigar    qwidth     start
#                         <Rle>  <Rle> <character> <integer> <integer>
#   ERR127306.26333541    chr14      +         72M        72  19069583
#   ERR127306.12926170    chr14      -         72M        72  19363755
#   ERR127306.26974899    chr14      +         72M        72  19369799
#    ERR127306.4055622    chr14      +         72M        72  19411097
#   ERR127306.13938407    chr14      +         72M        72  19411459
#                  ...      ...    ...         ...       ...       ...
#   ERR127306.10965452    chr14      -         72M        72 106981361
#   ERR127306.11919172    chr14      -         72M        72 106986528
#    ERR127306.3567919    chr14      +         72M        72 106989680
#   ERR127306.21510817    chr14      +         72M        72 106994763
#     ERR127306.661203    chr14      -         72M        72 107003171
#                            end     width     njunc |
#                      <integer> <integer> <integer> |
#   ERR127306.26333541  19069654        72         0 |
#   ERR127306.12926170  19363826        72         0 |
#   ERR127306.26974899  19369870        72         0 |
#    ERR127306.4055622  19411168        72         0 |
#   ERR127306.13938407  19411530        72         0 |
#                  ...       ...       ...       ... .
#   ERR127306.10965452 106981432        72         0 |
#   ERR127306.11919172 106986599        72         0 |
#    ERR127306.3567919 106989751        72         0 |
#   ERR127306.21510817 106994834        72         0 |
#     ERR127306.661203 107003242        72         0 |
#                                          seq
#                               <DNAStringSet>
#   ERR127306.26333541 TGAGAATGAT...ATGGCTGCAT
#   ERR127306.12926170 CCCCAGGTAT...TGGACACCAG
#   ERR127306.26974899 CATAGATGCA...TAACTTACCA
#    ERR127306.4055622 TGCTGGTGCA...CAGGAATCAG
#   ERR127306.13938407 CAGGAGGTAG...CTGTCTGGTC
#                  ...                     ...
#   ERR127306.10965452 GGGAGGCCCT...TGGAGAGAAC
#   ERR127306.11919172 CCTGAGAGCC...TCACTCTCTG
#    ERR127306.3567919 CAACTTTTAT...AGCAGGGAGG
#   ERR127306.21510817 CAAAGCTGGA...TTTTGCTGCC
#     ERR127306.661203 CATGACTTGA...TACAGTATTT
#   -------
#   seqinfo: 93 sequences from an unspecified genome

2.2 各类基础访问函数

seqnames(gal1)
# factor-Rle of length 400242 with 1 run
#   Lengths: 400242
#   Values :  chr14
# Levels(93): chr1 chr10 chr11 ... chrUn_gl000249 chrX chrY
head(seqlevels(gal1))
# [1] "chr1"                  "chr10"                
# [3] "chr11"                 "chr11_gl000202_random"
# [5] "chr12"                 "chr13"        
strand(gal1)
# factor-Rle of length 400242 with 111564 runs
#   Lengths:  1  1 15 20  2  2  2  3  1  2 ...  1  3  3  1  2  2  2  2  1
#   Values :  +  -  +  -  +  -  +  -  +  - ...  -  +  -  +  -  +  -  +  -
# Levels(3): + - *
head(cigar(gal1))
# [1] "72M" "72M" "72M" "72M" "72M" "72M"
head(njunc(gal1))
# [1] 0 0 0 0 0 0

3. 处理已比对核苷酸序列

当由高通量测序实验产生的 reads 被比对到参考基因组后,一般来说人们提出的问题分为两大类: positional onlynucleotide-related.

Positional only questions are about the position of the alignments with respect to the reference genome.

Nucleotide-related questions are about the nucleotide content of the alignments.

针对比对的 "nucleotide-related" 问题, GenomicAlignments 提供了不同层次的工具。

3.1 获得原始序列

BAM 格式中的 read 序列是“反向互补”的,当它们与反义链比对时,我们需要将它们重新“反向互补”,得到原始询问序列 (original query sequences).

确定需要被“反向互补”的 reads:

mcols(gal1)$seq 
#   A DNAStringSet instance of length 400242
#          width seq
#      [1]    72 TGAGAATGATGATTTCCAATTTCATCC...TGAACTCATCATTTTTTATGGCTGCAT
#      [2]    72 CCCCAGGTATACACTGGACTCCAGGTG...GATACACACACTCAAGGTGGACACCAG
#      [3]    72 CATAGATGCAAGAATCCTCAATCAAAT...ACAGCACATTAAAAAGATAACTTACCA
#      [4]    72 TGCTGGTGCAGGATTTATTCTACTAAG...AATCCACTTTCTTATCTCAGGAATCAG
#      [5]    72 CAGGAGGTAGGCTGTGCGTTCAGCAGT...CTGGTTGATCACCTTGACTGTCTGGTC
#      ...   ... ...
# [400238]    72 GGGAGGCCCTTTATATAACCATCAGGT...CTAATAGGATAAAAGCATGGAGAGAAC
# [400239]    72 CCTGAGAGCCCCTTGCTGTCCTGAGCA...GCCCTCTGGTGTTCTGATCACTCTCTG
# [400240]    72 CAACTTTTATTTCTTAAACACAAGACA...TCTCAGGTGAGCTGTCGAGCAGGGAGG
# [400241]    72 CAAAGCTGGATGTGTCTAGTGTTTTTA...AATAAGAGCATGTGTGGTTTTGCTGCC
# [400242]    72 CATGACTTGATGGCTGGAACAAATACA...AATACTAGCCTTTGCCATACAGTATTT
oqseq1 <- mcols(gal1)$seq 
is_on_minus <- as.logical(strand(gal1) == "-") 
oqseq1[is_on_minus] <- reverseComplement(oqseq1[is_on_minus]) 

每个 read 都会被比对不止一次,所以 gal1 中肯定有重复。

is_dup <- duplicated(names(gal1))
table(is_dup)
# is_dup
#  FALSE   TRUE 
# 393300   6942 

去重:

oqseq1 <- oqseq1[!is_dup] 
length(oqseq1)  ## 检查一下
# [1] 393300
length(gal1)
# [1] 400242

3.2 错配、插入缺失标记、缺口 (Mismatches, indels, and gaps )

由于比对过程中是容许一些错配、插入缺失标记、缺口的,所以 mcols(gal1)$seq 中的序列并不是完全与参考基因组匹配的。

"CIGAR" 包含着这些“小错误”出现在比对中的位置信息。

The ‘CIGAR’ (Compact Idiosyncratic Gapped Alignment Report) string is how the SAM/BAM format represents spliced alignments.

For CIGAR =’6M’, this means there are 6 exact matches to the reference.

head(sort(table(cigar(gal1)), decreasing=TRUE)) 
# 
#        72M 35M123N37M  64M316N8M 38M670N34M 18M123N54M  2M131N70M 
#     301920        134        134        133        119         96 
colSums(cigarOpTable(cigar(gal1)))   ## 总览
#         M         I         D         N         S         H         P 
#  28815928      1496       818 330945002         0         0         0 
#         =         X 
#         0         0 
table(njunc(gal1))   ## gap的数量
# 
#      0      1      2      3 
# 303622  94557   2052     11 

References


最后,向大家隆重推荐生信技能树的一系列干货!

  1. 生信技能树全球公益巡讲:https://mp.weixin.qq.com/s/E9ykuIbc-2Ja9HOY0bn_6g
  2. B站公益74小时生信工程师教学视频合辑:https://mp.weixin.qq.com/s/IyFK7l_WBAiUgqQi8O7Hxw
  3. 招学徒:https://mp.weixin.qq.com/s/KgbilzXnFjbKKunuw7NVfw
上一篇下一篇

猜你喜欢

热点阅读