单细胞测序分析R-数据处理

单细胞三大R包之seurat

2020-07-08  本文已影响0人  小洁忘了怎么分身

其实我看单细胞的各种资料,已经断断续续看了有一段时间了。只是没有整理出来,就从今天这篇上路吧~且搜且查,加上问豆豆,开始发布笔记。

1.数据、代码和R包准备

代码:https://satijalab.org/seurat/v3.0/pbmc3k_tutorial.html

数据:https://s3-us-west-2.amazonaws.com/10x.files/samples/cell/pbmc3k/pbmc3k_filtered_gene_bc_matrices.tar.gz

rm(list = ls())
options(stringsAsFactors = F)
library(Seurat)
library(dplyr)

2.读取数据

pbmc.data <- Read10X(data.dir = "filtered_gene_bc_matrices/hg19/")
dir("filtered_gene_bc_matrices/hg19/")
## [1] "barcodes.tsv" "genes.tsv"    "matrix.mtx"
pbmc <- CreateSeuratObject(counts = pbmc.data, 
                           project = "pbmc3k", 
                           min.cells = 3, 
                           min.features = 200)
pbmc
## An object of class Seurat 
## 13714 features across 2700 samples within 1 assay 
## Active assay: RNA (13714 features)

#查看表达矩阵
exp = pbmc[["RNA"]]@counts;dim(exp)
## [1] 13714  2700
exp[30:34,1:4]
## 5 x 4 sparse Matrix of class "dgCMatrix"
##        AAACATACAACCAC AAACATTGAGCTAC AAACATTGATCAGC AAACCGTGCTTCCG
## MRPL20              1              .              1              .
## ATAD3C              .              .              .              .
## ATAD3B              .              .              .              .
## ATAD3A              .              .              .              .
## SSU72               .              1              .              3

很多是.,也就是0,换了一种更省空间的表示方式。

3.质控

质控指标:

线粒体基因含量不能过高;

nFeature_RNA 不能过高或过低

为什么? nFeature_RNA是每个细胞中检测到的基因数量。nCount_RNA是细胞内检测到的分子总数。nFeature_RNA过低,表示该细胞可能已死/将死或是空液滴。高nCount_RNA和/或nFeature_RNA表明“细胞”实际上可以是两个或多个细胞。结合线粒体基因count数除去异常值,即可除去大多数双峰/死细胞/空液滴,因此它们过滤是常见的预处理步骤。 参考自:https://www.biostars.org/p/407036/

3.1 质控指标的可视化

pbmc[["percent.mt"]] <- PercentageFeatureSet(pbmc, pattern = "^MT-")
head(pbmc@meta.data, 3)
##                orig.ident nCount_RNA nFeature_RNA percent.mt
## AAACATACAACCAC     pbmc3k       2419          779  3.0177759
## AAACATTGAGCTAC     pbmc3k       4903         1352  3.7935958
## AAACATTGATCAGC     pbmc3k       3147         1129  0.8897363
VlnPlot(pbmc, 
        features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), 
        ncol = 3)
image.png

根据这个图,确定了过滤标准是:

nFeature_RNA在200~2500之间;线粒体基因占比在5%以下。

3.2 三者之间的相关性

plot1 <- FeatureScatter(pbmc, 
                        feature1 = "nCount_RNA", 
                        feature2 = "percent.mt")
plot2 <- FeatureScatter(pbmc, 
                        feature1 = "nCount_RNA", 
                        feature2 = "nFeature_RNA")
CombinePlots(plots = list(plot1, plot2))
image.png

可见,绝大多数细胞的nCount_RNA与percent.mt在正常范围内,percent.mt异常高时,nCount_RNA则异常低,反之亦然;nCount_RNA和nFeature_RNA基本成正比。

过滤走起

pbmc <- subset(pbmc, 
               subset = nFeature_RNA > 200 & 
                 nFeature_RNA < 2500 & 
                 percent.mt < 5)
dim(pbmc)
## [1] 13714  2638

可以看到过滤掉了几十个细胞。

4.数据归一化

pbmc <- NormalizeData(pbmc)
pbmc@assays$RNA[30:34,1:3]
## 5 x 3 sparse Matrix of class "dgCMatrix"
##        AAACATACAACCAC AAACATTGAGCTAC AAACATTGATCAGC
## MRPL20       1.635873       .              1.429744
## ATAD3C       .              .              .       
## ATAD3B       .              .              .       
## ATAD3A       .              .              .       
## SSU72        .              1.111715       .

5.找差异基因HVGs

# 有三种算法:vst、mean.var.plot、dispersion;默认选择2000个HVG
pbmc <- FindVariableFeatures(pbmc, 
                             selection.method = "vst", 
                             nfeatures = 2000)
top10 <- head(VariableFeatures(pbmc), 10);top10
##  [1] "PPBP"   "LYZ"    "S100A9" "IGLL5"  "GNLY"   "FTL"    "PF4"   
##  [8] "FTH1"   "GNG11"  "S100A8"

可视化展示HVGs,可以给top差异基因加上标签

plot1 <- VariableFeaturePlot(pbmc)
plot2 <- LabelPoints(plot = plot1, points = top10, repel = TRUE)
CombinePlots(plots = list(plot1, plot2))
image.png

6.数据标准化

all.genes <- rownames(pbmc)
pbmc <- ScaleData(pbmc, features = all.genes)
pbmc[["RNA"]]@scale.data[30:34,1:3]
##        AAACATACAACCAC AAACATTGAGCTAC AAACATTGATCAGC
## MRPL20     1.50566156    -0.56259931     1.24504892
## ATAD3C    -0.04970561    -0.04970561    -0.04970561
## ATAD3B    -0.10150202    -0.10150202    -0.10150202
## ATAD3A    -0.13088200    -0.13088200    -0.13088200
## SSU72     -0.68454728     0.58087748    -0.68454728

7.降维

7.1 线性降维-PCA

pbmc <- RunPCA(pbmc, features = VariableFeatures(object = pbmc))
# 查看前三个主成分由哪些feature组成
print(pbmc[["pca"]], dims = 1:3, nfeatures = 5)
## PC_ 1 
## Positive:  CST3, TYROBP, LST1, AIF1, FTL 
## Negative:  MALAT1, LTB, IL32, IL7R, CD2 
## PC_ 2 
## Positive:  CD79A, MS4A1, TCL1A, HLA-DQA1, HLA-DQB1 
## Negative:  NKG7, PRF1, CST7, GZMB, GZMA 
## PC_ 3 
## Positive:  HLA-DQA1, CD79A, CD79B, HLA-DQB1, HLA-DPB1 
## Negative:  PPBP, PF4, SDPR, SPARC, GNG11
#可视化
VizDimLoadings(pbmc, dims = 1:2, reduction = "pca")
image.png
#每个主成分对应基因的热图
DimHeatmap(pbmc, dims = 1:15, cells = 500, balanced = TRUE)
image.png
# 应该选多少个主成分进行后续分析
ElbowPlot(pbmc)
image.png

引用一下豆豆:它是根据每个主成分对总体变异水平的贡献百分比排序得到的图,我们主要关注”肘部“的PC,它是一个转折点(也即是这里的PC9-10),说明取前10个主成分可以比较好地代表总体变化

#PC1和2
DimPlot(pbmc, reduction = "pca")+ NoLegend()
image.png
细胞聚类
# 因为我们前面挑选了10个PCs,所以这里dims定义为10个
pbmc <- FindNeighbors(pbmc, dims = 1:10)
pbmc <- FindClusters(pbmc, resolution = 0.5) #分辨率
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
## 
## Number of nodes: 2638
## Number of edges: 96033
## 
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.8720
## Number of communities: 9
## Elapsed time: 0 seconds
# 结果聚成几类,用Idents查看
length(levels(Idents(pbmc)))
## [1] 9

resolution的大小决定着分群的数量,3000细胞时,0.4~1.2的分辨率表现较好。

8.非线性降维–umap + 聚类

pbmc <- RunUMAP(pbmc, dims = 1:10, verbose = FALSE)
pbmc <- FindNeighbors(pbmc, dims = 1:10, verbose = FALSE)
pbmc <- FindClusters(pbmc, verbose = FALSE)
length(levels(Idents(pbmc)))
## [1] 11
DimPlot(pbmc, label = TRUE) + NoLegend()
image.png

可以看到,聚类后的细胞分成了9个群(cluster)

9.找biomarker基因

这个就类似于找每个cluster的标记,有一点差异基因的意思。

找cluster1中的marker基因

cluster1.markers <- FindMarkers(pbmc, ident.1 = 1, min.pct = 0.25)
head(cluster1.markers, n = 5)
##             p_val avg_logFC pct.1 pct.2    p_val_adj
## IL32 1.662549e-95 0.8407273 0.949 0.461 2.280020e-91
## LTB  3.900931e-89 0.8859959 0.980 0.640 5.349736e-85
## CD3D 4.422655e-73 0.6559217 0.919 0.428 6.065228e-69
## IL7R 1.359897e-68 0.8167820 0.743 0.324 1.864963e-64
## LDHB 8.461106e-66 0.6143392 0.941 0.614 1.160356e-61

比较cluster5和0,3的差别

cluster5.markers <- FindMarkers(pbmc, ident.1 = 5, ident.2 = c(0, 3), min.pct = 0.25)
head(cluster5.markers, n = 5)
##                p_val avg_logFC pct.1 pct.2     p_val_adj
## S100A8 8.339302e-188  4.398822 0.987 0.075 1.143652e-183
## S100A9 5.303672e-165  4.573074 0.991 0.150 7.273455e-161
## LGALS2 5.247158e-163  2.733291 0.836 0.034 7.195953e-159
## TYROBP 5.566866e-160  3.240671 0.987 0.164 7.634399e-156
## CST3   9.451167e-155  3.300451 0.982 0.182 1.296133e-150
pbmc.markers <- FindAllMarkers(pbmc, only.pos = TRUE, min.pct = 0.25, logfc.threshold = 0.25)
# 这一步过滤(进行了分类、排序、挑前2个)
tops = pbmc.markers %>% group_by(cluster) %>% top_n(n = 2, wt = avg_logFC);tops
## # A tibble: 22 x 7
## # Groups:   cluster [11]
##        p_val avg_logFC pct.1 pct.2 p_val_adj cluster gene  
##        <dbl>     <dbl> <dbl> <dbl>     <dbl> <fct>   <chr> 
##  1 6.63e-112     0.756 0.956 0.597 9.09e-108 0       LDHB  
##  2 2.44e- 86     0.900 0.481 0.118 3.35e- 82 0       CCR7  
##  3 3.90e- 89     0.886 0.98  0.64  5.35e- 85 1       LTB   
##  4 2.36e- 61     0.857 0.646 0.243 3.23e- 57 1       CD2   
##  5 0\.            2.99  0.936 0.041 0\.        2       CD79A 
##  6 9.48e-271     2.49  0.622 0.022 1.30e-266 2       TCL1A 
##  7 1.25e-174     2.05  0.956 0.243 1.71e-170 3       CCL5  
##  8 6.09e-164     2.00  0.593 0.058 8.36e-160 3       GZMK  
##  9 5.60e-225     1.93  0.973 0.133 7.68e-221 4       LGALS2
## 10 3.26e-137     2.18  1     0.562 4.47e-133 4       LYZ   
## # … with 12 more rows

多种可视化:

# 一个基因在所有cluster中的表达量
VlnPlot(pbmc, features = c("MS4A1", "CD79A"))

# 基因表达量投射在降维聚类结果中
FeaturePlot(pbmc, features = c("MS4A1", "GNLY", "CD3E", "CD14", "FCER1A", "FCGR3A", "LYZ", "PPBP", "CD8A"))
image.png
#对每个cluster的top10 gene 绘制热图
top10 <- pbmc.markers %>% group_by(cluster) %>% top_n(n = 10, wt = avg_logFC)
DoHeatmap(pbmc, features = top10$gene) + NoLegend()
image.png
上一篇下一篇

猜你喜欢

热点阅读