Seurat-Tutorial part1学习

2021-05-01 本文已影响0人 7f0a92cda77c

安装（基于Rstudio）

参考https://satijalab.org/seurat/articles/install.html

options("repos"=c(CRAN="http://mirrors.tuna.tsinghua.edu.cn/CRAN/"))
install.packages('Seurat')
library(Seurat)

Getting Started with Seurat

参考https://satijalab.org/seurat/articles/get_started.html 选择了第一个学习下 Guided tutorial — 2,700 PBMCs https://satijalab.org/seurat/articles/pbmc3k_tutorial.html

下载数据

https://cf.10xgenomics.com/samples/cell/pbmc3k/pbmc3k_filtered_gene_bc_matrices.tar.gz，选择直接打开后再进一步解压缩hg19文件夹，出现3个文件

image.png

barcodes文件

基因文件两列，Ensemble ID和基因名

Matrix::readMM("D:/单细胞测序/data/2021-5-1/hg19/matrix.mtx")#读取这个数据
#32738 x 2700 sparse Matrix of class "dgTMatrix"

这三列分别是统计了基因、barcode、matrix.mtx矩阵的行数；第三行是汇总下测的基因是32738个，barcodes是2700个，也就是捕获了2700个细胞，对应的matrix.mtx矩阵行数是2286884行

matrix.mtx文件随意取的数据集

读取数据-函数 Read10X

https://satijalab.org/seurat/reference/read10x

Read10X(
  data.dir,#3个文件：matrix.mtx, genes.tsv (or features.tsv), and barcodes.tsv files 10×所在目录
  gene.column = 2,#genes.tsv这个文件中基因名字所在的列，默认是2
  cell.column = 1,#barcodes.tsv这个文件中细胞名字所在的列，默认是1
  unique.features = TRUE,
  strip.suffix = FALSE
)

下载的数据加载到环境中

pbmc.data <- Read10X(data.dir = "D:/单细胞测序/data/2021-5-1/hg19/")#加载dataset
da <- as.matrix(pbmc.data)
View(da)#下面的结果看看下，这个函数整合了3个文件

pbmc.data-列是barcode,行是基因名

可以查看某些基因在这个矩阵中前30列的情况

pbmc.data[c("CD3D", "TCL1A", "MS4A1"), 1:30]
#3 x 30 sparse Matrix of class "dgCMatrix"
   #[[ suppressing 30 column names ‘AAACATACAACCAC-1’, ‘AAACATTGAGCTAC-1’, ‘AAACATTGATCAGC-1’ ... ]]
                                                                   
#CD3D  4 . 10 . . 1 2 3 1 . . 2 7 1 . . 1 3 . 2  3 . . . . . 3 4 1 5
#TCL1A . .  . . . . . . 1 . . . . . . . . . . .  . 1 . . . . . . . .
#MS4A1 . 6  . . . . . . 1 1 1 . . . . . . . . . 36 1 2 . . 2 . . . .

The .values in the matrix represent 0s (no molecules detected)

使用CreateSeuratObject筛选数据，创建可用的数据

pbmc <- CreateSeuratObject(counts = pbmc.data, project = "pbmc3k", min.cells = 3, min.features = 200)
#counts是上述读取的10×数据；project是自己所取的名字；min.cells-检测到的特征至少存在的细胞数目；
#min.features-捕获的单个细胞必须满足检测到的基因数阈值，如果只检测到一个基因，则会把这个单细胞去除掉

补充下：features
https://www.embopress.org/doi/full/10.15252/msb.20209539
The features were grouped into seven classes : Genetics, Gene type, Gene body, TSS, 3′UTR, Distal regulators and Gene context
The features can be broadly divided into seven categories: transcription start site (TSS, e.g. core promoter motifs, chromatin accessibility, TF binding), gene body features (e.g. gene length, number of exons), 3′untranslated regions (UTR, e.g. length, miRNA motifs), distal regulatory elements (e.g. TSS‐distal chromatin accessibility, TF occupancy), gene type (e.g. housekeeping genes, TFs), gene context (e.g. gene density, distance to the borders of topologically associated domains (TADs)) and genetics (e.g. the presence of eQTL and a cis genetic component）

Standard pre-processing workflow

分为 These represent the selection and filtration of cells based on QC metrics, data normalization and scaling, and the detection of highly variable features.

QC and selecting cells for further analysis

Seurat 包可以根据自定义参数，如筛选细胞数目，基因数这些对metric进行质控

每个捕获细胞中检测到的基因数量

低质量的细胞或者是空液滴能检测到的基因很少
细胞双峰或多重峰(2个或多个细胞被捕获到一个droplet)可能显示异常高的基因计数

类似的，一个细胞内检测到的分子总数 (correlates strongly with unique genes)
映射到线粒体基因组的Reads数目的百分比

低质量/坏死细胞会出现染色体污染
使用PercentageFeatureSet()函数计算metrics线粒体中所占counts值的百分比
线粒体基因的标识是以MT- 开始的（人的），mt- 开始的（鼠源）

pbmc[["percent.mt"]] <- PercentageFeatureSet(pbmc, pattern = "^MT-")

Seurat的QC metrics探索

计算unique基因和总分子数 - CreateSeuratObject()

# 显示前5个细胞QC metrics
head(pbmc@meta.data, 5)
                  orig.ident nCount_RNA nFeature_RNA percent.mt
AAACATACAACCAC-1     pbmc3k       2419          779  3.0177759
AAACATTGAGCTAC-1     pbmc3k       4903         1352  3.7935958
AAACATTGATCAGC-1     pbmc3k       3147         1129  0.8897363
AAACCGTGCTTCCG-1     pbmc3k       2639          960  1.7430845
AAACCGTGTATGCG-1     pbmc3k        980          521  1.2244898

对QC metrics数据进行可视化，Seurat包已经自动化这个

#  violin plot对 QC metrics可视化 VLNPLOT https://satijalab.org/seurat/reference/vlnplot
VlnPlot(pbmc, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), ncol = 3)

QC metrics可视化后的violin图

# FeatureScatter 可以展示 Feature 两两之间的关系
plot1 <- FeatureScatter(pbmc, feature1 = "nCount_RNA", feature2 = "percent.mt")
plot2 <- FeatureScatter(pbmc, feature1 = "nCount_RNA", feature2 = "nFeature_RNA")
plot3 <- FeatureScatter(pbmc, feature1 = "nFeature_RNA", feature2 = "percent.mt")
plot1 + plot2 + plot3

对任意两个组合画图

根据Violin图对metrics提取合适的子集

pbmc <- subset(pbmc, subset = nFeature_RNA > 200 & nFeature_RNA < 2500 & percent.mt < 5)
#挑选的细胞（1）Feature_RNA数目控制在200~2500个之间；（2）选择线粒体基因的转录本数目低于10%

重新画图

VlnPlot(pbmc, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), ncol = 3)

对其筛选细胞后的特征的图