scanpy单细胞组学

使用scanpy进行高可变基因的筛选

2022-01-16  本文已影响0人  生信阿拉丁

作者:童蒙
编辑:angelica

代码解读scanpy又来啦,不要错过~~今天我们讲的是:高可变基因的筛选。

函数

scanpy.pp.highly_variable_genes

功能

取出高可变基因,默认使用log的数据,当使用flavor=seurat_v3的时候,采用count data。

flavor参数可以选择是使用Seurat,Cell ranger还是seurat v3的算法。

Seurat and Cellranger中,使用的是dispersion-based方法,获得归一化的方差。先对基因按照表达量平均值进行分bin,然后计算落在每个bin的基因的离散度(dispersion)的均值和SD,最终获得归一化的dispersion。对于每个表达量的bin,选择不同的高可变表达基因。

而Seurat3的算法,计算每个基因的方差进行归一化。首先对数据在规范化标准偏差下(a regularized standard deviation)进行标准化(使用z标准化),之后计算每个基因的归一化的方差,并且进行排序,获得高可变基因。

重要参数

If specified, highly-variable genes are selected within each batch separately and merged. This simple process avoids the selection of batch-specific genes and acts as a lightweight batch correction method. For all flavors, genes are first sorted by how many batches they are a HVG. For dispersion-based flavors ties are broken by normalized dispersion. If flavor = 'seurat_v3', ties are broken by the median (across batches) rank based on within-batch normalized variance.

代码

## _highly_variable_genes.py
 mean, var = materialize_as_ndarray(_get_mean_var(X))
 # now actually compute the dispersion
 mean[mean == 0] = 1e-12  # set entries equal to zero to small value
 dispersion = var / mean

 df['dispersions_norm'] = (
     df['dispersions'].values  # use values here as index differs
     - disp_mean_bin[df['mean_bin'].values].values
 ) / disp_std_bin[df['mean_bin'].values].values

获得每个基因的dispersion值,并进行排序

mean, var = _get_mean_var(X_batch)
not_const = var > 0
estimat_var = np.zeros(X.shape[1], dtype=np.float64)

y = np.log10(var[not_const])
x = np.log10(mean[not_const])
model = loess(x, y, span=span, degree=2)   ### 对mean和var进行loess回归
model.fit()
estimat_var[not_const] = model.outputs.fitted_values
reg_std = np.sqrt(10 ** estimat_var)

batch_counts = X_batch.astype(np.float64).copy()

参考资料

https://scanpy.readthedocs.io/en/stable/generated/scanpy.pp.highly_variable_genes.html

上一篇 下一篇

猜你喜欢

热点阅读