【旧版空间转录组Spatial】（二）跑通流程试验记录

2020-03-25 本文已影响0人 Geekero

旧号无端被封，小号再发一次

一、运行st_pipeline

工作流程概要图

详细工作流程图

1.1 需要的输入文件

FASTQ文件（读取1包含空间信息和UMI，读取2包含基因组序列）
用STAR生成的基因组索引
GTF或GFF3格式的注释文件（使用转录组时可选）
包含条形码和数组坐标的文件（查看文件夹“ ids”并选择正确的一个）。基本上，此文件包含3列（BARCODE，X和Y）。如果数据不是条形码（例如RNA-Seq数据），则此文件也是可选的。
数据集的名称

ST管道具有多个参数，这些参数主要与修剪，映射和注释有关，但是通常默认值已经足够了。安装ST管道后，您可以看到键入“ st_pipeline_run.py --help”的参数的完整说明。

(base) [Robin@SC-201910280935 pipl_test]$ st_pipeline_run.py --help
usage: st_pipeline_run.py [-h] [--ids [FILE]] --ref-map [FOLDER]
                          [--ref-annotation [FILE]] --expName [STRING]
                          [--allowed-missed [INT]] [--allowed-kmer [INT]]
                          [--overhang [INT]]
                          [--min-length-qual-trimming [INT]]
                          [--mapping-rv-trimming [INT]]
                          [--contaminant-index [FOLDER]] [--qual-64]
                          [--htseq-mode [STRING]] [--htseq-no-ambiguous]
                          [--start-id [INT]] [--no-clean-up] [--verbose]
                          [--mapping-threads [INT]]
                          [--min-quality-trimming [INT]] [--bin-path [FOLDER]]
                          [--log-file [STR]] [--output-folder [FOLDER]]
                          [--temp-folder [FOLDER]]
                          [--umi-allowed-mismatches [INT]]
                          [--umi-start-position [INT]]
                          [--umi-end-position [INT]] [--keep-discarded-files]
                          [--remove-polyA [INT]] [--remove-polyT [INT]]
                          [--remove-polyG [INT]] [--remove-polyC [INT]]
                          [--remove-polyN [INT]] [--filter-AT-content [INT%]]
                          [--filter-GC-content [INT%]] [--disable-multimap]
                          [--disable-clipping]
                          [--umi-cluster-algorithm [STRING]]
                          [--min-intron-size [INT]] [--max-intron-size [INT]]
                          [--umi-filter] [--umi-filter-template [STRING]]
                          [--compute-saturation]
                          [--saturation-points SATURATION_POINTS [SATURATION_POINTS ...]]
                          [--include-non-annotated]
                          [--inverse-mapping-rv-trimming [INT]]
                          [--two-pass-mode] [--strandness [STRING]]
                          [--umi-quality-bases [INT]]
                          [--umi-counting-offset [INT]]
                          [--demultiplexing-metric [STRING]]
                          [--demultiplexing-multiple-hits-keep-one]
                          [--demultiplexing-trim-sequences DEMULTIPLEXING_TRIM_SEQUENCES [DEMULTIPLEXING_TRIM_SEQUENCES ...]]
                          [--homopolymer-mismatches [INT]]
                          [--star-genome-loading [STRING]]
                          [--star-sort-mem-limit STAR_SORT_MEM_LIMIT]
                          [--disable-barcode] [--disable-umi]
                          [--transcriptome] [--version]
                          fastq_files fastq_files

1.1 基础语法

1.2 运行测试程序看看能否跑通

$ cp -r test tests2
$ cd test2
$ mkdir index
$ cd /opt/st_pipeline/test2/config
$ gzip -d Homo_sapiens.GRCh38.dna.chromosome.19.fa.gz
# STAR比对
$ STAR --runThreadN 10  --runMode genomeGenerate --genomeDir ./index \
--genomeFastaFiles ./config/Homo_sapiens.GRCh38.dna.chromosome.19.fa \
--sjdbGTFfile ./config/annotations/Homo_sapiens.GRCh38.79_chr19.gtf
# 运行st_pipeline_run.py
$ mkdir results
$ st_pipeline_run.py --expName test2 \
     --ids ./config/idfiles/150204_arrayjet_1000L2_probes.txt \
     --ref-map ./index --log-file log.txt  --output-folder ./results 
     --ref-annotation ./config/annotations/Homo_sapiens.GRCh38.79_chr19.gtf  \                  
     ./input/arrayjet_1002/testdata_R1.fastq 
     ./input/arrayjet_1002/testdata_R2.fastq

得到结果：

$ cd results/
$ ls
test2_reads.bed  test2_stdata.tsv

二、运行Spatial Transcriptomics Analysis

(base) [Robin@SC-201910280935 data]$ unsupervised.py --help
usage: unsupervised.py [-h] --counts-table-files COUNTS_TABLE_FILES
                       [COUNTS_TABLE_FILES ...] [--normalization [STR]]
                       [--num-clusters [INT]] [--num-exp-genes [FLOAT]]
                       [--num-exp-spots [FLOAT]] [--min-gene-expression [INT]]
                       [--num-genes-keep [INT]] [--clustering [STR]]
                       [--dimensionality [STR]] [--use-log-scale]
                       [--alignment-files ALIGNMENT_FILES [ALIGNMENT_FILES ...]]
                       [--image-files IMAGE_FILES [IMAGE_FILES ...]]
                       [--num-dimensions [INT]] [--spot-size [INT]]
                       [--top-genes-criteria [STR]] [--use-adjusted-log]
                       [--tsne-perplexity [INT]] [--tsne-theta [FLOAT]]
                       [--outdir OUTDIR] [--color-space-plots]


optional arguments:
  -h, --help            show this help message and exit
  --counts-table-files COUNTS_TABLE_FILES [COUNTS_TABLE_FILES ...]
                        One or more matrices with gene counts per feature/spot (genes as columns)
  --normalization [STR]
                        Normalize the counts using:
                        RAW = absolute counts
                        DESeq2 = DESeq2::estimateSizeFactors(counts)
                        DESeq2PseudoCount = DESeq2::estimateSizeFactors(counts + 1)
                        DESeq2Linear = DESeq2::estimateSizeFactors(counts, linear=TRUE)
                        DESeq2SizeAdjusted = DESeq2::estimateSizeFactors(counts + lib_size_factors)
                        RLE = EdgeR RLE * lib_size
                        TMM = EdgeR TMM * lib_size
                        Scran = Deconvolution Sum Factors (Marioni et al)
                        REL = Each gene count divided by the total count of its spot
                        (default: DESeq2)
  --num-clusters [INT]  The number of clusters/regions expected to be found.
                        If not given the number of clusters will be computed.
                        Note that this parameter has no effect with DBSCAN clustering.
  --num-exp-genes [FLOAT]
                        The percentage of number of expressed genes (>= --min-gene-expression) a spot
                        must have to be kept from the distribution of all expressed genes (default: 1)
  --num-exp-spots [FLOAT]
                        The percentage of number of expressed spots a gene
                        must have to be kept from the total number of spots (default: 1)
  --clustering [STR]    What clustering algorithm to use after the dimensionality reduction:
                        Hierarchical = Hierarchical Clustering (Ward)
                        KMeans = Suitable for small number of clusters
                        DBSCAN = Number of clusters will be automatically inferred
                        Gaussian = Gaussian Mixtures Model
                        (default: KMeans)
  --dimensionality [STR]
                        What dimensionality reduction algorithm to use:
                        tSNE = t-distributed stochastic neighbor embedding
                        PCA = Principal Component Analysis
                        ICA = Independent Component Analysis
                        SPCA = Sparse Principal Component Analysis
                        (default: tSNE)
  --use-log-scale       Use log2(counts + 1) values in the dimensionality reduction step
  --alignment-files ALIGNMENT_FILES [ALIGNMENT_FILES ...]
                        One or more tab delimited files containing and alignment matrix for the images as
                                 a11 a12 a13 a21 a22 a23 a31 a32 a33
                        Only useful is the image has extra borders, for instance not cropped to the array corners
                        or if you want the keep the original image size in the plots.
  --image-files IMAGE_FILES [IMAGE_FILES ...]
                        When provided the data will plotted on top of the image
                        It can be one ore more, ideally one for each input dataset
                         It is desirable that the image is cropped to the array
                        corners otherwise an alignment file is needed
  --num-dimensions [INT]
                        The number of dimensions to use in the dimensionality reduction (2 or 3). (default: 2)
  --spot-size [INT]     The size of the spots when generating the plots. (default: 20)
  --top-genes-criteria [STR]
                        What criteria to use to keep top genes before doing
                        the dimensionality reduction (Variance or TopRanked) (default: Variance)
  --use-adjusted-log    Use adjusted log normalized counts (R Scater::normalized())
                        in the dimensionality reduction step (recommended with SCRAN normalization)
  --tsne-perplexity [INT]
                        The value of the perplexity for the t-sne method. (default: 30)
  --tsne-theta [FLOAT]  The value of theta for the t-sne method. (default: 0.5)
  --outdir OUTDIR       Path to output dir

unsupervised.py --counts-table-files test2_stdata.tsv --normalization DESeq2 --num-clusters 5 \
     --clustering KMeans --dimensionality tSNE --image-files HE_Rep6_MOB.jpg --use-log-scale

【旧版空间转录组Spatial】（二）跑通流程试验记录

更多空间转录组文章：

1. 新版10X Visium

2. 旧版Sptial

一、运行st_pipeline

工作流程概要图

1.1 需要的输入文件

1.1 基础语法

1.2 运行测试程序看看能否跑通

二、运行Spatial Transcriptomics Analysis

猜你喜欢

热点阅读