单细胞免疫组学分析练习-1:cellranger multi
BCR重排
B细胞从干细胞发育成成熟的B细胞之前,是要经历重链的VDJ基因重排和轻链的VJ基因重排的;成熟的B细胞,迁移到外周淋巴器官之后,在外界抗原的刺激下,则可能发生体细胞高频突变(SHM)。基因重排和体细胞高频突变,一个发生在B细胞成熟前,一个发生在B细胞成熟后。参考
TCR alpha链和BCR轻链由V-J-C构成
TCR beta链和BCR重链由V-D-J-C构成(D-diversity)
TCR和BCR的多样性
CDR3是V(D)J基因编码的核心区域,通常会包含V基因的一部分,然后D基因,还有J基因的一部分,因此是BCR或者TCR上最具有代表性的,最具有辨识度的一段区域,相当于一个人的脸。在绝大多数免疫研究中,会把CDR3序列作为定义和识别某一个特定BCR或者TCR的唯一依据。
1. 下载示例数据集
首先从10x genomics 官网上下载示例数据集
(1)10x genomics数据集链接
(2)我选择的是BALB/c小鼠的PBMC数据集
①表达数据
curl -O https://s3-us-west-2.amazonaws.com/10x.files/samples/cell-vdj/3.0.0/vdj_v1_mm_balbc_pbmc_5gex/vdj_v1_mm_balbc_pbmc_5gex_fastqs.tar
②BCR数据
curl -O https://cf.10xgenomics.com/samples/cell-vdj/3.0.0/vdj_v1_mm_balbc_pbmc_b/vdj_v1_mm_balbc_pbmc_b_fastqs.tar
③TCR数据
curl -O https://cf.10xgenomics.com/samples/cell-vdj/3.0.0/vdj_v1_mm_balbc_pbmc_t/vdj_v1_mm_balbc_pbmc_t_fastqs.tar
数据比较大,建议在linux终端用screen,如果curl
不行,就用wget
下载结束后:
解压.tar文件
tar xvf /home/user/myh/raw_data/BALBC/vdj_v1_mm_balbc_pbmc_b_fastqs.tar
解压大全
解压后:
是R1 R2 I1文件
复习fastq命名原则
[Sample Name]S1_L00[Lane Number][Read Type]_001.fastq.gz
Where Read Type is one of:
I1: Sample index read (optional)
R1: Read 1
R2: Read 2
2. Fastqc & MultiQC
[myh@bogon fastqc]$ cd /home/user/myh/raw_data/BALBC/vdj_v1_mm_balbc_pbmc_5gex_fastqs
[myh@bogon vdj_v1_mm_balbc_pbmc_5gex_fastqs]$ find /home/user/myh/raw_data/BALBC/vdj_v1_mm_balbc_pbmc_5gex_fastqs -name "*R1*.gz" > 5gex_id_1.txt
[myh@bogon vdj_v1_mm_balbc_pbmc_5gex_fastqs]$ find /home/user/myh/raw_data/BALBC/vdj_v1_mm_balbc_pbmc_5gex_fastqs -name "*R2*.gz" > 5gex_id_2.txt
[myh@bogon vdj_v1_mm_balbc_pbmc_5gex_fastqs]$ cat 5gex_id_1.txt 5gex_id_2.txt > 5gex_id_all.txt
[myh@bogon vdj_v1_mm_balbc_pbmc_5gex_fastqs]$ cat 5gex_id_all.txt | xargs fastqc -o /home/user/myh/raw_data/BALBC/vdj_v1_mm_balbc_pbmc_5gex_fastqs/fastqc_reports
使用multiQC整合
multiqc /home/user/myh/raw_data/BALBC/vdj_v1_mm_balbc_pbmc_t_fastqs/fastqc_reports -o /home/user/myh/raw_data/BALBC/vdj_v1_mm_balbc_pbmc_t_fastqs/fastqc_reports/multiqc_outs
QC结果:以TCR为例
Sequence Quality Histograms Per Sequence Quality Scores Adapter Content
说明下载下来的数据是clean data,已经去除了接头序列。
3.cellranger
wget https://cf.10xgenomics.com/supp/cell-vdj/refdata-cellranger-vdj-GRCm38-alts-ensembl-5.0.0.tar.gz
tar -xf /home/user/myh/ref_data/refdata-cellranger-vdj-GRCm38-alts-ensembl-5.0.0.tar.gz
Create a multi config CSV
利用nano
创建csv
nano multi_config.csv
Copy and paste this text into the newly created file, and customize the code
[gene-expression]
reference,/home/user/myh/ref_data/refdata-gex-mm10-2020-A
expect-cells,1000
[vdj]
reference,/home/user/myh/ref_data/refdata-cellranger-vdj-GRCm38-alts-ensembl-5.0.0
[libraries]
fastq_id,fastqs,lanes,feature_types,subsample_rate
vdj_v1_mm_balbc_pbmc_5gex,/home/user/myh/raw_data/BALBC/vdj_v1_mm_balbc_pbmc_5gex_fastqs,1|2,gene expression,
vdj_v1_mm_balbc_pbmc_b,/home/user/myh/raw_data/BALBC/vdj_v1_mm_balbc_pbmc_b_fastqs,1|2,VDJ-B,
vdj_v1_mm_balbc_pbmc_t,/home/user/myh/raw_data/BALBC/vdj_v1_mm_balbc_pbmc_t_fastqs,1|2,VDJ-T,
Use your text editor's save command to save the file. In nano, save by typing CTRL+X → y → ENTER.
cell ranger multi
From within the working-directory/runs/ directory, run cellranger multi
cellranger multi --id=BALBC_multi --csv=/home/user/myh/raw_data/BALBC/multi_analysis/multi_config.csv
使用cellranger multi
而非cellranger vdj
可以同时分析转录组和TCR和BCR,更方便
完成:
解读outs文件:
官网解读
outs的结构
关于outs的解释
per_sample_outs/
: folder containing filtered data, i.e., only cell-associated barcodes in this sample.这里面的
sample_feature_bc_matrix
与cell ranger count得到的filtered_feature_bc_matrix
类似
4.将count文件中的bam文件转变为loom(为了方便scVelo计算RNA速率)
因为cellranger multi
的结果文件格式和cellranger count
的结果文件格式不全相同,所以更推荐使用velocyto run
而非velocyto run10x
velocyto run -b /home/user/myh/raw_data/BALBC/runs/BALBC_multi/outs/per_sample_outs/BALBC_multi/count/sample_feature_bc_matrix/barcodes.tsv.gz \
-o /home/user/myh/raw_data/BALBC/runs/BALBC_multi/outs/per_sample_outs/BALBC_multi/count/loom \
-m /home/user/myh/ref_data/mm10_allTracks.gtf \
/home/user/myh/raw_data/BALBC/runs/BALBC_multi/outs/per_sample_outs/BALBC_multi/count/sample_alignments.bam \
/home/user/myh/ref_data/refdata-gex-mm10-2020-A/genes/genes.gtf
5.VDJ结果解读
(1)clonotypes.csv
- clonotype_id:The ID of the clonotype to which this consensus sequence was assigned. 是这个样本内部依次排下去的序号,不同样本之间不通用!相同clonotype_id在不同样本之间不代表同一克隆型!
- frequency: The observed number of cell barcodes with this clonotype.侧面部分反映了这个BCR/TCR的丰度。为什么说是部分反映呢,因为受到测序深度等方面的限制,可能会miss一些高频克隆型,因此这个参数有意义,同时也有局限性。
- proportion: The observed fraction of cell barcodes with this clonotype.
-
cdr3s_aa: A semicolon-delimited list of chain:sequence pairs, where chain is for example TRA, TRB, IGK, IGL, or IGH and sequence is the CDR3 amino acid sequence for that chain. 反映了CDR3区的氨基酸序列
-
cdr3s_nt:CDR3区域的核苷酸组合
(2) filtered_contig_annotations.csv
contig的定义:Contiguous sequence of bases produced by assembly.
- barcode: Cell-barcode for this contig.
- is_cell: True or False value indicating whether the barcode was called as a cell.
- contig_id:Unique identifier for this contig.
- high_confidence: True or False value indicating whether the contig was called as high-confidence (unlikely to be a chimeric sequence or some other artifact).
- length: The contig sequence length in nucleotides.
-
chain:The chain associated with this contig; for example, TRA, TRB, IGK, IGL, or IGH. A value of "Multi" indicates that segments from multiple chains were present.
- xx_gene:The highest-scoring xx segment。这部分提供了V基因和J基因的具体片段,D区通常较短又突变较多,受技术限制,常常捕捉不到。
- full-length: A contig is full-length if it matches the initial part of a V gene, continues on, and ultimately matches the terminal part of a J gene.
- productive: 定义
其余:nt:碱基序列,fwr:framework region
- raw_clonotype_id:The ID of the clonotype to which this cell barcode was assigned.
- raw_consensus_id: The ID of the consensus sequence to which this contig was assigned.
- exact_subclonotype_id: The ID of the exact subclontype to which this cell barcode was assigned.
结果文件夹中的fasta序列文件,存储每个clonotype的contig序列或者consensus序列。consensus序列可以理解成这个样本里,这个克隆型的所有细胞的序列的统一。就是假如这个克隆型在这个样本里表达了十个细胞,和reference比起来,有九个细胞的某个位点都由A突变成了T,而剩下的一个是由A变成G,那么consensus序列的这个位点就是T。它不是reference序列,而是样本内部的一种统一,这样做能有效排除个别细胞中低频SNP的干扰。