【单细胞组学】CITE-seq-Count:分析CITEseq测

2022-07-26  本文已影响0人  SciNote

导言

昨天我们介绍了CITE-seq的测序原理和简单应用。目前很多国内外公司都能做CITE-seq测序,至少我们知道10x Genomics是可以做的。有了测序数据(reads文件),接下来我们看看怎么得到蛋白的表达数据。

CITE-seq的研发者创建的网站上(https://cite-seq.com/computational-tools/)给出了推荐使用的软件。

对于read-level数据可以使用CITE-seq-Count来获得count 矩阵

CITE-seq-Count的安装和使用

CITE-seq-Count是用Python搭建的工具,所以可以用pip安装:

pip install CITE-seq-Count==1.4.3

How to use it

CITE-seq-Count -R1 TAGS_R1.fastq.gz -R2 TAGS_R2.fastq.gz -t TAG_LIST.csv -cbf X1 -cbl X2 -umif Y1 -umil Y2 -cells EXPECTED_CELLS -o OUTFOLDER

这条命令将对比对到抗体DNA标签序列的reads和UMIs进行计数。

下图解释了其中的输入文件TAGS_R1.fastq.gz和TAGS_R2.fastq.gz预期的结构:

其实官方文档已经写的很好了,结构非常清晰,而且英文也更容易帮助我们理解其中一些术语,所以下面的内容我基本都不做翻译。这个工具使用起来很简单,输出结果也完全可以和Seurat无缝连接。

软件参数:

INPUT

-R1 READ1_PATH.fastq.gz

--read1 READ1_PATH.fastq.gz

-R2 READ2_PATH.fastq.gz

--read2 READ2_PATH.fastq.gz

-t tags.csv, --tags tags.csv

Antibody barcodes structure:

ATGCGA,First_tag_name
GTCATG,Second_tag_name
GCTAGTCGTACGA,Third_tag_nameGCTAGGTGTCGTA,Forth_tag_name

IMPORTANT: You need to provide only the variable region of the TAG in the tags.csv. Please refer to the following examples.

GCTAGTCGTACGA T AAAAAAAAAA
GCTAGTCGTACGA C AAAAAAAAAA
GCTAGTCGTACGA G AAAAAAAAAA
GCTGTCAGCATAC T AAAAAAAAAA
GCTGTCAGCATAC C AAAAAAAAAAGCTGTCAGCATAC G AAAAAAAAAA

The tags.csv should only contain the part before the T

GCTAGTCGTACGA,tag1GCTGTCAGCATAC,tag2

CGTAGTCGTAGCTA GCTAGTCGTACGA GCTAGCTGACTCGTAGTCGTAGCTA AACGTAGCTATGT 
GCTAGCTGACTCGTAGTCGTAGCTA GCTAGCATATCAG GCTAGCTGACT

The tags.csv should only contain the variable parts and use -trim 14 to trim the first 14 bases.

GCTAGTCGTACGA,tag1
AACGTAGCTATGT,tag2GCTAGCATATCAG,tag3

BARCODING

Positions of the cellular and UMI barcodes.

-cbf CB_FIRST, --cell_barcode_first_base CB_FIRST

-cbl CB_LAST, --cell_barcode_last_base CB_LAST

-umif UMI_FIRST, --umi_first_base UMI_FIRST

-umil UMI_LAST, --umi_last_base UMI_LASTExample: Barcodes from 1 to 16 and UMI from 17 to 26, then this is the input you need:

-cbf 1 -cbl 16 -umif 17 -umil 26

--bc_collapsing_dist N_ERRORS, default 1

--umi_collapsing_dist N_ERRORS, default 2

--no_umi_correction

Cells

You have to choose either the number of cells you expect or give it a list of cell barcodes to retrieve.

-cells EXPECTED_CELLS, --expected_cells EXPECTED_CELLS

-wl WHITELIST, --whitelist WHITELISTExample:

ATGCTAGTGCTAGCTAGTCAGGATCGACTGCTAACG

FILTERING

Filtering for structure of the antibody barcode as well as maximum errors.

--max-error MAX_ERROR, default 3Example:

If we have this kind of antibody barcode:

ATGCCAGThe script will be looking for ATGCCAG in R2

A MAX_ERROR of 1 will allow barcodes such as ATGTCAG, having one mismatch to be counted as valid.

There is a sanity check when for the MAX_ERROR value chosen to be sure you are not allowing too many mismatches and confuse your antibody barcodes. Mismatches on cell or UMI barcodes are discarded.

-trim N_BASES, --start-trim N_BASES, default 0

--sliding-window, default FalseExample:

The TAG: ATGCTAGCT with a variable prefix: TTCAATTTCA R2 reads:

TTCA ATGCTAGCTAAAAAAAAAAAAAAAAA
TTCAA ATGCTAGCTAAAAAAAAAAAAAAAA
TTCAAT ATGCTAGCTAAAAAAAAAAAAAAA
TTCAATT ATGCTAGCTAAAAAAAAAAAAAA
TTCAATTT ATGCTAGCTAAAAAAAAAAAAATTCAATTTC ATGCTAGCTAAAAAAAAAAAA

OUTPUT

-o OUTFOLDER, --output OUTFOLDER, default Results

--dense

OPTIONAL

-n FIRST_N, --first_n FIRST_N

-T N_THREADS, --threads N_THREADS, default Number of available cores

-u OUTFILE, --unmapped-tags OUTFILE, default unmapped.csv

-ut N_UNMAPPED, --unknown-top-tags N_UNMAPPED, default 50

--debug

软件输出结果

Mtx format

The mtx, matrix market, format is a sparse format for matrices. It only stores non zero values and is becoming popular in single-cell softwares.

The main advantage is that it requires less space than a dense matrix and that you can easily add different feature names within the same object.

For CITE-seq-Count, the output looks like this:

OUTFOLDER/
-- umi_count/
-- -- matrix.mtx.gz
-- -- features.tsv.gz
-- -- barcodes.tsv.gz
-- read_count/
-- -- matrix.mtx.gz
-- -- features.tsv.gz
-- -- barcodes.tsv.gz
-- unmapped.csv
-- run_report.yaml

File descriptions

Date: 2019-10-01Running time: 13.86 seconds
CITE-seq-Count Version: 1.4.3
Reads processed: 1000000
Percentage mapped: 33
Percentage unmapped: 67
Uncorrected cells: 0Correction:
    Cell barcodes collapsing threshold: 1
    Cell barcodes corrected: 57
    UMI collapsing threshold: 2
    UMIs corrected: 329
Run parameters:
    Read1_filename: fastq/test_R1.fastq.gz,fastq/test2_R1.fastq.gz
    Read2_filename: fastq/test_R2.fastq.gz,fastq/test2_R2.fastq.gz
    Cell barcode:
        First position: 1
        Last position: 16
    UMI barcode:
        First position: 17
        Last position: 26
    Expected cells: 100
    Tags max errors: 1
    Start trim: 0

Packages to read MTX

R:

I recommend using Seurat and their Read10x function to read the results.

With Seurat V3:

Read10x('OUTFOLDER/umi_count/', gene.column=1)

With Matrix:

library(Matrix)
matrix_dir = "/path_to_your_directory/out_cite_seq_count/umi_count/"barcode.path <- paste0(matrix_dir, "barcodes.tsv.gz")
features.path <- paste0(matrix_dir, "features.tsv.gz")
matrix.path <- paste0(matrix_dir, "matrix.mtx.gz")
mat <- readMM(file = matrix.path)
feature.names = read.delim(features.path, header = FALSE, stringsAsFactors = FALSE)
barcode.names = read.delim(barcode.path, header = FALSE, stringsAsFactors = FALSE)
colnames(mat) = barcode.names$V1
rownames(mat) = feature.names$V1

Python:

I recommend using scanpy and their read_mtx function to read the results.

Example:

import scanpy
import pandas as pd
import os
path = 'umi_cell_corrected'
data = scanpy.read_mtx(os.path.join(path,'umi_count/matrix.mtx.gz'))
data = data.T
features = pd.read_csv(os.path.join(path, 'umi_count/features.tsv.gz'), header=None)
barcodes = pd.read_csv(os.path.join(path, 'umi_count/barcodes.tsv.gz'), header=None)
data.var_names = features[0]
data.obs_names = barcodes[0]

Reference

[1]https://hoohm.github.io/CITE-seq-Count/

欢迎关注同名公众号

上一篇 下一篇

猜你喜欢

热点阅读