ATAC-seq上游分析

2019-09-27 本文已影响0人大吉岭猹

参考教程：生信技能树ATAC-seq教程

1. 环境配置及软件安装

1.1. 安装conda并配置镜像

conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/free
conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/bioconda
conda config --set show_channel_urls yes
cat ~/.condarc

1.2. 创建小环境并安装软件

conda create -n atac python=2
conda install -y sra-tools trim-galore samtools bedtools deeptools homer meme macs2 bowtie bowtie2 multiqc sambamba

1.3. 软件环境可以移植

conda activate target_env
conda env export > environment.yml
#换用户/服务器
conda env create -f exvironment.yml

2. （若使用公共数据）下载数据

在GEO公共数据库的SRA Run Selector中下载SRR_Acc_List.txt文件，并通过WinSCP软件上传至服务器
工作目录在/trainee/ZJU/iPS/

mkdir {ssr_public,fq_ATAC,clean_ATAC,qc_ATAC,align_ATAC,peaks_ATAC,motif_ATAC}
cd ssr_public
conda activate atac
cat SRR_Acc_List.txt | while read id; do (nohup prefetch $id -O ./ &); done #先跑单个数据，再跑循环

3. 转fastq

ls SRR* | while read id; do (nohup fastq-dump --gzip --split-3 -O ../fq_ATAC $id &); done

得到一系列.fq.gz文件

4. 测序数据的过滤和质控

需要自行制作config.raw文件，是3列，第一列占位用，没有意义，第二列是fq1的地址，第3列是fq2的地址。

cd ../fq_ATAC
#vim config.raw
cat config.raw | while read id;
    do echo $id
    arr=($id)
    fq2=${arr[2]}
    fq1=${arr[1]}
    nohup trim_galore -q 25 --phred33 --length 35 -e 0.1 --stringency 4 --paired -o clean_ATAC/ $fq1 $fq2 &
    done

得到的文件名字里有val
随后对测序质量进行评估

cd ../fq_ATAC
fastqc -t 5 *gz -o ./qc
multiqc ./qc #整合结果
cd ../clean_ATAC
fastqc -t 5 *gz -o ./qc
multiqc ./qc #整合结果

生成的.html格式报告可以拷到自己电脑上用浏览器查看

5. 比对

使用bowtie2，需要索引，这里使用的是服务器里已有的，如需下载索引，可以使用如下代码wget -4 -q ftp://ftp.ccb.jhu.edu/pub/data/bowtie2_indexes/mm10.zip
解压后，索引共含六个文件，调用索引是用六个文件的共同前缀
将以下代码保存成脚本，然后运行nohup bash bowtie2.sh &

bowtie2_index=/trainee/huangjing/project/reference/index/bowtie2/mm10
#vim config.clean
#自行构建config.clean，和前面的一样
cat config.clean|while read id;
do
    echo $id
    arr=($id)
    fq2=${arr[2]}
    fq1=${arr[1]}
    tmp=${fq1##*/}
    tmpp=${tmp%%.*}
    sample=${tmpp%_*} #去头去尾
    bowtie2 -p 5 --very-sensitive -X 2000 -x $bowtie2_index -1 $fq1 -2 $fq2 | samtools sort -O bam -@ 5 -o - > ${sample}.raw.bam
    samtools index ${sample}.raw.bam #构建索引
    samtools flagstat ${sample}.raw.bam > ${sample}.raw.stat #评估比对率
    sambamba markdup --overflow-list-size 600000 --tmpdir='./' -r ${sample}.raw.bam  ${sample}.sambamba.rmdup.bam #去除PCR重复
    samtools index ${sample}.sambamba.rmdup.bam #构建索引
    samtools flagstat ${sample}.sambamba.rmdup.bam > ${sample}.sambamba.rmdup.stat #评估比对率
    samtools view -h -f 2 -q 30 ${sample}.sambamba.rmdup.bam | grep -v chrM | samtools sort -O bam -@ 5 -o - > ${sample}.last.bam #接下来只保留两条reads要比对到同一条染色体(Proper paired) ，还有高质量的比对结果(Mapping quality>=30)，顺便去除线粒体DNA
    samtools index ${sample}.last.bam #构建索引
    samtools flagstat ${sample}.last.bam > ${sample}.last.stat #评估比对率
    bedtools bamtobed -i ${sample}.last.bam  > ${sample}.bed #转bed文件
done

6. 自定义分箱count

# 首先拿到基因组坐标染色体坐标
pip install pyfaidx
faidx mm10.fa -i chromsizes > sizes_mm10.genome
# 然后使用 bedtools 工具基因组染色体坐标进行窗口划分
bedtools makewindows -g sizes_mm10.genome -w 700 > 700.bed
# 对bam文件进行计算
ls ../align_ATAC/*.last.bam | while read id; do(bedtools multicov -bams $id -bed 700.bed > $(basename $id .last.bam)_700_counts.txt);done

7. peaks calling

ls *.last.bam | while read id; do (macs2 callpeak -t $id  -g mm --nomodel --shift -100 --extsize 200 -n ${id%%.*} --outdir ../peaks_ATAC/); done

友情宣传

生信爆款入门-全球听（买一得五）（第5期）（可能是最后一期），你的生物信息学入门课
数据挖掘第3期（两天变三周，实力加量），医学生/临床医师首选技能提高课
生信技能树的2019年终总结，你的生物信息学成长宝藏
2020学习主旋律，B站74小时免费教学视频为你领路

ATAC-seq上游分析

1. 环境配置及软件安装

1.1. 安装conda并配置镜像

1.2. 创建小环境并安装软件

1.3. 软件环境可以移植

2. （若使用公共数据）下载数据

3. 转fastq

4. 测序数据的过滤和质控

5. 比对

6. 自定义分箱count

7. peaks calling

友情宣传

猜你喜欢

热点阅读