生信星球转录组培训第一期Day5--郝志刚

2019-06-10 本文已影响4人马连洼小法师

数据过滤

简介：

目的去掉低质量的碱基或接头，一般使用两款软件：trim_galore和trimmomatic,他们的去除是从不合格的位置往后全部切掉。
trim_galore：它去接头序列依赖的是Cutadapt，接头一般出现在3’末端。
它参数如下：

-q 表示设置的碱基质量阈值
--phred33表示质量体系。phred33和phred64，目前测序平台出的数据都是phred33.
--length表示测序片段长度的阈值，比如设置阈值50，若去除接头和低质量的碱基后，长度低于50bp，就直接把整条去掉。
--stringency设置数字表示：有几个碱基和接头序列是有重叠，默认值为1，意思是从3`的位置检测，出现一个碱基与常用接头有重叠就把这个碱基以后的序列都去掉。
--paird表示双末端
-o 输出路径

实际操作

合并文件并加路径

raw=~/RNAseq/raw
ls `pwd`*_1* >new_1
ls `pwd`*_2* >new_2
paste new_1 new_2>conf

遍历文件并进行过滤

# 首先进入脚本目录
cd $rna/script
# 然后新建一个脚本文件，并向其中添加内容
cat >filter.sh #回忆一下前面怎么用cat新建一个脚本
# 脚本的内容是：
rna=/My_PATH/RNAseq #这里注意修改成自己的
cat $rna/raw/conf | while read i
do
    fqs=($i)
    fq1=${fqs[0]}
    fq2=${fqs[1]}
    nohup trim_galore -q 25 --phred33 --length 50 -e 0.1 --stringency 3 --paired -o $rna/clean $fq1 $fq2 &
done

bash filter.sh

sh、bash都是linux中的解释器，但有的sh是没有数组array解释器的，因此当你使用sh去运行整个命令时，有可能会产生报错：
Syntax error: "(" unexpected (expecting "done")
但这并不是说代码写错了，而是选择错了解释器。当然如果你的sh可以成功解释代码，也是可以用的

过滤结果

比对

需要数据

过滤并质控合格的数据
参考基因组、注释文件

生成小数据

# 先配置clean data路径
cd $rna/clean
ls `pwd`/*_1*gz >1.clean
ls `pwd`/*_2*gz >2.clean
paste 1.clean 2.clean >clean.conf

新建test文件夹，进行测试

mkdir $rna/test && cd "$_"
cat >sample_fq.sh
# 输入以下内容
rna=/MY_PATH/rnaseq 
cat $rna/clean/clean.conf | while read i;do
    fqs=($i)
    fq1=${fqs[0]}
    fq2=${fqs[1]}
    file=`basename $fq1`
    #echo $file
    surname=${file%%_*}
    #echo $fq1 $fq2 $surname
    # 随机选了10000条reads
    seqtk sample $fq1 10000 >test.${surname}_1.fastq
    seqtk sample $fq2 10000 >test.${surname}_2.fastq
done

basename的语法是：basename[选项][参数]
其中：
选项：为有路径信息的文件名，如/home/test/test.txt

参数：指文件扩展名

basename用法

seqtk 是一款针对fasta/fastq 文件进行处理的小程序，有很多的功能，速度很快，很方便；
安装conda install seqtk

子程序

seqtk seq : 用途：
　　案例1：seq 序列常规转换

将fastq转换成fasta：

seqtk seq -a Sample_R1.fq.gz > Sample_R1.fa

将fastq序列做反向互补分析：

seqtk seq -r Sample_R1.fq.gz > Sample_Revc_R1.fq

案例2：sample 随机抽样

seqtk sample -s100 Sample_R1.fq.gz 10000

可直接对压缩文件进行序列随机提取，在提取R1和R2两个文件的时候，需要-s值一致，才能使提取的序列id号对应。

案例3：subseq 提取序列

根据输入的bed文件信息，将固定区域的序列提取出来：

seqtk subseq in.fa reg.bed > out.fa

根据输入的name list，提取相应名称序列：

seqtk subseq in.fq name.lst > out.fq

案例4：截取序列

切除reads的前5bp，以及后10bp：

seqtk trimfq -b 5 -e 10 in.fq > out.fq

hisat2比对

mapping的目的：找到reads在基因组上的位置。

构建索引

mkdir $rna/align/hisat2 && cd "$_"
cat >index.sh #输入以下内容
rna=/MY_PATH/rnaseq
genome=$rna/ref/hg19.fa
hisat2-build -p 10 $genome hg19

bash index.sh

比对

# 小数据的路径
ls $rna/test/*_1* >1.test
ls $rna/test/*_2* >2.test
paste 1.test 2.test >test.conf
#合并文件

cat >hisat2_aln.sh
#创建索引路径
index= $rna/align/hisat2/hg19
#读取目标文件
cat test.conf| while read i
do
    fqs=($i)
    fq1=${fqs[0]}
    fq2=${fqs[1]}
    sample=`basename $fq1 | cut -d '_' -f1`
    hisat2 -p 10 -x $index -1 $fq1 -2 $fq2  | samtools sort -O bam \
        -@ 10  -o ${sample}_hisat.sorted.bam
    samtools index -@ 10 -b ${sample}_hisat.sorted.bam 
done

hisat2
-x 指定基因组索引
-1 指定第一个fastq文件
-2 指定第二个fastq文件
samtools
sort对bam文件进行排序。
-o 输出文件名
index：建立索引
-b 默认下输出是 SAM 格式文件，该参数设置输出 BAM 格式

结果

作业

（1）
过滤后

image.png

过滤前

image.png
运用trim_glore可以看到去除接头的具体情况
（2）

hisat2
-x 指定基因组索引
-1 指定第一个fastq文件
-2 指定第二个fastq文件
samtools
sort对bam文件进行排序。
-p 线程数
-o 输出文件名
index：建立索引
-b 默认下输出是 SAM 格式文件，该参数设置输出 BAM 格式

# 小数据的路径
ls $rna/test/*_1* >1.test
ls $rna/test/*_2* >2.test
paste 1.test 2.test >test.conf
#合并文件

cat >hisat2_aln.sh
#创建索引路径
index= $rna/align/hisat2/hg19
#读取目标文件
cat test.conf| while read i
do
    fqs=($i)
    fq1=${fqs[0]}
    fq2=${fqs[1]}
#去除文件路径，只保留第一个文件名
    sample=`basename $fq1 | cut -d '_' -f1`
#hisat2  10线程 将1，2两个文件比对到$index建立好索引的基因组上，运用samtools排序，输出bam文件
    hisat2 -p 10 -x $index -1 $fq1 -2 $fq2  | samtools sort -O bam \
        -@ 10  -o ${sample}_hisat.sorted.bam
#对生成bam文件排序
    samtools index -@ 10 -b ${sample}_hisat.sorted.bam 
done

（3）

#建立文件夹
mkdir $rna/clean/trimmomatic && cd "$_" 
#运行脚本
cat >filter-2.sh
rna=/MY_PATH/rnaseq 
cat $rna/raw/conf| while read i
do
    fqs=($i)
    fq1=${fqs[0]}
    fq2=${fqs[1]}
#basename去掉目标文件绝对路径
    tri1=`basename $fq1`
    tri2=`basename $fq2`

    nohup trimmomatic PE -phred33 \
    -trimlog trim.logfile\
    $fq1 $fq2 \
    clean.$tri1 unpaired.$tri1 \
    clean.$tri2 unpaired.$tri2 \
    SLIDINGWINDOW:5:20 LEADING:5 TRAILING:5 MINLEN:20 &
done

bash filter-2.sh

image.png

生信星球转录组培训第一期Day5--郝志刚

数据过滤

简介：

实际操作

比对

hisat2比对

作业

猜你喜欢

热点阅读