RNA-seq数据处理前后的比较

2020-05-29  本文已影响0人  javen_spring

处理前的fastq原数据,trim-galore处理后的fq.gz(fastq)数据 (若处理数据则需要在rna小环境下进行,此次主要是查看文件,在conda的环境下进行)

运行命令
cd ${workdir}/04.clean
zcat SRR1039510_1.fastq.gz | paste - - - - > raw.txt
zcat SRR1039510_1_val_1.fq.gz |paste - - - - > trim.txt
awk '(length($4)<63){print$1}' trim.txt > ID
head -n 100 ID > ID100
grep -w -f ID100 trim.txt | awk '{print$1,$4}' > trim.sm
grep -w -f ID100 raw.txt | awk '{print$1,$4}' > raw.sm
paste raw.sm trim.sm | awk '{print$2,$4}' | tr ' ' '\n' |less -S

实例运行:

(base) May5 10:51:27 ~
$ workdir=$HOME/project/airway
(base) May5 10:58:59 ~
$ cd /trainee2/May5/project/airway/04.clean/
(base) May5 11:00:34 ~/project/airway/04.clean
$ ls
clean_qc                                   SRR1039510.trim.log                        SRR1039512_1.fastq.gz
filter.sh                                  SRR1039511_1.fastq.gz                      SRR1039512_1.fastq.gz_trimming_report.txt
SRR1039510_1.fastq.gz                      SRR1039511_1.fastq.gz_trimming_report.txt  SRR1039512_1_val_1.fq.gz
SRR1039510_1.fastq.gz_trimming_report.txt  SRR1039511_1_val_1.fq.gz                   SRR1039512_2.fastq.gz
SRR1039510_1_val_1.fq.gz                   SRR1039511_2.fastq.gz                      SRR1039512_2.fastq.gz_trimming_report.txt
SRR1039510_2.fastq.gz                      SRR1039511_2.fastq.gz_trimming_report.txt  SRR1039512_2_val_2.fq.gz
SRR1039510_2.fastq.gz_trimming_report.txt  SRR1039511_2_val_2.fq.gz                   SRR1039512.trim.log
SRR1039510_2_val_2.fq.gz                   SRR1039511.trim.log
(base) May5 11:01:01 ~/project/airway/04.clean
$ zcat SRR1039510_1.fastq.gz |paste - - - - >raw.txt  #将原始fastq数据4行拼成一行
(base) May5 11:05:58 ~/project/airway/04.clean
$ wc -l raw.txt
25000 raw.txt   #原始read的条数
(base) May5 11:06:11 ~/project/airway/04.clean
$ zcat SRR1039510_1_val_1.fq.gz |paste - - - - >trim.txt   #将trim_galore修剪后的fq(fastq)数据4行拼成一行
(base) May5 11:11:30 ~/project/airway/04.clean
$ wc -l trim.txt 
24448 trim.txt  #trim_galore处理后read的条数
(base) May5 11:11:41 ~/project/airway/04.clean
$ less -S trim.txt 
(base) May5 11:19:34 ~/project/airway/04.clean
$ awk '(length($4)<63){print $1}' trim.txt >ID   #打印出trim.txt中第4列碱基小于63的行的第1列 (即第4列碱基小于63的行的SRR名称‘@开头的数字’)并重定向到ID文件
(base) May5 11:52:08 ~/project/airway/04.clean
$ wc -l ID   #查看ID文件的行数,即表示有多少条read被trim_galore了
1282 ID
(base) May5 11:28:38 ~/project/airway/04.clean
$ less -N ID
(base) May5 11:33:11 ~/project/airway/04.clean
$ head -n 100 ID >ID100   #取ID前100个read并重定向到ID100
(base) May5 11:34:41 ~/project/airway/04.clean
$ wc -l ID100
100 ID100
(base) May5 11:34:49 ~/project/airway/04.clean
$ grep -w -f ID100 trim.txt |  awk '{print $1,$4}' >trim.sm  #用ID100中的名称在trim.txt中进行匹配,并打印出匹配行的第1列和第4列,重定向到trim.sm文件
(base) May5 11:37:23 ~/project/airway/04.clean
$ grep -w -f ID100 raw.txt |  awk '{print $1,$4,$8}' >raw.sm  #用ID100中的名称在raw.txt中进行匹配,并打印出匹配行的第1列和第4列,并重定向到raw.sm文件
(base) May5 11:39:13 ~/project/airway/04.clean
$ head -n 5 *.sm   #打印出raw.sm,trim.sm的前5行
==> raw.sm <==
@SRR1039510.8 CTCATTTTCATCTTCACCATCAACAGAGAGAGCAGCATACTTGCTTGCAGAACTGAACTTAGA HIIIJJJJIIIIIJIJIGIIJJJJIJHIIIIIIGIGJJIIIIJJJJJIJJJJJJIGGIGJJIJ
@SRR1039510.60 AACCTTGGATTTAGCGGCTGAGTACTTCCTCTTGTACATGGCCTTTCTGGAATACATGGCAGA HJJJJJJJHJJJIJIJJJJJIJBFHIIIJIJJJGFGIJIIJJHHJJIJJJJGIJHHHHHHFFF
@SRR1039510.108 GAATTAGCAACTGTGAAACGTCCTCAGGAGAGAAGCTACATGCTGCAGAGGTGGCAAGAAGAT HJJJJJJJJJJJJHIIIJIJHIJJJJJJIJJJJJJJJJJJJJJJJJJJJJJCHGIJJHHHHFF
@SRR1039510.154 TGGTCAGATAGCCCTTGTCTCCCGCCGCCAATCTCTGGCCCCTAGCAGCACGGAGCAGACGGC HHIABBHGIIJEIIIGGHIHGIGCGHG@DFBGGCCEC;CHHH2?EHFFB@BADBB########
@SRR1039510.159 TGAAGTCACTTTTATAGAAGCTGTGTTAAATTATGGAAAGTACCTTGGGAGATAAGCTCAAGA HJJJJIIJJJJJJJJIJJJJJJJJIIJJJJJJJJJJJJJJGGIJJJJJJJJJJIJJJIJIIJJ

==> trim.sm <==
@SRR1039510.8 CTCATTTTCATCTTCACCATCAACAGAGAGAGCAGCATACTTGCTTGCAGAACTGAACTT
@SRR1039510.60 AACCTTGGATTTAGCGGCTGAGTACTTCCTCTTGTACATGGCCTTTCTGGAATACATGGC
@SRR1039510.108 GAATTAGCAACTGTGAAACGTCCTCAGGAGAGAAGCTACATGCTGCAGAGGTGGCAAGA
@SRR1039510.154 TGGTCAGATAGCCCTTGTCTCCCGCCGCCAATCTCTGGCCCCTAGCAGCACGGAG
@SRR1039510.159 TGAAGTCACTTTTATAGAAGCTGTGTTAAATTATGGAAAGTACCTTGGGAGATAAGCTCA

(base) May5 11:45:32 ~/project/airway/04.clean
$ paste raw.sm trim.sm | awk '{print$2,$3,$5}'  |tr ' ' '\n'  |less -N #将raw.sm trim.sm文件拼成一行(按前后顺序),取第2,3,5列(即原始序列,质量值,修剪后的序列),将空格替换为换行(\n),用less进行查看

上一篇下一篇

猜你喜欢

热点阅读