R语言 -- 统计 Fastq 文件前十万条 reads 的长度

2020-08-28 本文已影响0人生信摆渡

由于fastqc的结果 reads 长度分布有点奇怪，所以想看看是原始 Fastq 文件的问题还是 Fastqc 处理的问题。取了前十万条 reads 来进行统计作图。

Fastq files

R 脚本在 QC 文件夹里面

R script

setwd("/sibcb2/bioinformatics2/wangjiahao/LungCGI/Fastq/QC")

Ns = list.files("..", ".gz", full.names = TRUE)

pdf(file = "Reads_length_counts.pdf")

    for(i in 1:length(Ns)){

        fastq = Ns[i]
        cat("Processing", fastq, "\n")
        system(paste("zcat", fastq, "| head -100000 > temp"))
        seqs = readLines("temp")[seq(2, 400000, 4)]
        counts = table(nchar(seqs))

        plot(counts, col = "red", type = "l", main = strsplit(fastq, "[./]")[[1]][2], xlab = "Length", ylab = "Counts")
        file.remove("temp")
    }
dev.off()

Run

Result

刚开始没想到联合 Linux 的 zcat 和 head 命令，而是直接 readLines 把Fastq 文件读完，很显然这样太慢了，经过改进快了几十倍我giao

果然是 Fastq 文件的 reads 本身就有问题

R语言 -- 统计 Fastq 文件前十万条 reads 的长度

Fastq files

R 脚本在 QC 文件夹里面

R script

Run

Result

猜你喜欢

热点阅读