snakemake杂记:多个转录组比对到多个基因组得到多个bam

2023-05-17  本文已影响0人  小明的数据分析笔记本

我的需求是:

我有10个基因组,然后又12个转录组数据,然后将这个12个基因组数据分别比对到这个10个基因组,每个基因组得到12个bam文件,然后将每个基因组的12个bam文件合并 ,最终得到10个合并的bam文件

前面比对的步骤没有遇到问题

SAMPLES, = glob_wildcards("../../rnaseq/01.clean.reads/{sample}_clean_1.fq")
GENOMES, = glob_wildcards("{genome}.fa.masked")

GENOMES = [genome.split("/")[0] for genome in GENOMES if len(genome.split("/")) == 3]
GENOMES.remove("D1")

print(GENOMES)
print(SAMPLES)

print("Total genome size: ",len(GENOMES))
print("Total sample size: ",len(SAMPLES))




rule all:
    input:
        expand("{genome}/{genome}/ref_index.1.ht2",genome=GENOMES),
        expand("{genome}/{genome}/{sample}.sam",genome=GENOMES,sample=SAMPLES),
        expand("{genome}/{genome}/{sample}.sorted.bam.bai",genome=GENOMES,sample=SAMPLES),
        expand("{genome}/{genome}/merged.bam",genome=GENOMES),
        expand("{genome}/{genome}/merged.gtf",genome=GENOMES)


rule hisat2_build:
    input:
        ref = "{genome}/{genome}/{genome}.fa.masked"
    output:
        "{genome}/{genome}/ref_index.1.ht2"
    threads:
        8
    resources:
        mem_mb = 48000
    params:
        "{genome}/{genome}/ref_index"
    shell:
        """
        hisat2-build {input.ref} {params} -p {threads}
        """

rule hisat2_align:
    input:
        index = rules.hisat2_build.output,
        r1 = "../../rnaseq/01.clean.reads/{sample}_clean_1.fq",
        r2 = "../../rnaseq/01.clean.reads/{sample}_clean_2.fq"
    output:
        sam = "{genome}/{genome}/{sample}.sam"
    threads:
        8
    resources:
        mem_mb = 64000
    params:
        ref_index = rules.hisat2_build.params
    shell:
        """
        hisat2 -p {threads} --dta -x {params.ref_index} \
        -1 {input.r1} -2 {input.r2} -S {output.sam}
        """

rule samtools_sort:
    input:
        sam = rules.hisat2_align.output.sam
    output:
        bam = "{genome}/{genome}/{sample}.sorted.bam"
    threads:
        8
    resources:
        mem_mb = 64000
    shell:
        """
        samtools sort -@ {threads} -O bam -o {output.bam} {input.sam}
        """

rule samtools_index:
    input:
        rules.samtools_sort.output.bam
    output:
        "{genome}/{genome}/{sample}.sorted.bam.bai"
    threads:
        8
    resources:
        mem_mb = 24000
    shell:
        """
        samtools index {input}
        """

到合并的步骤最开始的写法是

rule samtools_merge:
    input:
        bams = expand(rules.samtools_sort.output.bam,genome=GENOMES,sample=SAMPLES),
        bai = expand(rules.samtools_index.output,genome=GENOMES,sample=SAMPLES)
    output:
        "{genome}/{genome}/merged.bam"
    threads:
        16
    resources:
        mem_mb = 64000
    shell:
        """
        samtools merge -@ {threads} {output} {input.bams}
        """

这样写的问题是合并的时候每个基因组对应的是120个bam文件

然后换成下面的写法

rule samtools_merge:
    input:
        bams = expand(rules.samtools_sort.output.bam,genome=wildcards.genome,sample=SAMPLES),
        bai = expand(rules.samtools_index.output,genome=wildcards.genome,sample=SAMPLES)
    output:
        "{genome}/{genome}/merged.bam"
    threads:
        16
    resources:
        mem_mb = 64000
    shell:
        """
        samtools merge -@ {threads} {output} {input.bams}
        """

这个报错提示

name 'wildcards' is not defined

然后又换成下面的写法

def getGenomes(wildcards):
  return wildcards.genome

rule samtools_merge:
    input:
        bams = expand(rules.samtools_sort.output.bam,genome=getGenome,sample=SAMPLES),
        bai = expand(rules.samtools_index.output,genome=getGenome,sample=SAMPLES)
    output:
        "{genome}/{genome}/merged.bam"
    threads:
        16
    resources:
        mem_mb = 64000
    shell:
        """
        samtools merge -@ {threads} {output} {input.bams}
        """

这个还是报错,报错内容忘记截图了,而且报错很诡异

然后以关键词 搜索,找到了一个链接

https://stackoverflow.com/questions/45508579/snakemake-wildcards-or-expand-command

这里面提到

'wildcards' is not "directly" defined in the input of a rule. You need to use a [function of 'wildcards'](http://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#functions-as-input-files) instead. I'm not sure I understand exactly what you want to do, but you may try something like that.

这里提供了一个写法

def condition2tumorsamples(wildcards):
    return expand(
        "mapped_reads/merged_samples/{sample}.sorted.dup.reca.bam",
        sample=config['conditions'][wildcards.condition]['tumor'])

def condition2normalsamples(wildcards):
    return expand(
        "mapped_reads/merged_samples/{sample}.sorted.dup.reca.bam",
        sample=config['conditions'][wildcards.condition]['normal'])

rule gatk_RealignerTargetCreator:
    input:
        tumor = condition2tumorsamples,
        normal = condition2normalsamples,    
    output:
        "mapped_reads/merged_samples/{condition}.realign.intervals"
    # remainder of the rule here...

安装这个思路我写我自己的

def getGenome01(wildcards):
    return expand(rules.samtools_sort.output.bam,genome=wildcards.genome,sample=SAMPLES)

def getGenome02(wildcards):
    return expand(rules.samtools_index.output,genome=wildcards.genome,sample=SAMPLES)

rule samtools_merge:
    input:
        bams = getGenome01,
        bai = getGenome02
    output:
        "{genome}/{genome}/merged.bam"
    threads:
        16
    resources:
        mem_mb = 64000
    shell:
        """
        samtools merge -@ {threads} {output} {input.bams}
        """

这个是可以正常运行的

推文记录的是自己的学习笔记,内容可能会存在错误,请大家批判着看,欢迎大家指出其中的错误

欢迎大家关注我的公众号

小明的数据分析笔记本

小明的数据分析笔记本 公众号 主要分享:1、R语言和python做数据分析和数据可视化的简单小例子;2、园艺植物相关转录组学、基因组学、群体遗传学文献阅读笔记;3、生物信息学入门学习资料及自己的学习笔记!

上一篇下一篇

猜你喜欢

热点阅读