scaffolds 或者是挂载的基因组拆分成Contig

2022-09-04  本文已影响0人  球果假水晶蓝

How can i get contig file from the scaffold file (scaffold was generated from CLC). Is there any converter or programme? Ex : This is my scaffold : ACTGTGCATNNNNNNACGCTGCA and I want the contig file from scaffold like : Contig1 - ACTGTGCA and Contig2-ACGCTGCA

方法1:
perl 脚本

方法2:
try with FASTA/Q toolkit SeqKit and shell commands.

cat seqs.fa                \
    | seqkit fx2tab        \
    | cut -f 2             \
    | sed -r 's/n+/\n/gi'  \
    | cat -n               \
    | seqkit tab2fx        \
    | seqkit replace -p "(.+)" -r "Contig{nr}"
cat seqs.fa                \ # read file
    | seqkit fx2tab        \ # convert FASTA to tabular format: scaffold1   ACTGTGCATNNNNNNACGCTGCA
    | cut -f 2             \ # select the 2nd column          : ACTGTGCATNNNNNNACGCTGCA
    | sed -r 's/n+/\n/gi'  \ # replace the Ns with '\n'       : ACTGTGCAT
                             #                                : ACGCTGCA
    | cat -n               \ # output row number              :     1  ACTGTGCAT
                             # I just want a 2-column format  :     2  ACGCTGCA
    | seqkit tab2fx        \ # convert tabular to FASTA       : >1
                             #                                : ACTGTGCAT
    | seqkit replace -p "(.+)" -r "Contig{nr}" # renname sequence header, {nr} means row number

方法3 python2.7

#!/usr/bin/python2.7

from Bio import SeqIO
import getopt,sys,re


def usage():
    print "Usage: python contig_from_scaffold.py -i <input_scaffold_fasta> -o <output_contig_fasta>"

try:
    options, remainder=getopt.getopt(sys.argv[1:], 'i:o:h')

except getopt.GetoptError as err:
    print str(err)
    usage()
    sys.exit()

for opt, arg in options:
    if opt in ('-i'):
        input_file=arg
    if opt in ('-h'):
        usage()
    sys.exit()
    elif opt in ('-o'):
        output_file=arg

out=open(output_file, 'w')

sequence = ''.join([str(record.seq).strip() for record in SeqIO.parse(input_file, "fasta")])

m=re.sub('[nN]+','\n',sequence).split('\n')

for i in range(1,len(m)):
    out.write('>contig_'+str(i)+'\n')
    out.write(m[i]+'\n')

资料来源于 https://www.biostars.org/p/211400/
https://www.jianshu.com/p/02d0ed5b16bd

上一篇 下一篇

猜你喜欢

热点阅读