秘籍 | 计算基因组的有效大小(effective genome
2021-06-21 本文已影响0人
生信卷王
写在前面
- 最近在ATAC-Seq中call peaks的时候发现需要用到基因组的有效大小(effective genome size)
方法一
自己编写脚本
#coding=utf-8
import sys
aList=[]
fa_file = sys.argv[1]
with open(fa_file,'r') as f:
for line in f:
line = line.strip()
line = line.upper()
if not line.startswith(">"):
baseA = line.count("A")
baseT = line.count("T")
baseC = line.count("C")
baseG = line.count("G")
aList.extend([baseA, baseT, baseC, baseG])
# print(aList)
print("effective_genome_size =", sum(aList))
运行脚本,脚本后接基因组名
python genomeSize.py m38
运行结果
effective_genome_size = 2652783500
- 需要统计总染色体大小,在循环中加上
baseN = line.count("N")
,修改aList.extend([baseA, baseT, baseC, baseG, baseN])
方法二(推荐)
使用统计软件gnx
#下载与编译
git clone https://github.com/mh11/gnx-tools.git
cd gnx-tools
mkdir bin
javac -d bin/ src/uk/ac/ebi/gnx/*
# 没装ant,请安装,链接:https://downloads.apache.org/ant/binaries/
# wget https://downloads.apache.org/ant/binaries/apache-ant-1.10.10-bin.tar.gz
# tar -zvxf apache-ant-1.10.10-bin.tar.gz
# ant程序 在/apache-ant-1.10.10/bin里面
ant -f package.xml
#使用方法
java -jar gnx.jar 基因组名
软件使用
java -jar gnx.jar m38
运行结果
Total number of sequences: 66
Total length of sequences: 2730871774 bp
Shortest sequence length : 1976 bp
Longest sequence length : 195471971 bp
Total number of Ns in sequences: 78088274
N50: 130694993 (9 sequences) (1442871972 bp combined)
结果验证,使用 - 查看有效基因组大小
echo "2730871774-78088274"|bc
2652783500