AB区室计算【juicer/homer】
2020-09-14 本文已影响0人
caokai001
Tips:
目前主流的计算AB区室工具: juicer/homer/cworld
https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-02095-z#Sec11-
juicer
: 第一篇HiC文章实验室提供的工具,计算多个分辨率每个bin的PC1 值,但是输出只有一列,不是bedgraph 文件,不太人性化;同时不能判断AB区室,需要自己手动校正. -
homer
: NGS分析套件,也有HiC分析流程
实例1:
HiC-Pro+homer
使用hicpro
进行预处理,用homer
进行AB区室分析,自己可以通过TSS
进行判断AB区室
(Tips:有时候也需要表观信号进行人工校正)
- Step1:安装配置
homer
## 安装homer.pl
wget http://homer.ucsd.edu/homer/configureHomer.pl
perl configureHomer.pl -install
echo PATH=/home/kcao/biosoft/homer/bin:$PATH
source ~/.bashrc
- Step2:下载基因组文件信息
下载内置
的hg19基因组文件
perl configureHomer.pl -install hg19
homer 内置一些基因组文件:比如hg19 genome
image.png
或者手动构建
基因组信息(非常见的物种:Pig)
大多数情况,直接指定基因组和注释文件位置就可以,比如annotation.pl 注释peak 文件.
## 1.gff 转换成gtf
(base) [00:16:03] kcao@login:~/Work/Pig_NCBI/Pig_source/other_ref
$ gffread GCF_000003025.5_Sscrofa10.2_genomic.gff -T -o GCF_000003025.5_Sscrofa10.2_genomic.gtf
$ less /public/home/kcao/Work/Pig_NCBI/Pig_source/other_ref/GCF_000003025.5_Sscrofa10.2_genomic.gtf
## 2.基因组信息建立
(base) [00:17:13] kcao@login:~/Work/Pig_NCBI/Pig_source/other_ref
$ loadGenome.pl -name pig_10.2 -org null \
-fasta /public/home/kcao/Work/Pig_NCBI/Pig_source/GCF_000003025.5_Sscrofa10.2_genomic.fa \
-gtf /public/home/kcao/Work/Pig_NCBI/Pig_source/other_ref/GCF_000003025.5_Sscrofa10.2_genomic.gtf
- Step3: 修改输入文件格式,将hicpro 输出文件,转换成homer格式
homer
格式如下:
image.png
hicpro validpair 格式: https://nservant.github.io/HiC-Pro/RESULTS.html#list-of-valid-interaction-products
#read name / chr_reads1 / pos_reads1 / strand_reads1 / chr_reads2 / pos_reads2 / strand_reads2 / fragment_size [/ allele_specific_tag]
格式转换
$ ls *.allValidPairs| while read sample ;do
echo $sample;
( nohup cat $sample | awk 'BEGIN{FS=OFS="\t"}{print NR,$2,$3,$4,$5,$6,$7}' > ${sample}.homer & );
done
- Step4:运行
homer
,计算AB
区室
运行三个分辨率
for sample in *.homer; do
echo "makeTagDirectory tag_${sample%%.*} -format HiCsummary ${sample};
analyzeHiC tag_${sample%%.*} -cpu 16 -res 100000 -bgonly;
analyzeHiC tag_${sample%%.*} -cpu 8 -res 100000 -norm -override >${sample%%.*}_whole_genome_matrix.txt;
runHiCpca.pl ${sample%%.*}_100kb_pca tag_${sample%%.*} -cpu 16 -res 100000 -superRes 100000 -genome pig_10.2 -corrDepth 1;
runHiCpca.pl ${sample%%.*}_100kb_pca tag_${sample%%.*} -cpu 16 -res 300000 -superRes 300000 -genome pig_10.2 -corrDepth 1;
runHiCpca.pl ${sample%%.*}_500kb_pca tag_${sample%%.*} -cpu 16 -res 500000 -superRes 500000 -genome pig_10.2 -corrDepth 1;"|qsub -d ./ -N ${sample%%.*};
done
实例2:
juicer + juicer
用juicer
上游分析,下游用juicer 计算AB区室
下面以Pig为例,染色体编号以NC命名的。
- Step1: 产生需要计算AB区室的染色体列表
(base) [14:06:58] kcao@login:~/Work/Pig_NCBI/Pig_HiC_PEFs/6_AB_juicer/PEFs
$ cat /public/home/kcao/Work/Pig_NCBI/Pig_source/GCF_000003025.5_Sscrofa10.2_genomic.fa | grep ">" >chrom.txt
$ cat chrom.txt | grep "NC" | awk '{print $1}' | sed 's/>//g' | grep -v "NC_000845.1" > chrom_2.txt
- Step2:批量运行
juicer eigenvector
分析
$ for sample in *.hic; do echo $sample; mkdir -p ./${sample%%.*};( cat chrom_2.txt | while read i; do echo $i ; java -jar /public/home/kcao/tools/Juicer/juicer/CPU/juicer_tools.jar eigenvector -p KR ${sample} ${i} BP 500000 ./${sample%%.*}/${sample/.hic/}.${i}_500kb.KR.txt ; done & ) ;done
- Step3:将计算的PC1结果加上前面三列及其header变成
bedgraph
格式
## 产生染色体长度
(base) [15:50:34] kcao@login:~/Work/Pig_NCBI/Pig_source
$ cat GCF_000003025.5_Sscrofa10.2_genomic.fa.fai | grep "NC" | grep -v "NC_000845.1" | awk '{print $1"\t"$2}' > GCF_000003025.5_Sscrofa10.2_genomic.fa.lensize
## 产生binsize为500k,bin区间信息
$ bedtools makewindows -g GCF_000003025.5_Sscrofa10.2_genomic.fa.lensize -w 500000 > chrom.bin.500k.size
## binsize 按照染色体分文件
$ mkdir sizeChrome && cd sizeChrome
$ mv ../chrom.bin.500k.size ./
$ cat chrom.bin.500k.size | awk '{print $0 >>$1}'
将bin区间添加到向量文件中
## Step1.创建最后的bedgraph文件[添加nohup与do冲突]
$ ls | grep -e "inter_30$" | while read sample ; do
( cat chrom_2.txt | while read i ; do
echo $i;
cat ./sizeChrome/${i} | paste - ./${sample%%.*}/${sample/.hic/}.${i}_500kb.KR.txt >>${sample}.inter.clean.bedgraph;
done & )
done
## Step2:添加头文件(注意换行符位置)
$ ls | grep -e "inter_30$" | while read id;
do
sample=$(basename $id /);
echo -e "track name=${sample} yLineMark="0.0" alwaysZero=on maxHeightPixels=100:75:11 visibility=full viewLimits=-1:1 autoScale=on type=bedGraph" >${sample}.inter_bedgraph.header;
cat ${sample}.inter_bedgraph.header ${sample}.inter.clean.bedgraph | grep -v "NaN" > ${sample}.inter.clean.checked.bedgraph ;
done
- Step4: 进行AB区室调整
由于PC1为正数的不一定为A区室,需要进行判断;这里使用的TSS数目进行判断,
tips: TSS 判断不一定准确,有时候需要加上表观信息进行判断
## TSS.bed
$ cat GCF_000003025.5_Sscrofa10.2_genomic.gene.bed | awk 'BEGIN{FS=OFS="\t"}{if($6=="+"){print $1,$2,$2+1,$4} else {print $1,$3-1,$3,$4}}' > GCF_000003025.5_Sscrofa10.2_genomic.TSS.bed
## 计算区室TSS数目
$
sample=PEFs_rep2_inter_30.inter.clean.bedgraph
refdir=/public/home/kcao/Work/Pig_NCBI/Pig_source/other_ref
echo -e "chr\tpos_TSS_num\tneg_TSS_num\tstatu" > rep2_AB_status.txt
cat chrom_2.txt | while read id ; do
echo $id;
grep $id $sample > ${id}.bedgraph
# pos_TSS
cat ${id}.bedgraph | awk '$4>0{print}' > ${id}.pos.bedgraph
pos_tss_num=`intersectBed -a ${id}.pos.bedgraph -b ${refdir}/GCF_000003025.5_Sscrofa10.2_genomic.TSS.bed -wa -wb| wc -l`
# neg_TSS
cat ${id}.bedgraph | awk '$4<0{print}' > ${id}.neg.bedgraph
neg_tss_num=`intersectBed -a ${id}.neg.bedgraph -b ${refdir}/GCF_000003025.5_Sscrofa10.2_genomic.TSS.bed -wa -wb| wc -l`
if [ $pos_tss_num -gt $neg_tss_num ]
then echo -e "$id\t$pos_tss_num\t$neg_tss_num\tTRUE" >> rep2_AB_status.txt
else echo -e "$id\t$pos_tss_num\t$neg_tss_num\tFALSE" >> rep2_AB_status.txt
fi
rm -rf ${id}.pos.bedgraph ${id}.neg.bedgraph ${id}.bedgraph
done
查看需要转换成染色体信息,最后一列为FALSE表示需要PC1取反
$ cat rep2_AB_status.txt
chr pos_TSS_num neg_TSS_num statu
NC_010443.4 2159 1413 TRUE
NC_010444.3 1506 1624 FALSE
NC_010445.3 1288 992 TRUE
NC_010446.4 1283 782 TRUE
NC_010447.4 936 884 TRUE
NC_010448.3 1977 913 TRUE
NC_010449.4 1344 1243 TRUE
NC_010450.3 744 693 TRUE
NC_010451.3 1215 984 TRUE
NC_010452.3 420 604 FALSE
NC_010453.4 257 536 FALSE
NC_010454.3 550 1025 FALSE
NC_010455.4 1483 898 TRUE
NC_010456.4 1266 823 TRUE
NC_010457.4 519 1117 FALSE
NC_010458.3 632 165 TRUE
NC_010459.4 285 748 FALSE
NC_010460.3 464 398 TRUE
NC_010461.4 704 591 TRUE
NC_010462.2 14 7 TRUE
(base) [11:39:08] kcao@login:~/Work/Pig_NCBI/Pig_HiC_PEFs/6_AB_juicer/PEFs
$ cat rep2_AB_status.txt | grep "FALSE" | cut -f 1 | xargs| sed "s/ /|/g"
NC_010443.4|NC_010446.4|NC_010447.4|NC_010450.3|NC_010451.3|NC_010456.4|NC_010457.4|NC_010460.3|NC_010461.4
#
(base) [11:30:51] kcao@login:~/Work/Pig_NCBI/Pig_HiC_PEFs/6_AB_juicer/PEFs
$ cat PEFs_rep2_inter_30.inter.clean.checked.bedgraph | awk 'BEGIN{FS=OFS="\t"}$1~/(NC_010443.4|NC_010446.4|NC_010447.4|NC_010450.3|NC_010451.3|NC_010456.4|NC_010457.4|NC_010460.3|NC_010461.4)$/{print $1,$2,$3,-$4} ;$1!~/(NC_010443.4|NC_010446.4|NC_010447.4|NC_010450.3|NC_010451.3|NC_010456.4|NC_010457.4|NC_010460.3|NC_010461.4)$/ {print $0}' > PEFs_rep2_inter_30.inter.clean.checked.ok.bedgraph
image.png
思考:
- 大部分情况,使用不同分辨率,
homer
可能计算结果一致性更好 - 从人性化来看,homer 输出为
bedgraph
及其进行了AB校正.
欢迎评论留言~😀