试读札记生物信息

人类参考基因组知识点(更新ing~)

2021-02-28  本文已影响0人  小贝学生信

一、人类基因组有多大

     chr       size size2
1   chr1  248956422  249M
2   chr2  242193529  242M
3   chr3  198295559  198M
4   chr4  190214555  190M
5   chr5  181538259  182M
6   chr6  170805979  171M
7   chr7  159345973  159M
8   chrX  156040895  156M
9   chr8  145138636  145M
10  chr9  138394717  138M
11 chr11  135086622  135M
12 chr10  133797422  134M
13 chr12  133275309  133M
14 chr13  114364328  114M
15 chr14  107043718  107M
16 chr15  101991189  102M
17 chr16   90338345   90M
18 chr17   83257441   83M
19 chr18   80373285   80M
20 chr20   64444167   64M
21 chr19   58617616   59M
22  chrY   57227415   57M
23 chr22   50818468   51M
24 chr21   46709983   47M
25   SUM 3088269832 3088M
#未考虑M线粒体,其长度较短,为16569,16Kbp,
from NCBI

二、奇怪的染色体name(chrUn,random,alt)

wget http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz
gunzip hg38.fa.gz
#提取染色体id
grep "^>" hg38.fa > chr.id
wc -l chr.id
#455 chr.id
head chr.id
####
>chr1
>chr10
>chr11
>chr11_KI270721v1_random
>chr12
>chr13
>chr14
>chr14_GL000009v2_random
>chr14_GL000225v1_random
>chr14_KI270722v1_random

2.1 Unlocalized scaffolds(*****random)

grep "random" chr.id > chr.random
wc -l chr.random
#42 chr.random
head chr.random
###
>chr11_KI270721v1_random
>chr14_GL000009v2_random
>chr14_GL000225v1_random
>chr14_KI270722v1_random
>chr14_GL000194v1_random
>chr14_KI270723v1_random
>chr14_KI270724v1_random
>chr14_KI270725v1_random
>chr14_KI270726v1_random

2.2 Unplaced scaffolds(chrUn******)

grep "chrUn" chr.id > chr.chrUn
wc -l chr.chrUn
#127 chr.chrUn
head chr.chrUn
###
>chrUn_KI270302v1
>chrUn_KI270304v1
>chrUn_KI270303v1
>chrUn_KI270305v1
>chrUn_KI270322v1
>chrUn_KI270320v1
>chrUn_KI270310v1
>chrUn_KI270316v1
>chrUn_KI270315v1
>chrUn_KI270312v1

2.3 Alternate loci scaffolds(*****alt)

grep "alt" chr.id > chr.alt
wc -l chr.alt
#261 chr.alt
head chr.alt
###
>chr1_KI270762v1_alt
>chr1_KI270766v1_alt
>chr1_KI270760v1_alt
>chr1_KI270765v1_alt
>chr1_GL383518v1_alt
>chr1_GL383519v1_alt
>chr1_GL383520v2_alt
>chr1_KI270764v1_alt
>chr1_KI270763v1_alt
>chr1_KI270759v1_alt

注意:以上具体的chromosome name均为ucsc的hg版本,与GRCh38略有差异,但基本也是这几种类型sequence

三、编码基因占比多少

wget https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/genes/hg38.refGene.gtf.gz
awk '{print$1, $10}' hg38.refGene.gtf |sort -k 2|uniq|grep -v alt | grep -v random | grep -v alt | grep -v fix| sort -k 1 > chr.gene
cut -d" " -f 1 chr.gene | uniq -c
###
   1113 chr10
   1676 chr11
   1392 chr12
    632 chr13
    946 chr14
   1010 chr15
   1146 chr16
   1574 chr17
    434 chr18
   1791 chr19
   2832 chr1
    780 chr20
    414 chr21
    644 chr22
   1817 chr2
   1563 chr3
   1088 chr4
   1313 chr5
   1453 chr6
   1341 chr7
   1029 chr8
   1114 chr9
      1 chrM
   1157 chrX
    143 chrY

四、下载参考基因组

4.1 NCBI

wget -c ftp://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/annotation/GRCh38_latest/refseq_identifiers/GRCh38_latest_genomic.fna.gz
NCBI

4.2 ensembl

wget -c http://ftp.ensembl.org/pub/release-103/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.toplevel.fa.gz
ensembl

4.3 UCSC

wget -c http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz
UCSC

五、更新ing~

上一篇 下一篇

猜你喜欢

热点阅读