2021年11月biorxiv生信好文速览
上个月围绕新冠疫情最大新闻大概是南非等国爆发的奥密克戎(omicron)毒株了。那么何为奥密克戎呢?奥密克戎正是希腊字母的小o。很多人将其与另一个希腊字母搞混,这个字母就是omega(欧米茄),中学物理中电阻的标记。有人认为这像是数学家想出来的主意,殊不知,数学家们也是一头雾水。来自哈佛大学的生物统计学助理教授Michael Baym日前发推文表示,这是他首次了解到这两个希腊字母的区别。这并不奇怪,因为虽然希腊字母广泛出现在各种数学公式中,维基中对omicron的记录仅仅是新冠变异毒株,一个黄蜂属(genus)名,以及一种细菌的别名。也许世卫组织的这一命名正是考虑到了这一点:人类对该毒株就像对omicron这个希腊字母一样,是如此的陌生(显然小编的理解与WHO官方的解释不一样)。
在对抗新冠疫情的战斗中,预印本(preprint)凭借其快速灵活的发文特点发挥了非常重要的作用。截至小编发稿,医学预印本网站medRxiv和生物学预印本网站bioRxiv上共计有23篇相关文章,可谓兵贵神速。其中的一篇来自南非科学家和医务人员的预印本更是引发了网络热议,被包括Nature news【1】等在内的超过150则新闻报道。我们本期的好文速览特意为大家挑选了三篇关于omicron的文章,分别从不同角度分析了这一神秘的变异毒株。在阅读这些文章的同时,切记它们还只是预印本,大概没有什么比medrxiv每篇文章标题下面的醒目字句更为贴切的了:This article is a preprint and has not been peer-reviewed. It reports new medical research that has yet to be evaluated and so should not be used to guide clinical practice.
大家都知道,人类基因组计划是堪比“阿波罗登月计划”的划时代壮举。08年,也就是在人类基因组草图绘制结束后的第七个年头,科学家们又提出了千人基因组计划(1000 Genomes Project; 1KGP),旨在建立更为全面的人类遗传变异图谱。在同一年,植物基因组学家也在酝酿着一项壮举,拟南芥1001基因组计划(1001 genomes project),目标是测序模式植物拟南芥的1001个生态型基因组【1】。据称,之所以被称为1001是为了盖过人类1000基因组。但注意到拟南芥的拉丁文(arabidopsis)直译为阿拉伯芥,所以小编给出了另一种解释:大概项目的发起人希望该项目日后能像阿拉伯民间神话故事集One Thousand and One Nights般,演绎出植物基因组学研究的传奇!2016年,该项目达成了1135个不同拟南芥生态型的基因组测序。尽管任务完成,但对于这些拟南芥基因组的解读在一直进行。特别是随着近年来长度短测序技术的发展,学界对于一些早前的结果似乎也有了新的认识。上个月,来自奥利地科学院的Magnus Nordborg团队在biorxiv发文,声称此前1001基因组项目中大量被认为是SNP的序列,应该被解读为基因倍增(duplication)的结果。值得一提的是,Nordborg正是拟南芥1001计划的主要发起人之一。这究竟是怎样一回事?让我们一起看看上个月的biorxiv生信好文速览吧。
1.【热点】南非学者:omicron毒株更易感染新冠康复患者?
Increased risk of SARS-CoV-2 reinfection associated with emergence of the Omicron variant in South Africa
Results 35,670 suspected reinfections were identified among 2,796,982 individuals with laboratory-confirmed SARS-CoV-2 who had a positive test result at least 90 days prior to 27 November 2021. The number of reinfections observed through the end of the third wave was consistent with the null model of no change in reinfection risk (approach 1). Although increases in the hazard of primary infection were observed following the introduction of both the Beta and Delta variants, no corresponding increase was observed in the reinfection hazard (approach 2). Contrary to expectation, the estimated hazard ratio for reinfection versus primary infection was lower during waves driven by the Beta and Delta variants than for the first wave (relative hazard ratio for wave 2 versus wave 1: 0.75 (CI95: 0.59–0.97); for wave 3 versus wave 1: 0.71 (CI95: 0.56–0.92)). In contrast, the recent spread of the Omicron variant has been associated with a decrease in the hazard coefficient for primary infection and an increase in reinfection hazard coefficient. The estimated hazard ratio for reinfection versus primary infection for the period from 1 November 2021 to 27 November 2021 versus wave 1 was 2.39 (CI95: 1.88–3.11).Conclusion Population-level evidence suggests that the Omicron variant is associated with substantial ability to evade immunity from prior infection. In contrast, there is no population-wide epidemiological evidence of immune escape associated with the Beta or Delta variants. This finding has important implications for public health planning, particularly in countries like South Africa with high rates of immunity from prior infection. Urgent questions remain regarding whether Omicron is also able to evade vaccine-induced immunity and the potential implications of reduced immunity to infection on protection against severe disease and death.
2.【未来】奥密克戎:未来还会有几多?
How many relevant SARS-CoV-2 variants might we expect in the future?
Objectives: The emergence of new SARS-CoV-2 variants is a major challenge in the management of Covid-19 pandemic. A crucial issue is to quantify the number of variants which may represent a potential risk for public health in the future. Methods: We fitted the data on the most relevant SARS-CoV-2 variants recorded by the World Health Organization (WHO). The function exploited for the fit is related to the total number of infected subjects in the world since the start of the epidemic. Results: We found that the number of relevant SARS-CoV-2 variants up to November 2021 was about 44. Moreover, the number of new relevant variants per ten million cases turned out to be 1.64 in November 2021, slightly decreased in comparison to the value of 2.29 in March 2020. Conclusions: Our simple mathematical model can evaluate the number of relevant SARS-CoV-2 variants as the cumulative number of cases increase worldwide and may represent a useful tool in planning strategies to effectively contrast the pandemic.
3.【比较】牛津大学:新冠肺炎病毒德尔塔与奥密克戎变种棘蛋白的生物信息学比较
Omicron and Delta Variant of SARS-CoV-2: A Comparative Computational Study of Spike protein
Emerging SARS-CoV-2 variants, especially those of concern, may have an impact on the virus's transmissibility and pathogenicity, as well as diagnostic equipment performance and vaccine effectiveness. Even though the SARS-CoV-2 Delta variant (B.1.617.2) emerged during India's second wave of infections, Delta variants have grown dominant internationally and are still evolving. On November 26, 2021, WHO identified the variant B.1.1.529 as a variant of concern, naming it Omicron, based on evidence that Omicron contains numerous mutations that may influence its behaviour. However, the mode of transmission and severity of the Omicron variant remains unknown. We used computational studies to examine the Delta and Omicron variants in this work and found that the Omicron variant had a higher affinity for human ACE2 than the Delta variant due to a significant number of mutations in the SARS-CoV-2 receptor binding domain, indicating a higher potential for transmission. Based on docking studies, the Q493R, N501Y, S371L, S373P, S375F, Q498R, and T478K mutations contribute significantly to high binding affinity with human ACE2. In comparison to the Delta variant, both the entire spike protein and the RBD in Omicron include a high proportion of hydrophobic amino acids such as leucine and phenylalanine. These amino acids are located within the protein's core and are required for structural stability. Omicron has a higher percentage of alpha-helix structure than the Delta variant in both whole spike protein and RBD, indicating that it has a more stable structure. We observed a disorder-order transition in the Omicron variant between spike protein RBD regions 468-473, and it may be significant in the influence of disordered residues/regions on spike protein stability and binding to ACE2. A future study might investigate the epidemiological and biological consequences of the Omicron variant.
4.【叶蝉】福建农林科大:叶蝉基因组中隐藏着其对多样环境适应的哪些奥秘?
Genomic variation in the tea leafhopper reveals the basis of adaptive evolution
The tea green leafhopper (TGL), Empoasca onukii, is of biological and economic interest. Despite numerous studies, the mechanisms underlying its adaptation and evolution remain enigmatic. Here, we used previously untapped genome and population genetics approaches to examine how this pest so rapidly has adapted to different environmental variables and thus has expanded geographically. We complete a chromosome-level assembly and annotation of the E. onukii genome, showing notable expansions of gene families associated with adaptation to chemoreception and detoxification. Genomic signals indicating balancing selection highlight metabolic pathways involved in adaptation to a wide range of tea varieties grown across ecologically diverse regions. Patterns of genetic variation among 54 E. onukii samples unveil the population structure and evolutionary history across different tea-growing regions in China. Our results demonstrate that the genomic change in key pathways, including those linked to metabolism, circadian rhythms and immune system function, may underlie the successful spread and adaptation of E. onukii. This work highlights the genetic and molecular bases underlying the evolutionary success of a species with broad economic impact, and provides insight into insect adaptation to host plants, which will ultimately facilitate more sustainable pest management.
5.【史前】从基因组窥探史前(prehistory,指无文字记载阶段)历史——以南美大陆的乌拉圭原住民为例
The Genomic Prehistory of the Indigenous People of Uruguay
The prehistory of the people of Uruguay is greatly complicated by the dramatic and severe effects of European contact, as with most of the Americas. After the series of military campaigns that exterminated the last remnants of nomadic peoples, Uruguayan official history masked and diluted the former indigenous ethnic diversity into the narrative of a singular people that all but died out. Here we present the first whole genome sequences of the Indigenous people of the region before the arrival of Europeans, from an archaeological site in eastern Uruguay that dates from 2,000 years before present. We find a surprising connection to ancient individuals from Panama and eastern Brazil, but not to modern Amazonians. This result may be indicative of a distinct migration route into South America that may have occurred along the Atlantic coast. We also find a distinct ancestry previously undetected in South America. Though this work begins to piece together some of the demographic nuance of the region, the sequencing of ancient individuals from across Uruguay is needed to better understand the ancient prehistory and genetic diversity that existed before European contact, thereby helping to rebuild the history of the indigenous population of what is now Uruguay.
6.【区分】荷兰瓦赫宁根大学(Wageningen University):如何区分宏基因组里原核生物与真核生物序列?
Whokaryote: distinguishing eukaryotic and prokaryotic contigs in metagenomes based on gene structure
Metagenomics has become a prominent technology to study the functional potential of all organisms in a microbial community. Most studies focus on the bacterial content of these communities, while ignoring eukaryotic microbes. Indeed, many metagenomics analysis pipelines silently assume that all contigs in a metagenome are prokaryotic. However, because of marked differences in gene structure, prokaryotic gene prediction tools fail to accurately predict eukaryotic genes. Here, we developed a classifier that distinguishes eukaryotic from prokaryotic contigs based on foundational differences between these taxa in gene structure. We first developed a random forest classifier that uses intergenic distance, gene density and gene length as the most important features. We show that, with an estimated accuracy of 97%, this classifier with principled features grounded in biology can perform almost as well as the classifiers EukRep and Tiara, which use k-mer frequencies as features. By re-training our classifier with Tiara predictions as additional feature, weaknesses of both types of classifiers are compensated; the result is an enhanced classifier that outperforms all individual classifiers, with an F1-score of 1.00 on precision, recall and accuracy for both eukaryotes and prokaryotes, while still being fast. In a reanalysis of metagenome data from a disease-suppressive plant endosphere microbial community, we show how using Whokaryote to select contigs for eukaryotic gene prediction facilitates the discovery of several biosynthetic gene clusters that were missed in the original study. Our enhanced classifier, which we call ‘Whokaryote’, is wrapped in an easily installable package and is freely available from https://git.wageningenur.nl/lotte.pronk/whokaryote.
7.【倍增】奥地利科学院Nordborg:拟南芥基因组中广泛的被忽视的基因倍增
Extensive gene duplication in Arabidopsis revealed by pseudo-heterozygosity
Results While genuine heterozygosity should occur in tracts within individuals, heterozygosity at a particular locus is instead shared across individuals in a manner that strongly suggests it reflects segregating duplications rather than actual heterozygosity. Focusing on pseudo-heterozygosity in annotated genes, we used GWAS to map the position of the duplicates, identifying 2500 putatively duplicated genes. The results were validated using de novo genome assemblies from six lines. Specific examples included an annotated gene and nearby transposon that, in fact, transpose together. Conclusions Our study confirms that most heterozygous SNPs calls in A. thaliana are artifacts, and suggest that great caution is needed when analysing SNP data from short-read sequencing. The finding that 10% of annotated genes are copy-number variables, and the realization that neither gene- nor transposon-annotation necessarily tells us what is actually mobile in the genome suggest that future analyses based on independently assembled genomes will be very informative.
8. 【空间】HisToGene:深度学习预测病理学图像中的基因表达图谱
Leveraging information in spatial transcriptomics to predict super-resolution gene expression from histology images in tumors
Recent developments in spatial transcriptomics (ST) technologies have enabled the profiling of transcriptome-wide gene expression while retaining the location information of measured genes within tissues. Moreover, the corresponding high-resolution hematoxylin and eosin-stained histology images are readily available for the ST tissue sections. Since histology images are easy to obtain, it is desirable to leverage information learned from ST to predict gene expression for tissue sections where only histology images are available. Here we present HisToGene, a deep learning model for gene expression prediction from histology images. To account for the spatial dependency of measured spots, HisToGene adopts Vision Transformer, a state-of-the-art method for image recognition. The well-trained HisToGene model can also predict super-resolution gene expression. Through evaluations on 32 HER2+ breast cancer samples with 9,612 spots and 785 genes, we show that HisToGene accurately predicts gene expression and outperforms ST-Net both in gene expression prediction and clustering tissue regions using the predicted expression. We further show that the predicted super-resolution gene expression also leads to higher clustering accuracy than observed gene expression. Gene expression predicted from HisToGene enables researchers to generate virtual transcriptomics data at scale and can help elucidate the molecular signatures of tissues.
9.【覆盖】BamToCov,一款更轻便的序列覆盖度计算软件——来自英国Quadram Institute
BamToCov: an efficient toolkit for sequence coverage calculations
Many genomics applications requires the calculation of nucleotide coverage of a reference or counting how many reads maps in a reference region. Here we present BamToCov, a suite of tools for rapid and flexible coverage calculations relying on a memory efficient algorithm and designed for flexible integration in bespoke pipelines. The tools of the suite will process sorted BAM or CRAM files, allowing to extract coverage information using different filtering approaches. BamToCov tools, unlike existing tools already available, have been developed to require a minimum amount of memory, to be easily integrated in workflows, and to allow for strand-specific coverage analyses. The unique coverage calculation algorithm makes it the ideal choice for the analysis of long reads alignments. The programs and their documentation are freely available at https://github.com/telatin/bamtocov.
10. 【甲基化】澳大利亚新南威尔士大学(University of New South Wales):胚胎发育过程中的DNA甲基化现象不是脊椎动物的专利
Active DNA demethylation of developmental cis-regulatory regions predates vertebrate origins
DNA methylation (5-methylcytosine; 5mC) is a repressive gene-regulatory mark required for vertebrate embryogenesis. Genomic 5mC is tightly regulated through the coordinated action of DNA methyltransferases, which deposit 5mC, and TET enzymes, which participate in its active removal through the formation of 5-hydroxymethylcytosine (5hmC). TET enzymes are essential for mammalian gastrulation and activation of vertebrate developmental enhancers, however, to date, a clear picture of 5hmC function, abundance, and genomic distribution in non-vertebrate lineages is lacking. By employing base-resolution 5mC and 5hmC quantification during sea urchin and lancelet embryogenesis, we shed light on the roles of non-vertebrate 5hmC and TET enzymes. We find that these invertebrate deuterostomes employ TET enzymes for targeted demethylation of regulatory regions associated with developmental genes and show that the complement of identified 5hmC-regulated genes is conserved to vertebrates. This work thus demonstrates that active 5mC removal from regulatory regions is a common feature of deuterostome embryogenesis suggestive of unexpected deep conservation of a major gene-regulatory module.
引文
1.Elsabe Brits & Paul Adepoju, 2012, Omicron potential under close scrutiny. https://www.nature.com/articles/d44148-021-00119-9
2.https://1001genomes.org/about.html