每周biorxiv生信好文速递(23/12/2019 - 29/
1. Smash++: an alignment-free and memory-efficient tool to find genomic rearrangements
Morteza Hosseini, Diogo Pratas, Burkhard Morgenstern, Armando J. Pinho
Background The development of high-throughput sequencing technologies and, as its result, the production of huge volumes of genomic data, has accelerated biological and medical research and discovery. Study on genomic rearrangements is crucial due to their role in chromosomal evolution, genetic disorders and cancer; Results We present Smash++, an alignment-free and memory-efficient tool to find and visualize small- and large-scale genomic rearrangements between two DNA sequences. This computational solution extracts information contents of the two sequences, exploiting a data compression technique, in order for finding rearrangements. We also present Smash++ visualizer, a tool that allows the visualization of the detected rearrangements along with their self- and relative complexity, by generating an SVG (Scalable Vector Graphics) image; Conclusions Tested on several synthetic and real DNA sequences from bacteria, fungi, Aves and mammalia, the proposed tool was able to accurately find genomic rearrangements. The detected regions complied with previous studies which took alignment-based approaches or performed FISH (Fluorescence in situ hybridization) analysis. The maximum peak memory usage among all experiments was ~1 GB, which makes Smash++ feasible to run on present-day standard computers.
2. A single-cell atlas of the human healthy airways
Marie Deprez, View ORCID ProfileLaure-Emmanuelle Zaragosi, Marin Truchi, Sandra Ruiz Garcia, Marie-Jeanne Arguel, Kevin Lebrigand, Agnès Paquet, Dana Pee’r, Charles-Hugo Marquette, Sylvie Leroy, View ORCID ProfilePascal Barbry
Rationale The respiratory tract constitutes an elaborated line of defense based on a unique cellular ecosystem. Single-cell profiling methods enable the investigation of cell population distributions and transcriptional changes along the airways. Methods We have explored cellular heterogeneity of the human airway epithelium in 10 healthy living volunteers by single-cell RNA profiling. 77,969 cells were collected by bronchoscopy at 35 distinct locations, from the nose to the 12th division of the airway tree. Results The resulting atlas is composed of a high percentage of epithelial cells (89.1%), but also immune (6.2%) and stromal (4.7%) cells with peculiar cellular proportions in different sites of the airways. It reveals differential gene expression between identical cell types (suprabasal, secretory, and multiciliated cells) from the nose (MUC4, PI3, SIX3) and tracheobronchial (SCGB1A1, TFF3) airways. By contrast, cell-type specific gene expression was stable across all tracheobronchial samples. Our atlas improves the description of ionocytes, pulmonary neuro-endocrine (PNEC) and brush cells, which are likely derived from a common population of precursor cells. We also report a population of KRT13 positive cells with a high percentage of dividing cells which are reminiscent of “hillock” cells previously described in mouse. Conclusions Robust characterization of this unprecedented large single-cell cohort establishes an important resource for future investigations. The precise description of the continuum existing from nasal epithelium to successive divisions of lung airways and the stable gene expression profile of these regions better defines conditions under which relevant tracheobronchial proxies of human respiratory diseases can be developed.
3. Deep exploration networks for rapid engineering of functional DNA sequences
Johannes Linder, Nicholas Bogard, Alexander B. Rosenberg, Georg Seelig
Engineering gene sequences with defined functional properties is a major goal of synthetic biology. Deep neural network models, together with gradient ascent-style optimization, show promise for sequence generation. The generated sequences can however get stuck in local minima, have low diversity and their fitness depends heavily on initialization. Here, we develop deep exploration networks (DENs), a type of generative model tailor-made for searching a sequence space to minimize the cost of a neural network fitness predictor. By making the network compete with itself to control sequence diversity during training, we obtain generators capable of sampling hundreds of thousands of high-fitness sequences. We demonstrate the power of DENs in the context of engineering RNA isoforms, including polyadenylation and cell type-specific differential splicing. Using DENs, we engineered polyadenylation signals with more than 10-fold higher selection odds than the best gradient ascent-generated patterns and identified splice regulatory elements predicted to result in highly differential splicing between cell lines.
4. Maximum Likelihood Reconstruction of Ancestral Networks by Integer Linear Programming
Vaibhav Rajan, Carl Kingsford, Xiuwei Zhang
Motivation The study of the evolutionary history of biological networks enables deep functional understanding of various bio-molecular processes. Network growth models, such as the Duplication-Mutation with Complementarity (DMC) model, provide a principled approach to characterizing the evolution of protein-protein interactions (PPI) based on duplication and divergence. Current methods for model-based ancestral network reconstruction primarily use greedy heuristics and yield sub-optimal solutions. Results We present a new Integer Linear Programming (ILP) solution for maximum likelihood reconstruction of ancestral PPI networks using the DMC model. We prove the correctness of our solution that is designed to find the optimal solution. It can also use efficient heuristics from general-purpose ILP solvers to obtain multiple optimal and near-optimal solutions that may be useful in many applications. Experiments on synthetic data show that our ILP obtains solutions with higher likelihood than those from previous methods, and is robust to noise and model mismatch. We evaluate our algorithm on two real PPI networks, with proteins from the families of bZIP transcription factors and the Commander complex. On both the networks, solutions from our ILP have higher likelihood and are in better agreement with independent biological evidence from other studies. Availability A Python implementation is available at https://bitbucket.org/cdal/. Contact vaibhav.rajan{at}nus.edu.sg
5. Distance Indexing and Seed Clustering in Sequence Graphs
Xian Chang, Jordan Eizenga, Adam M. Novak, Jouni Sirén, Benedict Paten
Graph representations of genomes are capable of expressing more genetic variation and can therefore better represent a population than standard linear genomes. However, due to the greater complexity of genome graphs relative to linear genomes, some functions that are trivial on linear genomes become more difficult in genome graphs. Calculating distance is one such function that is simple in a linear genome but much more complicated in a graph context. In read mapping algorithms, distance calculations are commonly used in a clustering step to determine if seed alignments could belong to the same mapping. Clustering algorithms are a bottleneck for some mapping algorithms due to the cost of repeated distance calculations. We have developed an algorithm for quickly calculating the minimum distance between positions on a sequence graph using a minimum distance index. We have also developed an algorithm that uses the distance index to cluster seeds on a graph. We demonstrate that our implementations of these algorithms are efficient and practical to use for mapping algorithms.