Remapping our recent evolutionary history

Upton KR

School of Chemistry and Molecular Biosciences, University of Queensland.

Transposons are mobile DNA sequences that replicate within a host genome and compose around half of the human genome. Although they were originally described as ’controlling elements’ that are able to regulate gene expression, they are often derided as ’Junk DNA’ with minimal benefit to the host. This belief has been reinforced by the technical limitation of uniquely identifying individual transposons in short-read sequencing data. Most reads that identify these elements align to multiple locations throughout the genome (multimapping reads). Traditional bioinformatic approaches have taken a conservative approach, only including reads that map to a unique location, resulting in an underrepresentation of transposons in functional models of genome regulation and reinforcing the Junk DNA hypothesis. Mappability analysis indicates ~40% of the human genome is adversely affected by the removal of multimapping reads, effectively masking the last 10 million years of our evolution from functional analysis. To address this gap in knowledge, my lab has developed and validated ReMapQ, a GPU-based machine learning algorithm incorporating a deep neural network to resolve the placement of multimapping reads. ReMapQ is able to incorporate all mappable reads, even those with 500 possible alignment locations. In-Silico validation with known-truth data sets has shown excellent performance, with ~90% precision in read placement. In a pilot analysis of Differentially Methylated Regions (DMRs) in triple negative breast cancer we have shown that biological signals identified by traditional analyses (including only high confidence single mapping reads) are highly conserved within ReMapQ processed data and have excellent correlation in log-fold-change values. Further, ReMapQ is able to identify around 50% more DMRs, not identified in single-mapping read analysis. We are working to refine this algorithm and apply it to integrated functional data sets to elucidate the functional impact of transposons and their role in disease and development in the human genome.