Lucia Lorenzi1, Francisco Avila Cobos1, Hua-Sheng Chiu2, Robrecht Cannoodt1, Tine Goovaerts1, Thomas Birkballe Hansen3, Pieter-Jan Volders1, Steve Gross4, Tom Taghon1, Karim Vermaelen1, Ken Bracke1, Jeroen Galle1, Jorgen Kjems3, Tim De Meyer1, Gary Schroth4, Pavel Sumazin2, Jo Vandesompele1, Pieter Mestdagh1
1) Ghent University, Belgium;
2) Baylor College of Medicine, US;
3) Aarhus University, Denmark; 4Illumina, US;
Technological advances in RNA expression profiling revealed that the human genome is pervasively transcribed, generating an unexpectedly complex transcriptome consisting of various classes of RNA molecules and a huge isoform diversity. Many of these RNAs show high tissue specificity, with some being expressed in only one or few cell types. While numerous large-scale RNA-sequencing studies have been performed, samples involved are often complex tissues, masking transcripts expressed in low-frequent cell populations, and sequencing methods typically focus on one class of RNA transcripts.
We assembled the most comprehensive human transcriptome across an extensive cohort of human samples, consisting of 160 different normal cell types, 45 tissues and 93 cancer cell lines. For each sample, total RNA, poly-A RNA and small RNA libraries were generated and sequenced using Illumina technology, yielding a total of 65 billion reads. Transcriptome assemblies for mRNAs, lncRNAs, miRNAs and circRNAs were matched with chromatin state maps from the Roadmap Epigenomics Consortium to define stringent gene models for each RNA biotype. Count data from polyA and total RNA sequencing libraries were combined to reveal the polyadenylation status of each transcript in each sample. We identified a total of 50235 gene loci of which 19668 were novel. From these loci, 37140 circRNAs were expressed. While a small fraction of novel genes was predicted to have coding potential, the majority of novel genes were non-coding, single exonic, and highly enriched for non-polyadenylated transcripts. Interestingly, a subset of genes showed variable poly-adenylation status across samples, mainly driven by alternative isoform usage. Biological information content of each RNA biotype was assessed by evaluating RNA expression – sample ontology associations and complex tissue deconvolution. Furthermore, we exploited the availability of intron reads from the total-RNA sequencing data to assess the regulatory potential of miRNAs, lncRNAs and circRNAs at the transcriptional and post-transcriptional level. Taken together, the RNA atlas serves as a unique resource for further studies on the function, organization and regulation of the different layers of the human transcriptome.