Interrogating the transcriptome in all its diversity

Slides:



Advertisements
Similar presentations
RNA-seq library prep introduction
Advertisements

An Introduction to Studying Expression Data Through RNA-seq
RNAseq.
12/04/2017 RNA seq (I) Edouard Severing.
Microarray Pitfalls Stem Cell Network Microarray Course, Unit 3 October 2006.
Walk-thru of CAGE exercise Also at /tag_analysis/ /tag_analysis/
Two short pieces MicroRNA Alternative splicing.
Transcriptome Sequencing with Reference
Peter Tsai Bioinformatics Institute, University of Auckland
RNA-seq: the future of transcriptomics ……. ?
Transcriptome Assembly and Quantification from Ion Torrent RNA-Seq Data Alex Zelikovsky Department of Computer Science Georgia State University Joint work.
1 Computational Molecular Biology MPI for Molecular Genetics DNA sequence analysis Gene prediction Gene prediction methods Gene indices Mapping cDNA on.
Data Analysis for High-Throughput Sequencing
RNA-Seq An alternative to microarray. Steps Grow cells or isolate tissue (brain, liver, muscle) Isolate total RNA Isolate mRNA from total RNA (poly.
Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520
Transcriptomics Jim Noonan GENE 760.
RNA-Seq An alternative to microarray. Steps Grow cells or isolate tissue (brain, liver, muscle) Isolate total RNA Isolate mRNA from total RNA (poly.
Using Isoform-Sensitive Microarrays to Study Different Modes of Alternative Splicing Christina Zheng Ares Lab RNA Club September 14, 2006.
RNA-seq Analysis in Galaxy
High Throughput Sequencing
Gene Finding Genome Annotation. Gene finding is a cornerstone of genomic analysis Genome content and organization Differential expression analysis Epigenomics.
mRNA-Seq: methods and applications
RNA-Seq and RNA Structure Prediction
Li and Dewey BMC Bioinformatics 2011, 12:323
Interrogating the transcriptome in all its diversity
Todd J. Treangen, Steven L. Salzberg
Transcriptome analysis With a reference – Challenging due to size and complexity of datasets – Many tools available, driven by biomedical research – GATK.
Amandine Bemmo 1,2, David Benovoy 2, Jacek Majewski 2 1 Universite de Montreal, 2 McGill university and Genome Quebec innovation centre Analyses of Affymetrix.
RNAseq analyses -- methods
Gene Level Expression Profiling Using Affymetrix Exon Arrays Alan Williams, Ph.D. Director Chip Design Affymetrix, Inc.
Genomics and High Throughput Sequencing Technologies: Applications Jim Noonan Department of Genetics.
Next Generation Sequencing and its data analysis challenges Background Alignment and Assembly Applications Genome Epigenome Transcriptome.
Adrian Caciula Department of Computer Science Georgia State University Joint work with Serghei Mangul (UCLA) Ion Mandoiu (UCONN) Alex Zelikovsky (GSU)
Vidyadhar Karmarkar Genomics and Bioinformatics 414 Life Sciences Building, Huck Institute of Life Sciences.
Verna Vu & Timothy Abreo
The iPlant Collaborative
RNA-Seq Assembly 转录组拼接 唐海宝 基因组与生物技术研究中心 2013 年 11 月 23 日.
Summarization of Oligonucleotide Expression Arrays BIOS Winter 2010.
1 Global expression analysis Monday 10/1: Intro* 1 page Project Overview Due Intro to R lab Wednesday 10/3: Stats & FDR - * read the paper! Monday 10/8:
RNA-Seq Primer Understanding the RNA-Seq evidence tracks on the GEP UCSC Genome Browser Wilson Leung08/2014.
Introduction to RNAseq
Alternative Splicing (a review by Liliana Florea, 2005) CS 498 SS Saurabh Sinha 11/30/06.
RNA-seq: Quantifying the Transcriptome
TOX680 Unveiling the Transcriptome using RNA-seq Jinze Liu.
Biases in RNA-Seq data. Transcript length bias Two transcripts of length 50 and 100 have the same abundance in a control sample. The expression of both.
The iPlant Collaborative
No reference available
Short read alignment BNFO 601. Short read alignment Input: –Reads: short DNA sequences (upto a few hundred base pairs (bp)) produced by a sequencing machine.
Gene expression  Introduction to gene expression arrays Microarray Data pre-processing  Introduction to RNA-seq Deep sequencing applications RNA-seq.
CyVerse Workshop Transcriptome Assembly. Overview of work RNA-Seq without a reference genome Generate Sequence QC and Processing Transcriptome Assembly.
RNA Sequencing and transcriptome reconstruction Manfred G. Grabherr.
Canadian Bioinformatics Workshops
Canadian Bioinformatics Workshops
Canadian Bioinformatics Workshops
Canadian Bioinformatics Workshops
Canadian Bioinformatics Workshops
Canadian Bioinformatics Workshops
RNA-Seq with the Tuxedo Suite Monica Britton, Ph.D. Sr. Bioinformatics Analyst September 2015 Workshop.
RNA-Seq Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520
RNA-Seq Primer Understanding the RNA-Seq evidence tracks on
RNA Quantitation from RNAseq Data
Gene expression from RNA-Seq
RNA-Seq analysis in R (Bioconductor)
Gene expression estimation from RNA-Seq data
From: TopHat: discovering splice junctions with RNA-Seq
Diverse abnormalities manifest in RNA
Inference of alternative splicing from RNA-Seq data with probabilistic splice graphs BMI/CS Spring 2019 Colin Dewey
Quantitative analyses using RNA-seq data
Sequence Analysis - RNA-Seq 2
Presentation transcript:

Interrogating the transcriptome in all its diversity Joel H Graber Thanks- I actually wanted to include the word alternative in this title, but I also wanted to keep the title to two lines. But as I hope to convince you, it’s really the alternatives that make the processing interesting and important. So let’s put it into perspective.

Empirical transcript measurements to characterize mRNA processing Stop codon PolyA sites mRNA/cDNA ESTs Microarray probes mRNA-Seq mRNA-RACE

The most important thing I can tell you about (especially large-scale) transcriptome measurement EVERY procedural step can leave its mark on the data True of both bench and computational steps What looks like interesting processing may not be Know your assumptions Be suspicious Test and torture your data to be confident

Small numbers of genes to test: qPCR

IGF1 mRNA data indicates at least 15 or more transcript isoforms

qPCR Primer Pair Set-up should catch most isoform differences

Igf1 transcript variants are differentially expressed as a function of strain, tissue, and nutrition

Many genes simultaneously I: microarrays The fundamental hypothesis of transcriptome measurement: the state of the cell can be ascertained by which transcripts are expressed at which levels

Using gene expression microarrays to assess variation in mRNA processing Our modified hypothesis of transcript measurement The activity of a cell is a function of which isoforms of which genes are expressed Expression arrays, though designed for abundance measurement, can also reveal isoform variation Summarization to one “expression level” is problematic So the work I’ll show you is based primarily on the efforts of two graduate students, Jesse Salisbury and Priyam Singh. As it says here at the top, we want to modify these measurements and not just get which genes are active, but also in what form. We are using expression microarrays and explicitly talking about the affymetrix type of array with multiple probes targeting a transcript. The standard analysis is designed for measurement of abundance, but we can reprocess this data to reveal changes in isoform as well as abundance. Tied up in this is the fact that for many genes, the summarization to just one number representing the activity of that gene is problematic, and in fact for some genes will miss interesting changes in gene activity.

Identifying processing changes with expression arrays: a simple example for illustration One gene with two isoforms Differing only in polyA site Regulatory sites for post-transcriptional control only in extended isoform Microarray probes hybridize to common and differential regions Explain the model and specifically point to the probes. Once again, we use the representation of the thickened region representing the coding sequence, and in this example the coding sequence of these two transcripts is identical, so that only the 3’-UTRs are different, and explicitly they differ only by their polyA site. Just to make things interesting, we’ll include a couple of regulatory elements, that make the transcript behavior different in the presence of the right trans-acting factors. So now, we have this gene, with two isoforms shown, and we’re going to do a microarray experiment in two different samples…

Standard microarray analysis can mask changes in mRNA processing 2 1 Sample 1 Sample 2 So we start with two samples to be measured, and we’re showing the balance of the two isoforms as changing between them- explicitly, in this example, I’ve chosen to keep the total abundance the same. The samples are collected and put onto the microarray. I’m showing here three replicates of each experiment in the raw data, with black representing sample 1 and orange sample 2. The signals for the 3 upstream, common probes are approximately equal, whereas the signals in the extended region are significantly higher for samples 2. In standard analysis, the signal for the probes are summarized to one value for each sample. But as this example shows, this is not an adequate representation. So instead of summarizing within each sample, we instead compare each probe individually, taking a ratio for each probe- then we look for changes in this ratio along the length, looking for patterns like we see here where there is a clear difference between the different parts of the gene. So with this analysis in place we applied it to a number of problems of interest. The one I’m going to tell you about is pro-B-cell lymphomas, which we studied in collaboration with Kevin MIlls Sample 2 Sample 1 Salisbury J et al, PLoS ONE 2009

Statistical Test I: Modified t-test Compute expression ratio for three or more probes on each side of putative break Significance is assessed by randomizing probes Spuriously low variance can cause problems

Systematic alternative processing of 3’-UTR can be correlated with functional changes Science 2008 Cell 2009 Cancer Research 2009

The Ube2a long transcript isoform is lost in all three tumor-types Ubiquitin conjugating enzyme 2a is a signaling gene involved in response to DNA damage. The layout of the 3’-end of the transcripts along with the microarray probes is shown at the top. Down below, we see a combined plot of the comparison of the three tumor types, as well as the mature b-cells against the wildtype proB cells. As with the illustration I showed before 0 here means no change from the wildtype pro-b-cells. As can be seen here, all three tumors have essentially no change or maybe slight decrease in signal, but the extended isoform has a decrease between 3 and 8-fold. In contrast, the mature B-cells have a uniform decrease by approximately 2 fold across the entire transcript. We also see the complementary pattern with preferential preservation of the long isoform:

The Pik3ap1 long transcript isoform is preserved in APC and APN, but not LPC tumors Here we see phophoinositide kinase 3 activating protein 1. The layout of the information is the same as before with the arrangement of the 3’-end of the gene and probes shown at the top, and the probe level analysis shown below. Pik3ap1 stands out from Ube2a in two notable ways: first of all the pattern is for elongation, rather than truncation with respect to proBs. Secondly, the pattern for Pik3ap1 is not constant across all tumors, but rather is only significant for the two artemis KO tumors. The Lig4 KO tumors show a uniform decrease in signal across all probes, whereas mature B-cells show essentially no change. The fact that not all genes behave in a common manner led us to test the idea that the patterns of which genes were changed and by how much could be used to distinguish the tumors:

Summary: tumors have systematic and characteristic changes in RNA processing Our data supports alternate 3’-processing as the source of changes rather than isoform-specific in stability While truncation dominates, genes with elongation are also observed APC and LPC tumors share an amplified oncogene (Myc), but differ in signature and prognosis So to summarize what we have found: It’s important to understand that the signals we observe could arise due to either alternative polyadenylation or equivalently to isoform specific changes in stability. Our work, as well as others, seems to indicate that changes in polyadenylation are playing a significant role; we found changes in the abundance of one critical transacting protein, as well as evidence of changes in a abundance or processing of the transcripts encoding several other polyA factors. The role of genes with elongation is an open question that we will continue to evaluate. In the growing model that truncation of the 3’-UTR leads to general increase of protein production, elongation would presumably accordingly reduce protein. Finally, we also performed our analysis on two sets of cell cultures, both drawn from human ovarian cancer, but differing in their response to cisplatin a common chemotherapeutic. With our analysis, we identified about 150 genes with significant changes, offering the possibility of predicting responsiveness to treatment based on RNA signatures. One of the more intriguing findings is the differences in both phenotype and RNA processing between the APC and LPC tumors, since these share a common amplified oncogene. This suggests the possibility that the differences in DNA damage could be responsible, given the different roles of Art and Lig4. We are following this up: Singh P et al, Cancer Research 2009

Differential splicing using whole-transcript microarrays FIRMA: a method for detection of alternative splicing from exon array data E. Purdom, K. M. Simpson, M. D. Robinson, J. G. Conboy, A. V. Lapuk and T.P. Speed Bioinformatics 2008 24(15):1707-1714 Differential splicing using whole-transcript microarrays M. D. Robinson and T.P. Speed BMC Bioinformatics 2009, 10:156

FIRMA/FIRMAGene details Full model: i: array index J: gene index k: probe index Residual: Exon arrays FIRMA score: Gene arrays FIRMAGene score:

RMA decomposition Heart Brain 75:25 B:H Total normalized expression Estimated probe effect RMA decomposition of probe-level Affymetrix data. Panel A shows the background-adjusted and normalized probe-level data for PRRX1, from the Affymetrix mixture dataset (see Methods). The probes are displayed in the order which they map to the human genome (not to scale), and lines join all probe intensities of the same sample. PRRX1 is expressed significantly higher in heart tissue compared to brain. Three replicates of pure heart tissue are shown as red lines; green lines represent pure brain tissue replicates and the blue lines represent a mixture of 75% brain tissue and 25% heart tissue. Panel B shows the estimated relative probe effects. Panel C shows the chip effects (i.e. summarized expression levels) and Panel D shows residuals, using the same colour scheme. Estimated chip effect Residual

Validation with known muscle-specific exons FIRMA scores represented by color scale for all 11 of the validated probesets (left of dividing line) as well as 15 additional top scoring probesets (right of dividing line). Two validated genes (UNR and ITGB1) also ranked high enough to be included in the top 15 candidate genes and their labels are colored in green to note this. Note that the color scale is not evenly spaced but rather based on percentiles of all FIRMA scores in all non-filtered probesets and samples. Purdom, E. et al. Bioinformatics 2008 24:1707-1714; doi:10.1093/bioinformatics/btn284 Copyright restrictions may apply.

Open questions, unresolved issues of FIRMA approaches Genetic variability Probes that change hybridization due to sequence variation Low expression probes Overlapping probes Unresponsive/hyperresponsive probes Interpretation IRLS is unsupervised; majority can define “normal”

CDFs (Chip definition files) MATTER!!!!!! The CDF maps probes on the array to putative genes/transcripts CDFs are explicitly dependent on the quality of the annotations used for association Which CDF is best can literally depend on the specific gene of interest

IGF1 annotations (and genomic extent) depend greatly on the data source

Even after you get the CDS right, it’s still good to understand the limitations of your array

Exon-gene array differences 4 probes per PSR (probe selection region) A mix of probable and improbable transcribed regions Gene Mostly “probable” regions ~25 probes per targeted transcript

Comparison of gene (FIRMAgene) and gene (FIRMA) arrays for MBP Normalized probe-level data and RMA residuals for MBP. Panels A and B show the residuals for Gene and Exon for RMA fits, respectively. There are 36 probes for Gene and 72 probes for Exon. Both panels show 33 lines, one for each hybridization (11 tissues with 3 biological replicates each). The brain and muscle replicates are shown blue and red lines, respectively. Brain Muscle

Measuring isoform variation with mRNAseq

Analysis of large sets of short sequence reads is a rapidly developing field Alignment first, assembly later MAQ Eland SHRiMP BowTie/TopHat SOAP Assembly first, alignment later Trinity transABySS Oases

High throughput sequence data is aligned in conceptually the same way as BLAST Better heuristics are necessary The problem is bigger Program tweaks are different Tuned to small read size, large genome (Rapid) Indexing is still the key Initial attempts were based on standard hashing, later on Burrows-Wheeler Transform

Alignment first processing strategy Enhanced Read Analysis of Gene Expression (ERANGE) and the allocation of multireads. (a) The main steps in the computational pipeline are outlined at left, with different aspects of read assignment and weighting diagrammed at right and the corresponding number of gene model reads treated in muscle shown in parentheses. In each step, the sequence read or reads being assigned by the algorithm are shown as a black rectangle, and their assignment to one or more gene models is indicated in color. Sequence reads falling outside known or predicted regions are shown in gray. RNAFAR regions (clusters of reads that do not belong to any gene model in our reference set) are shown as dotted lines. They can either be assigned to neighboring gene models, if they are within a specified threshold radius (purple), or assigned their own predicted transcript model (green). Multireads (shown as parallelograms) are assigned fractionally to their different possible locations based on the expression levels of their respective gene models as described in the text. (b) Comparison of mouse liver expanded RPKM values to publicly available Affymetrix microarray intensities from GEO (GSE6850) for genes called as present by Rosetta Resolver. Expanded RPKMs include unique reads, spliced reads and RNAFAR candidate exon aggregation, but not multireads. Genes with >30% contribution of multireads to their final RPKM (Supplementary Fig. 4) are marked in red. (c) Comparison of Affymetrix intensity values with final RPKMs, which includes multireads. Note that the multiread-affected genes that are below the regression line in b straddle the regression line in c.

Standard reporting of short reads: RPKM

A principal benefit of mRNAseq: novel exon/isoform discovery Mortazavi et al

Better mapping to splice junctions Align first to genome Remove perfect matches from further consideration (Good idea?) Remainder are aligned to a broadened set of possible splice junctions: Extract ~25 bases from each exon, join and use as target Standard, annotated splices Additional possible splices

RNA-seq analysis: alignment

RNA seq analysis II: identifying isoforms

RNAseq analysis III: visualization and interpretation

Assessing the ability to identify alternative isoforms

Trinity assembles reads to transcripts first Grabherr et al, Nature Biotech 2011

Sequences in a de Bruijn graph

Open questions/problems Dealing with the length bias RPKM does not correctly normalize; Optimal alignment Still in development Paralogs and common motifs are a problem Depth of coverage for isoform characterization Capture or focused chemistry helps

Summarization to total counts leads to false positives

Recent work I: Dealing with systematic bias in RNAseq data Sources of bias Fragmentation “random priming” Papers of note: Biases in Illumina transcriptome sequencing caused by random hexamer priming Hansen et al Nucleic Acids Research Volume38, Issue12 p. e131 Using non-uniform read distribution models to improve isoform expression inference in RNA-Seq Wu et al, Bioinformatics 10.1093/bioinformatics/btq696

Recent work 2: Focused Sequencing of PolyA sites Standard mRNAseq does not adequately sample polyA sites Recent directed studies: Formation, regulation and evolution of Caenorhabditis elegans 3′UTRs Jan et al, Nature Volume: 469, Pages: 97–101 Comprehensive polyadenylation site maps in yeast and human reveal pervasive alternative polyadenylation. Ozsolak et al, cell 143(6):1018-29 (2010)