Interrogating the transcriptome in all its diversity

Interrogating the transcriptome in all its diversity
Joel H Graber Thanks- I actually wanted to include the word alternative in this title, but I also wanted to keep the title to two lines. But as I hope to convince you, it’s really the alternatives that make the processing interesting and important. So let’s put it into perspective.

Empirical transcript measurements to characterize mRNA processing
Stop codon PolyA sites mRNA/cDNA ESTs Microarray probes mRNA-Seq mRNA-RACE

The most important thing I can tell you about (especially large-scale) transcriptome measurement
EVERY procedural step can leave its mark on the data True of both bench and computational steps What looks like interesting processing may not be Know your assumptions Be suspicious Test and torture your data to be confident

Small numbers of genes to test: qPCR

IGF1 mRNA data indicates at least 15 or more transcript isoforms

qPCR Primer Pair Set-up should catch most isoform differences

Igf1 transcript variants are differentially expressed as a function of strain, tissue, and nutrition

Many genes simultaneously I: microarrays
The fundamental hypothesis of transcriptome measurement: the state of the cell can be ascertained by which transcripts are expressed at which levels

Using gene expression microarrays to assess variation in mRNA processing
Our modified hypothesis of transcript measurement The activity of a cell is a function of which isoforms of which genes are expressed Expression arrays, though designed for abundance measurement, can also reveal isoform variation Summarization to one “expression level” is problematic So the work I’ll show you is based primarily on the efforts of two graduate students, Jesse Salisbury and Priyam Singh. As it says here at the top, we want to modify these measurements and not just get which genes are active, but also in what form. We are using expression microarrays and explicitly talking about the affymetrix type of array with multiple probes targeting a transcript. The standard analysis is designed for measurement of abundance, but we can reprocess this data to reveal changes in isoform as well as abundance. Tied up in this is the fact that for many genes, the summarization to just one number representing the activity of that gene is problematic, and in fact for some genes will miss interesting changes in gene activity.

Identifying processing changes with expression arrays: a simple example for illustration
One gene with two isoforms Differing only in polyA site Regulatory sites for post-transcriptional control only in extended isoform Microarray probes hybridize to common and differential regions Explain the model and specifically point to the probes. Once again, we use the representation of the thickened region representing the coding sequence, and in this example the coding sequence of these two transcripts is identical, so that only the 3’-UTRs are different, and explicitly they differ only by their polyA site. Just to make things interesting, we’ll include a couple of regulatory elements, that make the transcript behavior different in the presence of the right trans-acting factors. So now, we have this gene, with two isoforms shown, and we’re going to do a microarray experiment in two different samples…

Standard microarray analysis can mask changes in mRNA processing
2 1 Sample 1 Sample 2 So we start with two samples to be measured, and we’re showing the balance of the two isoforms as changing between them- explicitly, in this example, I’ve chosen to keep the total abundance the same. The samples are collected and put onto the microarray. I’m showing here three replicates of each experiment in the raw data, with black representing sample 1 and orange sample 2. The signals for the 3 upstream, common probes are approximately equal, whereas the signals in the extended region are significantly higher for samples 2. In standard analysis, the signal for the probes are summarized to one value for each sample. But as this example shows, this is not an adequate representation. So instead of summarizing within each sample, we instead compare each probe individually, taking a ratio for each probe- then we look for changes in this ratio along the length, looking for patterns like we see here where there is a clear difference between the different parts of the gene. So with this analysis in place we applied it to a number of problems of interest. The one I’m going to tell you about is pro-B-cell lymphomas, which we studied in collaboration with Kevin MIlls Sample 2 Sample 1 Salisbury J et al, PLoS ONE 2009

Statistical Test I: Modified t-test
Compute expression ratio for three or more probes on each side of putative break Significance is assessed by randomizing probes Spuriously low variance can cause problems

Systematic alternative processing of 3’-UTR can be correlated with functional changes
Science 2008 Cell 2009 Cancer Research 2009

The Ube2a long transcript isoform is lost in all three tumor-types
Ubiquitin conjugating enzyme 2a is a signaling gene involved in response to DNA damage. The layout of the 3’-end of the transcripts along with the microarray probes is shown at the top. Down below, we see a combined plot of the comparison of the three tumor types, as well as the mature b-cells against the wildtype proB cells. As with the illustration I showed before 0 here means no change from the wildtype pro-b-cells. As can be seen here, all three tumors have essentially no change or maybe slight decrease in signal, but the extended isoform has a decrease between 3 and 8-fold. In contrast, the mature B-cells have a uniform decrease by approximately 2 fold across the entire transcript. We also see the complementary pattern with preferential preservation of the long isoform:

The Pik3ap1 long transcript isoform is preserved in APC and APN, but not LPC tumors
Here we see phophoinositide kinase 3 activating protein 1. The layout of the information is the same as before with the arrangement of the 3’-end of the gene and probes shown at the top, and the probe level analysis shown below. Pik3ap1 stands out from Ube2a in two notable ways: first of all the pattern is for elongation, rather than truncation with respect to proBs. Secondly, the pattern for Pik3ap1 is not constant across all tumors, but rather is only significant for the two artemis KO tumors. The Lig4 KO tumors show a uniform decrease in signal across all probes, whereas mature B-cells show essentially no change. The fact that not all genes behave in a common manner led us to test the idea that the patterns of which genes were changed and by how much could be used to distinguish the tumors:

Summary: tumors have systematic and characteristic changes in RNA processing
Our data supports alternate 3’-processing as the source of changes rather than isoform-specific in stability While truncation dominates, genes with elongation are also observed APC and LPC tumors share an amplified oncogene (Myc), but differ in signature and prognosis So to summarize what we have found: It’s important to understand that the signals we observe could arise due to either alternative polyadenylation or equivalently to isoform specific changes in stability. Our work, as well as others, seems to indicate that changes in polyadenylation are playing a significant role; we found changes in the abundance of one critical transacting protein, as well as evidence of changes in a abundance or processing of the transcripts encoding several other polyA factors. The role of genes with elongation is an open question that we will continue to evaluate. In the growing model that truncation of the 3’-UTR leads to general increase of protein production, elongation would presumably accordingly reduce protein. Finally, we also performed our analysis on two sets of cell cultures, both drawn from human ovarian cancer, but differing in their response to cisplatin a common chemotherapeutic. With our analysis, we identified about 150 genes with significant changes, offering the possibility of predicting responsiveness to treatment based on RNA signatures. One of the more intriguing findings is the differences in both phenotype and RNA processing between the APC and LPC tumors, since these share a common amplified oncogene. This suggests the possibility that the differences in DNA damage could be responsible, given the different roles of Art and Lig4. We are following this up: Singh P et al, Cancer Research 2009

Differential splicing using whole-transcript microarrays
FIRMA: a method for detection of alternative splicing from exon array data E. Purdom, K. M. Simpson, M. D. Robinson, J. G. Conboy, A. V. Lapuk and T.P. Speed Bioinformatics (15): Differential splicing using whole-transcript microarrays M. D. Robinson and T.P. Speed BMC Bioinformatics 2009, 10:156

FIRMA/FIRMAGene details
Full model: i: array index J: gene index k: probe index Residual: Exon arrays FIRMA score: Gene arrays FIRMAGene score:

RMA decomposition Heart Brain 75:25 B:H Total normalized expression
Estimated probe effect RMA decomposition of probe-level Affymetrix data. Panel A shows the background-adjusted and normalized probe-level data for PRRX1, from the Affymetrix mixture dataset (see Methods). The probes are displayed in the order which they map to the human genome (not to scale), and lines join all probe intensities of the same sample. PRRX1 is expressed significantly higher in heart tissue compared to brain. Three replicates of pure heart tissue are shown as red lines; green lines represent pure brain tissue replicates and the blue lines represent a mixture of 75% brain tissue and 25% heart tissue. Panel B shows the estimated relative probe effects. Panel C shows the chip effects (i.e. summarized expression levels) and Panel D shows residuals, using the same colour scheme. Estimated chip effect Residual

Validation with known muscle-specific exons
FIRMA scores represented by color scale for all 11 of the validated probesets (left of dividing line) as well as 15 additional top scoring probesets (right of dividing line). Two validated genes (UNR and ITGB1) also ranked high enough to be included in the top 15 candidate genes and their labels are colored in green to note this. Note that the color scale is not evenly spaced but rather based on percentiles of all FIRMA scores in all non-filtered probesets and samples. Purdom, E. et al. Bioinformatics : ; doi: /bioinformatics/btn284 Copyright restrictions may apply.

Open questions, unresolved issues of FIRMA approaches
Genetic variability Probes that change hybridization due to sequence variation Low expression probes Overlapping probes Unresponsive/hyperresponsive probes Interpretation IRLS is unsupervised; majority can define “normal”

CDFs (Chip definition files) MATTER!!!!!!
The CDF maps probes on the array to putative genes/transcripts CDFs are explicitly dependent on the quality of the annotations used for association Which CDF is best can literally depend on the specific gene of interest

IGF1 annotations (and genomic extent) depend greatly on the data source

Even after you get the CDS right, it’s still good to understand the limitations of your array

Exon-gene array differences
4 probes per PSR (probe selection region) A mix of probable and improbable transcribed regions Gene Mostly “probable” regions ~25 probes per targeted transcript

Comparison of gene (FIRMAgene) and gene (FIRMA) arrays for MBP
Normalized probe-level data and RMA residuals for MBP. Panels A and B show the residuals for Gene and Exon for RMA fits, respectively. There are 36 probes for Gene and 72 probes for Exon. Both panels show 33 lines, one for each hybridization (11 tissues with 3 biological replicates each). The brain and muscle replicates are shown blue and red lines, respectively. Brain Muscle

Measuring isoform variation with mRNAseq

Analysis of large sets of short sequence reads is a rapidly developing field
Alignment first, assembly later MAQ Eland SHRiMP BowTie/TopHat SOAP Assembly first, alignment later Trinity transABySS Oases

High throughput sequence data is aligned in conceptually the same way as BLAST
Better heuristics are necessary The problem is bigger Program tweaks are different Tuned to small read size, large genome (Rapid) Indexing is still the key Initial attempts were based on standard hashing, later on Burrows-Wheeler Transform

Alignment first processing strategy
Enhanced Read Analysis of Gene Expression (ERANGE) and the allocation of multireads. (a) The main steps in the computational pipeline are outlined at left, with different aspects of read assignment and weighting diagrammed at right and the corresponding number of gene model reads treated in muscle shown in parentheses. In each step, the sequence read or reads being assigned by the algorithm are shown as a black rectangle, and their assignment to one or more gene models is indicated in color. Sequence reads falling outside known or predicted regions are shown in gray. RNAFAR regions (clusters of reads that do not belong to any gene model in our reference set) are shown as dotted lines. They can either be assigned to neighboring gene models, if they are within a specified threshold radius (purple), or assigned their own predicted transcript model (green). Multireads (shown as parallelograms) are assigned fractionally to their different possible locations based on the expression levels of their respective gene models as described in the text. (b) Comparison of mouse liver expanded RPKM values to publicly available Affymetrix microarray intensities from GEO (GSE6850) for genes called as present by Rosetta Resolver. Expanded RPKMs include unique reads, spliced reads and RNAFAR candidate exon aggregation, but not multireads. Genes with >30% contribution of multireads to their final RPKM (Supplementary Fig. 4) are marked in red. (c) Comparison of Affymetrix intensity values with final RPKMs, which includes multireads. Note that the multiread-affected genes that are below the regression line in b straddle the regression line in c.

Standard reporting of short reads: RPKM

A principal benefit of mRNAseq: novel exon/isoform discovery
Mortazavi et al

Better mapping to splice junctions
Align first to genome Remove perfect matches from further consideration (Good idea?) Remainder are aligned to a broadened set of possible splice junctions: Extract ~25 bases from each exon, join and use as target Standard, annotated splices Additional possible splices

RNA-seq analysis: alignment

RNA seq analysis II: identifying isoforms

RNAseq analysis III: visualization and interpretation

Assessing the ability to identify alternative isoforms

Trinity assembles reads to transcripts first
Grabherr et al, Nature Biotech 2011

Sequences in a de Bruijn graph

Open questions/problems
Dealing with the length bias RPKM does not correctly normalize; Optimal alignment Still in development Paralogs and common motifs are a problem Depth of coverage for isoform characterization Capture or focused chemistry helps

Summarization to total counts leads to false positives

Recent work I: Dealing with systematic bias in RNAseq data
Sources of bias Fragmentation “random priming” Papers of note: Biases in Illumina transcriptome sequencing caused by random hexamer priming Hansen et al Nucleic Acids Research Volume38, Issue12 p. e131 Using non-uniform read distribution models to improve isoform expression inference in RNA-Seq Wu et al, Bioinformatics /bioinformatics/btq696

Recent work 2: Focused Sequencing of PolyA sites
Standard mRNAseq does not adequately sample polyA sites Recent directed studies: Formation, regulation and evolution of Caenorhabditis elegans 3′UTRs Jan et al, Nature Volume: 469, Pages: 97–101 Comprehensive polyadenylation site maps in yeast and human reveal pervasive alternative polyadenylation. Ozsolak et al, cell 143(6): (2010)

Interrogating the transcriptome in all its diversity

Similar presentations

Presentation on theme: "Interrogating the transcriptome in all its diversity"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Interrogating the transcriptome in all its diversity

Similar presentations

Presentation on theme: "Interrogating the transcriptome in all its diversity"— Presentation transcript:

Similar presentations

About project

Feedback