Interrogating the transcriptome in all its diversity

Interrogating the transcriptome in all its diversity
Joel H Graber Thanks- I actually wanted to include the word alternative in this title, but I also wanted to keep the title to two lines. But as I hope to convince you, it’s really the alternatives that make the processing interesting and important. So let’s put it into perspective.

Why were so many predictions of the number of genes in a mammalian genome wrong?
Nature Genetics, June 2000, v25, n2.

Mammalian genomes contain far more transcript variants than protein variants
Genome Biology (2009) 10:201. Average protein products per locus = 1.7 Average distinct transcripts per locus = 5.7 From a review of gene finding in genomes earlier this year, we get these estimates, 1.7 proteins per locus, but 5.7 transcripts per locus. There’s even good reason to believe that this is probably an underestimate in terms of the number of transcripts. So how do we get so many more transcripts that proteins?

A processed, protein coding mRNA molecule includes distinct functional regions
Genomic sequence Alternative 3’-end formation comes in a number of different cases ….. Explain We have been gaining more and more information over the last few years about the nature of alternative 3’-end formation; most of the data has come from sequencing, and for many years especially, from ESTs. Modern tools are now making a new revolution and we’re going to rapidly gain new data with ever increasing depth of information. But, what I want to tell you about is our work at using an existing body of data, specifically expression microarrays, to investigate changes in 3’-end formation. To explain how we do this, let’s start with a toy problem, that is simplified but essentially accurate in the picture that it portrays. 5’-untranslated region (5’-UTR) 3’-untranslated Region (3’-UTR) Protein coding sequence

Pieces of a (Eukaryotic) Protein -Coding Gene (on the genome)
~ Mbp 5’ 3’ 5’ 3’ … ~ kbp promoter (~103 bp) enhancers (~ bp) other regulatory sequences (~ bp) Polyadenylation site (~ bp) exons (cds & utr) / introns (~ bp) (~ bp)

Alternate mRNA processing can lead to multiple transcript and/or protein products
… … 3 transcripts 1 protein product Here’s an illustration that should help make it clear. We start with four exons on genomic DNA, and look at the variant transcripts that can be made from it. In each of these diagrams, the CDS is represented by the thicker part of the line, and untranslated regions by the thin part. The critical thing to notice here is that there are only two protein products here, with the last three transcripts differing only in their 3’-UTRs. So the question arises, what is the role of this sort of variability. Why are there genes with multiple transcripts that generate the same protein product? The standard answer lies in regulation, in as much as sequence in the 3’-UTR has been shown to mediate post-transcriptional regulation in terms of stability, localization, and translation. We can see the consequences with an example from a recent paper, specifically the IGF2 mRNA binding protein 2.

Carolyn demonstrates gene regulation
Transcription control mRNA degradation localization Protein Translation Protein = water in pool mRNA = water in hose DNA = water in pipes But even this is too simple, because it show the expression of one gene in isolation as if it is independent. What we know is that the expression of genes in pretty much any organism, from single celled bacteria to multi-celled organisms like mammals is actually driven by a complex interwoven network of interactions. So let’s look at this from a more network like representation.

A somewhat more formal view of regulation in the various stages of gene expression

Systematic changes to mRNA processing can significantly change the regulatory program of a cell
Changes can be in a single gene or systemic Regulatory control during transcript generation Transcription initiation site Splicing pattern 3’-processing (polyadenylation and cleavage) site RNA editing Subsequent isoform-specific regulatory control Stability Translational efficiency Localization The changes that we’re talking about can be either to a single, critical gene, or to a large number of genes through systematic changes in trans-acting factors. It’s critical to remember that this regulation can happen at multiple levels, which provides mechanisms for very specific control of a given gene’s activity. The simplistic statement that I typically make is that the more important a gene is, the more specific its regulation needs to be. The basis of our studies is improving our understanding of this particular form of regulation. So for the rest of the talk,

A brief history of transcript measurement

Implications of transcript variation for gene expression measurement
Most large scale expression studies report one level per gene per sample Microarrays: One reported value of expression per probeset; Duplicate probesets are either averaged or discarded mRNAseq RPKM (reads per kilobase of transcript per million reads) For many genes, summarization to one expression level in a given cell type is inadequate

Every time we find a new way to measure RNA, we find previously unknown types
Mattick et al, Trends Genet 2009

Classes of alternative transcripts
Alternative splicing Alternative transcript initiation sites Alternative cleavage and polyadenylation (3’-processing) Combinations of one or more of these

The cascade of alternative mRNA processing in gene regulation
mRNA processing selections during mRNA generation can have a profound effect on downstream regulation of the resulting transcript

Processing and specifically alternative processing are controlled by cis-elements and transfactors
mRNA processing signals are typically constrained in both sequence content and positioning Activity of specific sites is a function of the strength of the local signals and the cell/environment specific concentrations/activities of transfactors

Alternative splicing

Alternative splicing can occur in several ways

Splicing signals and interacting factors

Cis elements required for splicing
3‘ss 5‘ss BP Yeast GUAUGU UACUAAC YAG ESE ESE Vertebrates AG GUAAGU CURAY YYYY NCAG GU 10-15 ESE? ESE? Plants AG GUAAGU CURAY UGYAG GU UA-rich UA-rich 62 100 70 49 64 95 100 44 79 99 58 53 42 100 57 5‘ss – 5‘ splice site (donor site) 3‘ss – 3‘ splice site (acceptor site) BP – branch point (A is branch point base) YYYY10-15 – polypyrimidine track Y – pyrimidine R – purine N – any base

PWM representations of splice site signals (mice)

Frequency of bases in each position of the splice sites
Donor sequences: 5’ splice site exon intron %A %U %C %G A G G U A A G U Acceptor sequences: 3’ splice site intron exon %A %U %C %G Y Y Y Y Y Y Y Y Y Y Y N Y A G G Polypyrimidine track (Y = U or C; N = any nucleotide)

Example 1: Insulin-like growth factor 1 (Igf1)
AKA somatomedin C or mechano growth factor Produced primarily by the liver as an endocrine hormone Primary action is mediated by binding to IGF1R Natural activator of the AKT pathway A primary mediator of the effects of growth hormone Expression has been Negatively correlated with lifespan Positively correlated with body size Its regulatory control remains poorly understand after 30y

IGF1 is subject to extensive alternative mRNA processing
~83,000 nt

IGF1 mRNA data indicates at least 15 or more transcript isoforms

Salient features of IGF1 expression
Mature, circulating IGF1 protein is a cleavage product, coded entirely in exons 3 and 4 Exon 5 contains an additional peptide cleavage product, with demonstrated independent functionality Exons 1 and 2 are mutually exclusive, and likely not the only upstream, transcript initiating exons Exon 5 can be skipped, included or 3’-terminal Exon 6’s reading frame changes depending on whether it is spliced from exon 4 or 5

IGF1 has two possible terminal exons (5 and 6)
~22,000 nt

IGF1 Exon 6, if included can vary between ~200 and ~6400 nt

Alternative polyadenylation

Alternative 3’-processing can arise in several ways with varying consequences
Alternative 3’-end formation comes in a number of different cases ….. Explain We have been gaining more and more information over the last few years about the nature of alternative 3’-end formation; most of the data has come from sequencing, and for many years especially, from ESTs. Modern tools are now making a new revolution and we’re going to rapidly gain new data with ever increasing depth of information. But, what I want to tell you about is our work at using an existing body of data, specifically expression microarrays, to investigate changes in 3’-end formation. To explain how we do this, let’s start with a toy problem, that is simplified but essentially accurate in the picture that it portrays. Adapted from Yan J, et al.,Genome Research. 2005; 15(3):

PolyA site selection depends on sequence elements and abundance/stochiometry of trans-factors
PAS 5’ UGUA AAUAAA 68 kD 73 kD 160 kD 25 kD 30 kD 100 kD CPSF PAPOL 50 kD 64 kD 77 kD 64 kD 77 kD 50 kD CSTF Symplekin UG-rich So we can put these signals in rough place on a pre-mRNA. The arrow marks the polyA site. The Cleavage Polyadenylation Specificity Factor (CPSF), has been shown to interact with the hexamer element, through its 160kD subunit. The 25 and 68kD subunits interact with the upstream UGUA element. There’s probably additional contact with U-rich elements. The Cleavage Stimulation Factor (CstF) interacts downstream with the U-rich and probably also the UG-rich elements, and actually the current model indicates that CstF acts as a dimer. A number of other factors are also at play including hnRNP H and H’ that interact with downstream G-rich elements. A recent structural study of the polyA complex indicated as many as 85 total proteins involved in the reaction. What we can see from this is the complexity and accompanying potential for control and changing the efficiency of polyadenylation based on the specific sequence and on the amount and balance of all of these protein factors. hnRNP H U-rich DSE G-rich 3’ Up to >80 proteins in complex

NMF defines patterns of signals that control 3’-processing (cleavage and polyadenylation)
So what I’m showing here is a representation of the results- the plot down below shows the H matrix, the positioning probabilities. Up on top is a sequence logo representation of the W matrix- the details of going from a vector to a sequence logo are a bit hairy, but I can go through them if anyone wants. Note that the NMF approach will find any sequence variation with position; so beyond the specific elements, we also have any changes in the background sequence content, for instance in the transition from 3’-UTR to intergenic sequence. As a result we typically work with 2 to 3 more elements in the solution than what we expect to have for real motifs. In the case of the mouse polyadenylation signal here, we believe there are five or maybe six critical determinants, that I’m highlighting here. We can put these findings into the context

Example 2: Insulin-like growth factor 2 mRNA binding protein 1 (Igf2bp1)
Contains four K homology domains and two RNA recognition motifs Binds to the 5’-UTR of IGF2 mRNA, regulating translation Can act as an oncogene if misregulated Evolutionarily conserved, with critical role in mRNA localization and translational control

Consequences: Igf2bp1 has transforming potential only when expressed in its truncated isoform
5’ 3’ As you can see here from the genome browser view, the IGF2bp1 sequence spans about 50,000 bases, if we zoom into the 3-terminal exon, we can see that there are two polyadenylation sites, separated by about 6000 bases. You can see the evidence of various transcripts, and the bars at the bottom represent the evolutionary conservation of the sequence. The difference between the transcripts with and without the extended UTR is stark- when expressed in cell lines, the truncated version is transforming, where the extended version is not. These highly conserved elements in the 3’-UTR are likely critical for the proper regulation of Igf2bp1. So with that background, let’s put the whole program in perspective AAA… Mayr and Bartel, Cell 2009

Inclusion (or exclusion) of regulatory sequences in the 3’-UTR fine tune expression and response
Spicher et al, Mol Cell Biol 1998

Example 3: Regulated control of polyA site selection for anitbodies during B-cell maturation

Alternative transcription initiation

Alternative transcription initiation can arise in several ways with varying consequences
Alternative 3’-end formation comes in a number of different cases ….. Explain We have been gaining more and more information over the last few years about the nature of alternative 3’-end formation; most of the data has come from sequencing, and for many years especially, from ESTs. Modern tools are now making a new revolution and we’re going to rapidly gain new data with ever increasing depth of information. But, what I want to tell you about is our work at using an existing body of data, specifically expression microarrays, to investigate changes in 3’-end formation. To explain how we do this, let’s start with a toy problem, that is simplified but essentially accurate in the picture that it portrays.

CAGE tags showed an unexpectedly high frequency in the 3’-UTR

3’-UTR CAGE tags occur in evolutionarily conserved contexts with a common local sequence

The definition of a gene becomes much more fluid: Ins2-IGF2
Two genes with spurious connection? One large genes with distinct, disjoint transcripts?

Cleaved 3’-UTR RNA products (uaRNAs) are often tissue-specific and can localize differentially

Next time: Details of measuring transcript differences in large-scale

Interrogating the transcriptome in all its diversity

Similar presentations

Presentation on theme: "Interrogating the transcriptome in all its diversity"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Interrogating the transcriptome in all its diversity

Similar presentations

Presentation on theme: "Interrogating the transcriptome in all its diversity"— Presentation transcript:

Similar presentations

About project

Feedback