Download presentation
Presentation is loading. Please wait.
1
Alternative Splicing from ESTs
Eduardo Eyras Bioinformatics UPF – February 2004 In this presentation I would like to give an overview of how Ensembl produces comparative genomics data. In particular I will present results of the comparison of the Mouse and Human genomes according to the Ensembl analyses.
2
Intro ESTs Prediction of Alternative Splicing from ESTs
3
Transcription Splicing Translation 5’ 3’ 3’ 5’ pre-mRNA exons introns
Mature mRNA Splicing 5’ CAP AAAAAAA Translation Peptide
4
Transcription Different Splicing Translation 5’ 3’ 3’ 5’ exons introns
pre-mRNA Mature mRNA Different Splicing 5’ CAP AAAAAAA Translation Different Peptide
5
Alt splicing as a mechanism of gene regulation
Functional domains can be added/subtracted protein diversity Can introduce early stop codons, resulting in truncated proteins or unstable mRNAs It can modify the activity of the transcription factors, affecting the expression of genes It is observed nearly in all metazoans Estimated to occur in 30%-60% of human
6
Forms of alternative splicing
Exon skipping / inclusion Alternative 3’ splice site Alternative 5’ splice site Mutually exclusive exons There are 5 types of alternative splicing. Exon skipping: one exon is not included in one of the variants. Alternative 3’ splice site: one variant contains an extra piece of sequence at the 3’ end. Alternative 5’ splice site: similarly, but in the 5’ end. Mutually exclusive exons: a pair of exons are each one included in different variants only, so they will never appear together. Intron retention: one intron in a variant is part of an exon in another variant. Intron retention Constitutive exon Alternatively spliced exons
7
How to study alternative splicing?
8
ESTs (Expressed Sequence Tags)
Single-pass sequencing of a small (end) piece of cDNA Typically nucleotides long It may contain coding and/or non-coding region ESTs: They represent snapshots of the genome being expressed in a certain set of conditions. They are single pass sequence reads from cDNAs cloned from a cell They are usually short, 5’ and 3’ ends from the clones are usually over-represented. Sequence quality usually diminishes at the end of the ESTs. Some contain pieces of sequence from the vectors. ESTs may contain coding and non-coding regions from the cDNA The information they provide can be biased by a too restrictive sampling. Note: mRNA is very unstable outside of a cell; therefore, scientists use special enzymes to convert it to complementary DNA (cDNA). cDNA is a much more stable compound and, importantly, because it was generated from a mRNA in which the introns had been removed, cDNA represents only expressed DNA sequence.
9
ESTs Cells from a specific organ, tissue or developmental stage
mRNA extraction AAAAAA 3’ 5’ Add oligo-dT primer AAAAAA 3’ 5’ TTTTTT 3’ 5’ Reverse transcriptase RNA AAAAAA 3’ 5’ TTTTTT DNA 3’ 5’ Ribonuclease H TTTTTT Expressed genes are converted to mature RNAs after transcription and splicing. mRNA molecules are unstable but can be converted into cDNAs (or complementary DNA). cDNA molecules are DNA molecules, hence have the usual base pairing and are more stable. Most eukayotic mRNAs have poly A tail at their 3’-end. This is used as priming site for the cDNA synthesis. The primer is a short stretch of synthetic DNA olinucleotide, typically 20 nucleotides in length, made up entirely of T’s. After the first strand is synthesized, the preparation is treated with ribonuclease H, which specifically degrades the RNA component of an RNA-DNA hybrid. This is done so that short segments of the RNA are left to prime the second strand synthesis, which is catalyzed by DNA polymerase I. 3’ 5’ DNA polimerase Ribonuclease H 5’ 3’ AAAAAA Double stranded cDNA TTTTTT 3’ 5’
10
Single-pass sequence reads
ESTs 5’ 3’ AAAAAA Clone cDNA into a vector TTTTTT 3’ 5’ 5’ EST Single-pass sequence reads Multiple cDNA clones 3’ EST Double stranded cDNAs are cloned into vectors, this generates what is called a clone library. Clones are picked at random for sequencing. Only short segments are sequenced from the 5’ and 3’ end. The ESTs therefore represent the ends of expressed mRNAs.
11
Sampling the Transcriptome with ESTs
Genomic Primary transcript Splicing Splice variants oligo-dT primer Reverse transcriptase cDNA clones (double stranded) The reverse transcriptase used to manufacture each cDNA in the library will eventually fall off the template, and this will terminate the production of the cDNA. Thus a series of length-differentiated 3' delimited cDNA fragments may be produced for each mRNA that is a viable template in the library. The length of the cDNA will vary, and this is an important factor for development of coverage for each mRNA template of an available gene. Usually, several hundred to several thousand clones are isolated at random from a given cDNA library. Clones are sequenced a single time, from one or both ends of the DNA insert, using universal primers which are complementary to the vector at the multiple cloning site. In almost all cases, the process produces ‘oriented’ clones, where the positions of the 5‘ and 3' ends of the cDNA relative to the vector are known in principle (although subject to some experimental error). Thus, two defined vector-based primers can be used to obtain a 3' and a 5‘ sequence from the same clone; depending on the length of the insert and the quality of the trace data, the sequences determined from the two ends may or may not overlap. A single read is taken from each primer. ESTs potentially retain information about the differential splicing of the primary RNA. EST sequences (Single-pass sequence reads) 5’ ’ 5’ ’
12
Large scale EST-sequencing coupled to Genome sequencing
13
EST sequencing Is fast and cheap
Gives direct information about the gene sequence Partial information Resulting ESTs Known gene (DB searches) Similar to known gene Contaminant Novel gene EST sequencing turned out to be a very fast and relatively cheap way of obtaining direct information about the genes. Each sequence contains partial information but are long enough to identify the gene they originate from. ESTs can be analyzed using database searches with programs like BLAST. They usually fall into four main categories: Those that are identical to a portion of a known gene Those with sequence similarity to a known gene Those that can be deemed useless because they are either devoid of meaningful sequence or matched sequence contaminating organisms. And those that did not match anything in the database
14
dbEST release 20 February 2004
Number of public entries: ,039,613 Summary by organism Homo sapiens (human) ,472,005 Mus musculus + domesticus (mouse) ,056,481 Rattus sp. (rat) ,841 Triticum aestivum (wheat) ,926 Ciona intestinalis ,511 Gallus gallus (chicken) ,385 Danio rerio (zebrafish) ,652 Zea mays (maize) ,417 Xenopus laevis (African clawed frog) ,901 …
15
Human EST length distribution
EST lengths ~ 450 bp Human EST length distribution (dbEST Sep ) ESTs are usually between 400 and 600 bp in length. For human, this distribution is peaked at around 450 bp. There is, however, a second peak at nearly 1000 bp which is perhaps related to the fact that there are also many cDNA sequences (almost full length) in dbEST.
16
ESTs provide expression data
eVOC Ontologies Anatomical System The tissue, organ or anatomical system from which the sample was prepared. Examples are digestive, lung and retina. The precise cell type from which a sample was prepared. Examples are: B-lymphocyte, fibroblast and oocyte. Cell Type Pathology The pathological state of the sample from which the sample was prepared. Examples are: normal, lymphoma, and congenital. Developmental Stage The stage during the organism's development at which the sample was prepared. Examples are: embryo, fetus, and adult. ESTs also allow the identification of genes specifically expressed in a chosen library or tissue, since they are obtained in a given set of known conditions. Once we localize the gene an EST belongs to, we obtain expression information about that gene. Currently there are several projects to organize the expression information in a set of Orthogonal Vocabularies which can describe the expression in a Specific manner: Ontology. One of these projects is the eVOC Ontologies from SANBI which provides a very high quality classification of the expression information for ESTs and it is becoming a standard. eVOC provides a link between the vocabularies and the EST sequences. A link between genes and eVOC expression data can also be found at Indicates whether the tissue used to prepare the library was derived from single or multiple samples. Examples are pooled, pooled donor and pooled tissue. Pooling J Kelso et al. Genome Research 2002
17
ESTs provide expression data
eVOC Ontologies Developmental Stage Cell Type Pathology Anatomical System Pooling … nervous brain cerebellum … Library 1 Library 2 … ESTs ESTs
18
Linking the expression vocabulary to gene annotations
ESTs From the set of ESTs aligned to the genome, we can derive a mapping between the ESTs and the ensembl genes according to compatible splicing structure. The comparison is coordinate based and not sequence based. One could also imagine a sequence-based comparison system, although it would be less specific than the genomic-position-based system. In the SANBI expression database each EST is linked to a library_name, which is linked to five Ontologies or trees of expression vocabulary: Anatomy, Cell Type, Pathology, Developmental Stage, Preparation. In this way we can link each ensembl gene with an expression vocabulary. Genes V Curwen et al. Genome Research (2004)
19
Gene expression vocabulary
This vocabulary can be used for querying in Ensmart.
20
Normalized vs. non-normalized libraries
In order to obtain information from the lowly expressed genes and not to be overwhelmed by the highly expressed genes, the results are usually ‘normalized’, that is, we equilibrate the density of all ESTs regardless of how much expressed they are. This is usually called a ‘normalized library’, and it is the standard information to work with. Here we can see the case of the human genome with non-normalized EST libraries mapped to ensembl genes. In blue we can see the amount of ESTs per gene. This gives us an idea of the transcription activity in the genome. Normalized vs. non-normalized libraries
21
The down side of the ESTs
Cannot detect lowly/rarely expressed genes or non-expressed sequences (regulatory) Random sampling: the more ESTs we sequence the less new useful sequences we will get Despite the usefulness of ESTs, they have some problems which sets some limitations to this approach. With ESTs is very hard to detect genes which are expressed are very low level, or genes that are expressed under very rare and specific set of conditions. We would have to reproduce every single possible set of conditions to be able to find those. Moreover, this method would not detect non-expressed sequences, like regulatory regions. An added problem is the nature of random sampling. In the way the ESTs are obtained, every time we time we get one EST, to get the next one we have not reduced the number of possible sequences that we may obtain, it is always the same pool of sequences. This results in that the further we sequence does not put us any closer to obtain an exhaustive collection of all the genes.
22
Using ESTs to study Alternative Splicing
Since the first EST sequencing project, many institutions around the world have carried intensive EST sequencing of different organisms and for different conditions (as we can see from the release in dbEST). In the meantime, the human genome has been sequenced. All the sequence corresponding to the the euchromatin and part of the heterochromatin is known. Now it is a good moment to combine both sources of information to explore the genome. The study of the genome with ESTs is now known as Trascriptomics.
23
ESTs aligned to the genome
Stop * GT AG PolyA Processed pseudogene True match best in genome Paralog It defines the location of exons and introns We can verify the splice sites of introns check the correct strand of spliced ESTs It helps preventing chimeras It can avoid putting together ESTs from paralogous genes We can prevent including pseudogenes in our analysis ESTs can provide the gene sequence, but it is limited to expressed sequences. The genomic sequence is necessary to obtain information about regulatory signals. Approach: Our approach to finding alt. Splicing using ESTs is considering ESTs aligned to the genome. This has the following advantages: It defines the location of exons and introns. We can verify the splice site sequences hence also check the correct strand of spliced ETSs It helps preventing chimeras It can find paralogs in the genomic sequence With the appropriate filtering, sequencing errors can be avoided In this situation we define the problem of finding alternative splicing information as follows: Problem: find the maximal set of transcripts which is compatible with the splicing of a given EST cluster, such that the transcripts are not redundant with each other, that is, their splicing is non-equivalent. Must Clip poly A tails before aligning
24
Alternative Exons/ 3´ PolyA sites from ESTs
ESTs can provide information about possible alternative splicing when they are aligned against the genome or against mRNA data. On the left we can see an example of several ESTs aligned to the genome. Their alignment structure suggest several possible forms of splicing, i.e. several possible combinations of exons. We will see in more detail later on a method to derive this. On the right we see a set of ESTs aligned to an mRNA sequence. In this case, the ESTs suggest multiple polyandenylation sites on the mRNA. Likewise, EST alignment scan suggest exon skipping a other similar alternative splicing phenomena. ESTs can also provide information about potential alternative splicing when aligned to the genome (and when aligned to mRNA data)
25
Aligning ESTs to the Genome
Many ESTs Fast programs, Fast computers Nearly exact matches Coverage >= 97% Percent_id >= 97% Splice sites: GT—AG, AT—AC, GC—AG We use exonerate to align the ESTs, with an est2genome model. We clip a number of bases on either end of the ESTs and further remove any remaining polyA/polyT tails. This increases the number of ESTs which are mapped in full length. The thresholds take are much more strict than usual to make sure to obtain a good set of predictions. The criterion of merging is exact-match for internal splice-sites, allowing any mismatch at external sites.
26
Genomics as a Technology
Development of special software: fast versus accurate alignment Development of special technology: efficient use of computer farms (~2000 CPUs)
27
Recovering full transcripts from ESTs
28
Recover the mRNA from the ESTs
EST sequences are partial information. We want to recover the full mature RNA sequence from the ESTs. This has lead to strategies to ‘cluster’ ESTs according to sequence similarity in order to try to recover the complete sequence. Moreover, since for each EST it is known the clone library and whether it is a 5’ or 3’ EST, we can use that information to put together clusters from both ends of a gene. Nevertheless, this information is not always available, so it will not be always possible to recover the full gene sequence. Clustering methods try to provide an equilibrium between the gene coverage by the clusters and the specificity of the clusters. This is dependent on how stringent or loose the clustering is performed. Stringent one-pass assembly methods tend to result in fewer, shorter consensus sequences. Looser systems for clustering result in larger, more 'sloppy‘ clusters, with various expressed forms being represented within each cluster. Each approach has its advantages and disadvantages. Stringent clustering provides greater initial fidelity, at a cost of lower coverage of expressed gene data and a lower inclusion rate of expressed gene forms.Loose clustering provides greater coverage, at a cost of possible inclusion of paralogous expressed genes, lower fidelity data, but at a gain of greater inclusion of alternate expressed forms.
29
What are the transcripts represented in this set of mapped ESTs?
The Problem ESTs Genome What are the transcripts represented in this set of mapped ESTs?
30
Predict Transcripts from ESTs
Transcript predictions In this situation we define the problem of finding alternative splicing information as follows: Problem: find the minimal set of transcripts which is compatible with the splicing of a given EST cluster, such that the transcripts are not redundant with each other, that is, their splicing is non-equivalent. We must consider the global relation between the splices in a given ESTs, to avoid a resulting combinatorial combination of the different splices. Thus we consider each EST as a set of splices such that every two ESTs must be either compatible or incompatible regarding the splicing structure. We consider ESTs as whole structures, and only combine the splices from two ESTs if they have ALL the overlapping splice sites equivalent Merge ESTs according to splicing structure compatibility
31
Redundant ESTs Consider 2 ESTs in a Genomic Cluster with more ESTS x z
z gives redundant splicing information, we could keep only x x z w x + z Every 2 overlapping ESTs in the cluster may or may not be splicing-compatible. If z + w However, the relation with other ESTs in the cluster is important: a third EST, w, is compatible with z but not with x. --> keep all relations
32
Extension of the exon structure
Consider 2 ESTs in a Genomic Cluster with more ESTS x y x + y y extends x, we can assume that they are from the same mRNA x z w Every 2 overlapping ESTs in the cluster may or may not be splicing-compatible. If Our success will depend on the coverage of the exons. However, ESTs are 3’and 5’ biased (ESTs like z not so frequent), hence we will have fragmentation.
33
E Eyras et al. Genome Research (2004)
Representation For every 2 ESTs in a Genomic Cluster, we decide if they represent equivalent splicing structures The compatibility relation is a graph: x x Extension y y x Consider a set of ESTs that we have mapped onto the genome. We can cluster those mapped ESTs according to their position in the chromosome, which gives rise to a number of potential gene-loci defined by these EST clusters. Every 2 overlapping ESTs in the cluster may or may not be splicing-compatible. If they are compatible, we can always represent that relation as one of two possibilities: extension ( one EST extends the 3’ of the other) or inclusion (one EST is totally included in the other). We represent these relations as a graph where the nodes are the ESTs and there are two types of edges: single arrows for extension and double arrows for inclusion. Furthermore, we will sort the ESTs in the cluster in two variables: by the 5’ end coordinate in descending order and by the 3’ coordinate in descending order (if the 5’ coordinate is the same). Inclusion x z z E Eyras et al. Genome Research (2004)
34
Criteria of “merging” Allow edge-exon mismatches
Allow internal mismatches The comparison between two transcripts has as result one of four possible results: 1.- Inclusion 2.- Extension 3.- Clash (overlap but structure non-compatible) 4.- No overlap ( none of the exons overlap ) At the level of the comparison we have to establish the criteria for defining two ESTs as mergeable ( or redundant). The algorithm is implemented so that we can choose between different types of merging criteria, according to the type of data we are dealing with: We can merge in a strict way, which is not very realistic, so we never use it. We can allow mismatches of exons or part of exons at the edges of the transcripts. This could be used when we have good cDNA or EST data. We can allow internal mismatches. This can be used with data we know may contain lot of noise, or when we have annotated the transcripts using two different alignment methods that may have produced different splice sites (after all, we’re doing this automatically). Finally, we could allow for intron mismatches if we cover an intron which is too small to be real, maybe due to an incomplete alignment or to a disagreement between the cDNA and the genomic sequence. This is the typical case used for human ESTs. Allow intron mismatches Is this intron real?
35
Transitivity x x y y Extension z w x Inclusion z x z w w
The ordering induces naturally a transitivity in the representation: The extension and the inclusion are transitive, so we do not need to show redundant relations. This ordering and, in turn, the transitivity also minimizes the number of comparisons that we will have to make to the ESTs in a graph when comparing a new EST. This reduces the number of comparisons needed
36
E Eyras et al. Genome Research (2004)
ClusterMerge graph Each node defines an inclusion sub-tree y z y x z x Extensions form acyclic graphs x x y y z z More complicated situations arise from the interaction of inclusions and extensions. We choose to put the inclusions as high as possible in the extension tree. Considering only the inclusions, each node defines and (inclusion) tree. In fact, for every given node, we can define a sub tree which is the tree given by all the nodes ‘included’ in this node, which is the root of the inclusion tree. The extensions, however, do not necessary form a tree. On the other hand, the directed graph is always acyclic. This property will be exploited in the algorithm. A generic graph of this type can be then seen as an intertwined forest of inclusion trees and extension acyclic directed graphs. We call this structure a ClusterMerge graph. w w E Eyras et al. Genome Research (2004)
37
Mergeable sets Example 1 2 3 4 5 6 7
Consider a set of ESTs that we have mapped onto the genome. We can cluster those mapped ESTs according to their position in the chromosome, which gives rise to a number of potential gene-loci defined by these EST clusters. Consider as an example a set of 6 ESTs, already put in the order specified.
38
Mergeable sets Example 1 3 1 2 3 2 5 7 4 5 6 4 6 7
Consider a set of ESTs that we have mapped onto the genome. We can cluster those mapped ESTs according to their position in the chromosome, which gives rise to a number of potential gene-loci defined by these EST clusters. Consider as an example a set of 6 ESTs, already put in the order specified.
39
Mergeable sets Example Root 1 3 1 2 3 2 5 7 4 5 6 4 6 7 Leaves
Consider a set of ESTs that we have mapped onto the genome. We can cluster those mapped ESTs according to their position in the chromosome, which gives rise to a number of potential gene-loci defined by these EST clusters. Consider as an example a set of 6 ESTs, already put in the order specified.
40
Mergeable sets Example Root 1 3 1 2 3 2 5 7 4 5 6 4 6 7 Leaves
Consider a set of ESTs that we have mapped onto the genome. We can cluster those mapped ESTs according to their position in the chromosome, which gives rise to a number of potential gene-loci defined by these EST clusters. Consider as an example a set of 6 ESTs, already put in the order specified. Lists produced: (1,2,3,5,6,7) ( 1,2,3,4,5,7)
41
Deriving the transcripts from the lists
Once we have the lists we must produce the putative transcripts from those lists. To merge the linked ESTs into a transcript we cluster the exons. Each exon-cluster will contribute to a given exon in the final transcript. If this exon is going to be internal, we do not allow external exons in the ESTs to contribute. If the external coordinate of one EST is longer than the most common internal coordinate, this is a potential alternative UTR termination. On the other hand it is also very hard to conclude whether those are real or not when working with ESTs. Internal Splice Sites: external coordinates of the 5’ and 3’ exons are not allowed to contribute
42
Deriving the transcripts from the lists
Splice Sites: are set to the most common coordinate 5’ and 3’ coordinates: are set to the exon coordinate that extends the potential UTR the most We can parameterize how much mismatch in the 3’ and 5’ splice sites we can allow when comparing ESTs, to try to distinguish true alternative 5’ and 3’ sites from sequencing errors in the ESTs. It is very difficult to determine which threshold this should be. In the example, human ESTs giving evidence for alternative 3’ splice site are not considered as we have set a higher threshold, so this EST is merged. For the internal splice sites we take the most common coordinates in the exon cluster. For the external splices (potential 5’ or 3’ ends of the transcript) we take the coordinate that extends the final exon the most, to maximise the chance of covering UTRs.
43
Single exon transcripts
From the resulting set of putative transcripts we reject the un-spliced ones. They could be produced from spurious hits, perhaps ESTs containing genomic sequence. They could also be related to pseudogenes. Another possibility is that the EST cluster that it was derived from represents and UTR region of a gene which did not have any overlap with a spliced EST (see the figure for a possible case like this). Reject resulting single exon transcripts when using ESTs
44
Alternative splicing and comparative genomics
45
Conservation of Alternative Splicing
Degree of conservation: 30-60% Methods: 1.- compare single events 2.- Cross-alignment of full transcripts
46
Exon Skipping Events Introns flanking alternatively spliced (skipped) exons have high sequence conservation. Higher on average than constitutive inrons. R Sorek & G Ast. Genome Research 13: , 2003
47
Overrepresented hexamer (downstream)
Sequences regulating the (Alternative) splicing Conserved Alternative Exon Flanking Introns Overrepresented hexamer (downstream) Overrepresented sequences in conserved introns (between human and mouse) may be Involved in the regulation of alternative splicing. Overrepresented: found in these introns more often than expected at random AND not found in intronic sequences flanking constitutive exons (and upstream of skipped ones) R Sorek & G Ast. Genome Research (2003) 13:
48
Overrepresented hexamer
Sequences regulating the (Alternative) splicing Conserved Alternative Exon Flanking Introns Overrepresented hexamer Not all types of events are equally conserved. Introns flanking alternative 5´and 3´exons, and retained introns, have higher sequence conservation. Sugnet CW, Kent WJ, Ares M Jr, Haussler D. Pac Symp Biocomput. 2004;:66-77
49
A Resch et al. Nucleic Acids Research 2004, 32 (4) 1261-1269
Frame preservation Frame preserving Constitutive exons Alternative exons All exons 39.7% (Human) 39.5% (Mouse) 41.6% (Human) 44.7% (Mouse) Conserved Exon 40.9% (Human) 38% (Mouse) 51.8% (Human) 51.9% (Mouse) A Resch et al. Nucleic Acids Research 2004, 32 (4)
50
Predicting alternative exons
51
R Sorek et al. Genome Research (2004) 14:1617-1623
Features Differentiating Between Alternatively splice and Constitutively spliced exons Alternative exons Constitutive exons Average size 87 128 length = mutliple of 3 73% 37% Average human-mouse exon conservation 94% 89% (A) Exons with upstream intron conserved in mouse 92% 45% (B) Exons with downstream intron conserved in mouse 82% 35% (A) + (B) 77% 17% (A), (B) : conservation is considered if at least there 12 consecutive matches over 100bp of the intron R Sorek et al. Genome Research (2004) 14:
52
Build a classifier to make predictions
Rule: Set of conditions over the parameters: e.g. “at least 99% conservation with mouse AND divisible by 3, etc…” Try all the possible combinations of parameters Select the rule that would correctly identify a maximum number of true alternative exons minimizing the number of false positives This rule achieved 31% sensitivity and no false positives in a set of known exons: At least 95% identity with mouse orthologous exon Exon size is a multiple of 3 An upstream intronic alignment of at least 15bp with at least 85% identity A downstream intronic exact alignment of at least 12bp R Sorek et al. Genome Research (2004) 14:
53
Summary Alternative splicing is a mechanism to generate function diversity We can study alternative splicing using ESTs (Expressed Sequence Tags) EST data is fragmented and full of noise: need to be processed Some alternative splicing is conserved across species (Human-Mouse) Prediction of alternative (conserved) exons is possible (a classifier) but no ab initio Evolution of alternative splicing?
54
THE END
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.