Download presentation
Presentation is loading. Please wait.
1
Alternative Splicing from ESTs Eduardo Eyras Bioinformatics UPF – February 2004
2
Intro ESTs Prediction of Alternative Splicing from ESTs
3
AAAAAAA5’ CAP Mature mRNA Splicing 5’ 3’ 5’ pre-mRNA Transcription exons introns Translation Peptide
4
AAAAAAA5’ CAP Mature mRNA Different Splicing 5’ 3’ 5’ pre-mRNA Transcription exons introns Translation Different Peptide
5
Alt splicing as a mechanism of gene regulation Functional domains can be added/subtracted protein diversity Can introduce early stop codons, resulting in truncated proteins or unstable mRNAs It can modify the activity of the transcription factors, affecting the expression of genes It is observed nearly in all metazoans Estimated to occur in 30%-40% of human
6
Forms of alternative splicing Exon skipping / inclusion Alternative 3’ splice site Alternative 5’ splice site Mutually exclusive exons Intron retention Constitutive exon Alternatively spliced exons
7
How to study alternative splicing?
8
ESTs (Expressed Sequence Tags) Single-pass sequencing of a small (end) piece of cDNA Typically 200-500 nucleotides long It may contain coding and/or non-coding region
9
ESTs Cells from a specific organ, tissue or developmental stage AAAAAA 3’5’ AAAAAA 3’5’ TTTTTT 5’3’ AAAAAA 3’5’ TTTTTT 5’3’ TTTTTT 5’3’ AAAAAA 3’5’ TTTTTT 5’3’ mRNA extraction RNA DNA Double stranded cDNA Add oligo-dT primer Reverse transcriptase Ribonuclease H DNA polimerase Ribonuclease H
10
ESTs AAAAAA 3’5’ TTTTTT 5’3’ Clone cDNA into a vector Multiple cDNA clones 5’ EST 3’ EST Single-pass sequence reads
11
Splice variants Genomic Primary transcript Splicing cDNA clones EST sequences 5’ 3’ Alternative Splicing from ESTs
12
ESTs can also provide information about potential alternative splicing when aligned to the genome (and when aligned to mRNA data)
13
EST sequencing Is fast and cheap Gives direct information about the gene sequence Partial information Resulting ESTsKnown gene (DB searches)Similar to known gene Contaminant Novel gene
14
ESTs provide expression data eVOC Ontologies http://www.sanbi.ac.za/evoc/ Anatomical System Cell Type The tissue, organ or anatomical system from which the sample was prepared. Examples are digestive, lung and retina. Pathology The precise cell type from which a sample was prepared. Examples are: B- lymphocyte, fibroblast and oocyte. Developmental Stage The pathological state of the sample from which the sample was prepared. Examples are: normal, lymphoma, and congenital. Pooling The stage during the organism's development at which the sample was prepared. Examples are: embryo, fetus, and adult. Indicates whether the tissue used to prepare the library was derived from single or multiple samples. Examples are pooled, pooled donor and pooled tissue.
15
Linking the expression vocabulary to gene annotations ESTs Genes
16
Normalized vs. non-normalized libraries
17
The down side of the ESTs Cannot detect lowly/rarely expressed genes or non- expressed sequences (regulatory) Random sampling: the more ESTs we sequence the less new useful sequences we will get
18
Gene Hunting Sequencing of the Human Genome (HGP) EST Sequencing
19
Origin of the ESTs Science. 1991 Jun 21;252(5013):1651-6 Complementary DNA sequencing: expressed sequence tags and human genome project. Adams MD, Kelley JM, Gocayne JD, Dubnick M, Polymeropoulos MH, Xiao H, Merril CR, Wu A, Olde B, Moreno RF, et al. Section of Receptor Biochemistry and Molecular Biology, National Institute of Neurological Disorders and Stroke, National Institutes of Health, Bethesda, MD. Automated partial DNA sequencing was conducted on more than 600 randomly selected human brain complementary DNA (cDNA) clones to generate expressed sequence tags (ESTs). ESTs have applications in the discovery of new human genes, mapping of the human genome, and identification of coding regions in genomic sequences. Of the sequences generated, 337 represent new genes, including 48 with significant similarity to genes from other organisms, such as a yeast RNA polymerase II subunit; Drosophila kinesin, Notch, and Enhancer of split; and a murine tyrosine kinase receptor. Forty-six ESTs were mapped to chromosomes after amplification by the polymerase chain reaction. This fast approach to cDNA characterization will facilitate the tagging of most human genes in a few years at a fraction of the cost of complete genomic sequencing, provide new genetic markers, and serve as a resource in diverse biological research fields.
20
EST-sequencing explosion Merck and WashU (1994) public ESTs GenBank dbEST non-exclusivity (1992)
21
Number of public entries: 20,039,613 Summary by organism Homo sapiens (human) 5,472,005 Mus musculus + domesticus (mouse) 4,056,481 Rattus sp. (rat) 583,841 Triticum aestivum (wheat) 549,926 Ciona intestinalis 492,511 Gallus gallus (chicken) 460,385 Danio rerio (zebrafish) 450,652 Zea mays (maize) 391,417 Xenopus laevis (African clawed frog) 359,901 … dbEST release 20 February 2004
22
EST lengths Human EST length distribution (dbEST Sep. 2003 ) ~ 450 bp
23
Recover the mRNA from the ESTs
24
What is an EST cluster? A cluster is a set of fragmented EST data (plus mRNA data if known), consolidated according to sequence similarity Clusters are indexed by gene such that all expressed data concerning a single gene is in a single index class, and each index class contains the information for only one gene. (Burke, Davison, Hide, Genome Research 1999).
25
EST pre-processing Vector Repeats Mitochondrial Xenocontaminants
26
EST Clustering UniGene (NCBI) www.ncbi.nlm.nih.gov/UniGene TIGR Human Gene Index www.tigr.org (The Institute for Genomic Research) StackDB www.sanbi.ac.za (South African Bioinformatics Institute)
27
UniGene Species UniGene Entries Homo sapiens 118,517 Mus musculus 82,482 Rattus norvegicus 43,942 Sus scrofa 20,426 Gallus gallus 11,970 Xenopus laevis 21,734 Xenopus tropicalis 17,102 …
28
ESTs and the Genome
29
ESTs aligned to the genome Some advantages: It defines the location of exons and introns We can verify the splice sites of introns (e.g. GT-AG) hence also check the correct strand of spliced ESTs It helps preventing chimeras It can avoid putting together ESTs from paralogous genes We can prevent including pseudogenes in our analysis
30
Aligning ESTs to the Genome Many ESTs Fast programs, Fast computers Nearly exact matchesCoverage>= 97% Percent_id>= 97% Splice sites: GT—AG, AT—AC, GC—AG
31
Aligning ESTs to the Genome Clip poly A tails/Clip 20bp from either end Best in genome Remove potential processed pseudogenes Give preference to ESTs that are spliced Extra pre-processing of ESTs:
32
Human ESTGenes Genomic length distribution of aligned human ESTs Tail up to ~ 800kb ~ 400bp
33
The Problem What are the transcripts represented in this set of mapped ESTs? ESTs Genome
34
Transcript predictions ESTs Predict Transcripts from ESTs Merge ESTs according to splicing structure compatibility
35
Representation Extension Inclusion zx y x Sort by the smallest coordinate ascending and by the largest coordinate descending Every 2 ESTs in a Genomic Cluster may represent the same splicing (redundant) or not The redundancy relation is a graph: x y x z
36
Criteria of merging Allow internal mismatches Allow intron mismatches Allow edge-exon mismatches
37
Transitivity Extension Inclusion wz y x w x This reduces the number of comparisons needed x y z x z w
38
ClusterMerge graph z x x y y z w Each node defines an inclusion sub-tree Extensions form acyclic graphs y x z x y z w
39
Recovering the Solution 1 2 9 6 8 7 4 3 5 Mergeable sets of ESTs can be recovered as special paths in the graph
40
Recovering the Solution 1 2 9 6 8 7 4 3 5 Root Leaves Leaf: not-extended and root of an inclusion tree Root: does not extend any node
41
Recovering the Solution 1 2 9 6 8 7 4 3 5 Root Leaves Any set of ESTs in a path from a root to a leaf is mergeable
42
Recovering the Solution 1 2 9 6 8 7 4 3 5 Root Leaves Add the inclusion tree attached to each node in the path
43
Recovering the Solution 1 2 9 6 8 7 4 3 5 Lists produced: (1,2,3,4,5,6,7,8) ( 1,2,3,4,5,6,7,9) This representation minimizes the necessary comparisons between ESTs
44
How to build the graph Mutual Recursion Search graph (leaves) Recursion search along extension branch Search sub-graph Inclusion => go up in the tree
45
How to build the graph 1 3 2 4 6 5 Example
46
How to build the graph 1 3 2 4 6 5 Example 1 4 2 6 5 3
47
How to build the graph 1 3 2 4 6 5 Example 1 4 2 6 5 3 7 Leaves
48
How to build the graph 1 3 2 4 6 5 Example 1 4 2 6 5 3 7 Inclusion
49
How to build the graph 1 3 2 4 6 5 Example 1 4 2 6 5 3 7 Inclusion
50
How to build the graph 1 3 2 4 6 5 Example 1 4 2 6 5 3 7 Extension
51
How to build the graph 1 3 2 4 6 5 Example 1 4 2 6 5 3 7 Inclusion
52
How to build the graph 1 3 2 4 6 5 Example 1 4 2 6 5 3 7 Place 7
53
How to build the graph 1 3 2 4 6 5 Example 1 4 2 6 5 3 7 Inclusion 7
54
How to build the graph 1 3 2 4 6 5 Example 1 4 2 6 5 3 7 tagged as visited - skip 7
55
How to build the graph 1 3 2 4 6 5 Example 1 4 2 6 5 3 7 Possible sub-trees beyond 1 or 3 remain unseen! The representation minimizes the necessary comparisons 7
56
Deriving the transcripts from the lists Internal Splice Sites:external coordinates of the 5’ and 3’ exons are not allowed to contribute
57
Deriving the transcripts from the lists Splice Sites: are set to the most common coordinate 5’ and 3’ coordinates: are set to the exon coordinate that extends the potential UTR the most
58
Single exon transcripts Reject resulting single exon transcripts when using ESTs
59
Annotation with ESTs ESTs aligned to the genome can provide information about UTRs and alternative splicing
60
Annotation with ESTs EST-Transcripts at www.ensembl.org
61
Annotation with ESTs
62
Results for Human and Mouse Human EST-genes (assembly ncbi33): 38,581 Genes 122,247Transcripts ( 42% with full CDS ) Mouse EST-genes (assembly ncbi30): 32,848 Genes 103,664 Transcripts ( 36% with full CDS )
63
How many transcripts are conserved? Is Alternative Splicing conserved?
64
EST-transcript pairs 42,625 transcript pairs (in 18,242 gene pairs) gene pairs 78% with one transcript pair conserved 22% with more than one transcript pair conserved For 22% of the gene pairs some form of alt. splicing is conserved
65
Conservation of Alt. Splicing Take gene-pairs with more than one transcript-pair 19% of alt. variants in human are conserved in mouse 32% of alt. variants in mouse are conserved in human ∑ ( number of paired transcripts - 1) %conservation = ------------------------------------------------------- ∑ ( number of transcripts - 1 ) ∑ = sum over genes in a gene pair with more than one variant ( subtract the ‘main’ transcript form)
66
How many predicted ‘novel’ genes are validated by Human-Mouse comparison?
67
Novel genes ESTGenes Not in Ensembl Human ESTGenes validated by comparison to mouse 13,17418,242 ESTGenes with at least one complete ORF 24,201
68
Novel genes 984 ESTGenes not in Ensembl validated by comparison to mouse With a complete ORF
69
THE END
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.