Presentation is loading. Please wait.

Presentation is loading. Please wait.

Alternative Splicing from ESTs Eduardo Eyras Bioinformatics UPF – February 2004.

Similar presentations


Presentation on theme: "Alternative Splicing from ESTs Eduardo Eyras Bioinformatics UPF – February 2004."— Presentation transcript:

1 Alternative Splicing from ESTs Eduardo Eyras Bioinformatics UPF – February 2004

2 Intro ESTs Prediction of Alternative Splicing from ESTs

3 AAAAAAA5’ CAP Mature mRNA Splicing 5’ 3’ 5’ pre-mRNA Transcription exons introns Translation Peptide

4 AAAAAAA5’ CAP Mature mRNA Different Splicing 5’ 3’ 5’ pre-mRNA Transcription exons introns Translation Different Peptide

5 Alt splicing as a mechanism of gene regulation Functional domains can be added/subtracted  protein diversity Can introduce early stop codons, resulting in truncated proteins or unstable mRNAs It can modify the activity of the transcription factors, affecting the expression of genes It is observed nearly in all metazoans Estimated to occur in 30%-40% of human

6 Forms of alternative splicing Exon skipping / inclusion Alternative 3’ splice site Alternative 5’ splice site Mutually exclusive exons Intron retention Constitutive exon Alternatively spliced exons

7 How to study alternative splicing?

8 ESTs (Expressed Sequence Tags) Single-pass sequencing of a small (end) piece of cDNA Typically 200-500 nucleotides long It may contain coding and/or non-coding region

9 ESTs Cells from a specific organ, tissue or developmental stage AAAAAA 3’5’ AAAAAA 3’5’ TTTTTT 5’3’ AAAAAA 3’5’ TTTTTT 5’3’ TTTTTT 5’3’ AAAAAA 3’5’ TTTTTT 5’3’ mRNA extraction RNA DNA Double stranded cDNA Add oligo-dT primer Reverse transcriptase Ribonuclease H DNA polimerase Ribonuclease H

10 ESTs AAAAAA 3’5’ TTTTTT 5’3’ Clone cDNA into a vector Multiple cDNA clones 5’ EST 3’ EST Single-pass sequence reads

11 Splice variants Genomic Primary transcript Splicing cDNA clones EST sequences 5’ 3’ Alternative Splicing from ESTs

12 ESTs can also provide information about potential alternative splicing when aligned to the genome (and when aligned to mRNA data)

13 EST sequencing Is fast and cheap Gives direct information about the gene sequence Partial information Resulting ESTsKnown gene (DB searches)Similar to known gene Contaminant Novel gene

14 ESTs provide expression data eVOC Ontologies http://www.sanbi.ac.za/evoc/ Anatomical System Cell Type The tissue, organ or anatomical system from which the sample was prepared. Examples are digestive, lung and retina. Pathology The precise cell type from which a sample was prepared. Examples are: B- lymphocyte, fibroblast and oocyte. Developmental Stage The pathological state of the sample from which the sample was prepared. Examples are: normal, lymphoma, and congenital. Pooling The stage during the organism's development at which the sample was prepared. Examples are: embryo, fetus, and adult. Indicates whether the tissue used to prepare the library was derived from single or multiple samples. Examples are pooled, pooled donor and pooled tissue.

15 Linking the expression vocabulary to gene annotations ESTs Genes

16 Normalized vs. non-normalized libraries

17 The down side of the ESTs Cannot detect lowly/rarely expressed genes or non- expressed sequences (regulatory) Random sampling: the more ESTs we sequence the less new useful sequences we will get

18 Gene Hunting Sequencing of the Human Genome (HGP) EST Sequencing

19 Origin of the ESTs Science. 1991 Jun 21;252(5013):1651-6 Complementary DNA sequencing: expressed sequence tags and human genome project. Adams MD, Kelley JM, Gocayne JD, Dubnick M, Polymeropoulos MH, Xiao H, Merril CR, Wu A, Olde B, Moreno RF, et al. Section of Receptor Biochemistry and Molecular Biology, National Institute of Neurological Disorders and Stroke, National Institutes of Health, Bethesda, MD. Automated partial DNA sequencing was conducted on more than 600 randomly selected human brain complementary DNA (cDNA) clones to generate expressed sequence tags (ESTs). ESTs have applications in the discovery of new human genes, mapping of the human genome, and identification of coding regions in genomic sequences. Of the sequences generated, 337 represent new genes, including 48 with significant similarity to genes from other organisms, such as a yeast RNA polymerase II subunit; Drosophila kinesin, Notch, and Enhancer of split; and a murine tyrosine kinase receptor. Forty-six ESTs were mapped to chromosomes after amplification by the polymerase chain reaction. This fast approach to cDNA characterization will facilitate the tagging of most human genes in a few years at a fraction of the cost of complete genomic sequencing, provide new genetic markers, and serve as a resource in diverse biological research fields.

20 EST-sequencing explosion Merck and WashU (1994)  public ESTs  GenBank  dbEST  non-exclusivity (1992)

21 Number of public entries: 20,039,613 Summary by organism Homo sapiens (human) 5,472,005 Mus musculus + domesticus (mouse) 4,056,481 Rattus sp. (rat) 583,841 Triticum aestivum (wheat) 549,926 Ciona intestinalis 492,511 Gallus gallus (chicken) 460,385 Danio rerio (zebrafish) 450,652 Zea mays (maize) 391,417 Xenopus laevis (African clawed frog) 359,901 … dbEST release 20 February 2004

22 EST lengths Human EST length distribution (dbEST Sep. 2003 ) ~ 450 bp

23 Recover the mRNA from the ESTs

24 What is an EST cluster? A cluster is a set of fragmented EST data (plus mRNA data if known), consolidated according to sequence similarity Clusters are indexed by gene such that all expressed data concerning a single gene is in a single index class, and each index class contains the information for only one gene. (Burke, Davison, Hide, Genome Research 1999).

25 EST pre-processing Vector Repeats Mitochondrial Xenocontaminants

26 EST Clustering UniGene (NCBI) www.ncbi.nlm.nih.gov/UniGene TIGR Human Gene Index www.tigr.org (The Institute for Genomic Research) StackDB www.sanbi.ac.za (South African Bioinformatics Institute)

27 UniGene Species UniGene Entries Homo sapiens 118,517 Mus musculus 82,482 Rattus norvegicus 43,942 Sus scrofa 20,426 Gallus gallus 11,970 Xenopus laevis 21,734 Xenopus tropicalis 17,102 …

28 ESTs and the Genome

29 ESTs aligned to the genome Some advantages: It defines the location of exons and introns We can verify the splice sites of introns (e.g. GT-AG)  hence also check the correct strand of spliced ESTs It helps preventing chimeras It can avoid putting together ESTs from paralogous genes We can prevent including pseudogenes in our analysis

30 Aligning ESTs to the Genome Many ESTs  Fast programs, Fast computers Nearly exact matchesCoverage>= 97% Percent_id>= 97% Splice sites: GT—AG, AT—AC, GC—AG

31 Aligning ESTs to the Genome Clip poly A tails/Clip 20bp from either end Best in genome Remove potential processed pseudogenes Give preference to ESTs that are spliced Extra pre-processing of ESTs:

32 Human ESTGenes Genomic length distribution of aligned human ESTs Tail up to ~ 800kb ~ 400bp

33 The Problem What are the transcripts represented in this set of mapped ESTs? ESTs Genome

34 Transcript predictions ESTs Predict Transcripts from ESTs Merge ESTs according to splicing structure compatibility

35 Representation Extension Inclusion zx y x Sort by the smallest coordinate ascending and by the largest coordinate descending Every 2 ESTs in a Genomic Cluster may represent the same splicing (redundant) or not The redundancy relation is a graph: x y x z

36 Criteria of merging Allow internal mismatches Allow intron mismatches Allow edge-exon mismatches

37 Transitivity Extension Inclusion wz y x w x This reduces the number of comparisons needed x y z x z w

38 ClusterMerge graph z x x y y z w Each node defines an inclusion sub-tree Extensions form acyclic graphs y x z x y z w

39 Recovering the Solution 1 2 9 6 8 7 4 3 5 Mergeable sets of ESTs can be recovered as special paths in the graph

40 Recovering the Solution 1 2 9 6 8 7 4 3 5 Root Leaves Leaf: not-extended and root of an inclusion tree Root: does not extend any node

41 Recovering the Solution 1 2 9 6 8 7 4 3 5 Root Leaves Any set of ESTs in a path from a root to a leaf is mergeable

42 Recovering the Solution 1 2 9 6 8 7 4 3 5 Root Leaves Add the inclusion tree attached to each node in the path

43 Recovering the Solution 1 2 9 6 8 7 4 3 5 Lists produced: (1,2,3,4,5,6,7,8) ( 1,2,3,4,5,6,7,9) This representation minimizes the necessary comparisons between ESTs

44 How to build the graph Mutual Recursion Search graph (leaves) Recursion search along extension branch Search sub-graph Inclusion => go up in the tree

45 How to build the graph 1 3 2 4 6 5 Example

46 How to build the graph 1 3 2 4 6 5 Example 1 4 2 6 5 3

47 How to build the graph 1 3 2 4 6 5 Example 1 4 2 6 5 3 7 Leaves

48 How to build the graph 1 3 2 4 6 5 Example 1 4 2 6 5 3 7 Inclusion

49 How to build the graph 1 3 2 4 6 5 Example 1 4 2 6 5 3 7 Inclusion

50 How to build the graph 1 3 2 4 6 5 Example 1 4 2 6 5 3 7 Extension

51 How to build the graph 1 3 2 4 6 5 Example 1 4 2 6 5 3 7 Inclusion

52 How to build the graph 1 3 2 4 6 5 Example 1 4 2 6 5 3 7 Place 7

53 How to build the graph 1 3 2 4 6 5 Example 1 4 2 6 5 3 7 Inclusion 7

54 How to build the graph 1 3 2 4 6 5 Example 1 4 2 6 5 3 7 tagged as visited - skip 7

55 How to build the graph 1 3 2 4 6 5 Example 1 4 2 6 5 3 7 Possible sub-trees beyond 1 or 3 remain unseen! The representation minimizes the necessary comparisons 7

56 Deriving the transcripts from the lists Internal Splice Sites:external coordinates of the 5’ and 3’ exons are not allowed to contribute

57 Deriving the transcripts from the lists Splice Sites: are set to the most common coordinate 5’ and 3’ coordinates: are set to the exon coordinate that extends the potential UTR the most

58 Single exon transcripts Reject resulting single exon transcripts when using ESTs

59 Annotation with ESTs ESTs aligned to the genome can provide information about UTRs and alternative splicing

60 Annotation with ESTs EST-Transcripts at www.ensembl.org

61 Annotation with ESTs

62 Results for Human and Mouse Human EST-genes (assembly ncbi33): 38,581 Genes 122,247Transcripts ( 42% with full CDS ) Mouse EST-genes (assembly ncbi30): 32,848 Genes 103,664 Transcripts ( 36% with full CDS )

63 How many transcripts are conserved? Is Alternative Splicing conserved?

64 EST-transcript pairs 42,625 transcript pairs (in 18,242 gene pairs) gene pairs 78% with one transcript pair conserved 22% with more than one transcript pair conserved For 22% of the gene pairs some form of alt. splicing is conserved

65 Conservation of Alt. Splicing Take gene-pairs with more than one transcript-pair 19% of alt. variants in human are conserved in mouse 32% of alt. variants in mouse are conserved in human ∑ ( number of paired transcripts - 1) %conservation = ------------------------------------------------------- ∑ ( number of transcripts - 1 ) ∑ = sum over genes in a gene pair with more than one variant ( subtract the ‘main’ transcript form)

66 How many predicted ‘novel’ genes are validated by Human-Mouse comparison?

67 Novel genes ESTGenes Not in Ensembl Human ESTGenes validated by comparison to mouse 13,17418,242 ESTGenes with at least one complete ORF 24,201

68 Novel genes 984 ESTGenes not in Ensembl validated by comparison to mouse With a complete ORF

69 THE END


Download ppt "Alternative Splicing from ESTs Eduardo Eyras Bioinformatics UPF – February 2004."

Similar presentations


Ads by Google