Download presentation
Presentation is loading. Please wait.
Published byAshley Hopkins Modified over 9 years ago
1
Gene discovery using combined signals from genome sequence and natural selection Michael Brent Washington University The mouse genome analysis group
2
GENSIPS10/7/2002 2 Genes are read out via mRNA & processing
3
GENSIPS10/7/2002 3 RNA Processing
4
GENSIPS10/7/2002 4 A typical human gene structure
5
GENSIPS10/7/2002 5 In a mammalian genome Finding all the genes is hard Mammalian genomes are large –5,051 miles of 10pt type –Raleigh to Tripoli, Libya Only about 1.5% protein coding –Raleigh to Winston-Salem
6
GENSIPS10/7/2002 6 Genes are fairly unconstrained Intron length is highly variable ~5% are 40-100 nt long ~3% are longer than 30,000 nt Distance between genes is highly variable From 10 3 to 10 6 nt or more (probably)
7
GENSIPS10/7/2002 7 Exons per gene (RefSeq)
8
GENSIPS10/7/2002 8 Background is not random Segmental duplications Entire regions duplicate, then diverge slowly Processed pseudogenes Spliced transcripts integrate back into the genome –Sequence is similar to source genes –Generally not functional
9
GENSIPS10/7/2002 9 Gene prediction: two approaches 1. Transcript-based (E.g., GeneWise) A.Map experimentally determined sequences of spliced transcripts to their genomic source B.Map transcript sequences to genomic regions that could produce similar transcripts 2. De novo (genome only) Model DNA patterns characteristic of gene components –Splice donor and accepter –Protein coding sequence –Translation start and stop
10
GENSIPS10/7/2002 10 Advantages and disadvantages Transcript-based Advantage: conservative –Evidence of transcription for every exon Disadvantage: conservative –Can’t find “truly novel” genes Still subject to error
11
GENSIPS10/7/2002 11 Advantages and disadvantages De novo Advantage 1: Less biased toward –Known transcripts –Transcripts that can be sequenced easily Advantage 2: Genome sequencing is easy Disadvantages –No direct evidence of transcription –Presumably, more false positives
12
GENSIPS10/7/2002 12 Single-genome de novo: Genscan Strengths For mammalian sequence, one of the best single-genome, de novo gene predictors Widely used to great practical advantage De facto standard for mammalian sequence Limitations Predicts >45K genes (best est.: 25-30K) Predicts >315K exons (best est. 200K-250K) Gets only 9% of known genes exactly right*
13
GENSIPS10/7/2002 13 Dual genome de novo We developed algorithms that use two genomes to Reduce the number of false positives Refined the details of the structures
14
GENSIPS10/7/2002 14 Probability model Assigns probability to annotated DNA sequences: 5’TAGCCTACTGAAATGGACCGCTTCAGCGTGGTAT3’ Optimization algorithm Given a DNA sequence, find the most probable annotation, according to the model Exon5’ UTR Intron Single-genome de novo method
15
GENSIPS10/7/2002 15 CCATGGCGTCTTCAGGCAGTGACTC Genscan’s generative model Intron Exon Intron
16
GENSIPS10/7/2002 16 Generalized HMM States correspond to gene features Model generates DNA sequence by passing through states The probability of annotated DNA sequence is the probability of –generating the DNA sequence –by passing through states corre- sponding to the annotation. Genscan’s generative model
17
GENSIPS10/7/2002 17 Dual genome prediction Input Target and informant genomes Idea Patterns of evolution since the last common ancestor may reveal gene structure
18
GENSIPS10/7/2002 18 Two conservation signals 1. Local alignment signal Selective pressures differ by feature This leaves a characteristic signature 2. Structural signal Locations of introns tend to be conserved
19
GENSIPS10/7/2002 19 Characteristic local alignments TTATCCACCAGACCAGATAGATACTTGTCTGCCACCCTC |||||||||||||||||||| || ||||| || || ||| TTATCCACCAGACCAGATAGGTATTTGTCAGCTACTCTC Coding exon CTAGAGATGCAAAAGAAACAGGTACCGCAGTGC---CCC |||||| || | ||||||||| || || || CTAGAGC----AAGAAGACAGGTACCATAGGGCTCTCCT Intron (non-coding) human mouse
20
GENSIPS10/7/2002 20 Conservation of intron location
21
GENSIPS10/7/2002 21 Align→predict→filter→test WU-BLAST Aligned Intron Filter Validation (RT-PCR) TTATCCACCAGAC CAGATAGATACTT GTCTGCCACCCTC TTATCCACCAGAC CAGATAGGTATTT GTCAGCTACTCTC TCTGCCACC || || || TCAGCTACT TWINSCAN
22
GENSIPS10/7/2002 22 gHMM decoding Representation change TCTGCCACC ||:||:|| TCTGCCACC || || || TCAGCTACT Conservation sequence TWINSCAN
23
GENSIPS10/7/2002 23 BLAST Alignments Target Informant
24
GENSIPS10/7/2002 24 Projecting BLAST Alignments Target Informant
25
GENSIPS10/7/2002 25 Projecting BLAST Alignments Target Informant
26
GENSIPS10/7/2002 26 Projecting BLAST Alignments Target Informant
27
GENSIPS10/7/2002 27 Projecting BLAST Alignments Target Informant
28
GENSIPS10/7/2002 28 Conservation sequence CTAGAGATGCAAAAGAAACAGGTACCGCAGTGC---CCC Synthetic (projected) local alignment human mouse |||||| | ||||||||| || || || CTAGAG AGACAGGTACCATAGGGCTCTCCT Pair each nucleotide of the target with “|” if it is aligned and identical
29
GENSIPS10/7/2002 29 Conservation sequence CTAGAGATGCAAAAGAAACAGGTACCGCAGTGC---CCC Synthetic (projected) local alignment human mouse |||||| |:|||||||||::||:|| ||: CTAGAG AGACAGGTACCATAGGGCTCTCCT Pair each nucleotide of the target with “|” if it is aligned and identical “:” if it is aligned to mismatch or gap
30
GENSIPS10/7/2002 30 Conservation sequence CTAGAGATGCAAAAGAAACAGGTACCGCAGTGC---CCC Synthetic (projected) local alignment human mouse ||||||......... |:|||||||||::||:|| ||: CTAGAG AGACAGGTACCATAGGGCTCTCCT Pair each nucleotide of the target with “|” if it is aligned and identical “:” if it is aligned to mismatch or gap “.” if it is unaligned
31
GENSIPS10/7/2002 31 Conservation sequence CTAGAGATGCAAAAGAAACAGGTACCGCAGTGC---CCC Conservation sequence human ||||||......... |:|||||||||::||:|| ||: Pair each nucleotide of the target with “|” if it is aligned and identical “:” if it is aligned to mismatch or gap “.” if it is unaligned
32
GENSIPS10/7/2002 32 Conservation sequence CTAGAGATGCAAAAGAAACAGGTACCGCAGTGCCCC Conservation sequence human ||||||......... |:|||||||||::||:||||: Pair each nucleotide of the target with “|” if it is aligned and identical “:” if it is aligned to mismatch or gap “.” if it is unaligned
33
GENSIPS10/7/2002 33 Probability model Assigns probability to annotated DNA: 5’TAGCCTACTGAAATGGACCGCTTCAGCGTGGTAT3’ |||........|:||||:|||||||||:||::|| Optimization Given DNA and conservation sequence, find the most probable annotation, according to the model Exon5’ UTR Intron Twinscan: Extending the model
34
GENSIPS10/7/2002 34 Each state “generates” DNA and conservation sequence independently Probability of annotated DNA and conservation sequence is probability of generating the DNA and conservation sequence by passing through corresponding states Twinscan
35
GENSIPS10/7/2002 35 Performance Evaluation RefSeq A set ~13,000 “Known” mRNAs Represents ~40-50% of human genes –Usually, only one of several splices Mapping to genome is imperfect Best available gold standard
36
GENSIPS10/7/2002 36
37
GENSIPS10/7/2002 37
38
GENSIPS10/7/2002 38
39
GENSIPS10/7/2002 39
40
GENSIPS10/7/2002 40 Short term goal All multi-exon human genes Predict accurately –Integrate information from more genomes Verify at least one intron experimentally Follow up with full-length verification
41
GENSIPS10/7/2002 41 Acknowledgments Funding agencies National Institutes of Health (NHGRI) National Science Foundation (DBI) Sequencing centers Sanger, Whitehead, Wash. U. My group Ian Korf, Paul Flicek, Evan Keibler, Ping Hu Collaborators Roderic Guigo, Josep Abril, Genis Parra –Pankaj Agarwal Stylianos Antonarakis, Alexandre Reymond, Manolis Dermitzakis
42
GENSIPS10/7/2002 42 Other clades Plants Arabidopsis thaliana, cabbage, rice Nematodes C. elegans, C. briggsae Fungi Cryptococcus neoformans (JEC21, H99)
43
GENSIPS10/7/2002 43 Pair HMM algorithms (SLAM,…) Input is orthologous sequences. Aligns and predicts simultaneously, using a joint probability model Predicts orthologous genes in 2 sequences All predicted CDS is aligned Some aligned regions are not predicted CDS –Labeled conserved non-coding sequence
44
GENSIPS10/7/2002 44 The algorithms (SLAM,…) sgp2 Alignment before prediction (tblastx) Predicts genes in target sequence only Don’t need orthologous input sequences –Paralogs & low-coverage shotgun can help Modifies scores of all potential exons, by –At each base, add tblastx score of best overlapping local alignment (roughly) –To gene-id scores of that potential exon
45
GENSIPS10/7/2002 45 The algorithms TWINSCAN Alignment before prediction (blastn) Predicts in target sequence only Modifies scores of all potential exons, UTRs, splice sites, start and stop models, by –At each base, apply a feature-specific scoring model (estimated for this purpose) –to the best overlapping local alignment, and adding the result –To Genscan scores for that feature
46
GENSIPS10/7/2002 46 % Aligned, CDS vs. other
47
GENSIPS10/7/2002 47 Query Sequence tblastx HSPs geneid Exons HSPs Projections SGP Exons Syntenic Gene Prediction (sgp2)
48
GENSIPS10/7/2002 48 Why work on gene finding? Genes are Components responsible for biological function Variations cause human disease / susceptibility Controls for modifying biological function –Human gene therapy –Agriculture –Nanotechnology, etc.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.