Download presentation
Presentation is loading. Please wait.
1
Gene Prediction: Past, Present, and Future Sam Gross
2
Genes ATG Gene RNA Protein Proteins are about 500 AA long Genes are about 1500bp long TAG TAA TGA
3
ORF Scanning In “lower” organisms, genes are contiguous We expect about 1 stop codon per 64bp If we see a long ORF, it’s probably a gene! –And conversely, all genes are long ORFs
4
Introns GT AG ATG TGA TAA TAG Drosophila: 3.4 introns per gene on average mean intron length 475, mean exon length 397 Human: 8.8 introns per gene on average mean intron length 4400, mean exon length 165 ORF scanning is defeated
5
Splicing GT AG ATG TGA TAA TAG GT AG ATG TGA TAA TAG AG
6
Needles in a Haystack Human genome is about 3.2Gbp 20,000 – 25,000 genes 78% intergenic, 20% introns, 2% coding
7
Gotta Find ‘Em All 60-85% of all human genes have been found, mostly by random EST sequencing –This probably won’t work for the rest For most genes, only one splice variant is known If we can computationally predict a gene, we have a cheap experiment (RT-PCR) to verify
8
Looking For Clues Signals used by the cell –99% of introns begin with GT, end with AG –0.8% of introns begin with GC, end with AG –Gene begins with ATG –Gene ends with TAG, TAA, or TGA Other properties of genes –Exons have characteristic lengths –Base composition of exons is characteristic due to genetic code –Exons tend to be conserved between species Pattern of conservation is three-periodic
9
Three-Periodicity Most amino acids can be coded for by more than one DNA triplet (codon) Usually, the degeneracy is in the last position HumanCCTGTT(Proline, Valine) MouseCCAGTC(Proline, Valine) RatCCAGTC(Proline, Valine) DogCCGGTA(Proline, Valine) ChickenCCCGTG(Proline, Valine)
10
Hidden Markov Models The de facto standard for gene prediction Probabilistic finite state machine Transition to a state, emit a character, transition to a new state –Many independence assumptions CDSNC ACG
11
HMMs For Gene Prediction Generative model –Define P(X, Y) as a product of many independent terms P( ACG ) = P(start in noncoding) * P(noncoding emits A) * P(noncoding transitions to noncoding) * P(noncoding emits C) * P(noncoding transitions to coding) * P(coding emits A) Terms are of the forms P(y i | y i-1 ) and P(x i | y i ) –Trained by collecting counts
12
HMMs For Gene Prediction To predict genes given a sequence X, calculate argmax Y P(Y | X) = argmax Y P(X, Y) / P(X) = argmax Y P(X, Y)
13
Generalized Hidden Markov Models Like a HMM, but state durations are explicit Transition to a state, pick a duration d, emit d characters, transition to a new state Dynamic programming algorithm complexity goes from O(N 2 L) to O(N 2 LK) –K is the maximum state duration –Not so bad in practice
14
Predicting Genes With HMMs Given a sequence, we can calculate the most likely annotation Internal Exon Intron Inter- genic Final Exon Initial Exon Single Exon GGTGAGGTGACCAAGAACGTGTTGACAGTA
15
The Past: GENSCAN Chris Burge, Stanford, 1997 Before the Human Genome Project –No alignments available –People still thought there were 100,000 human genes
16
The GENSCAN Model
17
Output probabilities for NC and CDS depend on previous 5 bases (5 th -order) –P(X i | X i-1, X i-2, X i-3, X i-4, X i-5 ) Each CDS frame has its own model Special 2 nd -order positional models for start codon, stop codon, and acceptor site Even fancier model for donor sites –Maximal dependence decomposition (MDD) –Long-range dependencies Separate model for different isochores
18
GENSCAN Performance First program to do well on realistic sequences –Multiple genes in both orientations Pretty good sensitivity, poor specificity –70% exon Sn, 40% exon Sp Not enough exons per gene Was the best gene predictor for about 4 years
19
Comparative Gene Prediction Exon Intron Exon Intron -3 -2 -1 +1 +2 +3 Human A A G G T G -3 -2 -1 +1 +2 +3 Human A A G G T G Mouse A A G G T GMouse A A T G T G Chicken A A G G T GChicken A A _ A C G A B
20
The Recent Past: TWINSCAN Korf, Flicek, Duan, Brent, Washington University in St. Louis, 2001 Uses an informant sequence to help predict genes –For human, informant is normally mouse Informant sequence consists of three characters –Match:| –Mismatch:: –Unaligned:. Informant sequence assumed independent of target sequence
21
The TWINSCAN Model Just like GENSCAN, except adds models for conservation sequence 5 th -order models for CDS and NC, 2 nd - order models for start and stop codons and splice sites –One CDS model for all frames Many informants tried, but mouse seems to be at the “sweet spot”
22
TWINSCAN Performance Slightly more sensitive than GENSCAN, much more specific –Exon sensitivity/specificity about 75% Much better at the gene level –Most genes are mostly right, about 25% exactly right Was the best gene predictor for about 4 years
23
The Present: N-SCAN Gross and Brent, Washington University in St. Louis, 2005 If one informant sequence is good, let’s try more! Also several other improvements on TWINSCAN
24
N-SCAN Improvements Multiple informants Richer models of sequence evolution Frame-specific CDS conservation model Conserved noncoding sequence model 5’ UTR structure model
25
GENSCAN TWINSCAN N-SCAN HMM Outputs TargetGGTGAGGTGACCAAGAACGTGTTGACAGTA Conservation|||:||:||:|||||:||||||||...... sequence TargetGGTGAGGTGACCAAGAACGTGTTGACAGTA Informant1GGTCAGC___CCAAGAACGTGTAG...... Informant2GATCAGC___CCAAGAACGTGTAG...... Informant3GGTGAGCTGACCAAGATCGTGTTGACACAA...
26
N-SCAN State Diagram
27
Two-Component Output Distributions Target sequence model Phylogenetic model for informants Product gives the probability of a multiple alignment column
28
Phylogenetic Bayesian Network Models
29
Graph Transformation
30
Inference Slightly-modified version of Felsenstein’s algorithm At each of the O(N) nodes, we calculate 6 o+1 summations over 6 o+1 values Total time complexity is O(N 6 2(o+1) )
31
Training Simple with labeled multiple alignment of all sequences Can use known genes as a labeling Don’t know ancestral genome sequences –Treat them as missing data and use EM
32
CPD Parameterizations Each Bayesian network of order o has (2N-1)(6 o+1 )(6 o+1 -1) free parameters We can reduce this number by restricting the form of the CPDs Partially reversible models –Relative frequency of DNA k-mers remains constant as sequence evolves –Gaps and unaligned regions introduced over time
33
N-SCAN Phylogenetic Models vs. Traditional Phylogenetic Models Root (target) node is observed –Can use existing single-sequence models –Can use higher-order models –Can estimate target sequence model optimally
34
No assumption of homogeneous substitution process –Gaps and unaligned regions can be treated naturally –Robust against Function-changing mutation Alignment error Sequencing error –The price is many more parameters N-SCAN Phylogenetic Models vs. Traditional Phylogenetic Models
35
Conservation Score Coefficient N-SCAN uses log-likelihood scores internally. The score of a position i under state S is Values of k between 0.3 and 0.6 result in the best performance –Performance is roughly constant in this range
36
Whole-Genome Human Gene Prediction Annotations used were cleaned RefSeqs –16,259 genes –20,837 transcripts N-SCAN used human, mouse, rat, chicken alignment
37
Exact Exon Accuracy
38
Exact Gene Accuracy
39
Intron Sensitivity By Length
40
Human Informant Effectiveness
41
Drosophila Informant Effectiveness
42
The Future(?): CONTRAST New gene predictor currently in the works Based not on a generalized HMM, but a semi-Markov conditional random field (SCRF)
43
HMMs For Gene Prediction Generative model –Define P(X, Y) as a product of many independent terms P( ACG ) = P(start in noncoding) * P(noncoding emits A) * P(noncoding transitions to noncoding) * P(noncoding emits C) * P(noncoding transitions to coding) * P(coding emits A) Terms are of the forms P(y i | y i-1 ) and P(x i | y i ) –Trained by collecting counts
44
HMMs For Gene Prediction To predict genes given a sequence X, calculate argmax Y P(Y | X) = argmax Y P(X, Y) / P(X) = argmax Y P(X, Y) Advantage: simplicity –Extremely fast training, efficient inference Disadvantage: simplicity –Makes many unwarranted independence assumptions –Inaccurate model will get us into trouble
45
When HMMs Go Wrong Normal HMM training optimizes wrong function –We use P(Y | X) for prediction, but we’re optimizing P(X, Y) = P(Y | X) P(X) –This means we may prefer parameters that lead to worse predictions if they assign a higher probability to the sequence
46
When HMMs Go Wrong NC A 3% B 2% C 95% CDS A 49% B 49% C 2% NC A 3% B 2% C 95% CDS A 3% B 95% C 2% NNC A 2% B 2% C 96% CNS A 96% B 2% C 2% CDS A 49% B 49% C 2% A = Conserved triplet B = Synonymous substitution C = Nonsynonymous substitution …CCCCCCCCCCCCCAAAAAAAAAACCCC…CCCCCCCBBABAAABBABBABCC…
47
Can We Fix It? Directly optimize No closed form solution –But function and gradient can be calculated efficiently using DP If we’re going to numerically optimize anyway, might as well switch to a more expressive model
48
CRFs For Gene Prediction Discriminative model –Define P(Y | X) as a product of many terms Individual terms are not probabilities! Terms are of the form f j (y i-1, y i, X, i) w j The Good –Independence assumptions much weaker than in HMMs –Inference complexity is the same as for HMM The Bad –Training requires numerical optimization of (convex) likelihood function
49
The Math CRFs HMMs
50
HMMs vs. CRFs y1y1 x1x1 y2y2 x2x2 y3y3 x3x3 y4y4 x4x4 y5y5 x5x5 y6y6 x6x6 … HMM y1y1 x1x1 y2y2 x2x2 y3y3 x3x3 y4y4 x4x4 y5y5 x5x5 y6y6 x6x6 … CRF
51
HMMs vs. CRFs HMM-style “features” –Last state is exon, current state is intron –Current state is exon, current sequence character is “C” CRF-style features –Current state is exon, CG percent in 100Kbp window is between 40% and 50%, at least one CpG island predicted within 10Kbp –Current state is exon, 3 unspliced ESTs with at least 95% identity aligned near current position –Current state is exon, 1 spliced EST with at least 95% identity aligned near current position
52
Semi-Markov CRFs Semi-Markov CRFs are to CRFs as generalized HMMs (or semi-HMMs) are to HMMs Instead of assigning labels to each position, assign labels to segments Features are f(y i-1, y i, X, i, j)
53
Future Directions SVM-based splice site models that use alignment information –Splice site models in current gene predictors are pretty primitive Alternative splicing! –Not yet handled well –Very poor experimental coverage of transcriptome
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.