Download presentation
Presentation is loading. Please wait.
1
CS262 Lecture 9, Win07, Batzoglou Gene Recognition
2
CS262 Lecture 9, Win07, Batzoglou Gene structure exon1 exon2exon3 intron1intron2 transcription translation splicing exon = protein-coding intron = non-coding Codon: A triplet of nucleotides that is converted to one amino acid
3
CS262 Lecture 9, Win07, Batzoglou Needles in a Haystack
4
CS262 Lecture 9, Win07, Batzoglou Classes of Gene predictors Ab initio Only look at the genomic DNA of target genome De novo Target genome + aligned informant genome(s) EST/cDNA-based & combined approaches Use aligned ESTs or cDNAs + any other kind of evidence Gene Finding EXON Human tttcttagACTTTAAAGCTGTCAAGCCGTGTTCTAGATAAAATAAGTATTGGACAACTTGTTAGTCTCCTTTCCAACAACCTGAACAAATTTGATGAAgtatgtaccta Macaque tttcttagACTTTAAAGCTGTCAAGCCGTGTTCTAGATAAAATAAGTATTGGACAACTTGTTAGTCTCCTTTCCAACAACCTGAACAAATTTGATGAAgtatgtaccta Mouse ttgcttagACTTTAAAGTTGTCAAGCCGCGTTCTTGATAAAATAAGTATTGGACAACTTGTTAGTCTTCTTTCCAACAACCTGAACAAATTTGATGAAgtatgta-cca Rat ttgcttagACTTTAAAGTTGTCAAGCCGTGTTCTTGATAAAATAAGTATTGGACAACTTATTAGTCTTCTTTCCAACAACCTGAACAAATTTGATGAAgtatgtaccca Rabbit t--attagACTTTAAAGCTGTCAAGCCGTGTTCTAGATAAAATAAGTATTGGGCAACTTATTAGTCTCCTTTCCAACAACCTGAACAAATTTGATGAAgtatgtaccta Dog t-cattagACTTTAAAGCTGTCAAGCCGTGTTCTGGATAAAATAAGTATTGGACAACTTGTTAGTCTCCTTTCCAACAACCTGAACAAATTCGATGAAgtatgtaccta Cow t-cattagACTTTGAAGCTATCAAGCCGTGTTCTGGATAAAATAAGTATTGGACAACTTGTTAGTCTCCTTTCCAACAACCTGAACAAATTTGATGAAgtatgta-cta Armadillo gca--tagACCTTAAAACTGTCAAGCCGTGTTTTAGATAAAATAAGTATTGGACAACTTGTTAGTCTCCTTTCCAACAACCTGAACAAATTTGATGAAgtatgtgccta Elephant gct-ttagACTTTAAAACTGTCCAGCCGTGTTCTTGATAAAATAAGTATTGGACAACTTGTCAGTCTCCTTTCCAACAACCTGAACAAATTTGATGAAgtatgtatcta Tenrec tc-cttagACTTTAAAACTTTCGAGCCGGGTTCTAGATAAAATAAGTATTGGACAACTTGTTAGTCTCCTTTCCAACAACCTGAACAAATTTGATGAAgtatgtatcta Opossum ---tttagACCTTAAAACTGTCAAGCCGTGTTCTAGATAAAATAAGCACTGGACAGCTTATCAGTCTCCTTTCCAACAATCTGAACAAGTTTGATGAAgtatgtagctg Chicken ----ttagACCTTAAAACTGTCAAGCAAAGTTCTAGATAAAATAAGTACTGGACAATTGGTCAGCCTTCTTTCCAACAATCTGAACAAATTCGATGAGgtatgtt--tg
5
CS262 Lecture 9, Win07, Batzoglou Signals for Gene Finding 1.Regular gene structure 2.Exon/intron lengths 3.Codon composition 4.Motifs at the boundaries of exons, introns, etc. Start codon, stop codon, splice sites 5.Patterns of conservation 6.Sequenced mRNAs 7.(PCR for verification)
6
CS262 Lecture 9, Win07, Batzoglou Next Exon: Frame 0 Next Exon: Frame 1
7
CS262 Lecture 9, Win07, Batzoglou Exon and Intron Lengths
8
CS262 Lecture 9, Win07, Batzoglou Nucleotide Composition Base composition in exons is characteristic due to the genetic code Amino AcidSLCDNA Codons IsoleucineIATT, ATC, ATA LeucineLCTT, CTC, CTA, CTG, TTA, TTG ValineVGTT, GTC, GTA, GTG PhenylalanineFTTT, TTC MethionineMATG CysteineCTGT, TGC AlanineAGCT, GCC, GCA, GCG GlycineGGGT, GGC, GGA, GGG ProlinePCCT, CCC, CCA, CCG ThreonineTACT, ACC, ACA, ACG SerineSTCT, TCC, TCA, TCG, AGT, AGC TyrosineYTAT, TAC TryptophanWTGG GlutamineQCAA, CAG AsparagineNAAT, AAC HistidineHCAT, CAC Glutamic acidEGAA, GAG Aspartic acidDGAT, GAC LysineKAAA, AAG ArginineRCGT, CGC, CGA, CGG, AGA, AGG Amino AcidSLCDNA Codons IsoleucineIATT, ATC, ATA LeucineLCTT, CTC, CTA, CTG, TTA, TTG ValineVGTT, GTC, GTA, GTG PhenylalanineFTTT, TTC MethionineMATG CysteineCTGT, TGC AlanineAGCT, GCC, GCA, GCG GlycineGGGT, GGC, GGA, GGG ProlinePCCT, CCC, CCA, CCG ThreonineTACT, ACC, ACA, ACG SerineSTCT, TCC, TCA, TCG, AGT, AGC TyrosineYTAT, TAC TryptophanWTGG GlutamineQCAA, CAG AsparagineNAAT, AAC HistidineHCAT, CAC Glutamic acidEGAA, GAG Aspartic acidDGAT, GAC LysineKAAA, AAG ArginineRCGT, CGC, CGA, CGG, AGA, AGG
9
CS262 Lecture 9, Win07, Batzoglou atg tga ggtgag caggtg cagatg cagttg caggcc ggtgag
10
CS262 Lecture 9, Win07, Batzoglou Splice Sites (http://www-lmmb.ncifcrf.gov/~toms/sequencelogo.html)
11
CS262 Lecture 9, Win07, Batzoglou HMMs for Gene Recognition GTCAGATGAGCAAAGTAGACACTCCAGTAACGCGGTGAGTACATTAA exon intron intergene Intergene State First Exon State Intron State Intron State
12
CS262 Lecture 9, Win07, Batzoglou HMMs for Gene Recognition exon intron intergene Intergene State First Exon State Intron State Intron State GTCAGATGAGCAAAGTAGACACTCCAGTAACGCGGTGAGTACATTAA
13
CS262 Lecture 9, Win07, Batzoglou Duration HMMs for Gene Recognition TAAAAAAAAAAAAAAAATTTTTTTTTTTTTTTGGGGGGGGGGGGGGGCCCCCCC Exon1Exon2Exon3 Duration d i P INTRON (x i | x i-1 …x i-w ) P EXON_DUR (d) i P EXON((i – j + 2)%3)) (x i | x i-1 …x i-w ) j+2 P 5’SS (x i-3 …x i+4 ) P STOP (x i-4 …x i+3 )
14
CS262 Lecture 9, Win07, Batzoglou Genscan Burge, 1997 First competitive HMM-based gene finder, huge accuracy jump Only gene finder at the time, to predict partial genes and genes in both strands Features –Duration HMM –Four different parameter sets Very low, low, med, high GC-content
15
CS262 Lecture 9, Win07, Batzoglou Using Comparative Information
16
CS262 Lecture 9, Win07, Batzoglou Using Comparative Information Hox cluster is an example where everything is conserved
17
CS262 Lecture 9, Win07, Batzoglou Patterns of Conservation 30% 1.3% 0.14% 58% 14% 10.2% GenesIntergenic Mutations Gaps Frameshifts Separation 2-fold 10-fold 75-fold
18
CS262 Lecture 9, Win07, Batzoglou Comparison-based Gene Finders Rosetta, 2000 CEM, 2000 –First methods to apply comparative genomics (human-mouse) to improve gene prediction Twinscan, 2001 –First HMM for comparative gene prediction in two genomes SLAM, 2002 –Generalized pair-HMM for simultaneous alignment and gene prediction in two genomes NSCAN, 2006 –Best method to-date based on a phylo-HMM for multiple genome gene prediction
19
CS262 Lecture 9, Win07, Batzoglou Twinscan 1.Align the two sequences (eg. from human and mouse) 2.Mark each human base as gap ( - ), mismatch ( : ), match ( | ) New “alphabet”: 4 x 3 = 12 letters = { A-, A:, A|, C-, C:, C|, G-, G:, G|, U-, U:, U| } 3.Run Viterbi using emissions e k (b) where b { A-, A:, A|, …, T| } Emission distributions e k (b) estimated from real genes from human/mouse e I (x|) < e E (x|): matches favored in exons e I (x-) > e E (x-): gaps (and mismatches) favored in introns Example Human : ACGGCGACGUGCACGU Mouse : ACUGUGACGUGCACUU Alignment : ||:|:|||||||||:|
20
CS262 Lecture 9, Win07, Batzoglou SLAM – Generalized Pair HMM d e Exon GPHMM 1.Choose exon lengths (d,e). 2.Generate alignment of length d+e.
21
CS262 Lecture 9, Win07, Batzoglou NSCAN—Multiple Species Gene Prediction GENSCAN TWINSCAN N-SCAN TargetGGTGAGGTGACCAAGAACGTGTTGACAGTA Conservation|||:||:||:|||||:||||||||...... sequence TargetGGTGAGGTGACCAAGAACGTGTTGACAGTA Conservation|||:||:||:|||||:||||||||...... sequence TargetGGTGAGGTGACCAAGAACGTGTTGACAGTA Informant1GGTCAGC___CCAAGAACGTGTAG...... Informant2GATCAGC___CCAAGAACGTGTAG...... Informant3GGTGAGCTGACCAAGATCGTGTTGACACAA TargetGGTGAGGTGACCAAGAACGTGTTGACAGTA Informant1GGTCAGC___CCAAGAACGTGTAG...... Informant2GATCAGC___CCAAGAACGTGTAG...... Informant3GGTGAGCTGACCAAGATCGTGTTGACACAA... Target sequence: Informant sequences (vector): Joint prediction (use phylo-HMM):
22
CS262 Lecture 9, Win07, Batzoglou NSCAN—Multiple Species Gene Prediction X X C C Y Y Z Z H H M M R R X X C C Y Y Z Z H H M M R R
23
CS262 Lecture 9, Win07, Batzoglou Performance Comparison GENSCAN Generalized HMM Models human sequence TWINSCAN Generalized HMM Models human/mouse alignments N-SCAN Phylo-HMM Models multiple sequence evolution GENSCAN Generalized HMM Models human sequence TWINSCAN Generalized HMM Models human/mouse alignments N-SCAN Phylo-HMM Models multiple sequence evolution NSCAN human/mouse > Human/multiple informants
24
CS262 Lecture 9, Win07, Batzoglou 2-level architecture No Phylo-HMM that models alignments CONTRAST Human tttcttagACTTTAAAGCTGTCAAGCCGTGTTCTAGATAAAATAAGTATTGGACAACTTGTTAGTCTCCTTTCCAACAACCTGAACAAATTTGATGAAgtatgtaccta Macaque tttcttagACTTTAAAGCTGTCAAGCCGTGTTCTAGATAAAATAAGTATTGGACAACTTGTTAGTCTCCTTTCCAACAACCTGAACAAATTTGATGAAgtatgtaccta Mouse ttgcttagACTTTAAAGTTGTCAAGCCGCGTTCTTGATAAAATAAGTATTGGACAACTTGTTAGTCTTCTTTCCAACAACCTGAACAAATTTGATGAAgtatgta-cca Rat ttgcttagACTTTAAAGTTGTCAAGCCGTGTTCTTGATAAAATAAGTATTGGACAACTTATTAGTCTTCTTTCCAACAACCTGAACAAATTTGATGAAgtatgtaccca Rabbit t--attagACTTTAAAGCTGTCAAGCCGTGTTCTAGATAAAATAAGTATTGGGCAACTTATTAGTCTCCTTTCCAACAACCTGAACAAATTTGATGAAgtatgtaccta Dog t-cattagACTTTAAAGCTGTCAAGCCGTGTTCTGGATAAAATAAGTATTGGACAACTTGTTAGTCTCCTTTCCAACAACCTGAACAAATTCGATGAAgtatgtaccta Cow t-cattagACTTTGAAGCTATCAAGCCGTGTTCTGGATAAAATAAGTATTGGACAACTTGTTAGTCTCCTTTCCAACAACCTGAACAAATTTGATGAAgtatgta-cta Armadillo gca--tagACCTTAAAACTGTCAAGCCGTGTTTTAGATAAAATAAGTATTGGACAACTTGTTAGTCTCCTTTCCAACAACCTGAACAAATTTGATGAAgtatgtgccta Elephant gct-ttagACTTTAAAACTGTCCAGCCGTGTTCTTGATAAAATAAGTATTGGACAACTTGTCAGTCTCCTTTCCAACAACCTGAACAAATTTGATGAAgtatgtatcta Tenrec tc-cttagACTTTAAAACTTTCGAGCCGGGTTCTAGATAAAATAAGTATTGGACAACTTGTTAGTCTCCTTTCCAACAACCTGAACAAATTTGATGAAgtatgtatcta Opossum ---tttagACCTTAAAACTGTCAAGCCGTGTTCTAGATAAAATAAGCACTGGACAGCTTATCAGTCTCCTTTCCAACAATCTGAACAAGTTTGATGAAgtatgtagctg Chicken ----ttagACCTTAAAACTGTCAAGCAAAGTTCTAGATAAAATAAGTACTGGACAATTGGTCAGCCTTCTTTCCAACAATCTGAACAAATTCGATGAGgtatgtt--tg SVM X Y abab
25
CS262 Lecture 9, Win07, Batzoglou CONTRAST
26
CS262 Lecture 9, Win07, Batzoglou log P(y | x) ~ w T F(x, y) F(x, y) = i f(y i-1, y i, i, x) f(y i-1, y i, i, x): 1{y i-1 = INTRON, y i = EXON_FRAME_1} 1{y i-1 = EXON_FRAME_1, x human,i-2,…, x human,i+3 = ACCGGT) 1{y i-1 = EXON_FRAME_1, x human,i-1,…, x dog,i+1 = ACC, AGC) (1-c)1{a<SVM_DONOR(i)<b} (optional)1{EXON_FRAME_1, EST_EVIDENCE} CONTRAST - Features
27
CS262 Lecture 9, Win07, Batzoglou Accuracy increases as we add informants Diminishing returns after ~5 informants CONTRAST – SVM accuracies SNSP
28
CS262 Lecture 9, Win07, Batzoglou CONTRAST - Decoding Viterbi Decoding: maximize P(y | x) Maximum Expected Boundary Accuracy Decoding: maximize i,B 1{y i-1, y i is exon boundary B} Accuracy(y i-1, y i, B | x) Accuracy(y i-1, y i, B | x) = P(y i-1, y i is B | x) – (1 – P(y i-1, y i is B | x))
29
CS262 Lecture 9, Win07, Batzoglou CONTRAST - Training Maximum Conditional Likelihood Training: maximize L(w) = P w (y | x) Maximum Expected Boundary Accuracy Training: Expected BoundaryAccuracy (w) = i Accuracy i Accuracy i = B 1{(y i-1, y i is exon boundary B} P w (y i-1, y i is B | x) - B’ ≠ B P(y i-1, y i is exon boundary B’ | x)
30
CS262 Lecture 9, Win07, Batzoglou Performance Comparison De Novo EST-assisted Human Macaque Mouse Rat Rabbit Dog Cow Armadillo Elephant Tenrec Opossum Chicken Human Macaque Mouse Rat Rabbit Dog Cow Armadillo Elephant Tenrec Opossum Chicken
31
CS262 Lecture 9, Win07, Batzoglou Performance Comparison
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.