Presentation is loading. Please wait.

Presentation is loading. Please wait.

CS262 Lecture 9, Win07, Batzoglou Gene Recognition.

Similar presentations


Presentation on theme: "CS262 Lecture 9, Win07, Batzoglou Gene Recognition."— Presentation transcript:

1 CS262 Lecture 9, Win07, Batzoglou Gene Recognition

2 CS262 Lecture 9, Win07, Batzoglou Gene structure exon1 exon2exon3 intron1intron2 transcription translation splicing exon = protein-coding intron = non-coding Codon: A triplet of nucleotides that is converted to one amino acid

3 CS262 Lecture 9, Win07, Batzoglou Needles in a Haystack

4 CS262 Lecture 9, Win07, Batzoglou Classes of Gene predictors  Ab initio Only look at the genomic DNA of target genome  De novo Target genome + aligned informant genome(s)  EST/cDNA-based & combined approaches Use aligned ESTs or cDNAs + any other kind of evidence Gene Finding EXON Human tttcttagACTTTAAAGCTGTCAAGCCGTGTTCTAGATAAAATAAGTATTGGACAACTTGTTAGTCTCCTTTCCAACAACCTGAACAAATTTGATGAAgtatgtaccta Macaque tttcttagACTTTAAAGCTGTCAAGCCGTGTTCTAGATAAAATAAGTATTGGACAACTTGTTAGTCTCCTTTCCAACAACCTGAACAAATTTGATGAAgtatgtaccta Mouse ttgcttagACTTTAAAGTTGTCAAGCCGCGTTCTTGATAAAATAAGTATTGGACAACTTGTTAGTCTTCTTTCCAACAACCTGAACAAATTTGATGAAgtatgta-cca Rat ttgcttagACTTTAAAGTTGTCAAGCCGTGTTCTTGATAAAATAAGTATTGGACAACTTATTAGTCTTCTTTCCAACAACCTGAACAAATTTGATGAAgtatgtaccca Rabbit t--attagACTTTAAAGCTGTCAAGCCGTGTTCTAGATAAAATAAGTATTGGGCAACTTATTAGTCTCCTTTCCAACAACCTGAACAAATTTGATGAAgtatgtaccta Dog t-cattagACTTTAAAGCTGTCAAGCCGTGTTCTGGATAAAATAAGTATTGGACAACTTGTTAGTCTCCTTTCCAACAACCTGAACAAATTCGATGAAgtatgtaccta Cow t-cattagACTTTGAAGCTATCAAGCCGTGTTCTGGATAAAATAAGTATTGGACAACTTGTTAGTCTCCTTTCCAACAACCTGAACAAATTTGATGAAgtatgta-cta Armadillo gca--tagACCTTAAAACTGTCAAGCCGTGTTTTAGATAAAATAAGTATTGGACAACTTGTTAGTCTCCTTTCCAACAACCTGAACAAATTTGATGAAgtatgtgccta Elephant gct-ttagACTTTAAAACTGTCCAGCCGTGTTCTTGATAAAATAAGTATTGGACAACTTGTCAGTCTCCTTTCCAACAACCTGAACAAATTTGATGAAgtatgtatcta Tenrec tc-cttagACTTTAAAACTTTCGAGCCGGGTTCTAGATAAAATAAGTATTGGACAACTTGTTAGTCTCCTTTCCAACAACCTGAACAAATTTGATGAAgtatgtatcta Opossum ---tttagACCTTAAAACTGTCAAGCCGTGTTCTAGATAAAATAAGCACTGGACAGCTTATCAGTCTCCTTTCCAACAATCTGAACAAGTTTGATGAAgtatgtagctg Chicken ----ttagACCTTAAAACTGTCAAGCAAAGTTCTAGATAAAATAAGTACTGGACAATTGGTCAGCCTTCTTTCCAACAATCTGAACAAATTCGATGAGgtatgtt--tg

5 CS262 Lecture 9, Win07, Batzoglou Signals for Gene Finding 1.Regular gene structure 2.Exon/intron lengths 3.Codon composition 4.Motifs at the boundaries of exons, introns, etc. Start codon, stop codon, splice sites 5.Patterns of conservation 6.Sequenced mRNAs 7.(PCR for verification)

6 CS262 Lecture 9, Win07, Batzoglou Next Exon: Frame 0 Next Exon: Frame 1

7 CS262 Lecture 9, Win07, Batzoglou Exon and Intron Lengths

8 CS262 Lecture 9, Win07, Batzoglou Nucleotide Composition Base composition in exons is characteristic due to the genetic code Amino AcidSLCDNA Codons IsoleucineIATT, ATC, ATA LeucineLCTT, CTC, CTA, CTG, TTA, TTG ValineVGTT, GTC, GTA, GTG PhenylalanineFTTT, TTC MethionineMATG CysteineCTGT, TGC AlanineAGCT, GCC, GCA, GCG GlycineGGGT, GGC, GGA, GGG ProlinePCCT, CCC, CCA, CCG ThreonineTACT, ACC, ACA, ACG SerineSTCT, TCC, TCA, TCG, AGT, AGC TyrosineYTAT, TAC TryptophanWTGG GlutamineQCAA, CAG AsparagineNAAT, AAC HistidineHCAT, CAC Glutamic acidEGAA, GAG Aspartic acidDGAT, GAC LysineKAAA, AAG ArginineRCGT, CGC, CGA, CGG, AGA, AGG Amino AcidSLCDNA Codons IsoleucineIATT, ATC, ATA LeucineLCTT, CTC, CTA, CTG, TTA, TTG ValineVGTT, GTC, GTA, GTG PhenylalanineFTTT, TTC MethionineMATG CysteineCTGT, TGC AlanineAGCT, GCC, GCA, GCG GlycineGGGT, GGC, GGA, GGG ProlinePCCT, CCC, CCA, CCG ThreonineTACT, ACC, ACA, ACG SerineSTCT, TCC, TCA, TCG, AGT, AGC TyrosineYTAT, TAC TryptophanWTGG GlutamineQCAA, CAG AsparagineNAAT, AAC HistidineHCAT, CAC Glutamic acidEGAA, GAG Aspartic acidDGAT, GAC LysineKAAA, AAG ArginineRCGT, CGC, CGA, CGG, AGA, AGG

9 CS262 Lecture 9, Win07, Batzoglou atg tga ggtgag caggtg cagatg cagttg caggcc ggtgag

10 CS262 Lecture 9, Win07, Batzoglou Splice Sites (http://www-lmmb.ncifcrf.gov/~toms/sequencelogo.html)

11 CS262 Lecture 9, Win07, Batzoglou HMMs for Gene Recognition GTCAGATGAGCAAAGTAGACACTCCAGTAACGCGGTGAGTACATTAA exon intron intergene Intergene State First Exon State Intron State Intron State

12 CS262 Lecture 9, Win07, Batzoglou HMMs for Gene Recognition exon intron intergene Intergene State First Exon State Intron State Intron State GTCAGATGAGCAAAGTAGACACTCCAGTAACGCGGTGAGTACATTAA

13 CS262 Lecture 9, Win07, Batzoglou Duration HMMs for Gene Recognition TAAAAAAAAAAAAAAAATTTTTTTTTTTTTTTGGGGGGGGGGGGGGGCCCCCCC Exon1Exon2Exon3 Duration d  i P INTRON (x i | x i-1 …x i-w ) P EXON_DUR (d)  i P EXON((i – j + 2)%3)) (x i | x i-1 …x i-w ) j+2 P 5’SS (x i-3 …x i+4 ) P STOP (x i-4 …x i+3 )

14 CS262 Lecture 9, Win07, Batzoglou Genscan Burge, 1997 First competitive HMM-based gene finder, huge accuracy jump Only gene finder at the time, to predict partial genes and genes in both strands Features –Duration HMM –Four different parameter sets Very low, low, med, high GC-content

15 CS262 Lecture 9, Win07, Batzoglou Using Comparative Information

16 CS262 Lecture 9, Win07, Batzoglou Using Comparative Information Hox cluster is an example where everything is conserved

17 CS262 Lecture 9, Win07, Batzoglou Patterns of Conservation 30% 1.3% 0.14% 58% 14% 10.2% GenesIntergenic Mutations Gaps Frameshifts Separation 2-fold 10-fold 75-fold 

18 CS262 Lecture 9, Win07, Batzoglou Comparison-based Gene Finders Rosetta, 2000 CEM, 2000 –First methods to apply comparative genomics (human-mouse) to improve gene prediction Twinscan, 2001 –First HMM for comparative gene prediction in two genomes SLAM, 2002 –Generalized pair-HMM for simultaneous alignment and gene prediction in two genomes NSCAN, 2006 –Best method to-date based on a phylo-HMM for multiple genome gene prediction

19 CS262 Lecture 9, Win07, Batzoglou Twinscan 1.Align the two sequences (eg. from human and mouse) 2.Mark each human base as gap ( - ), mismatch ( : ), match ( | ) New “alphabet”: 4 x 3 = 12 letters  = { A-, A:, A|, C-, C:, C|, G-, G:, G|, U-, U:, U| } 3.Run Viterbi using emissions e k (b) where b  { A-, A:, A|, …, T| } Emission distributions e k (b) estimated from real genes from human/mouse e I (x|) < e E (x|): matches favored in exons e I (x-) > e E (x-): gaps (and mismatches) favored in introns Example Human : ACGGCGACGUGCACGU Mouse : ACUGUGACGUGCACUU Alignment : ||:|:|||||||||:|

20 CS262 Lecture 9, Win07, Batzoglou SLAM – Generalized Pair HMM d e Exon GPHMM 1.Choose exon lengths (d,e). 2.Generate alignment of length d+e.

21 CS262 Lecture 9, Win07, Batzoglou NSCAN—Multiple Species Gene Prediction GENSCAN TWINSCAN N-SCAN TargetGGTGAGGTGACCAAGAACGTGTTGACAGTA Conservation|||:||:||:|||||:||||||||...... sequence TargetGGTGAGGTGACCAAGAACGTGTTGACAGTA Conservation|||:||:||:|||||:||||||||...... sequence TargetGGTGAGGTGACCAAGAACGTGTTGACAGTA Informant1GGTCAGC___CCAAGAACGTGTAG...... Informant2GATCAGC___CCAAGAACGTGTAG...... Informant3GGTGAGCTGACCAAGATCGTGTTGACACAA TargetGGTGAGGTGACCAAGAACGTGTTGACAGTA Informant1GGTCAGC___CCAAGAACGTGTAG...... Informant2GATCAGC___CCAAGAACGTGTAG...... Informant3GGTGAGCTGACCAAGATCGTGTTGACACAA... Target sequence: Informant sequences (vector): Joint prediction (use phylo-HMM):

22 CS262 Lecture 9, Win07, Batzoglou NSCAN—Multiple Species Gene Prediction X X C C Y Y Z Z H H M M R R X X C C Y Y Z Z H H M M R R

23 CS262 Lecture 9, Win07, Batzoglou Performance Comparison GENSCAN Generalized HMM Models human sequence TWINSCAN Generalized HMM Models human/mouse alignments N-SCAN Phylo-HMM Models multiple sequence evolution GENSCAN Generalized HMM Models human sequence TWINSCAN Generalized HMM Models human/mouse alignments N-SCAN Phylo-HMM Models multiple sequence evolution NSCAN human/mouse > Human/multiple informants

24 CS262 Lecture 9, Win07, Batzoglou 2-level architecture No Phylo-HMM that models alignments CONTRAST Human tttcttagACTTTAAAGCTGTCAAGCCGTGTTCTAGATAAAATAAGTATTGGACAACTTGTTAGTCTCCTTTCCAACAACCTGAACAAATTTGATGAAgtatgtaccta Macaque tttcttagACTTTAAAGCTGTCAAGCCGTGTTCTAGATAAAATAAGTATTGGACAACTTGTTAGTCTCCTTTCCAACAACCTGAACAAATTTGATGAAgtatgtaccta Mouse ttgcttagACTTTAAAGTTGTCAAGCCGCGTTCTTGATAAAATAAGTATTGGACAACTTGTTAGTCTTCTTTCCAACAACCTGAACAAATTTGATGAAgtatgta-cca Rat ttgcttagACTTTAAAGTTGTCAAGCCGTGTTCTTGATAAAATAAGTATTGGACAACTTATTAGTCTTCTTTCCAACAACCTGAACAAATTTGATGAAgtatgtaccca Rabbit t--attagACTTTAAAGCTGTCAAGCCGTGTTCTAGATAAAATAAGTATTGGGCAACTTATTAGTCTCCTTTCCAACAACCTGAACAAATTTGATGAAgtatgtaccta Dog t-cattagACTTTAAAGCTGTCAAGCCGTGTTCTGGATAAAATAAGTATTGGACAACTTGTTAGTCTCCTTTCCAACAACCTGAACAAATTCGATGAAgtatgtaccta Cow t-cattagACTTTGAAGCTATCAAGCCGTGTTCTGGATAAAATAAGTATTGGACAACTTGTTAGTCTCCTTTCCAACAACCTGAACAAATTTGATGAAgtatgta-cta Armadillo gca--tagACCTTAAAACTGTCAAGCCGTGTTTTAGATAAAATAAGTATTGGACAACTTGTTAGTCTCCTTTCCAACAACCTGAACAAATTTGATGAAgtatgtgccta Elephant gct-ttagACTTTAAAACTGTCCAGCCGTGTTCTTGATAAAATAAGTATTGGACAACTTGTCAGTCTCCTTTCCAACAACCTGAACAAATTTGATGAAgtatgtatcta Tenrec tc-cttagACTTTAAAACTTTCGAGCCGGGTTCTAGATAAAATAAGTATTGGACAACTTGTTAGTCTCCTTTCCAACAACCTGAACAAATTTGATGAAgtatgtatcta Opossum ---tttagACCTTAAAACTGTCAAGCCGTGTTCTAGATAAAATAAGCACTGGACAGCTTATCAGTCTCCTTTCCAACAATCTGAACAAGTTTGATGAAgtatgtagctg Chicken ----ttagACCTTAAAACTGTCAAGCAAAGTTCTAGATAAAATAAGTACTGGACAATTGGTCAGCCTTCTTTCCAACAATCTGAACAAATTCGATGAGgtatgtt--tg SVM X Y abab

25 CS262 Lecture 9, Win07, Batzoglou CONTRAST

26 CS262 Lecture 9, Win07, Batzoglou log P(y | x) ~ w T F(x, y) F(x, y) =  i f(y i-1, y i, i, x) f(y i-1, y i, i, x):  1{y i-1 = INTRON, y i = EXON_FRAME_1}  1{y i-1 = EXON_FRAME_1, x human,i-2,…, x human,i+3 = ACCGGT)  1{y i-1 = EXON_FRAME_1, x human,i-1,…, x dog,i+1 = ACC, AGC)  (1-c)1{a<SVM_DONOR(i)<b}  (optional)1{EXON_FRAME_1, EST_EVIDENCE} CONTRAST - Features

27 CS262 Lecture 9, Win07, Batzoglou Accuracy increases as we add informants Diminishing returns after ~5 informants CONTRAST – SVM accuracies SNSP

28 CS262 Lecture 9, Win07, Batzoglou CONTRAST - Decoding Viterbi Decoding: maximize P(y | x) Maximum Expected Boundary Accuracy Decoding: maximize  i,B 1{y i-1, y i is exon boundary B} Accuracy(y i-1, y i, B | x) Accuracy(y i-1, y i, B | x) = P(y i-1, y i is B | x) – (1 – P(y i-1, y i is B | x))

29 CS262 Lecture 9, Win07, Batzoglou CONTRAST - Training Maximum Conditional Likelihood Training: maximize L(w) = P w (y | x) Maximum Expected Boundary Accuracy Training: Expected BoundaryAccuracy (w) =  i Accuracy i Accuracy i =  B 1{(y i-1, y i is exon boundary B} P w (y i-1, y i is B | x) -  B’ ≠ B P(y i-1, y i is exon boundary B’ | x)

30 CS262 Lecture 9, Win07, Batzoglou Performance Comparison De Novo EST-assisted Human Macaque Mouse Rat Rabbit Dog Cow Armadillo Elephant Tenrec Opossum Chicken Human Macaque Mouse Rat Rabbit Dog Cow Armadillo Elephant Tenrec Opossum Chicken

31 CS262 Lecture 9, Win07, Batzoglou Performance Comparison


Download ppt "CS262 Lecture 9, Win07, Batzoglou Gene Recognition."

Similar presentations


Ads by Google