Presentation is loading. Please wait.

Presentation is loading. Please wait.

Gene Recognition Credits for slides: Marina Alexandersson Lior Pachter Serge Saxonov.

Similar presentations


Presentation on theme: "Gene Recognition Credits for slides: Marina Alexandersson Lior Pachter Serge Saxonov."— Presentation transcript:

1 Gene Recognition Credits for slides: Marina Alexandersson Lior Pachter Serge Saxonov

2 Lecture 13, Thursday May 15, 2003 Reading GENSCAN SLAM Twinscan Optional: Chris Burge’s Thesis

3 Lecture 13, Thursday May 15, 2003 Gene expression Protein RNA DNA transcription translation CCTGAGCCAACTATTGATGAA PEPTIDEPEPTIDE CCUGAGCCAACUAUUGAUGAA

4 Lecture 13, Thursday May 15, 2003 Gene structure exon1 exon2exon3 intron1intron2 transcription translation splicing exon = coding intron = non-coding

5 Lecture 13, Thursday May 15, 2003 Finding genes Start codon ATG 5’ 3’ Exon 1 Exon 2 Exon 3 Intron 1Intron 2 Stop codon TAG/TGA/TAA Splice sites

6

7

8 Lecture 13, Thursday May 15, 2003 Approaches to gene finding Homology –BLAST, Procrustes. Ab initio –Genscan, Genie, GeneID. Hybrids –GenomeScan, GenieEST, Twinscan, SGP, ROSETTA, CEM, TBLASTX, SLAM.

9 HMMs for single species gene finding: Generalized HMMs

10 Lecture 13, Thursday May 15, 2003 HMMs for gene finding GTCAGAGTAGCAAAGTAGACACTCCAGTAACGC exon intron intergene

11 Lecture 13, Thursday May 15, 2003 GHMM for gene finding TAAAAAAAAAAAAAAAATTTTTTTTTTTTTTTGGGGGGGGGGGGGGGCCCCCCC Exon1Exon2Exon3

12 Lecture 13, Thursday May 15, 2003 Observed duration times

13 Lecture 13, Thursday May 15, 2003 Biology of Splicing (http://genes.mit.edu/chris/)

14 Lecture 13, Thursday May 15, 2003 Consensus splice sites (http://www-lmmb.ncifcrf.gov/~toms/sequencelogo.html) Donor: 7.9 bits Acceptor: 9.4 bits (Stephens & Schneider, 1996)

15 Lecture 13, Thursday May 15, 2003 Splice Site Models WMM: weight matrix model = PSSM (Staden 1984) WAM: weight array model = 1 st order Markov (Zhang & Marr 1993) MDD: maximal dependence decomposition (Burge & Karlin 1997) decision-tree like algorithm to take significant pairwise dependencies into account

16 Lecture 13, Thursday May 15, 2003 Splice site detection 5’ 3’ Donor site Position 

17 atg tga ggtgag caggtg cagatg cagttg caggcc ggtgag

18 Lecture 13, Thursday May 15, 2003 Coding potential Amino AcidSLCDNA codons IsoleucineIATT, ATC, ATA LeucineLCTT, CTC, CTA, CTG, TTA, TTG ValineVGTT, GTC, GTA, GTG PhenylalanineFTTT, TTC MethionineMATG CysteineCTGT, TGC AlanineAGCT, GCC, GCA, GCG GlycineGGGT, GGC, GGA, GGG ProlinePCCT, CCC, CCA, CCG ThreonineTACT, ACC, ACA, ACG SerineSTCT, TCC, TCA, TCG, AGT, AGC TyrosineYTAT, TAC TryptophanWTGG GlutamineQCAA, CAG AsparagineNAAT, AAC HistidineHCAT, CAC Glutamic acidEGAA, GAG Aspartic acidDGAT, GAC LysineKAAA, AAG ArginineRCGT, CGC, CGA, CGG, AGA, AGG Stop codons StopTAA, TAG, TGA

19 Lecture 13, Thursday May 15, 2003 Comparison of 1196 orthologous genes (Makalowski et al., 1996) Sequence identity: –exons: 84.6% –protein: 85.4% –introns: 35% –5’ UTRs: 67% –3’ UTRs: 69% 27 proteins were 100% identical.

20 Lecture 13, Thursday May 15, 2003 HumanMouse Human-mouse homology

21 Lecture 13, Thursday May 15, 2003

22 Not always: HoxA human-mouse

23 Lecture 13, Thursday May 15, 2003 50. :. :. :. :. : 247 GGTGAGGTCGAGGACCCTGCA CGGAGCTGTATGGAGGGCA AGAGC |: || ||||: |||| --:|| ||| |::| |||---|||| 368 GAGTCGGGGGAGGGGGCTGCTGTTGGCTCTGGACAGCTTGCATTGAGAGG 100. :. :. :. :. : 292 TTC CTACAGAAAAGTCCCAGCAAGGAGCCACACTTCACTG |||----------|| | |::| |: ||||::|:||:-|| ||:| | 418 TTCTGGCTACGCTCTCCCTTAGGGACTGAGCAGAGGGCT CAGGTCGCGG 150. :. :. :. :. : 332 ATGTCGAGGGGAAGACATCATTCGGGATGTCAGTG ---------------||||||||||||||||||||||:|||||||||||| 467 TGGGAGATGAGGCCAATGTCGAGGGGAAGACATCATTTGGGATGTCAGTG 200. :. :. :. :. : 367 TTCAACCTCAGCAATGCCATCATGGGCAGCGGCATCCTGGGACTCGCCTA |||||:||||||||:||||||||||||||:|| ||:|||||:|||||||| 517 TTCAATCTCAGCAACGCCATCATGGGCAGTGGAATTCTGGGGCTCGCCTA Alignment

24 Lecture 13, Thursday May 15, 2003 Twinscan Twinscan is an augmented version of the Gencscan HMM. E I transitions duration emissions ACUAUACAGACAUAUAUCAU

25 Lecture 13, Thursday May 15, 2003 Twinscan Algorithm 1.Align the two sequences (eg. from human and mouse) 2.Mark each human base as gap ( - ), mismatch ( : ), match ( | ) New “alphabet”: 4 x 3 = 12 letters  = { A-, A:, A|, C-, C:, C|, G-, G:, G|, U-, U:, U| }

26 Lecture 13, Thursday May 15, 2003 Twinscan Algorithm (cont’d) 3.Run Viterbi using emissions e k (b) where b  { A-, A:, A|, …, T| } Note: Emission distributions e k (b) estimated from real genes from human/mouse e I (x|) < e E (x|): matches favored in exons e I (x-) > e E (x-): gaps (and mismatches) favored in introns

27 Lecture 13, Thursday May 15, 2003 Example Human : ACGGCGACUGUGCACGU Mouse : ACUGUGAC GUGCACUU Align : ||:|:|||-||||||:| Input to Twinscan HMM: A| C| G: G| C: G| A| C| U- G| U| G| C| A| C| G: U| Recall, e E (A|) > e I (A|) e E (A-) < e I (A-) Likely exon

28 HMMs for simultaneous alignment and gene finding: Generalized Pair HMMs

29 Lecture 13, Thursday May 15, 2003 A Pair HMM for alignments M P(x i, y j ) I P(x i ) J P(y j ) 1 - 2  1-  - 2        BEGIN END M J I

30 Lecture 13, Thursday May 15, 2003 Cross-species gene finding 5’ 3’ Exon1 Exon2 Exon3 Intron1Intron2 CNS [human] [mouse] Exon = coding Intron = non-coding CNS = conserved non-coding

31 Lecture 13, Thursday May 15, 2003 Generalized Pair HMMs

32 Lecture 13, Thursday May 15, 2003 Ingredients in exon scores Splice site detection (VLMM) Length distribution (generalized) Coding potential (codon freq. tables) GC-stratification

33 Lecture 13, Thursday May 15, 2003 Exon GPHMM d e 1.Choose exon lengths (d,e). 2.Generate alignment of length d+e.

34 Lecture 13, Thursday May 15, 2003 The SLAM hidden Markov model

35 Lecture 13, Thursday May 15, 2003 Example: HoxA2 and HoxA3 SLAM SGP-2 Twinscan Genscan TBLASTX SLAM CNS VISTA RefSeq

36 Lecture 13, Thursday May 15, 2003 no. states max duration length seq1 length seq2 Computational complexity

37 Lecture 13, Thursday May 15, 2003 Approximate alignment Reduces TU -factor to hT

38 Lecture 13, Thursday May 15, 2003 Measuring Performance Definition: –Sensitivity (SN): (# correctly predicted)/(# true) –Specificity (SP): (# correctly predicted)/(# total predicted)

39 Lecture 13, Thursday May 15, 2003 Measuring Performance


Download ppt "Gene Recognition Credits for slides: Marina Alexandersson Lior Pachter Serge Saxonov."

Similar presentations


Ads by Google