Presentation is loading. Please wait.

Presentation is loading. Please wait.

A knowledge-based approach to integrated genome annotation Michael Brent Washington University.

Similar presentations


Presentation on theme: "A knowledge-based approach to integrated genome annotation Michael Brent Washington University."— Presentation transcript:

1 A knowledge-based approach to integrated genome annotation Michael Brent Washington University

2 EST-, mRNA-, and protein- based methods

3 ENCODE WorkshopMay 6, 2005 3 Outline of our process MGC validated clones + RefSeq NM’s Remove all with frame shifts Fill with spliced Hs mRNA & EST Threaded de novo predict- ions Paragon aligner BLAT N-SCAN +EST

4 Paragon aligner Manimozhiyan Arumugam with Chaochun Wei

5 ENCODE WorkshopMay 6, 2005 5 Better EST/cDNA-to-genome alignment Idea Go beyond minimizing mismatches and gaps Accurate probabilities in correct alignments Estimate parameters for each sequence set

6 ENCODE WorkshopMay 6, 2005 6 Better EST/cDNA alignment Two sources of mismatches & gaps Error (sequencing, RT) –Quals give local probs. Not used here. Polymorphism (RNA vs. genome strains) Gap vs. indel rates are different Parameters must vary with sequence quality & source strains/polymorphism rates E.g. prefer non-matches in low quality bases

7 ENCODE WorkshopMay 6, 2005 7 Better EST/cDNA alignment Introns Accurate probabilities in correct alignments –GT/AG vs. GC/AG vs. AT/AC Absolutely no junk splice sites –Not clear what to do with polymorphic sites Long introns are rarer than short introns

8 ENCODE WorkshopMay 6, 2005 8 Small exon in finished cDNA STANDARD TOOL (EST_GENOME) GENOME 351 GCGGGCGCGTTGGTGCGGAAAGCGGCGGACTATGTCCGAAGCAAGGATTT 400 |||||||||||||||||||||||||||||||||||||||||||||||||| BC000810 51 GCGGGCGCGTTGGTGCGGAAAGCGGCGGACTATGTCCGAAGCAAGGATTT 100 GENOME 401 CCGGGACTACCTCATGAGGTGACG-Agcgcc.......tgtagCACTTCT 16339 ||||||||||||||||| || ||| |>>>>> 15907 >>>>> ||||| BC000810 101 CCGGGACTACCTCATGA-GT-ACGCA.................--CTTCT 129 GENOME 16340 GGGGCCCAGTAGCCAACTGGGGTCTTCCCATTGCTGCCATCAATGATATG 16389 |||||||||||||||||||||||||||||||||||||||||||||||||| BC000810 130 GGGGCCCAGTAGCCAACTGGGGTCTTCCCATTGCTGCCATCAATGATATG 179 OUR PAIR HMM GENOME 351 GCGGGCGCGTTGGTGCGGAAAGCGGCGGACTATGTCCGAAGCAAGGATTT 400 |||||||||||||||||||||||||||||||||||||||||||||||||| BC000810 51 GCGGGCGCGTTGGTGCGGAAAGCGGCGGACTATGTCCGAAGCAAGGATTT 100 GENOME 401 CCGGGACTACCTCATGAGGTGAC.......AATAGTACGGTAAG...... 13006 ||||||||||||||||||>>>>> 12584 >>>>>||||>>>>> 3326 BC000810 101 CCGGGACTACCTCATGAG.................TACG........... 122 GENOME 13007 TGTAGCACTTCTGGGGCCCAGTAGCCAACTGGGGTCTTCCCATTGCTGCC 13046 >>>>>||||||||||||||||||||||||||||||||||||||||||||| BC000810 123.....CACTTCTGGGGCCCAGTAGCCAACTGGGGTCTTCCCATTGCTGCC 167

9 ENCODE WorkshopMay 6, 2005 9

10 ENCODE WorkshopMay 6, 2005 10 Blind test Test set 100 alignment pairs of MGC clones to genome Paragon & EST_genome differ on all of them Output format identical Evaluation Curator attempting to explain discrepancies Result 37 cases where biological evidence favors 1 In 31/37 Paragon alignment is supported

11 ENCODE WorkshopMay 6, 2005 11 Future directions UTR vs. ORF Polymorphism is more common in UTR And 3 rd position in ORF Conservation Use alignments to distinguish true from false –Splice sites, introns –Codons –Polymorphisms (analogous to quality values)

12 ENCODE WorkshopMay 6, 2005 12 Conceptual shift Traditional view cDNA data “speaks for itself”. Theory neutral. Alignment = counting matches, mismatches, gaps cDNA = genome annotation

13 ENCODE WorkshopMay 6, 2005 13 Conceptual shift Our view More knowledge = better alignments & annotations cDNA is very useful evidence re: gene structure –Need to align it correctly –Need to determine its completeness –If not complete, predict the remainder Gene prediction & cDNA alignment are the same problem –cDNA/EST just adds another information source

14 N-SCAN_EST Chaochun Wei

15 ENCODE WorkshopMay 6, 2005 15 TWINSCAN/N-SCAN_EST Goal: Integrate EST information with TWINSCAN to –improve accuracy where EST evidence exits –without losing the ability to predict novel genes.

16 ENCODE WorkshopMay 6, 2005 16 Twinscan_est

17 ENCODE WorkshopMay 6, 2005 17 Generating EST-alignment Sequence

18 ENCODE WorkshopMay 6, 2005 18 Modeling EST alignment sequence Probability models In each HMM state –Separate models for EST alignment sequence –Probabilities of DNA, conservation sequence, and EST sequence are multiplied. Very similar to models of genomic alignments

19 Multi-genome methods: N-SCAN Samuel Gross with Randall Brown

20 ENCODE WorkshopMay 6, 2005 20 N-SCAN: Using multi-genome alignments Motivation Many genomes should give stronger signal of negative selection than two Lots of genomes are being sequenced Methods 1.Extend Twinscan to a phylogenetic tree model 2.At each site, mutation rate & pattern of tolerated substitutions depend on function

21 ENCODE WorkshopMay 6, 2005 21 Example A multiple alignment that (A) is and (B) is not typical of the splice boundary shown

22 ENCODE WorkshopMay 6, 2005 22 Using mutation patterns for improving gene prediction Tree hidden Markov model Each state –generates columns of a multiple alignment –by a substitution process –along the branches of a phylogenetic tree

23 ENCODE WorkshopMay 6, 2005 23 Challenges Columns are not correct, orthologous 1.Sequencing error 2.Alignment error 3.Change of function (I am not a mouse!)

24 ENCODE WorkshopMay 6, 2005 24 Differences from EXONIPHY Approach Estimate models of actual alignments, not evolutionary processes Model 1.Independent substitution probabilities on each branch of the tree 2.6 characters: A, C, G, T, gap, unaligned 3.Condition backwards from target genome

25 ENCODE WorkshopMay 6, 2005 25 Using mutation patterns for improving gene prediction Traditional factorization Pr(a 2 ) Pr(a 1 |a 2 ) Pr(h|a 1 ) Pr(m|a 1 ) Pr(c|a 2 ) N-SCAN factorization Pr(h) Pr(a 1 |h) Pr(a 2 |a 1 ) Pr(m|a 1 ) Pr(c|a 2 )

26 ENCODE WorkshopMay 6, 2005 26 Preliminary study in human

27 ENCODE WorkshopMay 6, 2005 27 Preliminary study in human

28 ENCODE WorkshopMay 6, 2005 28 Fin


Download ppt "A knowledge-based approach to integrated genome annotation Michael Brent Washington University."

Similar presentations


Ads by Google