Comparative gene hunting Irmtraud Meyer The Sanger Institute now University of Oxford

Slides:



Advertisements
Similar presentations
Gene Prediction: Similarity-Based Approaches
Advertisements

EAnnot: A genome annotation tool using experimental evidence Aniko Sabo & Li Ding Genome Sequencing Center Washington University, St. Louis.
HIDDEN MARKOV MODELS IN COMPUTATIONAL BIOLOGY CS 594: An Introduction to Computational Molecular Biology BY Shalini Venkataraman Vidhya Gunaseelan.
Finding regulatory modules from local alignment - Department of Computer Science & Helsinki Institute of Information Technology HIIT University of Helsinki.
Ab initio gene prediction Genome 559, Winter 2011.
Basics of Comparative Genomics Dr G. P. S. Raghava.
Hidden Markov Models.
Profile Hidden Markov Models Bioinformatics Fall-2004 Dr Webb Miller and Dr Claude Depamphilis Dhiraj Joshi Department of Computer Science and Engineering.
Hidden Markov Models in Bioinformatics
Profiles for Sequences
Gene predictions for eukaryotes attgccagtacgtagctagctacacgtatgctattacggatctgtagcttagcgtatct gtatgctgttagctgtacgtacgtatttttctagagcttcgtagtctatggctagtcgt.
HMM Sampling and Applications to Gene Finding and Alignment European Conference on Computational Biology 2003 Simon Cawley * and Lior Pachter + and thanks.
Hidden Markov Models Sasha Tkachev and Ed Anderson Presenter: Sasha Tkachev.
1 Gene Finding Charles Yan. 2 Gene Finding Genomes of many organisms have been sequenced. We need to translate the raw sequences into knowledge. Where.
Gene Finding (DNA signals) Genome Sequencing and assembly
CSE182-L10 Gene Finding.
CSE182-L12 Gene Finding.
Comparative ab initio prediction of gene structures using pair HMMs
Eukaryotic Gene Finding
Phylogenetic Shadowing Daniel L. Ong. March 9, 2005RUGS, UC Berkeley2 Abstract The human genome contains about 3 billion base pairs! Algorithms to analyze.
Lecture 12 Splicing and gene prediction in eukaryotes
Eukaryotic Gene Finding
Genome Annotation BCB 660 October 20, From Carson Holt.
Gene Finding Genome Annotation. Gene finding is a cornerstone of genomic analysis Genome content and organization Differential expression analysis Epigenomics.
Making Sense of DNA and protein sequence analysis tools (course #2) Dave Baumler Genome Center of Wisconsin,
Hidden Markov Models In BioInformatics
What is comparative genomics? Analyzing & comparing genetic material from different species to study evolution, gene function, and inherited disease Understand.
Genome Annotation BBSI July 14, 2005 Rita Shiang.
CSCE555 Bioinformatics Lecture 6 Hidden Markov Models Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:
Gene finding with GeneMark.HMM (Lukashin & Borodovsky, 1997 ) CS 466 Saurabh Sinha.
Good solutions are advantageous Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Coding Domain Sequence Prediction and Alternative Splicing Detection in Human Malaria Gambiae Jun Li 1, Bing-Bing Wang 2, Jose M. Ribeiro 3, Kenneth D.
By Zemin Ning & Adam Spargo Informatics Division The Wellcome Trust Sanger Institute The SSAHA2 Application Pack.
Gene finding and gene structure prediction M. Fatih BÜYÜKAKÇALI Computational Bioinformatics 2012.
Recombinant DNA Technology and Genomics A.Overview: B.Creating a DNA Library C.Recover the clone of interest D.Analyzing/characterizing the DNA - create.
Web Databases for Drosophila Introduction to FlyBase and Ensembl Database Wilson Leung6/06.
Gene Prediction: Similarity-Based Methods (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 15, 2005 ChengXiang Zhai Department of Computer Science.
Mark D. Adams Dept. of Genetics 9/10/04
Comp. Genomics Recitation 9 11/3/06 Gene finding using HMMs & Conservation.
From Genomes to Genes Rui Alves.
Curation Tools Gary Williams Sanger Institute. SAB 2008 Gene curation – prediction software Gene prediction software is good, but not perfect. Out of.
Eukaryotic Gene Prediction Rui Alves. How are eukaryotic genes different? DNA RNA Pol mRNA Ryb Protein.
Bioinformatics and Computational Biology
Basic Overview of Bioinformatics Tools and Biocomputing Applications II Dr Tan Tin Wee Director Bioinformatics Centre.
Genes and Genomes. Genome On Line Database (GOLD) 243 Published complete genomes 536 Prokaryotic ongoing genomes 434 Eukaryotic ongoing genomes December.
Multiple Species Gene Finding using Gibbs Sampling Sourav Chatterji Lior Pachter University of California, Berkeley.
Annotation of Drosophila virilis Chris Shaffer GEP workshop, 2006.
.1Sources of DNA and Sequencing Methods.1Sources of DNA and Sequencing Methods 2 Genome Assembly Strategy and Characterization 2 Genome Assembly.
Gene discovery using combined signals from genome sequence and natural selection Michael Brent Washington University The mouse genome analysis group.
JIGSAW: a better way to combine predictions J.E. Allen, W.H. Majoros, M. Pertea, and S.L. Salzberg. JIGSAW, GeneZilla, and GlimmerHMM: puzzling out the.
Fgenes++ pipelines for automatic annotation of eukaryotic genomes Victor Solovyev, Peter Kosarev, Royal Holloway College, University of London Softberry.
Applications of HMMs in Computational Biology BMI/CS 576 Colin Dewey Fall 2010.
(H)MMs in gene prediction and similarity searches.
Annotation of eukaryotic genomes
A knowledge-based approach to integrated genome annotation Michael Brent Washington University.
Gene prediction 10. June 2004 Irmtraud Meyer University of Oxford
1 Gene Finding. 2 “The Central Dogma” TranscriptionTranslation RNA Protein.
bacteria and eukaryotes
EGASP 2005 Evaluation Protocol
EGASP 2005 Evaluation Protocol
Basics of Comparative Genomics
PlantGDB: Annotation Principles & Procedures
Eukaryotic Gene Finding
Genome Center of Wisconsin, UW-Madison
Ab initio gene prediction
Genome organization and Bioinformatics
Introduction to Bioinformatics II
HIDDEN MARKOV MODELS IN COMPUTATIONAL BIOLOGY
Basics of Comparative Genomics
.1Sources of DNA and Sequencing Methods 2 Genome Assembly Strategy and Characterization 3 Gene Prediction and Annotation 4 Genome Structure 5 Genome.
Presentation transcript:

Comparative gene hunting Irmtraud Meyer The Sanger Institute now University of Oxford

Making sense of the genome: What are the proteins and where are they encoded ? Experiments in Lab Sequence Database proteins ESTs protein DNA cctgctgggtgcgagagccggcgtaccggtgaggcc

Aim in ab initio gene prediction: protein Sequence Database proteins Experiments in Lab DNA cctgctgggtgcgagagccggcgtaccggtgaggcc

3059 million bases: GCTGCCAACGC… We will very soon have: 3286 million bases: ACTGCGGGCGC…

Rough comparative map: reference:

Typical Situation:... gagccgcctcctccccttccccacgctctaggagggggccgcgggggcctggct gcgtcggccaatcggagtgcacttccgcagctgacaaattcagtataaaagcttggggct ggggccgagcactggggactttgagggtggccaggccagcgtaggaggccagcgtaggat cctgctgggagcggggaactgagggaagcgacgccgagaaagcaggcgtaccacggaggg agagaaaagctccggaagcccagcagcgcctttacgcacagctgccaactggccgctgcc gaccgtctccagctcccgaggacgcgcgaccggacaccgggtcctgccacagccgaggac agctcgccgctcgccgcagcgagcccggggcggcccttcagggggacctttcccagatcg Cccaggccgcccggatgtgcacgaaaatggaacag ggcgacgggggctcgggaagcctgacagggcttttgcgcacagctgccggctgg tgctacccgcccgcgccagcccccgagaacgcgcgaccaggcacccagtccggtcaccgc agcggagagctcgccgctcgctgcagcgaggcccggagcggccccgcagggaccctcccc agaccgcctgggccgcccggatgtgcactaaaatggaacagcccttctaccacgacgact catacacagctacgggatacggccgggcccctggtggcctctctctacacgactacaaac tcctgaaaccgagcctggcggtcaacctggccgacccctaccggagtctcaaagcgcctg Gggctcgcggacccggcccagagggcggcggtggcggcagctacttttc... ? Human DNA Mouse DNA

Similar problem: demotic greek hieroglyphs

Aim in comparative ab initio gene prediction : annotatesimultaneously DNA x: DNA y: Input: x: y: Output: ? cctgctgggtgcgagagccggcgtaccggtgaggcc cctgctgggagcgaaagcaggcgtaccacggaggg

Why is this a good idea ? IISPTHISJLKDAFKLJDFISDFLKJUEHIDDENWRWIERUOIYWERIUY advantages: can detect new genes as there is no need to search in databases for proteins fewer assumptions needed than in one-strand ab initio gene- prediction methods, i.e. can detect unusual genes KISFTHISPLKDAPKOJGFISJYTKJUWHIDDENRUIEUNNKLZSBUEYQ

3059 million bases Mouse – human comparison: 3286 million bases about (?) genes

Analysing mouse and human DNA: Training: adjust parameters of Doublescan with set of known pairs of orthologous mouse and human genes Testing: Test set: 80 pairs of known mouse and human genes 55 % : same number of exons, different coding length 42 % : same number of exons, same coding length 3 % : different number of exons, different coding length

Results - Performance: annotation: prediction: correct overlapping missing wrong

C. elegans – C. briggsae C. elegans sequenced in million bases 5 autosomes, one X about genes C. briggsae around 100 million bases 5 autosomes, one X

Results - Performance: annotation: prediction: correct overlapping missing wrong

Summary: Doublescan: predicts the gene structures of both sequences at the same time as aligning the sequences capable of predicting partial, complete and multiple genes or no genes at all as well as more diverged pairs of genes which are related by events of exon-fusion or exon-splitting can be used to analyse long sequences using the Stepping Stone algorithm (same performance as Hirschberg algorithm) general concept: can be trained to analyse other pairs of related genomes performance on mouse - human DNA and c. elegans – c. briggsae DNA very promising

To do list: large scale mouse - human comparison large scale c. elegans – c. briggsae comparison search for regulatory regions: x: y:

References: I.M.Meyer And R. Durbin, Bioinformatics, 2002,18(10), pp

Acknowledgements: Richard Durbin Sequencing centres Trinity College, Cambridge Wellcome Trust The Sanger Centre

The method: What are pair hidden Markov models ? How can they be used to find genes ?

Pair HMMs: idea: annotate the two sequences by parsing them through connected states DNA y: DNA x: each state reads a fixed number of letters from one or two of the sequences

idea: annotate the two sequences by parsing them through connected states DNA y: DNA x: match exon match intron each state reads a fixed number of letters from one or two of the sequences match intergenic reads 1 letter from each sequence at a time start state Pair HMMs:

idea: annotate the two sequences by parsing them through connected states DNA y: DNA x: match intergenic match exon match intron each state reads a fixed number of letters from one or two of the sequences a state a transition Pair HMMs:

ACGTCGACATGGCCTATCCGCTGAGCT ACGTCGGGCCTCTCCGCTAAGCT Doublescan: emit x:- emit -:y match intergenic x:y x: y: ACGTCGACATGGCCTATCCGCTGAGCT ACGTCG GGCCTCTCCGCTAAGCT

emit x:- emit -:y match intergenic x:y match exon x1x2x3:y1y2y3 emit x1x2x3:- emit -:y1y2y3 x: y: CAAGCATGCGACAAAGGATACAGCGACCTC CAAGCCTGCGGATACAGCGAACTC CAAGCATGCGACAAAGGATACAGCGACCTC CAAGCCTGC GGATACAGCGAACTC same amino-acid (Alanine) insertion of two codons similar amino-acids (Aspartic, Glutamic acid) Doublescan:

emit x:- emit -:y match intergenic x:y match exon x1x2x3:y1y2y3 emit x1x2x3:- emit -:y1y2y3 start:start stop:stop start codonstop codon x: y: Doublescan:

emit x:-emit -:y match intron x:y start:startstop:stop emit x:- emit -:y match intergenic x:y match exon x1x2x3:y1y2y3 emit x1x2x3:- emit -:y1y2y3 GT:GTAG:AG GCATGCGTACAGTTG…GTCAGGAGAGCGAACTCGCA GCCTGCGTACAGTTA…AGTACGAGAGCGAACTCGCA intronexon 5’ splice site3’splice site Doublescan:

AGx2x3:AGy2y3 x1GT:y1GT start:startstop:stop emit x:- emit -:y match intergenic x:y match exon x1x2x3:y1y2y3 emit x1x2x3:- emit -:y1y2y3 x1GT:y1GTAGx2x3:AGy2y3 (…) emit x:-emit -:y match intron x:y GT:GTAG:AG GCATGCGTACAGTTG…GTCAGGAGAGCGAACTCGCA GCCTGCGTACAGTTA…AGTACGAGAGCGAACTCGCA GCATGCAGTACAGTTG…GTCAGGAGGCGAACTCGCA GCCTGCAGTACAGTTA…AGTACGAGGCGAACTCGCA exon intron Doublescan:

x1x2GT:y1y2GTAGx3:AGy3 (…) start:startstop:stop emit x:- emit -:y match intergenic x:y match exon x1x2x3:y1y2y3 emit x1x2x3:- emit -:y1y2y3 x1GT:y1GTAGx2x3:AGy2y3 (…) emit x:-emit -:y match intron x:y GT:GTAG:AG x1x2GT:y1y2GT AGx3:AGy3exon intron GCATGCAGGTACAGTTG…GTCAGGAGCGAACTCGCA GCCTGCAGGTACAGTTA…AGTACGAGCGAACTCGCA Doublescan:

start:startstop:stop emit x:- emit -:y match intergenic x:y match exon x1x2x3:y1y2y3 emit x1x2x3:- emit -:y1y2y3 x1GT:y1GTAGx2x3:AGy2y3 (…) x1x2GT:y1y2GTAGx3:AGy3 (…) GT:GTAG:AG emit x:-emit -:y match intron x:y x: y: x: y: exon fusion Doublescan:

-:GT-:AG emit y intron -:y -:y1GT -:AGy2y3 (…) -:y1y2GT-:AGy3 (…) start:startstop:stop emit x:- emit -:y match intergenic x:y match exon x1x2x3:y1y2y3 emit x1x2x3:- emit -:y1y2y3 x1GT:y1GTAGx2x3:AGy2y3 (…) x1x2GT:y1y2GTAGx3:AGy3 (…) GT:GTAG:AG emit x:-emit -:y match intron x:y x1GT:- AGx2x3:- (…) x1x2GT:-AGx3:- (…) GT:-AG:- emit x intron x:- Doublescan: Start End are connected to all other states

score Refinements: Score all potential splice sites => distinguish between true and false splice sites by rescaling the nominal transition probs to the splice site states score cctgctgggtgcgagagccggcgtaccggtgaggcccctgctgggtg cgagagccggcgtaccggtg x y cctgctggaggcggtagcgtgcttagtggtgaggcccctgttgggcg cgagagccggtaaaccgctg match exon x1x2x3:y1y2y3 x1GT:y1GT x1x2GT:y1y2GT GT:GT

score Refinements to Doublescan: Score all potential translation start sites => distinguish between true and false translation start sites by rescaling the nominal transition probs to the START START state match intergenic x:y start:start stop:stop cctgctggatgcggtagcgtgcttatgggtgaggcccctgttgggca tgagagccggtaaaccgctg y cgtgctggacgcatgagcgtgcttacgggtgatgcccctgtatggca ggagagccggtatggcgctg x