HMM Sampling and Applications to Gene Finding and Alignment European Conference on Computational Biology 2003 Simon Cawley * and Lior Pachter + and thanks.

Slides:



Advertisements
Similar presentations
Gene Prediction: Similarity-Based Approaches
Advertisements

EAnnot: A genome annotation tool using experimental evidence Aniko Sabo & Li Ding Genome Sequencing Center Washington University, St. Louis.
Hidden Markov Model in Biological Sequence Analysis – Part 2
BIOINFORMATICS GENE DISCOVERY BIOINFORMATICS AND GENE DISCOVERY Iosif Vaisman 1998 UNIVERSITY OF NORTH CAROLINA AT CHAPEL HILL Bioinformatics Tutorials.
HIDDEN MARKOV MODELS IN COMPUTATIONAL BIOLOGY CS 594: An Introduction to Computational Molecular Biology BY Shalini Venkataraman Vidhya Gunaseelan.
1 DNA Analysis Amir Golnabi ENGS 112 Spring 2008.
1 Introduction to Bioinformatics 2 Mini Exam 3 3 Mini Exam Take a pencil and a piece of paper Please, not too close to your neighbour There a three.
1 Computational Molecular Biology MPI for Molecular Genetics DNA sequence analysis Gene prediction Gene prediction methods Gene indices Mapping cDNA on.
Gene predictions for eukaryotes attgccagtacgtagctagctacacgtatgctattacggatctgtagcttagcgtatct gtatgctgttagctgtacgtacgtatttttctagagcttcgtagtctatggctagtcgt.
Gene Prediction Methods G P S Raghava. Prokaryotic gene structure ORF (open reading frame) Start codon Stop codon TATA box ATGACAGATTACAGATTACAGATTACAGGATAG.
GTCAGATGAGCAAAGTAGACACTCCAGTAACGCGGTGAGTACATTAA exon intron intergene Find Gene Structures in DNA Intergene State First Exon State Intron State.
Hidden Markov Models Sasha Tkachev and Ed Anderson Presenter: Sasha Tkachev.
Hidden Markov Models I Biology 162 Computational Genetics Todd Vision 14 Sep 2004.
Non-coding RNA William Liu CS374: Algorithms in Biology November 23, 2004.
1 Gene Finding Charles Yan. 2 Gene Finding Genomes of many organisms have been sequenced. We need to translate the raw sequences into knowledge. Where.
Picking Alignments from (Steiner) Trees Fumei Lam Marina Alexandersson Lior Pachter.
Gene Finding (DNA signals) Genome Sequencing and assembly
CSE182-L10 Gene Finding.
CSE182-L12 Gene Finding.
Comparative ab initio prediction of gene structures using pair HMMs
Finding genes in human using the mouse Finding genes in mouse using the human Lior Pachter Department of Mathematics U.C. Berkeley.
Finding Genes based on Comparative Genomics Robin Raffard November, 30 th 2004 CS 374.
Eukaryotic Gene Finding
Phylogenetic Shadowing Daniel L. Ong. March 9, 2005RUGS, UC Berkeley2 Abstract The human genome contains about 3 billion base pairs! Algorithms to analyze.
CSE182-L8 Gene Finding. Project EST clustering and assembly Given a collection of EST (3’/5’) sequences, your goal is to cluster all ESTs from the same.
Lecture 12 Splicing and gene prediction in eukaryotes
UCSC Known Genes Version 3 Take 10. Overall Pipeline Get alignments etc. from database Remove antibody fragments Clean alignments, project to genome Cluster.
Eukaryotic Gene Finding
BNFO 602/691 Biological Sequence Analysis Mark Reimers, VIPBG
Gene Finding Genome Annotation. Gene finding is a cornerstone of genomic analysis Genome content and organization Differential expression analysis Epigenomics.
An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Gene Prediction: Similarity-Based Approaches.
Hidden Markov Models In BioInformatics
Comparative Genomics & Annotation The Foundation of Comparative Genomics The main methodological tasks of CG Annotation: Protein Gene Finding RNA Structure.
What is comparative genomics? Analyzing & comparing genetic material from different species to study evolution, gene function, and inherited disease Understand.
Motif finding with Gibbs sampling CS 466 Saurabh Sinha.
Gene finding and gene structure prediction M. Fatih BÜYÜKAKÇALI Computational Bioinformatics 2012.
CSCI 6900/4900 Special Topics in Computer Science Automata and Formal Grammars for Bioinformatics Bioinformatics problems sequence comparison pattern/structure.
A markovian approach for the analysis of the gene structure C. MelodeLima 1, L. Guéguen 1, C. Gautier 1 and D. Piau 2 1 Biométrie et Biologie Evolutive.
Web Databases for Drosophila Introduction to FlyBase and Ensembl Database Wilson Leung6/06.
Computational Genomics and Proteomics Lecture 8 Motif Discovery C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.
Sackler Medical School
Gene Prediction: Similarity-Based Methods (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 15, 2005 ChengXiang Zhai Department of Computer Science.
Mark D. Adams Dept. of Genetics 9/10/04
Comp. Genomics Recitation 9 11/3/06 Gene finding using HMMs & Conservation.
Eukaryotic Gene Prediction Rui Alves. How are eukaryotic genes different? DNA RNA Pol mRNA Ryb Protein.
Genes and Genomes. Genome On Line Database (GOLD) 243 Published complete genomes 536 Prokaryotic ongoing genomes 434 Eukaryotic ongoing genomes December.
Multiple Species Gene Finding Sourav Chatterji
Multiple Species Gene Finding using Gibbs Sampling Sourav Chatterji Lior Pachter University of California, Berkeley.
.1Sources of DNA and Sequencing Methods.1Sources of DNA and Sequencing Methods 2 Genome Assembly Strategy and Characterization 2 Genome Assembly.
Gene discovery using combined signals from genome sequence and natural selection Michael Brent Washington University The mouse genome analysis group.
JIGSAW: a better way to combine predictions J.E. Allen, W.H. Majoros, M. Pertea, and S.L. Salzberg. JIGSAW, GeneZilla, and GlimmerHMM: puzzling out the.
Fgenes++ pipelines for automatic annotation of eukaryotic genomes Victor Solovyev, Peter Kosarev, Royal Holloway College, University of London Softberry.
Bioinformatics Workshops 1 & 2 1. use of public database/search sites - range of data and access methods - interpretation of search results - understanding.
Applications of HMMs in Computational Biology BMI/CS 576 Colin Dewey Fall 2010.
1 MAVID: Constrained Ancestral Alignment of Multiple Sequence Author: Nicholas Bray and Lior Pachter.
Genome Annotation Assessment in Drosophila melanogaster by Reese, M. G., et al. Summary by: Joe Reardon Swathi Appachi Max Masnick Summary of.
UCSC Genome Browser Zeevik Melamed & Dror Hollander Gil Ast Lab Sackler Medical School.
(H)MMs in gene prediction and similarity searches.
Using DNA Subway in the Classroom Genome Annotation: Red Line.
Spectral Algorithms for Learning HMMs and Tree HMMs for Epigenetics Data Kevin C. Chen Rutgers University joint work with Jimin Song (Rutgers/Palentir),
1 Gene Finding. 2 “The Central Dogma” TranscriptionTranslation RNA Protein.
bacteria and eukaryotes
The Transcriptional Landscape of the Mammalian Genome
Eukaryotic Gene Finding
Ab initio gene prediction
Introduction to Bioinformatics II
Pair Hidden Markov Model
HIDDEN MARKOV MODELS IN COMPUTATIONAL BIOLOGY
.1Sources of DNA and Sequencing Methods 2 Genome Assembly Strategy and Characterization 3 Gene Prediction and Annotation 4 Genome Structure 5 Genome.
Presentation transcript:

HMM Sampling and Applications to Gene Finding and Alignment European Conference on Computational Biology 2003 Simon Cawley * and Lior Pachter + and thanks to Eli Rusman * Affymetrix + UC Berkeley Mathematics Dept

Conservation of alternative splicing between human and mouse Modrek and Lee: 40-60% of human genes have alternative splice forms. Nature Genetics Nurtdinov et al. 75% of human alternative splice forms are conserved in mouse. Human Molecular Genetics Can we develop ab-initio methods for detecting conserved alternative splice sites?

A A C A T T A G A AGATTACCACA Sequence Alignment

A A C A T T A G A AGATTACCACA max Finding the optimal alignment

a i,j = w a i-1,j + w a i,j-1 + s i,j a i-1,j-1 A A C A T T A G A AGATTACCACA Alignment forward variables for positions [1,i] and [1,j] in each sequence Match/mismatch probabilities for positions i,j in each sequence gap probabilities Sampling to find alternative alignments

Linear Space Sampling Sequences length T,U To obtain k samples Time complexity: O(TU+k(T+U)) Memory requirements: O(T+U) Hirschberg’s divide and conquer algorithm Time complexity: O(TU) Memory requirements: O(T+U)

Alternative Splicing in Mammalian Genomes pre-mRNA TRANSLATION SPLICING Protein I ALTERNATIVE SPLICING Protein II TRANSLATION

M. Alexandersson, S. Cawley, L. Pachter, SLAM- Cross-species gene finding and alignment with a generalized pair hidden Markov model, Genome Research, 13 (2003) p Cross-species simultaneous gene finding and alignment

Modeling gene features 5’ 3’ Exon 1 Exon 2 Exon 3 Intron 1Intron 2 CNS [human] [mouse]

The SLAM hidden Markov model

SLAM components Splice site detector –VLMM Intron and intergenic regions –2nd order Markov chain –independent geometric lengths Coding sequence –PHMM on protein level –generalized length distribution Conserved non-coding sequence –PHMM on DNA level

SLAM input and output Input: –Pair of homologous sequences. Output: –CDS and CNS predictions in both sequences. –Protein predictions. –Protein and CNS alignment.

Input:

Output:

Methodology for identifying alternative splice sites Compiled SLAM gene predictions for the human, mouse and rat genomes. Identified a set of 3400 human/mouse/rat gene triples with consistent predictions from hs/mm and hs/rn analyses. For each triple, sampled sub-optimal parses from hs/mm and hs/rn runs Collected alternative exons (non-Viterbi exons) that appeared in both the hs/mm and hs/rn runs Examined overlap with RefSeq genes, mRNAs and ESTs

SLAM whole genome predictions Built a whole genome homology map (Colin Dewey) Pre-aligned the homologous blocks to reduce the SLAM search space (Nicolas Bray using AVID) Ran SLAM on the resulting blocks

[human] [mouse] [rat]

Comparing predicted alternative exons to ESTs and mRNAs human/mouse/rat alternative exons human/mouse alternative exons EST/mRNA No EST/mRNAEST/mRNA No EST/mRNA Gene count Alt. Exon count Shifties Newbies

Conclusions Sampling is memory efficient, fast, and should be used routinely for alignment applications. Conserved alternative splice forms can be detected ab-initio. The extent of alternative splicing conservation is currently unclear. Sampling provides an alternative approach for investigating this problem- one that is not sensitive to biases in EST data. Problem: design effective and scalable validation strategies for alternative splice sites.