Genome Annotation (protein coding genes)

Slides:



Advertisements
Similar presentations
Hidden Markov Model in Biological Sequence Analysis – Part 2
Advertisements

BIOINFORMATICS GENE DISCOVERY BIOINFORMATICS AND GENE DISCOVERY Iosif Vaisman 1998 UNIVERSITY OF NORTH CAROLINA AT CHAPEL HILL Bioinformatics Tutorials.
Genomics: READING genome sequences ASSEMBLY of the sequence ANNOTATION of the sequence carry out dideoxy sequencing connect seqs. to make whole chromosomes.
HIDDEN MARKOV MODELS IN COMPUTATIONAL BIOLOGY CS 594: An Introduction to Computational Molecular Biology BY Shalini Venkataraman Vidhya Gunaseelan.
Hidden Markov Models in Bioinformatics
Ch9 Reasoning in Uncertain Situations Dr. Bernard Chen Ph.D. University of Central Arkansas Spring 2011.
Hidden Markov Models.
Profile Hidden Markov Models Bioinformatics Fall-2004 Dr Webb Miller and Dr Claude Depamphilis Dhiraj Joshi Department of Computer Science and Engineering.
1 Profile Hidden Markov Models For Protein Structure Prediction Colin Cherry
1 DNA Analysis Amir Golnabi ENGS 112 Spring 2008.
Hidden Markov Models: Applications in Bioinformatics Gleb Haynatzki, Ph.D. Creighton University March 31, 2003.
Hidden Markov Models CBB 231 / COMPSCI 261. An HMM is a following: An HMM is a stochastic machine M=(Q, , P t, P e ) consisting of the following: a finite.
Ka-Lok Ng Dept. of Bioinformatics Asia University
Hidden Markov Models in Bioinformatics
Profiles for Sequences
JM - 1 Introduction to Bioinformatics: Lecture XIII Profile and Other Hidden Markov Models Jarek Meller Jarek Meller Division.
Hidden Markov Models (HMMs) Steven Salzberg CMSC 828H, Univ. of Maryland Fall 2010.
درس بیوانفورماتیک December 2013 مدل ‌ مخفی مارکوف و تعمیم ‌ های آن به نام خدا.
Hidden Markov Models in Bioinformatics Example Domain: Gene Finding Colin Cherry
Hidden Markov Models Sasha Tkachev and Ed Anderson Presenter: Sasha Tkachev.
Gene Finding. Finding Genes Prokaryotes –Genome under 10Mb –>85% of sequence codes for proteins Eukaryotes –Large Genomes (up to 10Gb) –1-3% coding for.
Hidden Markov Models I Biology 162 Computational Genetics Todd Vision 14 Sep 2004.
Non-coding RNA William Liu CS374: Algorithms in Biology November 23, 2004.
HIDDEN MARKOV MODELS IN MULTIPLE ALIGNMENT. 2 HMM Architecture Markov Chains What is a Hidden Markov Model(HMM)? Components of HMM Problems of HMMs.
. Sequence Alignment via HMM Background Readings: chapters 3.4, 3.5, 4, in the Durbin et al.
HIDDEN MARKOV MODELS IN MULTIPLE ALIGNMENT
Comparative ab initio prediction of gene structures using pair HMMs
Deepak Verghese CS 6890 Gene Finding With A Hidden Markov model Of Genomic Structure and Evolution. Jakob Skou Pedersen and Jotun Hein.
Gene Finding Genome Annotation. Gene finding is a cornerstone of genomic analysis Genome content and organization Differential expression analysis Epigenomics.
Hidden Markov Models In BioInformatics
Comparative Genomics & Annotation The Foundation of Comparative Genomics The main methodological tasks of CG Annotation: Protein Gene Finding RNA Structure.
CSCE555 Bioinformatics Lecture 6 Hidden Markov Models Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:
Gene finding with GeneMark.HMM (Lukashin & Borodovsky, 1997 ) CS 466 Saurabh Sinha.
BINF6201/8201 Hidden Markov Models for Sequence Analysis
BME 110L / BIOL 181L Computational Biology Tools October 29: Quickly that demo: how to align a protein family (10/27)
Doug Raiford Lesson 3.  Have a fully sequenced genome  How identify the genes?  What do we know so far? 10/13/20152Gene Prediction.
BME 110L / BIOL 181L Computational Biology Tools February 19: In-class exercise: a phylogenetic tree for that.
CSCI 6900/4900 Special Topics in Computer Science Automata and Formal Grammars for Bioinformatics Bioinformatics problems sequence comparison pattern/structure.
Bioinformatics Ayesha M. Khan 9 th April, What’s in a secondary database?  It should be noted that within multiple alignments can be found conserved.
Pattern Matching Rhys Price Jones Anne R. Haake. What is pattern matching? Pattern matching is the procedure of scanning a nucleic acid or protein sequence.
From Genomes to Genes Rui Alves.
Eukaryotic Gene Prediction Rui Alves. How are eukaryotic genes different? DNA RNA Pol mRNA Ryb Protein.
Hidden Markov Models in Bioinformatics
CZ5226: Advanced Bioinformatics Lecture 6: HHM Method for generating motifs Prof. Chen Yu Zong Tel:
Basic Overview of Bioinformatics Tools and Biocomputing Applications II Dr Tan Tin Wee Director Bioinformatics Centre.
Applications of HMMs in Computational Biology BMI/CS 576 Colin Dewey Fall 2010.
Motif Search and RNA Structure Prediction Lesson 9.
(H)MMs in gene prediction and similarity searches.
1 Applications of Hidden Markov Models (Lecture for CS498-CXZ Algorithms in Bioinformatics) Nov. 12, 2005 ChengXiang Zhai Department of Computer Science.
BIOINFORMATICS Ayesha M. Khan Spring 2013 Lec-8.
Hidden Markov Models Wassnaa AL-mawee Western Michigan University Department of Computer Science CS6800 Adv. Theory of Computation Prof. Elise De Doncker.
1 Gene Finding. 2 “The Central Dogma” TranscriptionTranslation RNA Protein.
Introduction to Profile HMMs
Hidden Markov Models BMI/CS 576
ORF Calling.
bacteria and eukaryotes
The Transcriptional Landscape of the Mammalian Genome
What is a Hidden Markov Model?
Gil McVean Department of Statistics, Oxford
Pfam: multiple sequence alignments and HMM-profiles of protein domains
Eukaryotic Gene Finding
Gene architecture and sequence annotation
Predicting Active Site Residue Annotations in the Pfam Database
Predicting Genes in Actinobacteriophages
Introduction to Bioinformatics II
Hidden Markov Models in Bioinformatics min
Hidden Markov Models (HMMs)
What do you with a whole genome sequence?
HIDDEN MARKOV MODELS IN COMPUTATIONAL BIOLOGY
4. HMMs for gene finding HMM Ability to model grammar
Presentation transcript:

Genome Annotation (protein coding genes) Genome analysis; Nov 14 2007

Motivation Problem: Once we know the sequence, we wish to understand it. First step is to understand find the genes. Finding the genes is the first step in Determining the proteome Study gene regulation Map genotypes to phenotypes ...

Protein coding genes Protein coding genes are the best understood. Several approaches exist for searching for them. We will cover only a small fraction of the approaches here.

Protein coding genes Search for functional sites: - e.g. promoters, cap-sites, poly-A sites, splice sites, start and stop codons... Analysis of sequence composition: - e.g. nucleotide usage and triple distribution... Homology search: - e.g. search against databases of known proteins or known genes...

Hidden Markov models Hidden Markov model: Moves from state to state in the same way as a Markov model, but emits a symbol (from a finite alphabet of the model) from each state visited (except silent states) according to a probability distribution of the state (emission probabilities) ... Model M: A run follows a sequence of states: H H L L H And emits a sequence of symbols:

Hidden Markov models Hidden Markov model: Moves from state to state in the same way as a Markov model, but emits a symbol (from a finite alphabet of the model) from each state visited (except silent states) according to a probability distribution of the state (emission probabilities) ... Start Model M: A run follows a sequence of states: H H L L H And emits a sequence of symbols: End Special Start- and End-states can be added to generate finite output

Hidden Markov models Hidden Markov model: Moves from state to state in the same way as a Markov model, but emits a symbol (from a finite alphabet of the model) from each state visited (except silent states) according to a probability distribution of the state (emission probabilities) ... Start State with emission probabilities The emission prob's of state Q sums to 1 Model M: A run follows a sequence of states: H H L L H And emits a sequence of symbols: Transition probability The transition prob's of transitions from state Q sums to 1 End Special Start- and End-states can be added to generate finite output

Applications of HMMs in bioinformatics History of HMMs History of HMMs Hidden Markov Models were introduced in statistical papers by Leonard E. Baum and others in the late1960s. One of the first applications of HMMs was speech recognition in the mid-1970s. In the late 1980s, HMMs were applied to the analysis of biological sequences. Since then, many applications in bioinformatic... Applications of HMMs in bioinformatics prediction of protein-coding regions in genome sequences modeling families of related DNA or protein sequences prediction of secondary structure elements in proteins ... and many others ...

HMM based gene finding Biological facts The gene is a substring of the DNA sequence of A,C,G,T's The gene starts with a start-code atg The gene ends with a stop-codon taa, tag or tga The number of nucleotides in a gene is a multiplum of 3 Z: NNNCCCCCCCCCNNNNNNNNCCCCCCCCCCCCCCCNNNNNNNNNNN X: acgatgcgctaatatgtccgatgacgtgagcataagcgacatgcag A: >0 C: >0 G: >0 T: >0 N: non-coding C: coding

HMM based gene finding Biological facts The gene is a substring of the DNA sequence of A,C,G,T's The gene starts with a start-code atg The gene ends with a stop-codon taa, tag or tga The number of nucleotides in a gene is a multiplum of 3 Z: NNNCCCCCCCCCNNNNNNNNCCCCCCCCCCCCCCCNNNNNNNNNNN X: acgatgcgctaatatgtccgatgacgtgagcataagcgacatgcag A: >0 C: >0 G: >0 T: >0 N: non-coding C: coding A: 1 C: 0 G: 0 T: 0 A: 0 T: 1 G: 1

The gene is a substring of the DNA sequence of A,C,G,T's The gene starts with a start-code atg The gene ends with a stop-codon taa, tag or tga The number of nucleotides in a gene is a multiplum of 3 HMM based gene finding A: >0 C: >0 G: >0 T: >0 N: non-coding C: coding A: 1 C: 0 G: 0 T: 0 A: 0 T: 1 G: 1

The gene is a substring of the DNA sequence of A,C,G,T's The gene starts with a start-code atg The gene ends with a stop-codon taa, tag or tga The number of nucleotides in a gene is a multiplum of 3 HMM based gene finding A: >0 C: >0 G: >0 T: >0 N: non-coding C: coding A: 1 C: 0 G: 0 T: 0 A: 0 T: 1 G: 1

HMM based gene finding

Comparative gene finding

Comparative HMM gene finding

Comparative HMM gene finding

Quality of gene finders Knapp and Chen, 2006

Using existing knowledge of genes Just finding sequences need not be the whole story. Filtering candidates through databases of known genes, for example, can improve the quality of gene finders (but only for finding “known” genes).

“Profiles” An aligned sequence family or “region of interest” ACA --- ATG TCA ACT ATC ACA C-- AGC AGA --- ATC ACC G-- ATC ... A “classic” profile summarizing the sequence family A C A - - - A T C 0.8 . 0.8 0.2 . . 1.0 . . . 0.8 0.2 0.2 0.2 . . . 0.8 . 0.2 . 0.2 . . . 0.2 0.2 0.2 . . . . 0.2 . 0.8 . . . . 0.4 0.8 0.8 . . . A: C: G: T: -: 1 2 3 4 5 6 7 8 9

Profile HMMs ACA --- ATG TCA ACT ATC ACA C-- AGC AGA --- ATC ACC G-- ATC

Profile HMMs Begin End

Profile HMMs Consists of Match-, Insert-, and Delete-states. A run generates a sequence (DNA or protein). The hidden path of states explains how the generated sequence relates to the sequence family ...

Beyond protein coding genes There seem to be more conserved sequence than coding sequence in genomes.

Beyond protein coding genes Siepel et al. 2005

Beyond protein coding genes There seem to be more conserved sequence than coding sequence in genomes. If we believe conservation ← purifying selection ← function then there is much non-coding but functionally important sequence to be found and figured out.

Beyond protein coding genes There seem to be more conserved sequence than coding sequence in genomes. If we believe conservation ← purifying selection ← function then there is much non-coding but functionally important sequence to be found and figured out. ...regulatory sequences, splicing machinery, RNA genes...

RNA genes Stochastic Context-Free Grammars (SCFGs) – a formalism similar to HMMs – can be used for recognizing RNA structure.

RNA genes Stochastic Context-Free Grammars (SCFGs) – a formalism similar to HMMs – can be used for recognizing RNA structure. Not quite as computationally efficient, but can analyse genome sequences in sliding windows.

Simple example of CFG S  a S u  ac S gu  acg S cgu  acgg S ccgu S  aSu | uSa | gSc | cSc | ... | SS | aS | uS | cS | gS | “” S  a S u  ac S gu  acg S cgu  acgg S ccgu  acgga S ccgu  acggag S ccgu  acggagu S ccgu  acggagug S ccgu  acggagugc S ccgu  acggagugc “” ccgu AG U CG ACGG UGCC

Simple example of CFG S  a S u  ac S gu  acg S cgu  acgg S ccgu S  aSu | uSa | gSc | cSc | ... | SS | aS | uS | cS | gS | “” Adding probabilities to the rules gives us stochastic CFGs. These can be used for annotations (similar to HMMs). S  a S u  ac S gu  acg S cgu  acgg S ccgu  acgga S ccgu  acggag S ccgu  acggagu S ccgu  acggagug S ccgu  acggagugc S ccgu  acggagugc “” ccgu AG U CG ACGG UGCC

Simple example of CFG S  a S u  ac S gu  acg S cgu  acgg S ccgu S  aSu | uSa | gSc | cSc | ... | SS | aS | uS | cS | gS | “” Adding probabilities to the rules gives us stochastic CFGs. These can be used for annotations (similar to HMMs). S  a S u  ac S gu  acg S cgu  acgg S ccgu  acgga S ccgu  acggag S ccgu  acggagu S ccgu  acggagug S ccgu  acggagugc S ccgu  acggagugc “” ccgu AG U CG ACGG UGCC Signals that are typically important – and captured by the stochastic grammar – are stability (free energy) of the predicted structure.

Comparative RNA gene finding Pedersen et al. 2006

Summary Using stochastic methods (HMMs or SCFGs) we can model our biological knowledge to extract sequence and evolutionary signal to automatically annotate genomes. We have seen hidden Markov models for annotating protein coding genes and stochastic context free grammars to annotate for RNA genes