Parsing A Bacterial Genome Mark Craven Department of Biostatistics & Medical Informatics University of Wisconsin U.S.A.

Slides:



Advertisements
Similar presentations
Prokaryotic Gene Regulation:
Advertisements

Hidden Markov Model in Biological Sequence Analysis – Part 2
PREDetector : Prokaryotic Regulatory Element Detector Samuel Hiard 1, Sébastien Rigali 2, Séverine Colson 2, Raphaël Marée 1 and Louis Wehenkel 1 1 Bioinformatics.
Computational discovery of gene modules and regulatory networks Ziv Bar-Joseph et al (2003) Presented By: Dan Baluta.
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
Finding regulatory modules from local alignment - Department of Computer Science & Helsinki Institute of Information Technology HIIT University of Helsinki.
Stochastic Context Free Grammars for RNA Modeling CS 838 Mark Craven May 2001.
Bioinformatics Motif Detection Revised 27/10/06. Overview Introduction Multiple Alignments Multiple alignment based on HMM Motif Finding –Motif representation.
Ab initio gene prediction Genome 559, Winter 2011.
10. Lexicalized and Probabilistic Parsing -Speech and Language Processing- 발표자 : 정영임 발표일 :
1 DNA Analysis Amir Golnabi ENGS 112 Spring 2008.
Hidden Markov Models in Bioinformatics
Profiles for Sequences
Gene prediction and HMM Computational Genomics 2005/6 Lecture 9b Slides taken from (and rapidly mixed) Larry Hunter, Tom Madej, William Stafford Noble,
1 Gene Finding Charles Yan. 2 Gene Finding Genomes of many organisms have been sequenced. We need to translate the raw sequences into knowledge. Where.
Lecture 5: Learning models using EM
MOPAC: Motif-finding by Preprocessing and Agglomerative Clustering from Microarrays Thomas R. Ioerger 1 Ganesh Rajagopalan 1 Debby Siegele 2 1 Department.
1 Predicting Gene Expression from Sequence Michael A. Beer and Saeed Tavazoie Cell 117, (16 April 2004)
Feature Selection and Its Application in Genomic Data Analysis March 9, 2004 Lei Yu Arizona State University.
Distributed Representations of Sentences and Documents
Cis-regultory module 10/24/07. TFs often work synergistically (Harbison 2004)
A Statistical Method for Finding Transcriptional Factor Binding Sites Authors: Saurabh Sinha and Martin Tompa Presenter: Christopher Schlosberg CS598ss.
Hidden Markov Models In BioInformatics
Transcription Transcription- synthesis of RNA from only one strand of a double stranded DNA helix DNA  RNA(  Protein) Why is RNA an intermediate????
Elements of Molecular Biology All living things are made of cells All living things are made of cells Prokaryote, Eukaryote Prokaryote, Eukaryote.
1/24 Learning to Extract Genic Interactions Using Gleaner LLL05 Workshop, 7 August 2005 ICML 2005, Bonn, Germany Mark Goadrich, Louis Oliphant and Jude.
Comparative Genomics & Annotation The Foundation of Comparative Genomics The main methodological tasks of CG Annotation: Protein Gene Finding RNA Structure.
Regulatory factors 1) Gene copy number 2) Transcriptional control 2-1) Promoters 2-2) Terminators, attenuators and anti-terminators 2-3) Induction and.
A systems biology approach to the identification and analysis of transcriptional regulatory networks in osteocytes Angela K. Dean, Stephen E. Harris, Jianhua.
Genetic Regulatory Network Inference Russell Schwartz Department of Biological Sciences Carnegie Mellon University.
Queensland University of Technology CRICOS No J Using a Beagle to sniff for Bacterial Promoters Stefan R. Maetschke, Michael Towsey and James M.
Learning Regulatory Networks that Represent Regulator States and Roles Keith Noto and Mark Craven K. Noto and M. Craven, Learning Regulatory.
Some Probability Theory and Computational models A short overview.
Doug Raiford Lesson 3.  Have a fully sequenced genome  How identify the genes?  What do we know so far? 10/13/20152Gene Prediction.
JM - 1 Introduction to Bioinformatics: Lecture III Genome Assembly and String Matching Jarek Meller Jarek Meller Division of Biomedical.
CSCI 6900/4900 Special Topics in Computer Science Automata and Formal Grammars for Bioinformatics Bioinformatics problems sequence comparison pattern/structure.
Identifying conserved segments in rearranged and divergent genomes Bob Mau, Aaron Darling, Nicole T. Perna Presented by Aaron Darling.
Computational Genomics and Proteomics Lecture 8 Motif Discovery C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.
Module networks Sushmita Roy BMI/CS 576 Nov 18 th & 20th, 2014.
AdvancedBioinformatics Biostatistics & Medical Informatics 776 Computer Sciences 776 Spring 2002 Mark Craven Dept. of Biostatistics & Medical Informatics.
Background & Motivation Problem & Feature Construction Experiments Design & Results Conclusions and Future Work Exploring Alternative Splicing Features.
Gene Prediction: Similarity-Based Methods (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 15, 2005 ChengXiang Zhai Department of Computer Science.
From Genomes to Genes Rui Alves.
Interpolated Markov Models for Gene Finding BMI/CS 776 Mark Craven February 2002.
Starting Monday M Oct 29 –Back to BLAST and Orthology (readings posted) will focus on the BLAST algorithm, different types and applications of BLAST; in.
Data Mining the Yeast Genome Expression and Sequence Data Alvis Brazma European Bioinformatics Institute.
EB3233 Bioinformatics Introduction to Bioinformatics.
Class 23, 2001 CBCl/AI MIT Bioinformatics Applications and Feature Selection for SVMs S. Mukherjee.
Learning Sequence Motifs Using Expectation Maximization (EM) and Gibbs Sampling BMI/CS 776 Mark Craven
Inference with Gene Expression and Sequence Data BMI/CS 776 Mark Craven April 2002.
Module Networks BMI/CS 576 Mark Craven December 2007.
Applications of HMMs in Computational Biology BMI/CS 576 Colin Dewey Fall 2010.
Motif Search and RNA Structure Prediction Lesson 9.
MicroRNA Prediction with SCFG and MFE Structure Annotation Tim Shaw, Ying Zheng, and Bram Sebastian.
Finding genes in the genome
1 Applications of Hidden Markov Models (Lecture for CS498-CXZ Algorithms in Bioinformatics) Nov. 12, 2005 ChengXiang Zhai Department of Computer Science.
BIOINFORMATICS Ayesha M. Khan Spring 2013 Lec-8.
More on HMMs and Multiple Sequence Alignment BMI/CS 776 Mark Craven March 2002.
Graphical Models for Segmenting and Labeling Sequence Data Manoj Kumar Chinnakotla NLP-AI Seminar.
1 Gene Finding. 2 “The Central Dogma” TranscriptionTranslation RNA Protein.
Hidden Markov Models BMI/CS 576
bacteria and eukaryotes
Genome Annotation (protein coding genes)
The Transcriptional Landscape of the Mammalian Genome
Stochastic Context-Free Grammars for Modeling RNA
Learning Sequence Motif Models Using Expectation Maximization (EM)
Stochastic Context-Free Grammars for Modeling RNA
Ab initio gene prediction
Stochastic Context Free Grammars for RNA Structure Modeling
Presentation transcript:

Parsing A Bacterial Genome Mark Craven Department of Biostatistics & Medical Informatics University of Wisconsin U.S.A.

The Task Given: a bacterial genome Do: use computational methods to predict a “parts list” of regulatory elements

Outline 1.background on bacterial gene regulation 2.background on probabilistic language models 3.predicting transcription units using probabilistic language models 4.augmenting training with “weakly” labeled examples 5.refining the structure of a stochastic context free grammar

The Central Dogma of Molecular Biology

Transcription in Bacteria

Operons in Bacteria operon: sequence of one or more genes transcribed as a unit under some conditions promoter: “signal” in DNA indicating where to start transcription terminator: “signal” indicating where to stop transcription promoter gene terminator gene mRNA

The Task Revisited Given: –DNA sequence of E. coli genome –coordinates of known/predicted genes –known instances of operons, promoters, terminators Do: –learn models from known instances –predict complete catalog of operons, promoters, terminators for the genome

Our Approach: Probabilistic Language Models 1.write down a “grammar” for elements of interest (operons, promoters, terminators, etc.) and relations among them 2.learn probability parameters from known instances of these elements 3.predict new elements by “parsing” uncharacterized DNA sequence

Transformational Grammars a transformational grammar characterizes a set of legal strings the grammar consists of –a set of abstract nonterminal symbols –a set of terminal symbols (those that actually appear in strings) –a set of productions

A Grammar for Stop Codons this grammar can generate the 3 stop codons: taa, tag, tga with a grammar we can ask questions like –what strings are derivable from the grammar? –can a particular string be derived from the grammar?

The Parse Tree for tag

A Probabilistic Version of the Grammar each production has an associated probability the probabilities for productions with the same left-hand side sum to 1 this grammar has a corresponding Markov chain model

A Probabilistic Context Free Grammar for Terminators START PREFIX STEM_BOT1 STEM_BOT2 STEM_MID STEM_TOP2 STEM_TOP1 LOOP LOOP_MID SUFFIX B t l STEM_BOT2 t r t l * STEM_MID t r * | t l * STEM_TOP2 t r * t l LOOP t r B B LOOP_MID B B t l * STEM_TOP1 t r * B LOOP_MID | B B B B B B B B B a | c | g | u B B B B B B B B B PREFIX STEM_BOT1 SUFFIX t = { a,c,g,u }, t * = { a,c,g,u, } c g a c c g c c-u-c-a-a-a-g-g- g c u g g c g u a u c c -u-u-u-u-u-u-u-u prefix stem loop suffix

Inference with Probabilistic Grammars for a given string there may be many parses, but some are more probable than others we can do prediction by finding relatively high probability parses there are dynamic programming algorithms for finding the most probable parse efficiently

Learning with Probabilistic Grammars in this work, we write down the productions by hand, but learn the probability parameters to learn the probability parameters, we align sequences of a given classs (e.g. terminators) with the relevant part of the grammar when there is hidden state (i.e. the correct parse is not known), we use Expectation Maximization (EM) algorithms

Outline 1.background on bacterial gene regulation 2.background on probabilistic language models 3.predicting transcription units using probabilistic language models [Bockhorst et al., ISMB/Bioinformatics ‘03] 4.augmenting training with “weakly” labeled examples 5.refining the structure of a stochastic context free grammar

untranscribed region transcribed region ORF SCFG position specific Markov model semi-Markov model TSS ORF last ORF RIT prefix RDT prefix stem loop stem loop start RIT suffix RDT suffix end spacer start spacer prom intern post prom intra ORF pre term UTR A Model for Transcription Units

The Components of the Model stochastic context free grammars (SCFGs) represent variable-length sequences with long-range dependencies semi-Markov models represent variable-length sequences position-specific Markov models represent fixed-length sequence motifs

Gene Expression Data in addition to DNA sequence data, we also use expression data to make our parses microarrays enable the simultaneous measurement of the transcription levels of thousands of genes genes/ sequence positions experimental conditions

Incorporating Expression Data ACGTAGATAGACAGAATGACAGATAGAGACAGTTCGCTAGCTGACAGCTAGATCGATAGCTCGATAGCACGTGTACGTAGATAGACAGAATGACAGATAGAGACAGTTCGCT our models parse two sequences simultaneously –the DNA sequence of the genome –a sequence of expression measurements associated with particular sequence positions the expression data is useful because it provides information about which subsequences look like they are transcribed together

Predictive Accuracy for Operons

Predictive Accuracy for Promoters

Predictive Accuracy for Terminators

Accuracy of Promoter & Terminator Localization

Terminator Predictive Accuracy

Outline 1.background on bacterial gene regulation 2.background on probabilistic language models 3.predicting transcription units using probabilistic language models 4.augmenting training data with “weakly” labeled examples [Bockhorst & Craven, ICML ’02] 5.refining the structure of a stochastic context free grammar

Key Idea: Weakly Labeled Examples regulatory elements are inter-related –promoters precede operons –terminators follow operons –etc. relationships such as these can be exploited to augment training sets with “weakly labeled” examples

Inferring “Weakly” Labeled Examples g1 g2g3g4 g5 ACGTAGATAGACAGAATGACAGATAGAGACAGTTCGCTAGCTGACAGCTAGATCGATAGCTCGATAGCACGTGTACGTAGATAGACAGAATGACAGATAGAGACAGTTCGCT TGCATCTATCTGTCTTACTGTCTATCTCTGTCAAGCGATCGACTGTCGATCTAGCTATCGAGCTATCGTGCACATGCATCTATCTGTCTTACTGTCTATCTCTGTCAAGCGA if we know that an operon ends at g4, then there must be a terminator shortly downstream if we know that an operon begins at g2, then there must be a promoter shortly upstream we can exploit relations such as this to augment our training sets

Strongly vs. Weakly Labeled Terminator Examples gtccgttccgccactattcactcatgaaaatgagttcagagagccgcaagatttttaattttgcggtttttttgtatttgaattccaccatttctctgttcaatg end of stem-loop strongly labeled terminator : weakly labeled terminator: extent of terminator sub-class: rho-independent

Training the Terminator Models: Strongly Labeled Examples rho-dependent terminator model negative model rho-independent terminator model negative examples rho-independent examples rho-dependent examples

Training the Terminator Models: Weakly Labeled Examples rho-dependent terminator model negative model rho-independent terminator model negative examples weakly labeled examples combined terminator model

Do Weakly Labeled Terminator Examples Help? task: classification of terminators (both sub-classes) in E. coli K-12 train SCFG terminator model using: –S strongly labeled examples and –W weakly labeled examples evaluate using area under ROC curves

Area under ROC curve Number of strong positive examples 250 weak examples 25 weak examples 0 weak examples Learning Curves using Weakly Labeled Terminators

Are Weakly Labeled Examples Better than Unlabeled Examples? train SCFG terminator model using: –S strongly labeled examples and –U unlabeled examples vary S and U to obtain learning curves

Training the Terminator Models: Unlabeled Examples rho-dependent terminator model negative model rho-independent terminator model unlabeled examples combined model

unlabeled examples 25 unlabeled examples 0 unlabeled examples Area under ROC curve Number of strong positive examples weak examples 25 weak examples 0 weak examples Weakly Labeled Unlabeled Learning Curves: Weak vs. Unlabeled

Are Weakly Labeled Terminators from Predicted Operons Useful? train operon model with S labeled operons predict operons generate W weakly labeled terminators from W most confident predictions vary S and W

Area under ROC curve Number of strong positive examples 200 weak examples 100 weak examples 25 weak examples 0 weak examples Learning Curves using Weakly Labeled Terminators

Outline 1.background on bacterial gene regulation 2.background on probabilistic language models 3.predicting transcription units using probabilistic language models 4.augmenting training with “weakly” labeled examples 5.refining the structure of a stochastic context free grammar [Bockhorst & Craven, IJCAI ’01]

Learning SCFGs given the productions of a grammar, can learn the probabilities using the Inside-Outside algorithm we have developed an algorithm that can add new nonterminals & productions to a grammar during learning basic idea: –identify nonterminals that seem to be “overloaded” –split these nonterminals into two; allow each to specialize

Refining the Grammar in a SCFG there are various “contexts” in which each grammar nonterminal may be used consider two contexts for the nonterminal if the probabilities for look very different, depending on its context, we add a new nonterminal and specialize

Refining the Grammar in a SCFG we can compare two probability distributions P and Q using Kullback-Leibler divergence P Q

Learning Terminator SCFGs extracted grammar from the literature (~ 120 productions) data set consists of 142 known E. coli terminators, 125 sequences that do not contain terminators learn parameters using Inside-Outside algorithm (an EM algorithm) consider adding nonterminals guided by three heuristics –KL divergence –chi-squared –random

SCFG Accuracy After Adding 25 New Nonterminals

SCFG Accuracy vs. Nonterminals Added

Conclusions summary –we have developed an approach to predicting transcription units in bacterial genomes –we have predicted a complete set of transcription units for the E. coli genome advantages of the probabilistic grammar approach –can readily incorporate background knowledge –can simultaneously get a coherent set of predictions for a set of related elements –can be easily extended to incorporate other genomic elements current directions –expanding the vocabulary of elements modeled (genes, transcription factor binding sites, etc.) –handling overlapping elements –making predictions for multiple related genomes

Acknowledgements Craven Lab: Joe Bockhorst, Keith Noto David Page, Jude Shavlik Blattner Lab: Fred Blattner, Jeremy Glasner, Mingzhu Liu, Yu Qiu funding from National Science Foundation, National Institutes of Health