Two bioinformatics applications of dynamic Bayesian networks

Slides:



Advertisements
Similar presentations
Probabilistic models Haixu Tang School of Informatics.
Advertisements

Hidden Markov Model in Biological Sequence Analysis – Part 2
An Approach to ECG Delineation using Wavelet Analysis and Hidden Markov Models Maarten Vaessen (FdAW/Master Operations Research) Iwan de Jong (IDEE/MI)
Hidden Markov Models (1)  Brief review of discrete time finite Markov Chain  Hidden Markov Model  Examples of HMM in Bioinformatics  Estimations Basic.
Biointelligence Laboratory, Seoul National University
Bioinformatics Motif Detection Revised 27/10/06. Overview Introduction Multiple Alignments Multiple alignment based on HMM Motif Finding –Motif representation.
Finding Transcription Factor Binding Sites BNFO 602/691 Biological Sequence Analysis Mark Reimers, VIPBG.
Consistent probabilistic outputs for protein function prediction William Stafford Noble Department of Genome Sciences Department of Computer Science and.
What is Statistical Modeling
Profile Hidden Markov Models Bioinformatics Fall-2004 Dr Webb Miller and Dr Claude Depamphilis Dhiraj Joshi Department of Computer Science and Engineering.
Profiles for Sequences
Hidden Markov Models Theory By Johan Walters (SR 2003)
درس بیوانفورماتیک December 2013 مدل ‌ مخفی مارکوف و تعمیم ‌ های آن به نام خدا.
Yanxin Shi 1, Fan Guo 1, Wei Wu 2, Eric P. Xing 1 GIMscan: A New Statistical Method for Analyzing Whole-Genome Array CGH Data RECOMB 2007 Presentation.
Peptide Identification by Tandem Mass Spectrometry Behshad Behzadi April 2005.
Lecture 5: Learning models using EM
Cs726 Modeling regulatory networks in cells using Bayesian networks Golan Yona Department of Computer Science Cornell University.
Learning, Uncertainty, and Information Big Ideas November 8, 2004.
Speaker Adaptation for Vowel Classification
Genome evolution: a sequence-centric approach Lecture 3: From Trees to HMMs.
Proteomics: A Challenge for Technology and Information Science CBCB Seminar, November 21, 2005 Tim Griffin Dept. Biochemistry, Molecular Biology and Biophysics.
Learning Bayesian Networks
Proteomics Informatics – Protein identification II: search engines and protein sequence databases (Week 5)
Previous Lecture: Regression and Correlation
Scaffold Download free viewer:
My contact details and information about submitting samples for MS
Analysis of tandem mass spectra - II Prof. William Stafford Noble GENOME 541 Intro to Computational Molecular Biology.
The Human Transcription Factor Proteome Andrew Stergachis Stamatoyannopoulos Lab Dept. of Genome Sciences University of Washington.
EE513 Audio Signals and Systems Statistical Pattern Classification Kevin D. Donohue Electrical and Computer Engineering University of Kentucky.
An Introduction to ENCODE Mark Reimers, VIPBG (borrowing heavily from John Stamatoyannopoulos and the ENCODE papers)
The dynamic nature of the proteome
PROTEIN STRUCTURE NAME: ANUSHA. INTRODUCTION Frederick Sanger was awarded his first Nobel Prize for determining the amino acid sequence of insulin, the.
Learning Structure in Bayes Nets (Typically also learn CPTs here) Given the set of random variables (features), the space of all possible networks.
EM and expected complete log-likelihood Mixture of Experts
Hidden Markov Models for Sequence Analysis 4
Segmental Hidden Markov Models with Random Effects for Waveform Modeling Author: Seyoung Kim & Padhraic Smyth Presentor: Lu Ren.
Bayesian networks Classification, segmentation, time series prediction and more. Website: Twitter:
Common parameters At the beginning one need to set up the parameters.
Analysis of Complex Proteomic Datasets Using Scaffold Free Scaffold Viewer can be downloaded at:
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 11: Bayesian learning continued Geoffrey Hinton.
Laxman Yetukuri T : Modeling of Proteomics Data
Structure Discovery of Pop Music Using HHMM E6820 Project Jessie Hsu 03/09/05.
Hidden Markov Models in Keystroke Dynamics Md Liakat Ali, John V. Monaco, and Charles C. Tappert Seidenberg School of CSIS, Pace University, White Plains,
Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.
INF380 - Proteomics-101 INF380 – Proteomics Chapter 10 – Spectral Comparison Spectral comparison means that an experimental spectrum is compared to theoretical.
Identifying conserved segments in rearranged and divergent genomes Bob Mau, Aaron Darling, Nicole T. Perna Presented by Aaron Darling.
Bioinformatics Ayesha M. Khan 9 th April, What’s in a secondary database?  It should be noted that within multiple alignments can be found conserved.
Module networks Sushmita Roy BMI/CS 576 Nov 18 th & 20th, 2014.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
Starting Monday M Oct 29 –Back to BLAST and Orthology (readings posted) will focus on the BLAST algorithm, different types and applications of BLAST; in.
1 CONTEXT DEPENDENT CLASSIFICATION  Remember: Bayes rule  Here: The class to which a feature vector belongs depends on:  Its own value  The values.
Multi-Speaker Modeling with Shared Prior Distributions and Model Structures for Bayesian Speech Synthesis Kei Hashimoto, Yoshihiko Nankaku, and Keiichi.
Algorithms in Computational Biology11Department of Mathematics & Computer Science Algorithms in Computational Biology Markov Chains and Hidden Markov Model.
Multiple flavors of mass analyzers Single MS (peptide fingerprinting): Identifies m/z of peptide only Peptide id’d by comparison to database, of predicted.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Elements of a Discrete Model Evaluation.
Bayesian Speech Synthesis Framework Integrating Training and Synthesis Processes Kei Hashimoto, Yoshihiko Nankaku, and Keiichi Tokuda Nagoya Institute.
Final Report (30% final score) Bin Liu, PhD, Associate Professor.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
HW7: Evolutionarily conserved segments ENCODE region 009 (beta-globin locus) Multiple alignment of human, dog, and mouse 2 states: neutral (fast-evolving),
Institute of Statistics and Decision Sciences In Defense of a Dissertation Submitted for the Degree of Doctor of Philosophy 26 July 2005 Regression Model.
Spectral Algorithms for Learning HMMs and Tree HMMs for Epigenetics Data Kevin C. Chen Rutgers University joint work with Jimin Song (Rutgers/Palentir),
Hidden Markov Models BMI/CS 576
A Database of Peak Annotations of Empirically Derived Mass Spectra
MassMatrix Search Results Explained
Hidden Markov Models Part 2: Algorithms
CONTEXT DEPENDENT CLASSIFICATION
EE513 Audio Signals and Systems
Mathematical Foundations of BME
A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models Jeff A. Bilmes International.
A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models Jeff A. Bilmes International.
Presentation transcript:

Two bioinformatics applications of dynamic Bayesian networks William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering University of Washington

Outline Segmenting genomic data Matching peptides to mass spectra Background: DNA, chromatin and DNase I Simple solution Wavelets Hierarchical model Matching peptides to mass spectra Background: tandem mass spectrometry Modeling peptide fragmentation

The human genome in vivo Chromatin Fiber Gene ‘domains’ Nucleus Trans-factor complex DnaseI Hypersensitive Site Genes Genomic DNA Packaged into Chromatin

Measuring chromatin accessibility

A simple hidden Markov model very ^ Open chromatin Closed chromatin Each state contains a single Gaussian. The model has six parameters (two transitions, two means, two standard deviations). The parameters are initialized randomly and trained in an unsupervised fashion via expectation-maximization. EM is re-started 100 times, and we select the parameters that yield the highest likelihood. The original data set is then segmented using either Viterbi or posterior decoding.

1.5 megabases

A problem, and two solutions Problem: We are interested in phenomena occurring at multiple scales. Solution #1: Perform a wavelet smooth prior to HMM analysis. Solution #2: Build a more complex probability model.

Change point model Four-state model: major DNase hypersensitive site (DHS), minor DHS, intermediate sensitivity region, and insensitive region. Continuous mixture of Gaussians at each state. Gamma distribution of lengths within each region.

Spanning the gaps Beginning in State 1 (Insensitive)

Spanning the gaps Beginning in State 4 (Major DHS)

Selecting the number of states

Improved fit to the data Insensitive Intermediate sensitivity Minor DHS Major DHS Each panel is a QQ plot of the difference between the observed residuals and the theoretical Gaussian.

Capturing different scales

Enrichment of biologically relevant features

Future directions Many types of genomic data Phylogenetic conservation scores Various histone modifications Replication timing, etc. Perform segmentions in multiple dimensions simultaneously. Assign statistical significance to observed segments.

Shotgun proteomics Training PSMs Test PSMs Trained Model Evaluation Probability Model PSM = peptide-spectrum match

Peptide sequence influences peak height

Bayesian network We model peptide fragmentation using a Bayesian network. Nodes represent random variables, and edges represent conditional dependencies. Each node stores a conditional probability table (CPT) giving Pr(node|parents). Is b-ion observed? b-ion intensity 1.00 0.00 no b-ion observed 0.75 0.25 b-ion observed intensity > 50% intensity < 50%

Ion series modeled in a Markov chain Is b-ion observed? Is b-ion observed? Is b-ion observed? Is b-ion observed? Is b-ion observed? b-ion intensity b-ion intensity b-ion intensity b-ion intensity b-ion intensity ~ PepHMM (Han et al., 2005).

A more realistic model Is b-ion observed? b-ion intensity N-term AA C-term AA Is ion detectable? Fractional m/z Is proton mobile?

Ion series modeled in a Markov chain

Vectors of log-odds ratios Correct peptide-spectrum matches Incorrect peptide-spectrum matches

Binary classifier

Model Evaluation: Accuracy Training PSMs Test PSMs Trained Model Evaluation Probability Model Model Redundant TP/FP Unique TP/FP Bayes Net 285/300, 95% 137/144, 95.1% SEQUEST 288/300, 96% 136/144, 94.4% InsPecT 274/300, 91.3% 131/144, 90.9%

An incorrect identification Bayes net: HQDETQDALNALDLLTNEK SEQUEST: LRPGAELLEGAHVGNFVEMK This peptide does not appear in E. coli, the organism from which this protein sample was derived. Blue = b and y, green = a, red = ammonia loss, magenta = water loss, sienna = +2

Co-eluting peptides SEQUEST: AFPEAVLFIHPLDAK Bayes net: DVFVHFSALQGNQFK SEQUEST: AFPEAVLFIHPLDAK Blue = b and y, green = a, red = ammonia loss, magenta = water loss, sienna = +2

Future directions Build a single Bayesian network that includes all ion types. Produce more descriptive outputs from the Bayesian network for input to the classifier. Add more biophysical details to the model: chromatography retention time, a better mass-to-charge estimate, etc. Generate a better (larger, more accurate) gold standard data set.

Acknowledgments DNase I hypersensitivity Wavelet analysis: Bob Thurman John Stamatoyannopoulos Pete Sabo Scott Kuehn many others in the Stam lab Wavelet analysis: Bob Thurman Change point model Charles Lawrence Heng Lian William Thompson Mass spectrometry Aaron Klammer Jeff Bilmes Sheila Reynolds Michael MacCoss