+ Protein and gene model inference based on statistical modeling in k-partite graphs Sarah Gester, Ermir Qeli, Christian H. Ahrens, and Peter Buhlmann.

Slides:



Advertisements
Similar presentations
RNA-Seq based discovery and reconstruction of unannotated transcripts
Advertisements

+ Multi-label Classification using Adaptive Neighborhoods Tanwistha Saha, Huzefa Rangwala and Carlotta Domeniconi Department of Computer Science George.
Mixture Models and the EM Algorithm
Review of Probability. Definitions (1) Quiz 1.Let’s say I have a random variable X for a coin, with event space {H, T}. If the probability P(X=H) is.
Probabilistic modelling in computational biology Dirk Husmeier Biomathematics & Statistics Scotland.
Bayesian inference “Very much lies in the posterior distribution” Bayesian definition of sufficiency: A statistic T (x 1, …, x n ) is sufficient for 
Maximum Likelihood And Expectation Maximization Lecture Notes for CMPUT 466/551 Nilanjan Ray.
The adjustment of the observations
Introduction of Probabilistic Reasoning and Bayesian Networks
Structural Inference of Hierarchies in Networks BY Yu Shuzhi 27, Mar 2014.
From: Probabilistic Methods for Bioinformatics - With an Introduction to Bayesian Networks By: Rich Neapolitan.
Design principle of biological networks—network motif.
Visual Recognition Tutorial
Clustering short time series gene expression data Jason Ernst, Gerard J. Nau and Ziv Bar-Joseph BIOINFORMATICS, vol
Regulatory Network (Part II) 11/05/07. Methods Linear –PCA (Raychaudhuri et al. 2000) –NIR (Gardner et al. 2003) Nonlinear –Bayesian network (Friedman.
Statistical Inference Chapter 12/13. COMP 5340/6340 Statistical Inference2 Statistical Inference Given a sample of observations from a population, the.
1 Unsupervised Learning With Non-ignorable Missing Data Machine Learning Group Talk University of Toronto Monday Oct 4, 2004 Ben Marlin Sam Roweis Rich.
Visual Recognition Tutorial
Bayesian Networks Alan Ritter.
Thanks to Nir Friedman, HU
Topic models for corpora and for graphs. Motivation Social graphs seem to have –some aspects of randomness small diameter, giant connected components,..
Causal Modeling for Anomaly Detection Andrew Arnold Machine Learning Department, Carnegie Mellon University Summer Project with Naoki Abe Predictive Modeling.
This work is licensed under a Creative Commons Attribution 4.0 International License. Oliver Kohlbacher, Sven Nahnsen, Knut Reinert COMPUTATIONAL PROTEOMICS.
Analysis of tandem mass spectra - II Prof. William Stafford Noble GENOME 541 Intro to Computational Molecular Biology.
1 1 Slide © 2014 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole.
INFERENTIAL STATISTICS – Samples are only estimates of the population – Sample statistics will be slightly off from the true values of its population’s.
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
DNA microarray technology allows an individual to rapidly and quantitatively measure the expression levels of thousands of genes in a biological sample.
Topic Models in Text Processing IR Group Meeting Presented by Qiaozhu Mei.
Estimating a Population Proportion
Estimating parameters in a statistical model Likelihood and Maximum likelihood estimation Bayesian point estimates Maximum a posteriori point.
“PREDICTIVE MODELING” CoSBBI, July Jennifer Hu.
Analysis of Complex Proteomic Datasets Using Scaffold Free Scaffold Viewer can be downloaded at:
Various topics Petter Mostad Overview Epidemiology Study types / data types Econometrics Time series data More about sampling –Estimation.
Maximum Likelihood - "Frequentist" inference x 1,x 2,....,x n ~ iid N( ,  2 ) Joint pdf for the whole random sample Maximum likelihood estimates.
Bayesian Classification. Bayesian Classification: Why? A statistical classifier: performs probabilistic prediction, i.e., predicts class membership probabilities.
Serghei Mangul Department of Computer Science Georgia State University Joint work with Irina Astrovskaya, Marius Nicolae, Bassam Tork, Ion Mandoiu and.
Chapter 7 Sampling and Sampling Distributions ©. Simple Random Sample simple random sample Suppose that we want to select a sample of n objects from a.
1 Chapter 6 Estimates and Sample Sizes 6-1 Estimating a Population Mean: Large Samples / σ Known 6-2 Estimating a Population Mean: Small Samples / σ Unknown.
Probabilistic modelling in computational biology Dirk Husmeier Biomathematics & Statistics Scotland.
What Does the Likelihood Principle Say About Statistical Process Control? Gemai Chen, University of Calgary Canada July 10, 2006.
Univariate Linear Regression Problem Model: Y=  0 +  1 X+  Test: H 0 : β 1 =0. Alternative: H 1 : β 1 >0. The distribution of Y is normal under both.
Shared Peptides in Mass Spectrometry Based Protein Quantification Banu Dost, Nuno Bandeira, Xiangqian Li, Zhouxin Shen, Steve Briggs, Vineet Bafna University.
Chapter 3: Maximum-Likelihood Parameter Estimation l Introduction l Maximum-Likelihood Estimation l Multivariate Case: unknown , known  l Univariate.
INTRODUCTION TO Machine Learning 3rd Edition
CS Statistical Machine learning Lecture 24
Lecture 4: Likelihoods and Inference Likelihood function for censored data.
Review of Statistical Terms Population Sample Parameter Statistic.
The statistics of pairwise alignment BMI/CS 576 Colin Dewey Fall 2015.
Lecture 5: Statistical Methods for Classification CAP 5415: Computer Vision Fall 2006.
Hypothesis Testing Steps for the Rejection Region Method State H 1 and State H 0 State the Test Statistic and its sampling distribution (normal or t) Determine.
Graphical Models for Segmenting and Labeling Sequence Data Manoj Kumar Chinnakotla NLP-AI Seminar.
A Collapsed Variational Bayesian Inference Algorithm for Latent Dirichlet Allocation Yee W. Teh, David Newman and Max Welling Published on NIPS 2006 Discussion.
Amos Tanay Nir Yosef 1st HCA Jamboree, 8/2017
Chapter 3: Maximum-Likelihood Parameter Estimation
Introduction to Hypothesis Test – Part 2
Protein Inference by Generalized Protein Parsimony reduces False Positive Proteins in Bottom-Up Workflows Nathan J. Edwards, Department of Biochemistry.
CONCEPTS OF HYPOTHESIS TESTING
Continuous Random Variable
BOOTSTRAPPING: LEARNING FROM THE SAMPLE
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Topic models for corpora and for graphs
Continuous Random Variables
Inference of alternative splicing from RNA-Seq data with probabilistic splice graphs BMI/CS Spring 2019 Colin Dewey
LECTURE 07: BAYESIAN ESTIMATION
Topic models for corpora and for graphs
Topic Models in Text Processing
6.3 Sampling Distributions
Chapter Outline Inferences About the Difference Between Two Population Means: s 1 and s 2 Known.
Fundamental Sampling Distributions and Data Descriptions
Presentation transcript:

+ Protein and gene model inference based on statistical modeling in k-partite graphs Sarah Gester, Ermir Qeli, Christian H. Ahrens, and Peter Buhlmann

+ Problem Description Given peptides and scores/probabilities, infer the set of proteins present in the sample. PERFGKLMQK MLLTDFSSAWCR FFRDESQINNR TGYIPPPLJMGKR Protein A Protein B Protein C

+ Previous Approaches N-peptides rule ProteinProphet (Nesvizhskii et al Anal Chem) Assumes peptide scores are correct. Nested mixture model (Li et al Ann Appl Statist) Rescores peptides while doing the protein inference Does not allow shared peptides Peptide scores are independent Hierarchical statistical model (Shen et al Bioinformatics) Allows for shared peptides Assume PSM scores for the same peptide are independent Impractical on normal datasets MSBayesPro (Li et al J Comput Biol) Uses peptide detectabilities to determine peptide priors.

+ Markovian Inference of Proteins and Gene Models (MIPGEM) Inclusion of shared/degenerate peptides in the model. Treats peptide scores/probabilities as random values Model allows dependence of peptide scores. Inference of gene models

+ Why scores as random values? PERFGKLMQK MLLTDFSSAWCR FFRDESQINNR TGYIPPPLJMGKR Protein A Protein B Protein C

+ Building the bipartite graph

+ Shared peptides

+ Definitions Let p i be the score/probabilitiy of peptide i. I is the set of all peptides. Let Z j be the indicator variable for protein j. J is the set of all proteins.

+ Simple Probability Rules

+ Bayes Rule Prior probability on the protein being present Joint probability of seeing these peptide scores Probability of observing these peptide scores given that the protein is present

+ Assumptions Prior probabilities of proteins are independent Dependencies can be included with a little more effort. This does not mean that proteins are independent.

+ Assumptions Connected components are independent

+ Assumptions Peptide scores are independent given their neighboring proteins. Ne(i) is the set of proteins connected to peptide i in the graph. I r is the set of peptides belonging to the rth connected component R(I r ) is the set of proteins connected to peptides in I r

+ Assumptions Conditional peptide probabilities are modeled by a mixture model. The specific mixture model they use is based on the peptide scores used (from PeptideProphet).

+ Bayes Rule Prior probability on the protein being present Joint probability of seeing these peptide scores Probability of observing these peptide scores given that the protein is present

+ Joint peptide score distribution Assumption: peptides in different components are independent I r is the set of peptides in component r R(I r ) is the set of proteins connected to peptides in I r

+ Conditional Probability Mixture model

+ Conditional Probability Mixture model

+ f 1 (x) – pdf of P(p i |{z j }) median

+ Choosing b 1 and b 2 Seek to maximize the log likelihood of observing the peptide scores.

+ Choosing b 1 and b 2 It turns out:

+ Conditional Protein Probabilities

+

+ Conditional Protein Probabilities(NEC Correction)

+ Conditional Protein Probabilities

+

+

+ Shared Peptides

+

+ If the shared peptide has p i ≥ median

+ Shared Peptides If the shared peptide has p i < median

+ Gene Model Inference

+ Assume a gene model, X, has only protein sequences which belong to the same connected component. Peptide 1 Peptide 2 Peptide 3 Peptide 4 Protein A Protein B Gene X

+ Gene Model Inference Assume a gene model, X, has only protein sequences which belong to the same connected component. R(X) is the set of proteins with edges to X. I r(X) is the set of peptides with edges to proteins with edges to X

+ Gene Model Inference Gene model, X, has proteins from different connected components of the peptide-protein graph. Peptide 1 Peptide 2 Peptide 3 Peptide 4 Protein A Protein B Gene X

+ Gene Model Inference Gene model, X, has proteins from different connected components of the peptide-protein graph. R l (X) is the set of proteins with edges to X in component l. I l(X) is the set of peptides with edges to proteins with edges to X in component l.

+ Datasets Mixture of 18 purified proteins Mixture of 49 proteins (Sigma49) Drosophila melanogaster Saccharomyces cerevisiae (~4200 proteins) Arabidopis thaliana (~4580 gene models)

+ Comparisons with other tools Small datasets with a known answer Mix of 18 proteins Sigma49

+ Comparisons with other tools One hit wonders Sigma49 no one hit wonders

+ Comparison with other tools Arabidopsis thaliana dataset has many proteins with high sequence similarity.

+ Splice isoforms

+ Conclusion +Criticism Developed a model for protein and gene model inference. Comparisons with other tools do not justify complexity: Value of a small FP rate at the expense of many FN is not shared for all applications. Discard some useful information such as #spectra/peptide Assumptions of parsimony from pruning may be too aggressive.