Presentation is loading. Please wait.

Presentation is loading. Please wait.

Consistent probabilistic outputs for protein function prediction William Stafford Noble Department of Genome Sciences Department of Computer Science and.

Similar presentations


Presentation on theme: "Consistent probabilistic outputs for protein function prediction William Stafford Noble Department of Genome Sciences Department of Computer Science and."— Presentation transcript:

1 Consistent probabilistic outputs for protein function prediction William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering University of Washington

2 Outline Motivation and background Methods –Shared base method –Reconciliation methods Results

3 The problem Given: protein sequence, knockout phenotype, gene expression profile, protein-protein interactions, and phylogenetic profile Predict a probability for every term in the Gene Ontology  Heterogeneous data  Missing data  Multiple labels per gene  Structured output

4 Consistent predictions Cytoplasmic membrane-bound vesicle (GO:0016023) Cytoplasmic vesicle (GO:0031410) is a The probability that protein X is a cytoplasmic membrane-bound vesicle must be less than or equal to the probability that protein X is a cytoplasmic vesicle.

5 Data sets

6

7 Kernels

8 SVM → Naïve Bayes Data 1 Data 2 Data 3 Data 4 Data 5 Data 6 Data 7 Data 8 Data 33 SVM/AL 1 SVM/AL 2 SVM/AL 3 SVM/AL 4 SVM/AL 5 SVM/AL 6 SVM/AL 7 SVM/AL 8 SVM/AL 33 Product, plus Bayes’ rule Probability 1 Probability 2 Probability 3 Probability 4 Probability 6 Probability 8 Probability 33 Probability Gaussian Asymmetric Laplace

9 SVM → logistic regression Data 1 Data 2 Data 3 Data 4 Data 5 Data 6 Data 7 Data 8 Data 33 SVM 1 SVM 2 SVM 3 SVM 4 SVM 5 SVM 6 SVM 7 SVM 8 SVM 33 Logistic regressor 1 Logistic regressor 2 Logistic regressor 3 Logistic regressor 11 Predict 1 Predict 2 Predict 3 Predict 4 Predict 6 Predict 8 Predict 33 Probability

10 Reconciliation Methods 3 heuristic methods 3 Bayesian networks 1 cascaded logistic regression 3 projection methods

11 Heuristic methods Max: Report the maximum probability of self and all descendants. And: Report the product of probabilities of all ancestors and self. Or: Compute the probability that at least one descendant of the GO term is “on,” assuming independence. All three methods use probabilities estimated by logistic regression.

12 Bayesian network Belief propagation on a graphical model with the topology of the GO. Given Y i, the distribution of each SVM output X i is modeled as an independent asymmetric Laplace distribution. Solved using a variational inference algorithm. “Flipped” variant: reverse the directionality of edges in the graph.

13 Cascaded logistic regression Fit a logistic regression to the SVM output only for those proteins that belong to all parent terms. Models the conditional distribution of the term, given all parents. The final probability is the product of these conditionals:

14 Isotonic regression Consider the squared Euclidean distance between two sets of probabilities. Find the closest set of probabilities to the logistic regression values that satisfy all the inequality constraints.

15 Isotonic regression Consider the squared Euclidean distance between two sets of probabilities. Find the closest set of probabilities to the logistic regression values that satisfy all the inequality constraints.

16 Küllback-Leibler projection Küllback-Leibler projection on the set of distributions which factorize according to the ontology graph. Two variants, depending on the directions of the edges.

17 Likelihood ratios obtained from logistic regression Hybrid method Replace the Bayesian log posterior for Y i by the marginal log posterior obtained from the logistic regression. Uses discriminative posteriors from logistic regression, but still uses a structural prior. BPAL KLP BPLR

18 Axes of evaluation Ontology –biological process –cellular compartment –molecular function Term size –3-10 proteins –11-30 proteins –31-100 proteins –100-200 proteins Evaluation mode –Joint evaluation –Per protein –Per term Recall –1% –10% –50% –80%

19 Legend Belief propagation, asymmetric Laplace Belief propagation, asymmetric Laplace, flipped Belief propagation, logistic regression Cascaded logistic regression Isotonic regression Logistic regression Küllback-Leibler projection Küllback-Leibler projection, flipped Naïve Bayes, asymmetric Laplace

20 Precision TP/(TP+FP) Recall TP / (TP+FN) Joint evaluation Biological process ontology Large terms (101-200)

21 Biological process ontology

22 Molecular function ontology

23 Cellular compartment ontology

24 Conclusions: Joint evaluation Reconciliation does not always help. Isotonic regression performs well overall, especially for recall > 20%. For lower recall values, both Küllback- Leibler projection methods work well.

25 Average precision per protein Biological process All term sizes

26 Biological process

27 Statistical significance Biological process Large terms

28 Biological process Large terms

29 Biological process 101-200 31-100 11-30 3-10 953 proteins 435 proteins 239 proteins 100 proteins

30 Molecular function 101-200 31-100 11-30 3-10 476 proteins 142 proteins 111 proteins 35 proteins

31 Cellular component 101-200 31-100 11-30 3-10 196 proteins 135 proteins 171 proteins 278 proteins

32 Conclusions: per protein Several methods perform well –Unreconciled logistic regression –Unreconciled naïve Bayes –Isotonic regression –Belief propagation with asymmetric Laplace For small terms –For molecular function and biological process, we do not observe many significant differences. –For cellular components, belief propagation with logistic regression works well.

33 Average precision per term Biological process All term sizes

34 Biological process 101-200 31-100 11-30 3-10 953 terms 435 terms 239 terms 100 terms

35 Molecular function 101-200 31-100 11-30 3-10 476 terms 142 terms 111 terms 35 terms

36 Cellular component 101-200 31-100 11-30 3-10 152 terms 97 terms 48 terms 30 terms

37 Conclusions Reconciliation does not always help. Isotonic regression (IR) performs well overall. For small biological process and molecular function terms, it is less clear that IR is one of the best methods.

38 Acknowledgments Guillaume Obozinski Charles Grant Gert Lanckriet Michael Jordan The mousefunc organizers Tim Hughes Lourdes Pena-Castillo Fritz Roth Gabriel Berriz Frank Gibbons

39 Per term for small terms Biological process Molecular function Cellular component


Download ppt "Consistent probabilistic outputs for protein function prediction William Stafford Noble Department of Genome Sciences Department of Computer Science and."

Similar presentations


Ads by Google