Evidence Integration in Bioinformatics Phil Long Columbia University.

Evidence Integration in Bioinformatics Phil Long Columbia University

A little molecular biology DNA has nucleotides (A, C, T and G) arranged linearly along chromosomes Regions of DNA, called genes, encode proteins Proteins biochemical workhorses Proteins made up of amino acids also strung together linearly fold up to form 3D structure

Problems What do proteins do? Which pairs of proteins work together?

Evidence of related function Similar protein sequences BLAST Smith-Waterman FASTA Evidence of interaction (see previous slide)

Indirect co-expression common essentiality similar functional annotations … Direct yeast two-hybrid in-vivo pulldown … Evidence of protein-protein interaction All yield errors Complementary How best to combine? Goals: Automatic High Throughput

Combining using machine learning Supervised analysis (useful gold standard designations available) Bayes Nets Support Vector Machines (kernel fusion) Boosting (RankBoost) Unsupervised analysis

Overfitting and inductive bias Overfitting: capturing insignificant details of the data at the expense of useful trends Inductive bias: a priori preference some explanations of the data over others (e.g. “simple” explanations better) Stronger inductive bias better protects against overfitting Theme: exploit special structure of evidence integration problems to fashion appropriate inductive biases

Supervised Learning with Bayes Nets

Bayesian Networks (a.k.a. Bayes Nets) To define joint probability distribution of multiple variables Useful for describing sparse direct dependence among variables: graph describes which direct dependencies parameters describe what kind of direct dependencies (Pearl, 1988)(Kasif, Koller, Friedman, Segal, Jordan,…)

Bayes Net - Example Earthquake Burgler Alarm Neighbor Call Alarm Parameters: EBPr(A|B,E) 000.1 010.81 100.9 110.91 Neighbor Call Params: APr(N|A) 00.2 10.9 (Friedman and Goldszmidt)

To use: Learn joint distribution Score: Pr(Interact | Co-expression, Essentiality, Function…) Naïve Bayes Interact Co-expressionEssentiality Function … (Jansen, et al, 2003)

Hierarchical Naïve Bayes When subset of variables related Use Naïve Bayes for related subset Form new variable with resulting scoring function Add to group, do new Naïve Bayes Possible interpretation: hidden Y2H variable ~(Jansen, et al, 2003) Uetz Y2HIto Y2H Interacts Y2H Interact Co-expressionEssentiality Function

Supervised learning with SVMs (kernel fusion)

Support Vector Machines for Classification Uses similarity function K (called “kernel function”) Special examples (“support vectors”) chosen, given weights New items assigned class predictions using special examples and weights: principles each example “votes” for its class more similar, bigger vote bigger weight, bigger vote. Specifically, if classes are {-1,1}, and examples are (x 1,y 1 ),…, (x m,y m ), then output classifier takes form h ( x ) = sign(α 1 y 1 K(x,x 1 ) +… + α m y m K(x,x m )). (Boser, Guyan and Vapnik, 1992)(Haussler, Jaakkola, Vert, Leslie, Weston,…)

Support Vector Machine Training Weights, support vectors chosen to optimize scoring function trades complexity of classifier with fit to data strong theoretical foundation globally optimal solution found fast (polynomial time) Successful in many domains, including bioinformatics

Kernel fusion Suppose have kernels K 1, K 2,…, K n. Idea: use μ 1 K 1 + μ 2 K 2 + … + μ n K n learn μ i ’s together with α i ’s can be done in polynomial time (semidefinite programming) Implicit view: similarity according to K i ’s provide independent sources of evidence of correct class (Lanckriet, et al, 2004)

Evaluation - ROC Curve sorted by score F F F T T T T predictions false positives true positives

Evaluation - ROC curve False positives True positives 1 1 Area under the ROC curve

Results – membrane protein prediction (Results essentially unaffected by adding random kernels.) (Lanckriet, et al, 2004)

Supervised learning with boosting (RankBoost)

RankBoost Learns scoring function for ranking Input: list of “order preferences” ( u 1 should be ranked above v 1, u 2 should be ranked above v 2, …) list of “base ranking functions” Output: single ranking function built from base rankers (Freund, et al 1999)

RankBoost behavior Add base rankers one at a time in rounds: assign weights to ranking preferences -- higher priority to pairs often out of order in earlier rankings choose base ranker to minimize weighted average “loss” (how wrong ranking was about pair) assign weight to base ranker based on how well it did. Output: weighted sum of base ranker scoring functions. (Freund, et al 1999)

RankBoost Theoretically well-founded Works well on combining search engine results movie preference prediction. Bioinformatics?

Unsupervised Learning with Bayes Nets (Segal, et al 2003)

Regulation of Expression Gene expression regulated Major mechanism: expression of gene G regulated by proteins binding DNA near G Proteins that regulate expression by binding DNA called transcription factors Sites where a transcription factor binds approximately conform to a motif

Transcriptional Modules Groups of coordinately regulated genes often work together Goal: Find such transcriptional modules Identify associated motifs Two common approaches Use results of microarrays (measure expression) and DNA sequence near regulated genes Approach 1: Cluster expression behaviors Look for short DNA sequences over-represented in clusters Approach 2: Learn functions from nearby DNA to expression behaviors Look for important “features”

Bayes Net for Transcription Modules S1S1 S2S2 S3S3 S4S4 S5S5 R1R1 R2R2 R3R3 ME1E1 E2E2 E3E3 E4E4 DNA Sequence Motifs Module (values are 1,2,3…) Expression experiments Functions defining conditional distributions for R i ’s and M restricted Additional conditional independence implicit During training, alternately Clusters of experiments influence motif definition Motif definitions influence cluster determination (Segal, et al 2003)

Unsupervised Evidence Integration (Long, et al 2005)

Problem Gold standard designations often not available Where few available, enriched for conclusions based on popular approaches

Examples Protein-protein interactions in higher organisms Pairing equivalent genes (“orthologs”) between newly sequenced genomes and human: direct sequence comparison phylogenetic analysis local conservation of gene order Meta-analysis of multiple screens for differential expression using microarrays for each gene, each study gives a score how to combine?

More generally Have (very large) list of candidate propositions For each, several scores Which propositions most likely to be true?

Notation Y indicates whether proposition true or not X 1,...,X n are the scores. Input Output X1X1 X2X2 X3X3 X4X4 X5X5 23001 010?1 845410 0.3 0.01 0.98

Isn’t this just clustering? Specific opportunities/challenges: Each X i positively associated with Y ; bigger X i → more likely that Y=1 Variables X 1,...,X n often complementary Minority class minute Classes overlap significantly

Related Theoretical Work [MV03] – Problem Training example generation: All variables {0,1}-valued Y chosen randomly, fixed X 1,...,X n chosen independently with Pr(X i = Y) = p i, where p i is unknown, same when Y is 0 or 1 (crucial for analysis) only X 1,...,X n given to training algorithm Goal: given m training examples generated as below output accurate classifier h that, given scores, outputs prediction of corresponding Y

Related Theoretical Work [MV03] – Results If n ≥ 3, can approach Bayes error (best possible for source) as m gets large Idea: variable “good” if often agrees with others: can determine how good variables are by looking at how often pairs agree can estimate how often pairs agree from training data can plug in to get estimates of how good variables are can use those to approximate optimal classifier: use everybody, but listen to the good ones more.

In our problem(s)... Some of X 1,...,X n continuous-valued, others discrete-valued Reasonable to assume X 1,...,X n conditionally independent given Y Reasonable to assume Pr(Y = 1 | X i = x) increasing in x, for all I Pr(Y = 1) typically very small

Conditional independence Captures non-redundance in variables (“different views”) Consequence: associations among variables due to common association with class designation Promoted by preprocessing (merge redundant variables) Likely some conditional dependencies remain Steps to make algorithm robust against departures from assumption “Shrink” estimates (skeptical of indications of very strong associations) Evaluate variable using many pairs of other variables, take median of results If main source of dependence is targeted class, there is hope…

Our Approach X1X1 X2X2 X3X3 X4X4 X5X5 one proposition another proposition 1 0 U1U1 U2U2 U3U3 U4U4 U5U5 0

Notes Estimate Pr(U i = 1 | Y = 1) by using agreement rates of U i with all other pairs of variables (new analysis needed), and taking median of results. Use Bayes’ Rule to approximate Pr(Y = 1 | X i ’s ) ; use this for score.

Evaluation: Yeast protein-protein data Use data from Jansen, et al study Mix of experimental and indirect evidence Preprocessing summed two yeast two-hybrid variables summed two in-vivo pulldown variables inverted and summed functional annotation variables. Gold standard designations as in Jansen, et al 20+ million examples, 8000+ interactions, 5 variables

Evaluation: other algorithms k -means spectral algorithm: project onto principal component (direction of maximum variance)

Evaluation Run algorithms without using gold standard, generate scores Evaluate using gold standard with ROC score

Results: Protein-protein data AlgorithmROC score Peer0.961 Spectral0.859 k -means0.865

Results: Protein-protein data AlgorithmArea Over ROC Curve Peer0.039 Spectral0.141 k -means0.135 AOC = 1 – AUC = Pr(random 0 ranked above random 1)

Evaluation: Artificial data Inspired by Altschul’s chacterization of BLAST score distribution Five variables, X i ’s conditionally independent given Y, Pr(Y = 1) = 0.01. All variables uniform on [0,1] when Y=1 Each variable exponential when Y=0 Means of exponentials chosen randomly when source is defined Results averaged over 100 random sources, 100000 examples for each source

Results: Artificial source AlgorithmROC score Peer0.997 Spectral0.974 k -means0.861

Results: Artificial source AlgorithmArea Over ROC Curve Peer0.003 Spectral0.026 k -means0.139 AOC = 1 – AUC = Pr(random 0 ranked above random 1)

Paper and Software Includes paper, source, data, scripts to reproduce all results. Location: http://www.cs.columbia.edu/~plong/peer

Evidence Integration in Bioinformatics Phil Long Columbia University.

Similar presentations

Presentation on theme: "Evidence Integration in Bioinformatics Phil Long Columbia University."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Evidence Integration in Bioinformatics Phil Long Columbia University.

Similar presentations

Presentation on theme: "Evidence Integration in Bioinformatics Phil Long Columbia University."— Presentation transcript:

Similar presentations

About project

Feedback