Dependency networks Sushmita Roy BMI/CS Nov 25 th, 2014
RECAP Probabilistic graphical models provide a natural way to represent biological networks So far we have see Bayesian networks: – Sparse candidates – Module networks Today we will focus on dependency networks
What you should know What are dependency networks? How they differ from Bayesian networks? GENIE3 algorithm for learning a dependency network from expression data Different ways to represent conditional distributions Evaluation of various network inference methods
Graphical models for representing regulatory networks Bayesian networks Dependency networks Structure Msb 2 Sho1 Ste20 Random variables encode expression levels T ARGET R EGULATORS X1X1 X2X2 Y3Y3 X1X1 X2X2 Y3Y3 Edges correspond to some form of statistical dependencies Y 3 =f(X 1,X 2 ) Function
Dependency network A type of probabilistic graphical model As in Bayesian networks has – A graph component describing the dependency structure between random variables – Each variable X j is associated with a prediction function f j to predict X j from the state of its neighbors Unlike Bayesian network – Can have cyclic dependencies Dependency Networks for Inference, Collaborative Filtering and Data visualization Heckerman, Chickering, Meek, Rounthwaite, Kadie 2000
Notation X i : i th random variable X={X 1,.., X p } : set of p random variables x i k : An assignment of X i in the k th sample x -i k : Set of assignments to all variables other than X i in the k th sample
Learning dependency networks ??? … XjXj Regulators f j can be of different types. Learning requires estimation of each of the f j functions In all cases learning requires us to minimize an error of predicting X j from its neighborhood: fjfj
Different representations of the f j function If X j is continuous – f j can be a linear function – f j can be a regression tree – f j can be a random forest An ensemble of trees If X j is discrete – f j can be a conditional probability table – f j can be a conditional probability tree
GENIE3: GEne Network Inference with Ensemble of trees Solves a set of regression problems – One per random variable Uses an Ensemble of regression trees to represent f j – Models non-linear dependencies Outputs a directed, cyclic graph with a confidence of each edge Focus on generating a ranking over edges rather than a graph structure and parameters Inferring Regulatory Networks from Expression Data Using Tree-Based Methods Van Anh Huynh-Thu, Alexandre Irrthum, Louis Wehenkel, Pierre Geurts, Plos One 2010
Recall our very simple regression tree example X2X2 X3X3 e1e1 e2e2 X 2 > e 1 X 2 > e 2 YES NO YES X3X3
An Ensemble of trees A single tree is prone to “overfitting” Instead of learning a single tree, Ensemble models make use of a collection of trees
– Prediction is A Random forest: An Ensemble of Trees …… tree t 1 tree t T split nodes leaf nodes x -j Taken from ICCV09 tutorial by Kim, Shotton and Stenger:
GENIE3 algorithm sketch For each X j, generate learning samples of input/output pairs – LS j ={(x -j k,x j k ), k=1..N} – On each LS j learn f j to predict the value of X j – f j is either a Random forest or Extra trees – Estimate w ij for all genes i ≠ j w ij quantifies the confidence of the edge between X i and X j Generate a global ranking of edges based on each w ij
GENIE3 algorithm sketch Figure from Huynh-Thu et al. Predictor ranking
Learning f j in GENIE3 Random forest or Extra Trees to represent the f j Learning the Random forest – Generate M=1000 bootstrap samples – At each node to be split, search for best split among K randomly selected variables – K was set to p-1 or (p-1) 1/2, where p is the number of regulators/parents Learning the Extra-Trees – Learn 1000 trees – Each tree is built from the original learning sample – At each test node, the best split is determined among K random splits, each determined by randomly selecting one input (without replacement) and a threshold
Computing the importance weight of a predictor Importance is computed at each interior node Remember there can be multiple interior nodes per regulator For an interior node, importance is given by the reduction in variance if we make a split on that node Interior node Set of data samples that reach this node #S : Size of the set S Var( S ): variance of the output variable in set S S t : subset of S when a test at N is true S f : subset of S when a test at N is false
Computing the importance weight of a predictor For a single tree the overall importance is then sum over all points in the tree where this node is used to split For an ensemble the importance is averaged over all trees To avoid bias towards highly variable genes, normalize the expression genes to all have unit variance
Computational complexity of GENIE3 Complexity per variable – O(TKNlog N) – T is the number of trees – K is the number of random attributes selected per split – N is the learning sample size
Evaluation of network inference methods Assume we know what the “right” network is One can use Precision-Recall curves to evaluate the predicted network Area under the PR curve (AUPR) curve quantifies performance Precision= # of correct edges # of predicted edges Recall= # of correct edges # of true edges
AUPR based performance comparison
Some comments about expression-based network inference methods We have seen two types of algorithms to learn these networks – Per-gene methods Sparse candidate: learn regulators for individual genes GENIE3 – Per-module methods Module networks: learn regulators for sets of genes/modules – Other implementations of module networks exist LIRNET: Learning a Prior on Regulatory Potential from eQTL Data – Su In Lee et al, Plos genetics 2009 ( rnal.pgen ) LeMoNe: Learning Module Networks – Michoel et al 2007 ( 2105/8/S2/S5)
Many implementations of per-gene methods Mutual Information – Context Likelihood of relatedness (CLR) – ARACNE Probabilistic methods – Bayesian network: Sparse Candidates Regression – TIGRESS – GENIE-3
DREAM: Dialogue for reverse engineeting assessments and methods Community effort to assess regulatory network inference DREAM 5 challenge Previous challenges: 2006, 2007, 2008, 2009, 2010 Marbach et al. 2012, Nature Methods
Where do different methods rank? Marbach et al., 2010 Community Random
Comparing module (LeMoNe) and per-gene (CLR) methods