The Bi-Module problem: new algorithms and applications Group meeting January 2013 David Amar.

Slides:

Advertisements

Similar presentations

Minimum Clique Partition Problem with Constrained Weight for Interval Graphs Jianping Li Department of Mathematics Yunnan University Jointed by M.X. Chen.

Advertisements

Social network partition Presenter: Xiaofei Cao Partick Berg.

Mining Compressed Frequent- Pattern Sets Dong Xin, Jiawei Han, Xifeng Yan, Hong Cheng Department of Computer Science University of Illinois at Urbana-Champaign.

. Exact Inference in Bayesian Networks Lecture 9.

~1~ Infocom’04 Mar. 10th On Finding Disjoint Paths in Single and Dual Link Cost Networks Chunming Qiao* LANDER, CSE Department SUNY at Buffalo *Collaborators:

Lauritzen-Spiegelhalter Algorithm

Optimal Rectangle Packing: A Meta-CSP Approach Chris Reeson Advanced Constraint Processing Fall 2009 By Michael D. Moffitt and Martha E. Pollack, AAAI.

FTP Biostatistics II Model parameter estimations: Confronting models with measurements.

Putting genetic interactions in context through a global modular decomposition Jamal.

BASIC METHODOLOGIES OF ANALYSIS: SUPERVISED ANALYSIS: HYPOTHESIS TESTING USING CLINICAL INFORMATION (MLL VS NO TRANS.) IDENTIFY DIFFERENTIATING GENES Basic.

A hub-attachment based method to detect functional modules from confidence-scored protein interactions and expression profiles Authors: Chia-Hao Chin 1,4,

Automatic Identification of ROIs (Regions of interest) in fMRI data.

Clustering short time series gene expression data Jason Ernst, Gerard J. Nau and Ziv Bar-Joseph BIOINFORMATICS, vol

© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.

VLSI Layout Algorithms CSE 6404 A 46 B 65 C 11 D 56 E 23 F 8 H 37 G 19 I 12J 14 K 27 X=(AB*CD)+ (A+D)+(A(B+C)) Y = (A(B+C)+AC+ D+A(BC+D)) Dr. Md. Saidur.

Recent Development on Elimination Ordering Group 1.

University of CreteCS4831 The use of Minimum Spanning Trees in microarray expression data Gkirtzou Ekaterini.

Modularity in Biological networks.  Hypothesis: Biological function are carried by discrete functional modules.  Hartwell, L.-H., Hopfield, J. J., Leibler,

Increasing graph connectivity from 1 to 2 Guy Kortsarz Joint work with Even and Nutov.

SubSea: An Efficient Heuristic Algorithm for Subgraph Isomorphism Vladimir Lipets Ben-Gurion University of the Negev Joint work with Prof. Ehud Gudes.

Semi-Supervised Clustering Jieping Ye Department of Computer Science and Engineering Arizona State University

The community-search problem and how to plan a successful cocktail party Mauro SozioAris Gionis Max Planck Institute, Germany Yahoo! Research, Barcelona.

Systems Biology, April 25 th 2007Thomas Skøt Jensen Technical University of Denmark Networks and Network Topology Thomas Skøt Jensen Center for Biological.

Graph-based consensus clustering for class discovery from gene expression data Zhiwen Yum, Hau-San Wong and Hongqiang Wang Bioinformatics, 2007.

Optimization Based Modeling of Social Network Yong-Yeol Ahn, Hawoong Jeong.

Chapter 5 Algorithm Analysis 1CSCI 3333 Data Structures.

Gene expression & Clustering (Chapter 10)

Efficient Model Selection for Support Vector Machines

MATISSE - Modular Analysis for Topology of Interactions and Similarity SEts Igor Ulitsky and Ron Shamir Identification.

Genetic Regulatory Network Inference Russell Schwartz Department of Biological Sciences Carnegie Mellon University.

Presenter ： Kuang-Jui Hsu Date ： 2011/5/23(Tues.).

Jesse Gillis 1 and Paul Pavlidis 2 1. Department of Psychiatry and Centre for High-Throughput Biology University of British Columbia, Vancouver, BC Canada.

Greedy Approximation Algorithms for finding Dense Components in a Graph Paper by Moses Charikar Presentation by Paul Horn.

UNC Chapel Hill Lin/Foskey/Manocha Minimum Spanning Trees Problem: Connect a set of nodes by a network of minimal total length Some applications: –Communication.

A Clustering Algorithm based on Graph Connectivity Balakrishna Thiagarajan Computer Science and Engineering State University of New York at Buffalo.

DATA MINING LECTURE 13 Pagerank, Absorbing Random Walks Coverage Problems.

CSCI 3160 Design and Analysis of Algorithms Chengyu Lin.

On Non-Disjoint Dominating Sets for the Lifetime of Wireless Sensor Networks Akshaye Dhawan.

Approximation Algorithms for NP-hard Combinatorial Problems Magnús M. Halldórsson Reykjavik University Local Search, Greedy and Partitioning

Randomized Composable Core-sets for Submodular Maximization Morteza Zadimoghaddam and Vahab Mirrokni Google Research New York.

Intel Confidential – Internal Only Co-clustering of biological networks and gene expression data Hanisch et al. This paper appears in: bioinformatics 2002.

Module networks Sushmita Roy BMI/CS 576 Nov 18 th & 20th, 2014.

CSE 589 Part VI. Reading Skiena, Sections 5.5 and 6.8 CLR, chapter 37.

1/24 Introduction to Graphs. 2/24 Graph Definition Graph : consists of vertices and edges. Each edge must start and end at a vertex. Graph G = (V, E)

Understanding Network Concepts in Modules Dong J, Horvath S (2007) BMC Systems Biology 2007, 1:24.

De novo discovery of mutated driver pathways in cancer Discussion leader: Matthew Bernstein Scribe: Kun-Chieh Wang Computational Network Biology BMI 826/Computer.

Biclustering of Expression Data by Yizong Cheng and Geoge M. Church Presented by Bojun Yan March 25, 2004.

GeneMANIA: a real-time multiple association network integration algorithm for predicting gene function Sara Mostafavi, Debajyoti Ray, David Warde-Farley,

Fast test for multiple locus mapping By Yi Wen Nisha Rajagopal.

A comparative approach for gene network inference using time-series gene expression data Guillaume Bourque* and David Sankoff *Centre de Recherches Mathématiques,

Community structure in graphs Santo Fortunato. More links “inside” than “outside” Graphs are “sparse” “Communities”

Correlation Clustering Nikhil Bansal Joint Work with Avrim Blum and Shuchi Chawla.

Advanced Gene Selection Algorithms Designed for Microarray Datasets Limitation of current feature selection methods: –Ignores gene/gene interaction: single.

Example Apply hierarchical clustering with d min to below data where c=3. Nearest neighbor clustering d min d max will form elongated clusters!

Network Partition –Finding modules of the network. Graph Clustering –Partition graphs according to the connectivity. –Nodes within a cluster is highly.

Network applications Sushmita Roy BMI/CS 576 Dec 9 th, 2014.

1 Power Efficient Monitoring Management in Sensor Networks A.Zelikovsky Georgia State joint work with P. BermanPennstate G. Calinescu Illinois IT C. Shah.

Mining Coherent Dense Subgraphs across Multiple Biological Networks Vahid Mirjalili CSE 891.

Cohesive Subgraph Computation over Large Graphs

Finding Dense and Connected Subgraphs in Dual Networks

Spectral methods for Global Network Alignment

Semi-Supervised Clustering

Consensus Partition Liang Zheng 5.21.

Spectral methods for Global Network Alignment

Subnetwork State Functions Define Dysregulated Subnetworks in Cancer

SEG5010 Presentation Zhou Lanjun.

Graphs and Algorithms (2MMD30)

Boltzmann Machine (BM) (§6.4)

CSE 373: Data Structures and Algorithms

Minimum Spanning Trees

Presentation transcript:

The Bi-Module problem: new algorithms and applications Group meeting January 2013 David Amar

Motivation: biological networks, regulation Helps organize and understand the data Can help in a variety of machine learning problems ▫Patient classification ▫Gene regulation We have many networks, each can help understand a specific “concept” ▫Integrate networks

Example: differential coexpression Differential correlation: the difference in co-expression of a gene pair between two classes. In complex diseases, different regulatory factors may change the level of activity of group of genes. Here we analyze co-expression networks ▫Node for a gene ▫Edge weight – correlation, DC ▫Undirected ▫Can be large Control Case gene1 gene2

Recent advances Single gene analysis (Lai et al. 2004) CoXpress (Watson 2006) ▫find co-expression modules using one of the conditions (classes) in the data and then test if these modules show a different co-expression pattern in other classes. GSCA ▫ gene set co-expression analysis (Choi et al. 2009) DiffCoEx (Tesson et al. 2010): ▫detection of gene modules that manifest a marked change in the correlation and module-to-module changes.

DC “global” patterns Bi- Module DC- cluster

Cerebellum activity, 80 genes, p=3.7E-10 Un-annotated Spliceosome (8.4E-3), miRNA preprocessing

DICER analysis flow Main

DICER input graphs We developed a statistical model to measure significance of DC. The output of the statistical process three graphs on the same node set (genes) ▫Positive score: pair is significant ▫Up-DC ▫Down-DC ▫CG “DC” graph

Searching for clusters Cluster the DC graphs. We used simple average linkage hierarchical clustering.

Simulated data 150 genes and 60 conditions divided into two classes of 30 conditions each. In these data we planted a DC cluster of 30 genes and a 20-gene meta-module consisting of two 10-gene modules. We added noise to the data

Simulated data

DICER results Bi-module recovery was perfect. DICER detected the planted cluster, but reported additional clusters: ▫The meta-module itself (with three additional genes) ▫Some other small clusters

Improvement Use complete hierarchical linkage instead of average linkage In real data, as compared to average linkage: ▫Promoted homogeneity (“cliques”) ▫Clusters tend to be small Results are now perfect

Bi-modules A pair of gene modules ▫In each module: genes are correlated across all phenotypes ▫Genes that belong to different modules are differentially correlated.

Formal Definition We are given two edge-weighted graphs G=(V,E) and G'=(V,E') with the same vertex set V. A bi-module is two disjoint vertex sets S, T in V such that: ▫Sum of edge weights of S and T in G is positive. ▫Sum of weights between S and T is positive in G'. ▫Constraints Find a bi-module of maximum cardinality.

DICER’s heuristic approach Input: consistency correlations graph (CG) and DC graph. Start with an edge (u,v) in the DC graph and define its neighborhood: all nodes connected with edges to u or v. Discard nodes connected to u and v. Remove “bad” genes, until the bi-module is “legal”: ▫Negative sum of DC scores with the other side ▫Negative sum of CG scores with the other side Accept if the bi-module is “interesting” ▫C1 or C2 are not too small ▫Ratio between sizes is at least 1/4 C1C2

A heuristic approach

Simple testing framework We shall simulate the data ▫Discrete graphs Add complete bi-modules Add noise to graphs ▫“flip” edges randomly Additional “noise”: add cliques and bicliques to both graphs.

Random graph discrete model Let n be the size of the graph Let p be the noise parameter Let n1,n2 be the size range for module Let mm be the number of bi-modules Let m be the size of random cliques or bicliques Edge weights: 1 or -1

Random graph discrete model Let n be the size of the graph ▫1000 Let p be the noise parameter ▫0 to 0.2 Let n1,n2 be the size range for module ▫ Let mm be the number of meta-modules ▫ 1 or 10 Let m be the size of random cliques or bicliques ▫15 Edge weights: 1 or -1

Scoring a pair Agreement between a “real” bi-module (U1,V1) and (U2,V2) ▫U vs. U : 0.5(J(U1,U2) + J(V1,V2))) ▫U vs. V : 0.5(J(U1,V2) + J(V1,U2))) ▫Agreement score: Max(UvsU,UvsV) ▫Score is between 0 and 1 U1V1 U2V2 U1V1 U2V2 OR

Scoring a solution Running max average Jaccard Set1: S1,…,Sn Set2:Y1,…,Ym For each Si calculate the best score vs. Set2 For each Yi calculate the best score vs. Set1 Report the average S1 S2 S3 Y1 Y2 Y3 Y4

Performance? 1 MM No random cliques and bicliques

How can we improve DICER?

DICER variant 1 – DICER* Consider a simple variant of DICER: ▫While looking for seeds ignore small seeds ▫Module size at least 5

DICER variant 2: MBC-DICER Look for complete bicliques in the DC graph Use these as seed pairs for initial search Start with small bicliques of the DC graph and unite\expand them

Technical details We used the FP-MBC solver (Li et al. 2005, exhaustive search) Outputs all maximal induced bicliques of minimal size >= k1,k2 Maximal induced biclique ▫A complete biclique ▫Cannot be extended

Technical details Fast for sparse graphs ▫On PPI (>6000,>17000) – less than three minutes Not efficient for noisy or non sparse graphs ▫Time ▫Memory

Use the solver The solver can be very slow ▫We only need “good” start points Solve (minSize,maxBics) ▫Set minimal size to (minSize,minSize) ▫kill if the number of bicliques exceeds maxBics ▫Return the bics

MBC-DICER: find seeds findSeeds(G1,G2,minSize,maxNum) ▫Seeds= [] ▫While new bics are discovered  Bics<-solve(size,maxNum)  For bic in sorted(Bics)  Remove all edges within the bic in G2  Bic<-Greedy remove nodes(bic)  Add bic to Seeds

DICER variant 3 We formalize the discrete complete bi- module problem as a MILP problem. Assume the input graphs are undirected Edge weights are 1 or zero Output is one bi-module

DICER variant 3 We formalize the discrete complete bi-module problem as a MILP problem. If (A=1) and (B=1) => C=1 Gene i can get 1 for at most one set Binary variables Max cardinality

DICER variant 3 Too slow…

DICER variant 3 Other DICER variants needed less than 10 seconds on the same data. We can also formalize the continuous problem as a MIP problem with quadratic constraints. The constraints are not convex and cplex fails.

Discrete graph model 1 Bi-module Average J

Discrete graph model 1 Bi-module Running time

Discrete graph model 10 MMs Add a clique and a biclique to each graph (i.e., non-meta-module patterns)

Running time example 10 MMs with non-MM cliques and bicliques P=0.1

Lung cancer data Healthy (n=97) vs. sick (n=90) Top 3000 genes DICER*, MBC-DICER: minimal size is 5 Accept all modules Compare total coverage, miRNA enrichments

Lung cancer data

Plans - outline Understand how the start point affects the results. ▫Test on new random models Use several larger datasets (increase n to 10000). Other related biological questions ▫Genetic networks ▫MDN+PPI Other related computational questions ▫The module-map problem

The module map detection problem Instead of looking for specific bi-modules look for the best module map. A module map contains: ▫A set of modules M1,…,Mn ▫A set of un-assigned singletons X ▫A set of module pairs – E1,…,Em The goal is to maximize: ▫Sum of edges within modules in G1 plus ▫Sum of edges between module pairs in G2

PPI and genetic interactions data GI: information of a double KO mutants ▫Bad pair: lower fitness ▫Neutral ▫Good pair: healthier than expected Goal (Ulitsky et al. 2008) ▫Find a set of modules: bad mutations within, good between modules ▫Add PPI connectivity constraints.

Ulitsky et al Greedy heuristic Try to improve the solution by: ▫Add single nodes to modules ▫Merge modules ▫After merge\addition update the module-pairs list Algorithm convergence depends on the start point ▫Use step 1 (no pairs) first and then run the algorithm

Global vs. local DICER can be viewed as an algorithm for module-map detection ▫“Local” greedy approach – focus on local improvements of module-pairs ▫More constraints

Algorithms Compare “local” hill climber and “global” hill climber Starting point: single nodes, DICER seeds (different variants) Compare on differential co-expression data Compare on GI data