TORQUE: T OPOLOGY -F REE Q UERYING OF P ROTEIN I NTERACTION N ETWORKS Sharon Bruckner 1, Falk Hüffner 1, Richard M. Karp 2, Ron Shamir 1, and Roded Sharan 1 1 School of computer science, Tel Aviv University 2 Int. Computer Science Institute, Berkley, CA
O UR GOAL : NETWORK QUERYING Start with a protein-protein interaction network of some species A. We seek subnetworks that match complexes or pathways. Network Querying: Given a protein complex from another species B, identify the subnetwork of A that is most similar to it. Why network querying? Match hints at an evolutionary conserved region Infer the functionality of the matched region.
Previous Methods Assume knowledge of the interactions within the query complex (the topology). Look for a match in the network with the same topology. Examples: Qnet (Dost et al, 2008), GraphFind (Ferro et al, 2008). ? ?
? N O NEED FOR TOPOLOGY ! Interaction information is noisy and incomplete, and for some species – not available.
T HE PROBLEM Input: Graph G=(V,E), |V|=n, |E|=m Color set {1,2,...,k} A coloring of network vertices
T HE PROBLEM We seek: Is there are connected subgraph of G that has exactly one vertex of each color? Call such a subgraph “colorful”
ABOUT THE PROBLEM NP-complete Hard even when the graph is a tree with max degree 3 (via reduction from 3SAT (Fellows et al, 2007) Our Contributions: A fixed parameter dynamic programming algorithm. Integer Linear Program Fast heuristics Implementation using a combination of the above.
DEFINING THE BASIC DP ALGORITHM Input: A graph where each vertex is colored by one of k colors. Output: Find a colorful tree Every connected subgraph has a spanning tree Every connected subgraph has a spanning tree Every colorful connected subgraph will have a colorful spanning tree Instead of looking for a colorful subgraph, look for a colorful tree Input: A graph where each vertex is colored by one of k colors. Output: Find the highest scoring colorful tree
D YNAMIC P ROGRAMMING A LGORITHM (F ELLOWS ET AL, 2008) Row for each vertex Column for each subset of colors, in increasing size. S1S1 S2S2 S3S3 S4S4 v1v1 00None3.4 v2v2 0None2.32 v3v3 None03.15None v4v v5v vertices Score of best tree Rooted in v 3 that Is colored exactly By S 3 IDEA: Instead of looking at all n k possible subgraphs, look only at all 2 k color sets
D YNAMIC P ROGRAMMING A LGORITHM The last column contains, for every vertex v, the highest scoring tree rooted in v colored by all the colors of the query! Running time: O(3 k |E|).
EXAMPLE v v u u T ( v, { } ) w w v v u u
EXAMPLE v v u u T ( v, { } ) w w v v u u
E XTENSION 1: A LLOWING DELETIONS – MATCHING WITH LESS COLORS ?
A LLOWING DELETIONS – MATCHING WITH LESS COLORS Simply look at all columns with color sets of size at least k - num_dels S1S1 S2S2 S3S3 S4S4 v1v1 00None3.4 v2v2 0None2.32 v3v3 None03.15None v4v v5v
E XTENSION 2: A LLOWING I NSERTIONS : S PECIAL NON - COLORED VERTICES, ARBITRARY VERTICES
A LLOWING NON - COLORED INSERTIONS For j insertions, we would expect running time: O(3 k+j m). Can show: O(3 k mj). Make j copies of each column, and recursively solve: B(v, S, j’) = H ighest score of a tree, rooted in v, colored by S, using exactly j’ insertions
F ORMULA & E XAMPLE a d b c f g e Running Time: O(3 k m*j)
D ETAILS For every vertex v, color subset S, the algorithm will accurately find the best tree of those having the minimal number of insertions. Once B(v,S,j) < ∞ for some j, the value for j+i will never be computed! Cannot guarantee that B(v,S,j+i) will have exactly j+i insertions. v v u u
Extension 3: ALLOWING MULTIPLE COLORS PER VERTEX
M ULTIPLE COLORS PER VERTEX “List Coloring” ([BFKN08]) Our solution: Used in Color Coding ([AYZ95]) Run the dynamic programming many times Each time, color each network vertex randomly by one of its possible colors. If we perform enough rounds, the correct solution should be colorful in at least one of them How many times do we have to run this? Depends on probability of a solution to become colorful: If every vertex can be assigned any of the k colors: In our case: In practice, decrease rounds using heuristics.
? P UTTING IT TOGETHER …
A SECOND APPROACH Formulate the problem as an integer linear program (ILP). Use efficient ILP solvers.
ILP at a glance Want: Subset T of the vertices Formulate colorfulness Only vertices in T are colored. Every vertex should get at most one color Every color should be given to at most one vertex Formulate connectivity Find a flow such that: Only vertices in T can be involved in the flow. Flow of k-1, single sink, k-1 sources Every source has connection to the sink via flow edges.
The Integer Linear Program
Heuristic Speedups First do data reduction only 5% of the vertices are associated with one or more query colors many non-colored vertices are too far from any colored vertex to be useful For each remaining connected component: Try a shortest-paths based heuristic that does not allow mismatches. If this fails: If few colors, but large instance, use dynamic programming Otherwise, use ILP
COLOR CONSTRAINTS Binary variables if v gets color Every vertex gets at most one color Every color is given to at most one vertex A vertex gets a color only if it is selected
CONNECTED SUBGRAPHS AS ILP
I MPLEMENTATION, E XPERIMENTS & R ESULTS
Experiments We applied our method to query complexes within: yeast (5430 proteins, interactions), fly (6650 proteins, interactions) human (7915 proteins, interactions). Queries: yeast, fly, human bovine, mouse, and rat.
C OMPARISON WITH OTHER METHODS Most previous work tested queries with a known topology. ? We compare our results with those of Qnet (Dost et al, 2008), designed to tackle topology-based queries. QNet uses color coding to tackle the subgraph homemorphism problem, allowing insertions and deletions.
Comparison with QNet
Results Evaluation Functional coherence Used GO TermFinder for functional enrichment in T. Specificity Looked at overlap between T and known complexes in the target species. Compared to overlap between random subgraphs and the known complexes. Corrected for multiple testing using FDR (q<0.05). Quality match: Functionally coherent and specific.
S ELECTED RESULTS
Evaluation - Comparison with QNet functional coherencespecificityNovel matches NetworkComplexTorqueQNetTorqueQnetTorqueQnet YeastFly23(100%)2(100%)19(82%)2(100%)70 Human134(95%)49(98%)119(85%)47(94%)82 FlyYeast8(100%)3(60%)8(100%)4(80%)10 Human56(90%)21(87%)62(100%)23(95%)225 HumanYeast48(84%)25(78%)43(75%)23(71%)86 Fly21(72%)0(0%)21(72%)0(0%)70 Total
T ESTING SPECIES WITH UNKNOWN TOPOLOGY NetworkComplex#Feasible#Matches Functional coherenceSpecificity Novel matches YeastBovine44440 Mouse Rat FlyBovine30--- Mouse Rat HumanBovine44210 Mouse Rat Total
Thanks: Nir Yosef, the TAU Computational Genomics group, and the Computational System Biology group. Israel Science Foundation, Edmond J. Safra Bioinformatics Program, Tel Aviv Univ. The PPI network querying problem motivates the colorful connected subgraph problem. A fixed parameter dynamic programming algorithm, allowing insertions, deletions, and multiple colors per vertex, along with an ILP formulation and heuristics, obtains good results. S UMMARY
R EFERENCES [FFHV07] M. R. Fellows, G. Fertin, D. Hermelin, and S. Vialette. Borderlines for finding connected motifs in vertex-colored graphs. In Proc. ICALP’07, volume 4596, pages 340–351. Springer-Verlag, [N06] R. Niedermeier. Invitation to Fixed-Parameter Algorithms. Number 31 in Oxford Lecture Series in Mathematics and Its Applications. Oxford University Press, [BFKN08] N. Betzler, M. R. Fellows, C. Komusiewicz, and R. Niedermeier. Parameterized algorithms and hardness results for some graph motif problems. In Proc. 19th CPM, volume 5029 of LNCS, pages 31{43. Springer, [AYZ95] N. Alon, R. Yuster, and U. Zwick. Color coding. Journal of the ACM, 42: 844{856, 1995}. [DSGRBS08] B. Dost, T. Shlomi, N. Gupta, E. Ruppin, V. Bafna, and R.Sharan. Qnet: A tool for querying protein interaction networks. Journal of Computational Biology, 15(7): , 2008.