A Multi-Level Parallel Implementation of a Program for Finding Frequent Patterns in a Large Sparse Graph Steve Reinhardt, Interactive Supercomputing

Slides:

Advertisements

Similar presentations

Enabling Speculative Parallelization via Merge Semantics in STMs Kaushik Ravichandran Santosh Pande College.

Advertisements

 Data mining has emerged as a critical tool for knowledge discovery in large data sets. It has been extensively used to analyze business, financial,

Mining for Tree-Query Associations in a Graph Jan Van den Bussche Hasselt University, Belgium joint work with Bart Goethals (U Antwerp, Belgium) and Eveline.

gSpan: Graph-based substructure pattern mining

www.brainybetty.com1 MAVisto A tool for the exploration of network motifs By Guo Chuan & Shi Jiayi.

Starting Parallel Algorithm Design David Monismith Based on notes from Introduction to Parallel Programming 2 nd Edition by Grama, Gupta, Karypis, and.

Introduction This chapter explores graphs and their applications in computer science This chapter explores graphs and their applications in computer science.

Graphs Graphs are the most general data structures we will study in this course. A graph is a more general version of connected nodes than the tree. Both.

5/12/2015PhD seminar CS BGU Counting subgraphs Support measures for graphs Natalia Vanetik.

More Graph Algorithms Minimum Spanning Trees, Shortest Path Algorithms.

Mining Top-K Large Structural Patterns in a Massive Network Feida Zhu 1, Qiang Qu 2, David Lo 1, Xifeng Yan 3, Jiawei Han 4, and Philip S. Yu 5 1 Singapore.

Frequent Subgraph Pattern Mining on Uncertain Graph Data

Data Mining Association Analysis: Basic Concepts and Algorithms

Rectilinear Pattern Recognition Dan J. Nardi Masters Thesis April 11, 2003.

Association Analysis (7) (Mining Graphs)

Integrating Bayesian Networks and Simpson’s Paradox in Data Mining Alex Freitas University of Kent Ken McGarry University of Sunderland.

PARALLEL PROGRAMMING WITH OPENMP Ing. Andrea Marongiu

MAE 552 – Heuristic Optimization Lecture 27 April 3, 2002

Analysis of Algorithms CS 477/677

Mining Tree-Query Associations in a Graph Bart Goethals University of Antwerp, Belgium Eveline Hoekx Jan Van den Bussche Hasselt University, Belgium.

Solving the Protein Threading Problem in Parallel Nocola Yanev, Rumen Andonov Indrajit Bhattacharya CMSC 838T Presentation.

SubSea: An Efficient Heuristic Algorithm for Subgraph Isomorphism Vladimir Lipets Ben-Gurion University of the Negev Joint work with Prof. Ehud Gudes.

The community-search problem and how to plan a successful cocktail party Mauro SozioAris Gionis Max Planck Institute, Germany Yahoo! Research, Barcelona.

Cmpt-225 Simulation. Application: Simulation Simulation  A technique for modeling the behavior of both natural and human-made systems  Goal Generate.

1 Efficiently Mining Frequent Trees in a Forest Mohammed J. Zaki.

Mining Scientific Data Sets Using Graphs George Karypis Department of Computer Science & Engineering University of Minnesota (Michihiro Kuramochi & Mukund.

Improved results for a memory allocation problem Rob van Stee University of Karlsruhe Germany Leah Epstein University of Haifa Israel WADS 2007 WAOA 2007.

FAST FREQUENT FREE TREE MINING IN GRAPH DATABASES Marko Lazić 3335/2011 Department of Computer Engineering and Computer Science,

272: Software Engineering Fall 2012 Instructor: Tevfik Bultan Lecture 17: Code Mining.

Graph Indexing: A Frequent Structure based Approach Authors:Xifeng Yan†, Philip S‡. Yu, Jiawei Han†

Advanced Algorithms Analysis and Design Lecture 8 (Continue Lecture 7…..) Elementry Data Structures By Engr Huma Ayub Vine.

Distributed Constraint Optimization Michal Jakob Agent Technology Center, Dept. of Computer Science and Engineering, FEE, Czech Technical University A4M33MAS.

Mining Frequent Itemsets with Constraints Takeaki Uno Takeaki Uno National Institute of Informatics, JAPAN Nov/2005 FJWCP.

Finding dense components in weighted graphs Paul Horn

Sequential PAttern Mining using A Bitmap Representation

Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,

Takeaki Uno Tatsuya Asai Yuzo Uchida Hiroki Arimura

Representing and Using Graphs

Télécom 2A – Algo Complexity (1) Time Complexity and the divide and conquer strategy Or : how to measure algorithm run-time And : design efficient algorithms.

An Efficient Algorithm for Discovering Frequent Subgraphs Michihiro Kuramochi and George Karypis ICDM, 2001 報告者：蔡明瑾.

Xiangnan Kong,Philip S. Yu Department of Computer Science University of Illinois at Chicago KDD 2010.

1 Frequent Subgraph Mining Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY June 12, 2010.

An Efficient Algorithm for Enumerating Pseudo Cliques Dec/18/2007 ISAAC, Sendai Takeaki Uno National Institute of Informatics & The Graduate University.

Expert Systems with Applications 34 (2008) 459–468 Multi-level fuzzy mining with multiple minimum supports Yeong-Chyi Lee, Tzung-Pei Hong, Tien-Chin Wang.

Mark Marron 1, Deepak Kapur 2, Manuel Hermenegildo 1 1 Imdea-Software (Spain) 2 University of New Mexico 1.

Xiangnan Kong,Philip S. Yu Multi-Label Feature Selection for Graph Classification Department of Computer Science University of Illinois at Chicago.

Frequent Subgraph Discovery Michihiro Kuramochi and George Karypis ICDM 2001.

Detecting Group Differences: Mining Contrast Sets Author: Stephen D. Bay Advisor: Dr. Hsu Graduate: Yan-Cheng Lin.

Outline Introduction – Frequent patterns and the Rare Item Problem – Multiple Minimum Support Framework – Issues with Multiple Minimum Support Framework.

University at BuffaloThe State University of New York Lei Shi Department of Computer Science and Engineering State University of New York at Buffalo Frequent.

Lecture 3: Uninformed Search

1 Inverted Matrix: Efficient Discovery of Frequent Items in Large Datasets in the Context of Interactive Mining -SIGKDD’03 Mohammad El-Hajj, Osmar R. Zaïane.

Mining Top-K Large Structural Patterns in a Massive Network Feida Zhu 1, Qiang Qu 2, David Lo 1, Xifeng Yan 3, Jiawei Han 4, and Philip S. Yu 5 1 Singapore.

1 Knowledge Discovery from Transportation Network Data Paper Review Jiang, W., Vaidya, J., Balaporia, Z., Clifton, C., and Banich, B. Knowledge Discovery.

Ricochet Robots Mitch Powell Daniel Tilgner. Abstract Ricochet robots is a board game created in Germany in A player is given 30 seconds to find.

Association Analysis (3)

Sunpyo Hong, Hyesoon Kim

Chapter 20: Graphs. Objectives In this chapter, you will: – Learn about graphs – Become familiar with the basic terminology of graph theory – Discover.

Graph Indexing From managing and mining graph data.

Indexing and Mining Free Trees Yun Chi, Yirong Yang, Richard R. Muntz Department of Computer Science University of California, Los Angeles, CA {

Gspan: Graph-based Substructure Pattern Mining

Mining Coherent Dense Subgraphs across Multiple Biological Networks Vahid Mirjalili CSE 891.

Finding Dense and Connected Subgraphs in Dual Networks

Mining in Graphs and Complex Structures

Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Data Partition Dr. Xiao Qin Auburn University.

Mining Frequent Subgraphs

Graph Database Mining and Its Applications

Finding Subgraphs with Maximum Total Density and Limited Overlap

Major Design Strategies

Presentation transcript:

A Multi-Level Parallel Implementation of a Program for Finding Frequent Patterns in a Large Sparse Graph Steve Reinhardt, Interactive Supercomputing George Karypis, Dept. of Computer Science, University of Minnesota

Outline Problem definition Prior work Problem and Approach Results Issues and Conclusions

Graph Datasets Flexible and powerful representation  Evidence extraction and link discovery (EELD)  Social Networks/Web graphs  Chemical compounds  Protein structures  Biological Pathways  Object recognition and retrieval  Multi-relational datasets

Finding Patterns in Graphs Many Dimensions Structure of the graph dataset  many small graphs graph transaction setting  one large graph single-graph setting Type of patterns  connected subgraphs  induced subgraphs Nature of the algorithm  Finds all patterns that satisfy the minimum support requirement Complete  Finds some of the patterns Incomplete Nature of the pattern’s occurrence  The pattern occurs exactly in the input graph Exact algorithms  There is a sufficiently similar embedding of the pattern in the graph Inexact algorithms MIS calculation for frequency  exact  approximate  upper bound Algorithm  vertical (depth-first)  horizontal (breadth-first) M. Kuramochi and G. Karypis. Finding frequent patterns in a large sparse graph. In SIAM International Conference on Data Mining (SDM-04),

Single Graph Setting Find all frequent subgraphs from a single sparse graph. Choice of frequency definition Input Graph Size 7 Frequency = 6 Size 6 Frequency = 1

vS I G RA M: Vertical Solution Candidate generation by extension  Add one more edge to a current embedding.  Solve MIS on embeddings in the same equivalence class.  No downward-closure-based pruning Two important components  Frequency-based pruning of extensions  Treefication based on canonical labeling

vS I G RA M: Connection Table Frequency-based pruning. Trying every possible extension is expensive and inefficient.  A particular extension might have been tested before. Categorize extensions into equivalent classes (in terms of isomorphism), and record if each class is frequent or not. If a class becomes infrequent, never try it in later exploration.

Parallelization Two clear sources of parallelism in the algorithm  Amount of parallelism from each source not known in advance The code is typical C code  structs, pointers, frequent mallocs/frees of small areas, etc.  nothing like the “Fortran”-like (dense linear algebra) examples shown for many parallel programming methods Parallel structures need to accommodate dynamic parallelism  Dynamic specification of parallel work  Dynamic allocation of processors to work Chose OpenMP taskq/task constructs  Proposed extensions to OpenMP standard  Support parallel work being defined in multiple places in a program, but be placed on a single conceptual queue and executed accordingly  ~20 lines of code changes in ~15,000 line program Electric Fence was very useful in finding coding errors

Algorithmic Parallelism vSiGraM (G, MIS_type, f) 1.F ←  2.F 1 ← all frequent size-1 subgraphs in G 3.for each F 1 in F 1 do 4.M(F 1 ) ← all embeddings of F 1 5. for each F 1 in F 1 do// high-level parallelism 6.F ← F  vSiGraM-Extend(F 1, G, f) return F vSiGraM-Extend(F k, G, f) 1.F ←  2.for each embedding m in M(F k ) do// low-level parallelism 3.C k+1 ← C k+1  {all (k+1)-subgraphs of G containing m} 4.for each C k+1 in C k+1 do 5.if F k is not the generating parent of C k+1 then 6.continue 7.compute C k+1.freq from M(C k+1 ) 8.if C k+1.freq < f then 9.continue 10.F ← F  vSiGraM-Extend(C k+1, G, f) 11.return F

Simple Taskq/Task Example main() { int val; #pragma intel omp taskq val = fib(12345); } fib(int n) { int partret[2]; if (n>2) #pragma intel omp task for(i=n-2; i<n; i++) { partret[n-2-i] = fib(i); } return (partret[0] + partret[1]); } else { return 1; }

High-Level Parallelism with taskq/task // At the bottom of expand_subgraph, after all child // subgraphs have been identified, start them all. #pragma intel omp taskq for (ii=0; ii<sg_set_size(child); ii++) { #pragma intel omp task captureprivate(ii) { SubGraph *csg = sg_set_at(child,ii); expand_subgraph(csg, csg->ct, lg, ls, o); } // end-task }

Low-Level Parallelism with taskq/task #pragma omp parallel shared(nt, priv_es) { #pragma omp master { nt = omp_get_num_threads(); //#threads in par priv_es = (ExtensionSet **)kmp_calloc(nt, sizeof(ExtensionSet *)); } #pragma omp barrier #pragma intel omp taskq { for (i = 0; i < sg_vmap_size(sg); i++) { #pragma intel omp task captureprivate(i) { int th = omp_get_thread_num(); if (priv_es[th] == NULL) { priv_es[th] = exset_init(128); } expand_map(sg, ct, ams, i, priv_es[th], lg); } }// end parallel section; next loop is serial reduction for (i=0; i < nt; i++) { if (priv_es[i] != NULL) { exset_merge(priv_es[i],es); } kmp_free(priv_es); } Implementation due to Grant Haab and colleagues from Intel OpenMP library group

Experimental Results SGI Altix™ 32 Itanium2™ sockets (64 cores), 1.6GHz 64 GBytes (though not memory limited) Linux No special dplace/cpuset configuration Minimum frequencies chosen to illuminate scaling behavior, not provide maximum performance

Dataset 1 - Chemical GraphFrequencyType of Parallelism Number of processors Time in seconds (speed-up) dtp 500 High (2.03) (2.40) (2.58) (2.56) (2.57) Low (0.98) (1.01) (0.83) (0.74) (0.63) Both (1.96) (2.37) (2.21) (1.08) (0.70) 100 High (1.97) (3.71) (6.39) (7.29) (7.61) Low (1.00) (1.02) (0.83) (0.70) (0.80) Both (1.99) (3.69) (1.55) (0.29) (0.33) 50 High (2.00) (4.64) (8.76) (16.56) (22.27) (21.03) Low (1.00) (0.96) (0.70) (1.07) (1.44) Both (2.03) (3.55) (1.17) (0.55) (0.48)

Dataset 2 – aviation GraphFrequencyType of Parallelism Number of processors Time in seconds (speed-up) air High (7.19) (22.30) (27.29) Low (2.13) 1500High (7.20) (22.89) (27.30) 1250High (7.37) (24.31) (29.58) 1000 High (8.06) (26.13) (25.65)

Performance of High-level Parallelism When sufficient quantity of work (i.e., frequency threshold is low enough)  Good speed-ups to 16P  Reasonable speed-ups to 30P  Little or no benefit above 30P  No insight into performance plateau

Poor Performance of Low-level Parallelism Several possible effects ruled out  Granularity of data allocation  Barrier before master-only reduction Source: highly variable times for register_extension  ~100X slower in parallel than serial, …  but different instances from execution to execution  Apparently due to highly variable run-times for malloc  Not understood

Issues and Conclusions OpenMP taskq/task were straightforward to use in this program and implemented the desired model Performance was good to a medium range of processor counts (best 26X on 30P) Difficult to gain insight into lack of performance  High-level parallelism 30P and above  Low-level parallelism

Backup

Datasets Dataset Connected Components VerticesEdges Vertex Labels Edge Labels Aviation2,703101,18598,4826,17351 Credit70014,70014, Citation16,99929,01442, VLSI2,63312,75211,542231

Datasets Dataset Connected Components VerticesEdges Vertex Labels Edge Labels Aviation2,703101,18598,4826,17351 Citation16,99929,01442, VLSI2,63312,75211,542231

Generally, vS I G RA M is 2-5 times faster than hS I G RA M (with exact and upper bound MIS) Largest pattern contained 13 edges. Aviation Dataset

Credit Dataset Generally, vS I G RA M is 2-5 times faster than hS I G RA M (with exact and upper bound MIS). Largest pattern contained 13 edges.

But, hS I G RA M can be more efficient especially with upper bound MIS (ub). Largest pattern contained 16 edges. Citation Dataset

Contact Map Dataset

DTP Dataset

VLSI Dataset Exact MIS never finished. Longest pattern contained 5 edges (constraint).

SUBDUE D. J. Cook and L. B. Holder. J. Artificial Intelligence Research, vol. 1, Heuristic pattern discovery system based on MDL, written in C. Version With the default setting, finds 3 most interesting patterns. No overlaps are allowed.

Comparison with SUBDUE Dataset SUBDUEvS I G RA M (approximate MIS) Freq.Size Runtime [sec] Freq. Largest Size Patterns Runtime [sec] Credit , , DTP 4,957 4,807 1, , , VLSI ,45218 Similar results with SEuS

Comparison With SEuS S. Ghazizadeh and S. Chawathe. DS2002. Pattern discovery algorithm using the summary data structure. Allows overlaps when counting frequency.  Tends to produce more number of patterns, because the frequency of each patterns becomes generally higher. Written in JAVA From Credit Dataset, SEuS discovered 48 patterns for 50 seconds (the support threshold unknown). vS I G RA M (apprx) spent 20 seconds to find 11,696 patterns.

Summary With approximate and exact MIS, vS I G RA M is 2-5 times faster than hS I G RA M. With upper bound MIS, however, hS I G RA M can prune a larger number of infrequent patterns.  The downward closure property plays the role. For some datasets, using exact MIS for frequency counting is just intractable. Compared to SUBDUE, S I G RA M finds more and longer patterns in shorter amount of runtime.

Thank You! Slightly longer version of this paper is also available as a technical report. S I G RA M executables will be available for download soon from

Complete Frequent Subgraph Mining— Existing Work So Far Input: A set of graphs (transactions) + support threshold Goal: Find all frequently occurring subgraphs in the input dataset.  AGM (Inokuchi et al., 2000), vertex-based, may not be connected.  FSG (Kuramochi et. al., 2001), edge-based, only connected subgraphs  AcGM (Inokuchi et al., 2002), gSpan (Yan & Han, 2002), FFSM (Huan et al., 2003), etc. follow FSG’s problem definition. Frequency of each subgraph  The number of supporting transactions.  Does not matter how many embeddings are in each transaction.

Frequency Under Transaction Setting Transaction 1 Transaction 2Transaction 3 Frequency = 2 ( T 1, T 2 ) Convenient assumption  No need to care multiple embeddings per transaction

Wait! What happens if there is no notion of transactions in input datasets? Many real graph datasets are not in the transaction format.  Network-related, VLSI design, etc.  Graphs created from data with temporal nature (e.g., link discovery, intrusion detection)

What is the reasonable frequency definition? Two reasonable choices:  The frequency is determined by the total number of embeddings. Not downward closed. Too many patterns. Artificially high frequency of certain patterns.  The frequency is determined by the number of edge-disjoint embeddings (Vanetik et al, ICDM 2002). Downward closed. Since each occurrence utilizes different sets of edges, occurrence frequencies are bounded. Solved by finding the maximum independent set (MIS) of the embedding overlap graph.

Embedding Overlap and MIS Edge-disjoint embeddings  { E 1, E 2, E 3 }  { E 1, E 2, E 4 } Create an overlap graph and solve MIS  Vertex  Embedding  Edge  Overlap E1E1 E2E2 E3E3 E4E4

OK. Definition is Fine, but … MIS-based frequency seems reasonable. Next question: How to develop mining algorithms for the single graph setting.

How to Handle Single Graph Setting? Issue 1: Frequency counting  Exact MIS is often intractable. Issue 2: Choice of search scheme  Horizontal (breadth-first)  Vertical (depth-first)

Issue 1: MIS-Based Frequency We considered approximate (greedy) and upper bound MIS too.  Approximate MIS may underestimate the frequency.  Upper bound MIS may overestimate the frequency. MIS is NP-complete and not be approximated.  Practically simple greedy scheme works pretty well. Halldórsson and Radhakrishnan. Greed is good, 1997.

Approximate and Upper Bound MIS Greedy MIS  Successively remove lowest degree vertices

Issue 2: Search Scheme Frequent subgraph mining  Exploration in the lattice of subgraphs Horizontal  Level-wise  Candidate generation and pruning Joining Downward closure property  Frequency counting Vertical  Traverse the lattice as if it were a tree.

Stop to Summarize for the moment Type of MIS for frequency counting  Approximate (greedy)  Exact  Upper bound Search scheme  Horizontal  Vertical

hS I G RA M: Horizontal Method Natural extension of FSG to the single graph setting. Candidate generation and pruning.  Downward closure property  Tighter pruning than vertical method Two-phase frequency counting  All embeddings by subgraph isomorphism Anchor edge list intersection, instead of TID list intersection. Localize subgraph isomorphism  MIS for the embeddings Approximate and upper bound MIS give subset and superset respectively.

TID List Recap Lattice of Subgraphs size k size k + 1 TID( ) = { T 1, T 3 } TID( ) = { T 1, T 2, T 3 } T1T1 T2T2 T3T3 TID( )  TID( ) ∩ TID( ) ∩ TID( ) = { T 1, T 3 } TID( ) = { T 1, T 2, T 3 }

Anchor Edges Each subgraph must appear close enough together. Keep one edge for each.  Complete embeddings require too much memory.  Localize subgraph isomorphism. Lattice of Subgraphs size k size k + 1

Treefication Lattice of Subgraphs Treefied Lattice size k - 1 size k size k + 1 : a node in the search space (i.e., a subgraph) Based on subgraph/supergraph relation Avoid visiting the same node in the lattice more than once.