Optimizing and Parallelizing Ranked Enumeration Konstantin Golenberg Benny Kimelfeld Benny Kimelfeld Yehoshua Sagiv The Hebrew University of Jerusalem.

Slides:



Advertisements
Similar presentations
The Selim and Rachel Benin School of Engineering and Computer Science Keyword Proximity Search in Complex Data Graphs Konstantin Golenberg Benny Kimelfeld.
Advertisements

The Primal-Dual Method: Steiner Forest TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AA A A AA A A A AA A A.
Hadi Goudarzi and Massoud Pedram
13/04/20151 SPARK: Top- k Keyword Query in Relational Database Wei Wang University of New South Wales Australia.
Jianxin Li, Chengfei Liu, Rui Zhou Swinburne University of Technology, Australia Wei Wang University of New South Wales, Australia Top-k Keyword Search.
Concurrent programming: From theory to practice Concurrent Algorithms 2014 Vasileios Trigonakis Georgios Chatzopoulos.
Efficient IR-Style Keyword Search over Relational Databases Vagelis Hristidis University of California, San Diego Luis Gravano Columbia University Yannis.
~1~ Infocom’04 Mar. 10th On Finding Disjoint Paths in Single and Dual Link Cost Networks Chunming Qiao* LANDER, CSE Department SUNY at Buffalo *Collaborators:
1 Chapter 1 Why Parallel Computing? An Introduction to Parallel Programming Peter Pacheco.
DISCOVER: Keyword Search in Relational Databases Vagelis Hristidis University of California, San Diego Yannis Papakonstantinou University of California,
1 EE5900 Advanced Embedded System For Smart Infrastructure Static Scheduling.
Starting Parallel Algorithm Design David Monismith Based on notes from Introduction to Parallel Programming 2 nd Edition by Grama, Gupta, Karypis, and.
SPARK: Top-k Keyword Query in Relational Databases Yi Luo, Xuemin Lin, Wei Wang, Xiaofang Zhou Univ. of New South Wales, Univ. of Queensland SIGMOD 2007.
Chapter 5 CPU Scheduling. CPU Scheduling Topics: Basic Concepts Scheduling Criteria Scheduling Algorithms Multiple-Processor Scheduling Real-Time Scheduling.
Enumerating Large Query Results Benny Kimelfeld IBM Almaden Research Center Sara Cohen The Hebrew University of Jerusalem Yehoshua Sagiv The Hebrew University.
CENG 334 – Operating Systems 05- Scheduling
A general approximation technique for constrained forest problems Michael X. Goemans & David P. Williamson Presented by: Yonatan Elhanani & Yuval Cohen.
Spring 2010CS 2251 Graphs Chapter 10. Spring 2010CS 2252 Chapter Objectives To become familiar with graph terminology and the different types of graphs.
Efficient algorithms for Steiner Tree Problem Jie Meng.
A Lightweight Infrastructure for Graph Analytics Donald Nguyen Andrew Lenharth and Keshav Pingali The University of Texas at Austin.
Improved results for a memory allocation problem Rob van Stee University of Karlsruhe Germany Leah Epstein University of Haifa Israel WADS 2007 WAOA 2007.
Register-Transfer (RT) Synthesis Greg Stitt ECE Department University of Florida.
1 A Combinatorial Toolbox for Protein Sequence Design and Landscape Analysis in the Grand Canonical Model Ming-Yang Kao Department of Computer Science.
Authors: Bhavana Bharat Dalvi, Meghana Kshirsagar, S. Sudarshan Presented By: Aruna Keyword Search on External Memory Data Graphs.
CSE 242A Integrated Circuit Layout Automation Lecture: Partitioning Winter 2009 Chung-Kuan Cheng.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
Mehdi Kargar Aijun An York University, Toronto, Canada Discovering Top-k Teams of Experts with/without a Leader in Social Networks.
Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.
Low-Power Gated Bus Synthesis for 3D IC via Rectilinear Shortest-Path Steiner Graph Chung-Kuan Cheng, Peng Du, Andrew B. Kahng, and Shih-Hung Weng UC San.
Keyword Searching and Browsing in Databases using BANKS Seoyoung Ahn Mar 3, 2005 The University of Texas at Arlington.
Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.
RecBench: Benchmarks for Evaluating Performance of Recommender System Architectures Justin Levandoski Michael D. Ekstrand Michael J. Ludwig Ahmed Eldawy.
Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.
Date : 2012/10/25 Author : Yosi Mass, Yehoshua Sagiv Source : WSDM’12 Speaker : Er-Gang Liu Advisor : Dr. Jia-ling Koh 1.
1 Short Term Scheduling. 2  Planning horizon is short  Multiple unique jobs (tasks) with varying processing times and due dates  Multiple unique jobs.
CIKM Finding and Approximating Top-k Answers in Keyword Proximity Search Benny Kimelfeld Yehoshua Sagiv Benny Kimelfeld and Yehoshua Sagiv The Selim.
Chapter 5: CPU Scheduling. 5.2 Silberschatz, Galvin and Gagne ©2005 Operating System Concepts Chapter 5: CPU Scheduling Basic Concepts Scheduling Criteria.
Solving the Maximum Cardinality Bin Packing Problem with a Weight Annealing-Based Algorithm Kok-Hua Loh University of Maryland Bruce Golden University.
Parallel & Distributed Systems and Algorithms for Inference of Large Phylogenetic Trees with Maximum Likelihood Alexandros Stamatakis LRR TU München Contact:
Parallel XML Parsing Using Meta-DFAs Yinfei Pan 1, Ying Zhang 1, Kenneth Chiu 1, Wei Lu 2 1 State University of New York (SUNY) Binghamton 2 Indiana University.
איך עונים על שאילתה, כשהתוצאה גדולה מאד? שרה כהן בית הספר להנדסה ולמדעי המחשב ע"ש רחל וסלים בנין ע"ש רחל וסלים בנין.
Finding a Minimal Tree Pattern Under Neighborhood Constraints Benny Kimelfeld Yehoshua Sagiv IBM Research – AlmadenThe Hebrew University of Jerusalem 2011.
Time Parallel Simulations I Problem-Specific Approach to Create Massively Parallel Simulations.
Parallel and Distributed Simulation Time Parallel Simulation.
Speeding Up Enumeration Algorithms with Amortized Analysis Takeaki Uno (National Institute of Informatics, JAPAN)
DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters Nanyang Technological University Shanjiang Tang, Bu-Sung Lee, Bingsheng.
OPERATING SYSTEMS CS 3530 Summer 2014 Systems and Models Chapter 03.
Pipelined and Parallel Computing Partition for 1 Hongtao Du AICIP Research Nov 3, 2005.
1 Adaptive Parallelism for Web Search Myeongjae Jeon Rice University In collaboration with Yuxiong He (MSR), Sameh Elnikety (MSR), Alan L. Cox (Rice),
Page 1 A Platform for Scalable One-pass Analytics using MapReduce Boduo Li, E. Mazur, Y. Diao, A. McGregor, P. Shenoy SIGMOD 2011 IDS Fall Seminar 2011.
Tunable QoS-Aware Network Survivability Presenter : Yen Fen Kao Advisor : Yeong Sung Lin 2013 Proceedings IEEE INFOCOM.
Parallel Programming in Chess Simulations Part 2 Tyler Patton.
Zaiben Chen et al. Presented by Lian Liu. You’re traveling from s to t. Which gas station would you choose?
Constraint Programming for the Diameter Constrained Minimum Spanning Tree Problem Thiago F. Noronha Celso C. Ribeiro Andréa C. Santos.
Clustering Data Streams A presentation by George Toderici.
Using the VTune Analyzer on Multithreaded Applications
Optimizing Distributed Actor Systems for Dynamic Interactive Services
Optimizing Parallel Algorithms for All Pairs Similarity Search
Parallel Programming By J. H. Wang May 2, 2017.
High Performance Computing on an IBM Cell Processor --- Bioinformatics
Task Scheduling for Multicore CPUs and NUMA Systems
Genomic Data Clustering on FPGAs for Compression
CPU Scheduling Basic Concepts Scheduling Criteria
CPU Scheduling G.Anuradha
Chapter 6: CPU Scheduling
Automatic Physical Design Tuning: Workload as a Sequence
Instructor: Shengyu Zhang
Haitao Wang Utah State University SoCG 2017, Brisbane, Australia
Donghui Zhang, Tian Xia Northeastern University
Relax and Adapt: Computing Top-k Matches to XPath Queries
Presentation transcript:

Optimizing and Parallelizing Ranked Enumeration Konstantin Golenberg Benny Kimelfeld Benny Kimelfeld Yehoshua Sagiv The Hebrew University of Jerusalem IBM Research – Almaden The Hebrew University of Jerusalem VLDB 2011 Seattle, WA

2 Background: DB Search at HebrewU eu brussels search Initial implementation was too slow… Purchased a multi-core server Didn’t help: cores were usually idle –Due to the inherent flow of the enumeration technique we used Needed deeper understanding of ranked enumeration to benefit from parallelization – This paper demo in SIGMOD’10, implementation in SIGMOD’08, algorithms in PODS’06

Outline Lawler-Murty’s Ranked Enumeration Optimizing by Progressive Bounds Parallelization / Core Utilization Conclusions

4 Ranked Enumeration User Problem Huge number (e.g., 2 |Problem| ) of ranked answers best answer 2 nd best answer 3 rd best answer... Examples: Various graph optimizations –Shortest paths –Smallest spanning trees –Best perfect matchings Top results of keyword search on DBs (graph search) Most probable answers in probabilistic DBs Best recommendations for schema integration Examples: Various graph optimizations –Shortest paths –Smallest spanning trees –Best perfect matchings Top results of keyword search on DBs (graph search) Most probable answers in probabilistic DBs Best recommendations for schema integration “Complexity”: What is the delay between successive answers? How much time to get top-k? Here (Can’t afford to instantiate all answers)

5 Goal:Find top-k answers Goal: Find top-k answers Abstract Problem Formulation O = A collection of objects A = score() score( a ) is high  a is of high-quality Huge, described by a condition on A ’s subsets…… Answers a ⊆ O input 17 a1a1 a2a2 a3a3 akak

6 Goal:Find top-k answers Goal: Find top-k answers Graph Search in The Abstraction A =… Answers a ⊆ O Data graph G Set Q of keywords Data graph G Set Q of keywords Edges of G Subtrees (edge sets) a c ontaining all keywords in Q (w/o redundancy, see [GKS 2008]) score( a ): 1, IR measures, etc. weight( a ) O =

7 What is the Challenge? O = 32 start 1 st (top) answer Optimization problem 31 2 nd answer ? j th answer ≠ previous (j-1) answers best remaining answer Conceivably, much more complicated than top-1! ? How to handle these constraints? (j may be large!)...

8 Lawler-Murty’s Procedure Lawler-Murty’s gives a general reduction: Finding top-k answers Finding top-1 answer under simple constraints if PTIME then PTIME We understand optimization much better! Often, amounts to classical optimization, e.g., shortest path (but sometimes it may get involved, e.g., [KS 2006]) [Murty, 1968] [Lawler, 1972] [Murty, 1968] [Lawler, 1972] Other general top-k procedure: [Hamacher & Queyranne 84], very similar!

9 Among the Uses of Lawler-Murty’s Shortest simple paths [Yen 1972] Minimum spanning trees [Gabow 1977, Katoh et al., 1981] Best solutions in resource allocation [Katoh et al. 1981] Best perfect matchings, best cuts [Hamacher & Queyranne 1985] Minimum Steiner trees [KS 2006] Graph/Combinatorial Algorithms: Yen’s algorithm to find sets of metabolites connected by chemical reactions [Takigawa & Mamitsuka 2008] Bioinformatics: ORDER-BY queries [KS 2006, 2007] Graph/XML search [GKS 2008] Generation of forms over integrated data [Talukdar et al. 2008] Course recommendation [Parameswaran & Garcia-Molina 2009] Querying Markov sequences [K & Ré 2010] Data Management:

10 Lawler-Murty’s Method: Conceptual start

11 Output 1. Find & Print the Top Answer start But Instead… In principle, at this point we should find the second-best answer

12 2. Partition the Remaining Answers simple constraints Partition defined by a set of simple constraints Output start Inclusion constraint: “ must contain ” Exclusion constraint: “ must not contain ”

13 3. Find the Top of Each Set Output start

14 4. Find & Print the Second Answer Output start Best among all the top answers in the partitions Next answer: Best among all the top answers in the partitions

15 5. Further Divide the Chosen Partition … and so on … (until k answers are printed) Output start...

16 Output Partition Reps. + Best of Each Lawler-Murty’s: Actual Execution Printed already Best of each partition best 19

17 Output Lawler-Murty’s: Actual Execution 24 Partition Reps. + Best of Each For each new partition, a task to find the best answer

18 Output Lawler-Murty’s: Actual Execution Partition Reps. + Best of Each 24 best…

Outline Lawler-Murty’s Ranked Enumeration Optimizing by Progressive Bounds Parallelization / Core Utilization Conclusions

20 Output Typical Bottleneck 24 Partition Reps. + Best of Each

21 Output Typical Bottleneck 24 Partition Reps. + Best of Each In top k?

22 12 Progressive Upper Bound Throughout the execution, an optimization alg. often upper bounds it’s final solution’s score Progressive: bound gets smaller in time Often, nontrivial bounds, e.g., –Dijkstra's algorithm: distance at the top of the queue Similarly: some Steiner-tree algorithms [DreyfusWagner72] –Viterbi algorithms: max intermediate probability –Primal-dual methods: value of dual LP solution ≤18≤14≤22≤24 Time

23 Output Freezing Tasks (Simplified) 24 Partition Reps. + Best of Each

24 Output Freezing Tasks (Simplified) 24 Partition Reps. + Best of Each ≤24≤ ≤24≤23≤

25 Output Freezing Tasks (Simplified) 24 Partition Reps. + Best of Each 22 > ≤24≤23≤20

26 Output Freezing Tasks (Simplified) Partition Reps. + Best of Each best ≤ ≤24≤23≤20≤18≤16≤15 15

27 Improvement of Freezing Mondial k = 10, 100 DBLP (part) k = 10, 100 DBLP (full) k = 10, 100 On average, freezing saved 56% of the running time Experiments: Graph Search 2 Intel Xeon processors (2.67GHz), 4 cores each (8 total); 48GB memory Simple Lawler-Murtyw/ Freezing

Outline Lawler-Murty’s Ranked Enumeration Optimizing by Progressive Bounds Parallelization / Core Utilization Conclusions

29 Awaiting Tasks Output Straightforward Parallelization

30 Awaiting Tasks Output Straightforward Parallelization

31 Awaiting Tasks Output Straightforward Parallelization

Not so fast… Typical: reduced 30% of running time Same for 2,3…,8 threads!

33 Awaiting Tasks Output Idle Cores while Waiting

34 Awaiting Tasks Output Idle Cores while Waiting idle

35 Awaiting Tasks Output Early Popping ≤24 ≤23≤20 22 > 20 ≤22 Skipped issues: Thread synchronization –semaphores, locking, etc. Correctness ≤19

36 Improvement of Early Popping Mondial short, medium-size & long queries DBLP (part) short, medium-size & long queries Experiments: Graph Search 2 Intel Xeon processors (2.67GHz), 4 cores each (8 total); 48GB memory

37 Early Popping vs. (Serial) Freezing Mondial short, medium-size & long queries DBLP (part) short, medium-size & long queries Need 4 threads to start gainingNeed 4 threads to start gaining And even then, fairly poor…And even then, fairly poor… Experiments: Graph Search 2 Intel Xeon processors (2.67GHz), 4 cores each (8 total); 48GB memory

38 Combining Freezing & Early Popping We discuss additional ideas and techniques to further utilize the cores –Not here, see the paper Main speedup by combining early popping with freezing –Cores kept busy… on high-potential tasks –Thread synchronization is quite involved At the high level, the final algorithm has the following flow:

39 Combining: General Idea Computed Answers (to-print) Partition Reps. as Frozen Tasks Output Threads work on frozen tasks frozen + new tasks computed answers

40 Combining: General Idea Computed Answers (to-print) Partition Reps. as Frozen Tasks Output Threads work on frozen tasks frozen + new tasks computed answers

41 Main task just pops computed results to print … but validates: no better results by frozen tasks Combining: General Idea Computed Answers (to-print) Partition Reps. as Frozen Tasks Output Threads work on frozen tasks frozen + new tasks computed answers

42 Combined vs. (Serial) Freezing MondialDBLP Now, significant gain (≈50%) already w/ 2 threads Experiments: Graph Search 2 Intel Xeon processors (2.67GHz), 4 cores each (8 total); 48GB memory

43 Improvement of Combined DBLP 4%-5% 3%-10% On average, with 8 threads we got 5.7% of the original running time Mondial Experiments: Graph Search 2 Intel Xeon processors (2.67GHz), 4 cores each (8 total); 48GB memory

Outline Lawler-Murty’s Ranked Enumeration Optimizing by Progressive Bounds Parallelization / Core Utilization Conclusions

45 Conclusions Considered Lawler-Murty’s ranked enumeration –Theoretical complexity guarantees –…but a direct implementation is very slow –Straightforward parallelization poorly utilizes cores Ideas: progressive bounds, freezing, early popping –In the paper: additional ideas, combination of ideas Most significant speedup by combining these ideas –Flow substantially differs from the original procedure –20x faster on 8 cores Test case: graph search; focus: general apps –Future: additional test cases Questions?