Load Balancing for Partition-based Similarity Search Xun Tang, Maha Alabduljalil, Xin Jin, Tao Yang Department of Computer Science University of California.

Slides:

Advertisements

Similar presentations

Google News Personalization: Scalable Online Collaborative Filtering

Advertisements

Revisiting Co-Processing for Hash Joins on the Coupled CPU- GPU Architecture School of Computer Engineering Nanyang Technological University 27 th Aug.

LIBRA: Lightweight Data Skew Mitigation in MapReduce

Scalable Content-Addressable Network Lintao Liu

Resource Management §A resource can be a logical, such as a shared file, or physical, such as a CPU (a node of the distributed system). One of the functions.

Effective Keyword Based Selection of Relational Databases Bei Yu, Guoliang Li, Karen Sollins, Anthony K.H Tung.

~1~ Infocom’04 Mar. 10th On Finding Disjoint Paths in Single and Dual Link Cost Networks Chunming Qiao* LANDER, CSE Department SUNY at Buffalo *Collaborators:

Fast Algorithms For Hierarchical Range Histogram Constructions

Parallel Computing MapReduce Examples Parallel Efficiency Assignment

Clustering and Load Balancing Optimization for Redundant Content Removal Shanzhong Zhu (Ask.com) Alexandra Potapova, Maha Alabduljalil (Univ. of California.

Fall 2008Parallel Query Optimization1. Fall 2008Parallel Query Optimization2 Bucket Sizes and I/O Costs Bucket B does not fit in the memory in its entirety,

Bahman Bahmani  Fundamental Tradeoffs  Drug Interaction Example [Adapted from Ullman’s slides, 2012]  Technique I: Grouping 

Distributed Graph Analytics Imranul Hoque CS525 Spring 2013.

Query Processing of Massive Trajectory Data based on MapReduce Qiang Ma, Bin Yang (Fudan University) Weining Qian, Aoying Zhou (ECNU) Presented By: Xin.

Lecture 21: Spectral Clustering

Beneficial Caching in Mobile Ad Hoc Networks Bin Tang, Samir Das, Himanshu Gupta Computer Science Department Stony Brook University.

A Scalable Semantic Indexing Framework for Peer-to-Peer Information Retrieval University of Illinois at Urbana-Champain Zhichen XuYan Chen Northwestern.

1 Jun Wang, 2 Sanjiv Kumar, and 1 Shih-Fu Chang 1 Columbia University, New York, USA 2 Google Research, New York, USA Sequential Projection Learning for.

ICNP'061 Benefit-based Data Caching in Ad Hoc Networks Bin Tang, Himanshu Gupta and Samir Das Computer Science Department Stony Brook University.

Segmentation Graph-Theoretic Clustering.

1 An Empirical Study on Large-Scale Content-Based Image Retrieval Group Meeting Presented by Wyman

Parallel Computation in Biological Sequence Analysis Xue Wu CMSC 838 Presentation.

The community-search problem and how to plan a successful cocktail party Mauro SozioAris Gionis Max Planck Institute, Germany Yahoo! Research, Barcelona.

Grid Load Balancing Scheduling Algorithm Based on Statistics Thinking The 9th International Conference for Young Computer Scientists Bin Lu, Hongbin Zhang.

Proteus: Power Proportional Memory Cache Cluster in Data Centers Shen Li, Shiguang Wang, Fan Yang, Shaohan Hu, Fatemeh Saremi, Tarek Abdelzaher.

Query Planning for Searching Inter- Dependent Deep-Web Databases Fan Wang 1, Gagan Agrawal 1, Ruoming Jin 2 1 Department of Computer.

Cache-Conscious Runtime Optimization for Ranking Ensembles Xun Tang, Xin Jin, Tao Yang Department of Computer Science University of California at Santa.

BiGraph BiGraph: Bipartite-oriented Distributed Graph Partitioning for Big Learning Jiaxin Shi Rong Chen, Jiaxin Shi, Binyu Zang, Haibing Guan Institute.

Domain decomposition in parallel computing Ashok Srinivasan Florida State University COT 5410 – Spring 2004.

1 Zi Yang, Wei Li, Jie Tang, and Juanzi Li Knowledge Engineering Group Department of Computer Science and Technology Tsinghua University, China {yangzi,

Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Background: MapReduce and FREERIDE Co-clustering on FREERIDE Experimental.

Efficient Exact Similarity Searches using Multiple Token Orderings Jongik Kim 1 and Hongrae Lee 2 1 Chonbuk National University, South Korea 2 Google Inc.

Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Summary of Contributions Background: MapReduce and FREERIDE Wavelet.

Load Balancing for Partition-based Similarity Search Date : 2014/09/01 Author : Xun Tang, Maha Alabduljalil, Xin Jin, Tao Yang Source : SIGIR’14 Advisor.

Network Aware Resource Allocation in Distributed Clouds.

Performance Tuning on Multicore Systems for Feature Matching within Image Collections Xiaoxin Tang*, Steven Mills, David Eyers, Zhiyi Huang, Kai-Cheung.

Load Balancing for Partition-based Similarity Search Xun Tang, Maha Alabduljalil, Xin Jin, Tao Yang Department of Computer Science University of California.

Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.

Towards Robust Indexing for Ranked Queries Dong Xin, Chen Chen, Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign VLDB.

Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.

Dominant Resource Fairness: Fair Allocation of Multiple Resource Types Ali Ghodsi, Matei Zaharia, Benjamin Hindman, Andy Konwinski, Scott Shenker, Ion.

05/04/06 1 Integrating Logic Synthesis, Tech mapping and Retiming Presented by Atchuthan Perinkulam Based on the above paper by A. Mishchenko et al, UCAL.

Cache-Conscious Performance Optimization for Similarity Search Maha Alabduljalil, Xun Tang, Tao Yang Department of Computer Science University of California.

Why Not Grab a Free Lunch? Mining Large Corpora for Parallel Sentences to Improve Translation Modeling Ferhan Ture and Jimmy Lin University of Maryland,

Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.

Design of a High-Throughput Low-Power IS95 Viterbi Decoder Xun Liu Marios C. Papaefthymiou Advanced Computer Architecture Laboratory Electrical Engineering.

Computer Science and Engineering Parallelizing Defect Detection and Categorization Using FREERIDE Leonid Glimcher P. 1 ipdps’05 Scaling and Parallelizing.

CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi.

SAGA: Array Storage as a DB with Support for Structural Aggregations SSDBM 2014 June 30 th, Aalborg, Denmark 1 Yi Wang, Arnab Nandi, Gagan Agrawal The.

CS 484 Load Balancing. Goal: All processors working all the time Efficiency of 1 Distribute the load (work) to meet the goal Two types of load balancing.

CS425: Algorithms for Web Scale Data Most of the slides are from the Mining of Massive Datasets book. These slides have been modified for CS425. The original.

2015/12/251 Hierarchical Document Clustering Using Frequent Itemsets Benjamin C.M. Fung, Ke Wangy and Martin Ester Proceeding of International Conference.

Domain decomposition in parallel computing Ashok Srinivasan Florida State University.

Data Structures and Algorithms in Parallel Computing Lecture 7.

Panther: Fast Top-k Similarity Search in Large Networks JING ZHANG, JIE TANG, CONG MA, HANGHANG TONG, YU JING, AND JUANZI LI Presented by Moumita Chanda.

Pipelined and Parallel Computing Partition for 1 Hongtao Du AICIP Research Nov 3, 2005.

1 Adaptive Parallelism for Web Search Myeongjae Jeon Rice University In collaboration with Yuxiong He (MSR), Sameh Elnikety (MSR), Alan L. Cox (Rice),

Data Structures and Algorithms in Parallel Computing

Example Apply hierarchical clustering with d min to below data where c=3. Nearest neighbor clustering d min d max will form elongated clusters!

University of Texas at Arlington Scheduling and Load Balancing on the NASA Information Power Grid Sajal K. Das, Shailendra Kumar, Manish Arora Department.

Load Balancing : The Goal Given a collection of tasks comprising a computation and a set of computers on which these tasks may be executed, find the mapping.

Date : 2016/08/09 Advisor : Jia-ling Koh Speaker : Yi-Yui Lee

Efficient Multi-User Indexing for Secure Keyword Search

Optimizing Parallel Algorithms for All Pairs Similarity Search

Parallel Density-based Hybrid Clustering

Efficient Similarity Search with Cache-Conscious Data Traversal

Accelerating MapReduce on a Coupled CPU-GPU Architecture

Department of Computer Science University of California, Santa Barbara

Department of Computer Science University of California, Santa Barbara

Donghui Zhang, Tian Xia Northeastern University

Presentation transcript:

Load Balancing for Partition-based Similarity Search Xun Tang, Maha Alabduljalil, Xin Jin, Tao Yang Department of Computer Science University of California at Santa Barbara SIGIR’14

Definition: Finding all pairs of objects whose similarity is above a certain threshold Application examples:  Document clustering –Near duplicates –Spam detection  Query suggestion  Advertisement fraud detection  Collaborative filtering & recommendation All Pairs Similarity Search (APSS) ≥ τ Sim (d i,d j ) = cos(d i,d j ) didi djdj

Big Data Challenges for Similarity Search 20M tweets fit in memory, but take days to process Approximated processing  Df-limit [Lin SIGIR’09]: remove features with their vector frequency exceeding an upper limit Sequential time (hour) Values marked * are estimated by sampling

Parallel Solutions for Exact APSS Parallel score accumulation [Lin SIGIR’09; Baraglia et al. ICDM’10]  25x or more faster Partition-based Similarity Search (PSS) [Alabduljalil et al. WSDM’13]

Symmetry of Comparison Partition-level comparison is symmetric Example: Should P i compare with P j or P j compare with P i ?  Impact communication/load of corresponding tasks Choice of comparison direction affects load balance PiPi PjPj PjPj Pi

Similarity GraphComparison Graph  Load assignment process: Transition from similarity graph to comparison graph

Load Balance Measurement & Examples Load balance metric: Graph cost = Max (task cost) Task cost is the sum of  Self comparison, including computation and I/O cost  Comparison to partitions point to itself

Challenges of Optimal Load Balance Skewed distribution of node connectivity & node sizes Empirical data

Two-Stage Load Balance Key Idea: tasks with small partitions or low connectivity should absorb more load Stage 1: Initial assignment of edge directions Challenge: How to optimize a sequence of steps that balances the load?

Introduce “Potential Weight” of a Task How to find the potentially lightest task?  Define potential weight (PW) as –Self comparison cost + Comparison cost to ALL neighboring partitions Each step picks the node with lowest PW and updates the sub-graph  Lightest partitions absorb as much workload as possible

A Sequence of Absorption Steps: Example Initial stateStep 1

Step 2Step 1

Stage 2: Assignment refinement Key Idea: gradually shift load of heavy tasks to their lightest neighbors Only reverse an edge direction if beneficial Result of Stage 1A refinement step

Gradual Mitigation of Load Imbalance Load imbalance indicators  Maximum / Average  Std. Dev / Average Step-wise load imbalance mitigation in processing of Twitter data

Competitive to Optimal Task Load Balancing Is this two-stage algorithm competitive the optimum?  Optimum = minimum (maximum task cost) Result: Two-stage solution ≤ (2 + δ) Optimum δ is the ratio of I/O and communication cost over computation cost In our tested cases, δ ≈ 10%

Competitive to Optimum Runtime Scheduler Can the solution of task assignment algorithm be competitive to the one produced by the optimum runtime scheduling?  PT opt = Minimum (parallel time on q cores) A greedy scheduler executes tasks produced by two- stage algorithm  E.g. Hadoop MapReduce  Yielded schedule length is PT q Result:

Evaluations Implementation: Hadoop MapReduce  Parallelized pre-processing and static partitioning  Comparison graph stored in a distributed cache  Hadoop scheduler execute queued tasks on available cores  Exact similarity comparison without approximated preprocessing Datasets  100M Twitter, 40M ClueWeb, 625K Yahoo! Music Clusters  Intel X GHz  AMD Opteron GHz

Scalability: Parallel Time and Speedup Efficiency decline caused by the increase of I/O overhead among machines in larger clusters YMusic dataset is not large enough to use more cores for amortizing overhead

Time Cost Distribution Static partitioning is parallelized and cost is insignificant Computation cost in APSS is dominating and I/O overhead is relatively low, δ ≈ 10% Competitive to optimum runtime

Comparison with Circular Load Assignment [Alabduljalil et al. WSDM’13] Task cost Parallel time reduction Stage 1 up to 39% Stage 2 up to 11% Improvement percentage

Contributions and Conclusions Two-stage load assignment algorithm for APSS  Convert undirected similarity graph to directed comparison graph  Improvement: up to 41% for the tested cases  Analysis on competitiveness to the optimum  Scalable for large datasets on hundreds of cores Improved dissimilarity detection and partitioning  More dissimilarity detected  Hierarchical static data partitioning method for more even sizes  Contributes up to 18% end-performance gain

Q & A Thank you! Presenter: Xun Tang Look for challenging opportunities

Backup Slides

Previous Work Filtering  Dynamic computation filtering. [Bayardo et al. WWW’07]  Prefix, positional, and suffice filtering. [Xiao et al. WWW’08] Similarity-based grouping  Inverted indexing. [Arasu et al. VLDB’06]  Parallelization with MapReduce. [Lin SIGIR’09]  Feature-sharing groups. [Vernica et al. SIGMOD’10]  Locality-sensitive hashing. [Gionis et al. VLDB’99; Ture et al. SIGIR’11]  Partition-based similarity comparison. [Alabduljalil et al. WSDM’13] Load balancing and scheduling [ Focus of this paper ]  Index serving in search systems. [Kayaaslan et al. TWEB’13]  Division scheme with MapReduce. [Wang et al. KDD’13]  Greedy scheduling policy. [Garey and Grahams. SIAM’75]  Delay scheduling with Hadoop. [Zaharia et al. EuroSys’10]

Contribution of This Paper Improves computation load balancing  PSS by an additional 41% for given data Key techniques Two-stage load assignment algorithm  First stage constructs a preliminary load assignment  Second stage refines the assignment  Analytical results on competitiveness to support design Improved dissimilarity detection with hierarchical data partitioning

Competitive to the Optimum for a Fully Connected Similarity Graph with Equal Sizes For example, a 5-node graph Results for a n-node fully connected graph with equal sizes

A Naïve Approach: Circular Load Balance Compares a partition with half of other partitions, if they are potentially similar

Function of each PSS task Read assigned partition P k and build inverted index. Repeat  Read vectors from a potentially similar partition.  Compare P k with these vectors.  Write similar vector pairs. Until all potentially similar vectors are compared. Compare Coarse-grain task parallelism PkPk

Why do we follow PSS? Normalized Pair-wise comparison time PSS  1.24 ns for Twitter; 0.74 ns for ClueWeb using 300 AMD cores given similarity threshold=0.8 Parallel score accumulation  19.7x to 25x slower than PSS Alternate MapReduce solution: PQ [Lin SIGIR’09]  ns with approximated processing

How Well Does It Work? How does the comparison graph generated by two-stage algorithm compare to the smallest cost possible graph? is defined as the smallest cost of a comparison graph derived from a similarity graph G. Overhead ratio of I/O and communication over computation

How Well Does It Work? How is our job completion time compare to the optimal solution? Not knowing the allocated computing resources. is the job completion time of the two-stage load assignment with a greedy scheduler on q cores; is that of an optimal solution.

Improved Data Partitioning: r-norm

Improved Data Partitioning: Layer Size Uniform: evenly-size partitions Non-uniform: size of partition L k proportional to index k

Effect of Recursive Hierarchical Partition Recursively divide a large sublayer by detecting dissimilar vectors inside the sublayer Each partition inherits the dissimilar relationship from its original sublayer The new partitions together with the undivided sublayers form the undirected similarity graph