Load Balancing for Partition-based Similarity Search Xun Tang, Maha Alabduljalil, Xin Jin, Tao Yang Department of Computer Science University of California at Santa Barbara SIGIR’14
Definition: Finding all pairs of objects whose similarity is above a certain threshold Application examples: Document clustering –Near duplicates –Spam detection Query suggestion Advertisement fraud detection Collaborative filtering & recommendation All Pairs Similarity Search (APSS) ≥ τ Sim (d i,d j ) = cos(d i,d j ) didi djdj
Big Data Challenges for Similarity Search 20M tweets fit in memory, but take days to process Approximated processing Df-limit [Lin SIGIR’09]: remove features with their vector frequency exceeding an upper limit Sequential time (hour) Values marked * are estimated by sampling
Parallel Solutions for Exact APSS Parallel score accumulation [Lin SIGIR’09; Baraglia et al. ICDM’10] 25x or more faster Partition-based Similarity Search (PSS) [Alabduljalil et al. WSDM’13]
Symmetry of Comparison Partition-level comparison is symmetric Example: Should P i compare with P j or P j compare with P i ? Impact communication/load of corresponding tasks Choice of comparison direction affects load balance PiPi PjPj PjPj Pi
Similarity GraphComparison Graph Load assignment process: Transition from similarity graph to comparison graph
Load Balance Measurement & Examples Load balance metric: Graph cost = Max (task cost) Task cost is the sum of Self comparison, including computation and I/O cost Comparison to partitions point to itself
Challenges of Optimal Load Balance Skewed distribution of node connectivity & node sizes Empirical data
Two-Stage Load Balance Key Idea: tasks with small partitions or low connectivity should absorb more load Stage 1: Initial assignment of edge directions Challenge: How to optimize a sequence of steps that balances the load?
Introduce “Potential Weight” of a Task How to find the potentially lightest task? Define potential weight (PW) as –Self comparison cost + Comparison cost to ALL neighboring partitions Each step picks the node with lowest PW and updates the sub-graph Lightest partitions absorb as much workload as possible
A Sequence of Absorption Steps: Example Initial stateStep 1
Step 2Step 1
Stage 2: Assignment refinement Key Idea: gradually shift load of heavy tasks to their lightest neighbors Only reverse an edge direction if beneficial Result of Stage 1A refinement step
Gradual Mitigation of Load Imbalance Load imbalance indicators Maximum / Average Std. Dev / Average Step-wise load imbalance mitigation in processing of Twitter data
Competitive to Optimal Task Load Balancing Is this two-stage algorithm competitive the optimum? Optimum = minimum (maximum task cost) Result: Two-stage solution ≤ (2 + δ) Optimum δ is the ratio of I/O and communication cost over computation cost In our tested cases, δ ≈ 10%
Competitive to Optimum Runtime Scheduler Can the solution of task assignment algorithm be competitive to the one produced by the optimum runtime scheduling? PT opt = Minimum (parallel time on q cores) A greedy scheduler executes tasks produced by two- stage algorithm E.g. Hadoop MapReduce Yielded schedule length is PT q Result:
Evaluations Implementation: Hadoop MapReduce Parallelized pre-processing and static partitioning Comparison graph stored in a distributed cache Hadoop scheduler execute queued tasks on available cores Exact similarity comparison without approximated preprocessing Datasets 100M Twitter, 40M ClueWeb, 625K Yahoo! Music Clusters Intel X GHz AMD Opteron GHz
Scalability: Parallel Time and Speedup Efficiency decline caused by the increase of I/O overhead among machines in larger clusters YMusic dataset is not large enough to use more cores for amortizing overhead
Time Cost Distribution Static partitioning is parallelized and cost is insignificant Computation cost in APSS is dominating and I/O overhead is relatively low, δ ≈ 10% Competitive to optimum runtime
Comparison with Circular Load Assignment [Alabduljalil et al. WSDM’13] Task cost Parallel time reduction Stage 1 up to 39% Stage 2 up to 11% Improvement percentage
Contributions and Conclusions Two-stage load assignment algorithm for APSS Convert undirected similarity graph to directed comparison graph Improvement: up to 41% for the tested cases Analysis on competitiveness to the optimum Scalable for large datasets on hundreds of cores Improved dissimilarity detection and partitioning More dissimilarity detected Hierarchical static data partitioning method for more even sizes Contributes up to 18% end-performance gain
Q & A Thank you! Presenter: Xun Tang Look for challenging opportunities
Backup Slides
Previous Work Filtering Dynamic computation filtering. [Bayardo et al. WWW’07] Prefix, positional, and suffice filtering. [Xiao et al. WWW’08] Similarity-based grouping Inverted indexing. [Arasu et al. VLDB’06] Parallelization with MapReduce. [Lin SIGIR’09] Feature-sharing groups. [Vernica et al. SIGMOD’10] Locality-sensitive hashing. [Gionis et al. VLDB’99; Ture et al. SIGIR’11] Partition-based similarity comparison. [Alabduljalil et al. WSDM’13] Load balancing and scheduling [ Focus of this paper ] Index serving in search systems. [Kayaaslan et al. TWEB’13] Division scheme with MapReduce. [Wang et al. KDD’13] Greedy scheduling policy. [Garey and Grahams. SIAM’75] Delay scheduling with Hadoop. [Zaharia et al. EuroSys’10]
Contribution of This Paper Improves computation load balancing PSS by an additional 41% for given data Key techniques Two-stage load assignment algorithm First stage constructs a preliminary load assignment Second stage refines the assignment Analytical results on competitiveness to support design Improved dissimilarity detection with hierarchical data partitioning
Competitive to the Optimum for a Fully Connected Similarity Graph with Equal Sizes For example, a 5-node graph Results for a n-node fully connected graph with equal sizes
A Naïve Approach: Circular Load Balance Compares a partition with half of other partitions, if they are potentially similar
Function of each PSS task Read assigned partition P k and build inverted index. Repeat Read vectors from a potentially similar partition. Compare P k with these vectors. Write similar vector pairs. Until all potentially similar vectors are compared. Compare Coarse-grain task parallelism PkPk
Why do we follow PSS? Normalized Pair-wise comparison time PSS 1.24 ns for Twitter; 0.74 ns for ClueWeb using 300 AMD cores given similarity threshold=0.8 Parallel score accumulation 19.7x to 25x slower than PSS Alternate MapReduce solution: PQ [Lin SIGIR’09] ns with approximated processing
How Well Does It Work? How does the comparison graph generated by two-stage algorithm compare to the smallest cost possible graph? is defined as the smallest cost of a comparison graph derived from a similarity graph G. Overhead ratio of I/O and communication over computation
How Well Does It Work? How is our job completion time compare to the optimal solution? Not knowing the allocated computing resources. is the job completion time of the two-stage load assignment with a greedy scheduler on q cores; is that of an optimal solution.
Improved Data Partitioning: r-norm
Improved Data Partitioning: Layer Size Uniform: evenly-size partitions Non-uniform: size of partition L k proportional to index k
Effect of Recursive Hierarchical Partition Recursively divide a large sublayer by detecting dissimilar vectors inside the sublayer Each partition inherits the dissimilar relationship from its original sublayer The new partitions together with the undivided sublayers form the undirected similarity graph