Cache-Conscious Performance Optimization for Similarity Search Maha Alabduljalil, Xun Tang, Tao Yang Department of Computer Science University of California at Santa Barbara 36 th ACM International Conference on Information Retrieval
Definition: Finding pairs of objects whose similarity is above a certain threshold. Application examples: Collaborative filtering. Spam and near duplicate detection. Image search. Query suggestions. Motivation: APSS still time consuming for large datasets. All Pairs Similarity Search (APSS) ≥ τ Sim (d i,d j ) = cos(d i,d j ) 2
Previous Work Approaches to speedup APSS: Exact APSS: –Dynamic Computation Filtering. [ Bayardo et al. WWW’07 ] –Inverted indexing. [Arasu et al. VLDB’06] –Parallelization with MapReduce. [Lin SIGIR’09] –Partition-based similarity comparison [Maha WSDM’13] Approximate APSS via LSH : Tradeoff between precision and recall plus addition of redundant computations. Approaches that utilize memory hierarchy: General query processing [ Manegold VLDB02 ] Other computing problems. 3
Baseline: Partition-based Similarity Search (PSS) Partitioning with dissimilarity detection Similarity comparison with parallel tasks [WSDM’13] 4
PSS Task Read assigned partition into area S. Repeat Read some vectors v i from other partitions Compare v i with S Output similar vector pairs Until other potentially similar vectors are compared. Memory areas: S = vectors owned, B = other vectors, C = temporary. Task steps: 5
Focus and Contribution Contribution: Analyze memory hierarchy behavior in PSS tasks. New data layout/traversal techniques for speedup: ① Splitting data blocks to fit cache. ② Coalescing: read a block of vectors from other partitions and process them together. Algorithms: Baseline: PSS [WSDM’13] Cache-conscious designs: PSS1 & PSS2 6
PROBLEM1: PSS area S is too big to fit in cache Other vectors B C Inverted index of vectors … … Accumulator for S … S … … … … … Too Long to fit in cache! 7
PSS1: Cache-conscious data splitting B Accumulator for Si C … S1S1 … S2S2 SqSq aaa aa … … aaa … After splitting: … … Split Size? 8
PSS1 Task Compare (S x, B) PSS1 Task Compare(S x, B) Read S and divide into many splits Read other vectors into B … for d i in S x for d j in B Sim(d i,d j ) += w i,t * w j,t if( sim(d i,d j ) + maxw d i * sum d j <t) then … Output similarity scores For each split S x 9
Modeling Memory/Cache Access of PSS1 Area S i Area B Area C Sim(d i,d j ) + = w i,t * w j,t if( sim(di,dj) + maxw d i * sum d j < T ) then Total number of data accesses : D 0 = D 0 (S i ) + D 0 (B)+D 0 (C) 10
Cache misses and data access time D 0 : total memory data accesses. Memory and cache access counts: D 1 : missed access at L1 D 2 : missed access at L2 D 3 : missed access at L3 Total data access time = (D 0 -D 1 )δ 1 + (D 1 -D 2 )δ 2 + (D 2 -D 3 )δ 3 + D 3 δ mem δ i : access time at cache level i δ mem : access time in memory. Memory and cache access time: 11
Total data access time Data found in L1 Total data access time = (D 0 -D 1 )δ 1 + (D 1 -D 2 )δ 2 + (D 2 -D 3 )δ 3 + D 3 δ mem ~2 cycles
Total data access time Data found in L2 Total data access time = (D 0 -D 1 )δ 1 + (D 1 -D 2 )δ 2 + (D 2 -D 3 )δ 3 + D 3 δ mem 6-10 cycles
Total data access time Data found in L3 Total data access time = (D 0 -D 1 )δ 1 + (D 1 -D 2 )δ 2 + (D 2 -D 3 )δ 3 + D 3 δ mem cycles
Total data access time Data found in memory Total data access time = (D 0 -D 1 )δ 1 + (D 1 -D 2 )δ 2 + (D 2 -D 3 )δ 3 + D 3 δ mem cycles
Actual vs. Predicted Avg. task time ≈ #features * ( lookup + multiply + add) + access mem 13
RECALL: Split size s B Accumulator for Si C … S1S1 … S2S2 SqSq aaa aa … … aaa … … … Split Size s
Ratio of Data Access to Computation Avg. task time ≈ #features * ( lookup + add+multiply) + access mem Data access computation Data access Split size s 15
PSS2: Vector coalescing Issues: PSS1 focused on splitting S to fit into cache. PSS1 does not consider cache reuse to improve temporal locality in memory areas B and C. Solution: coalescing multiple vectors in B
PSS2: Example for improved locality SiSi … …… C … … … B … Striped areas in cache 16
Evaluation Implementation: Hadoop MapReduce. Objectives: Effectiveness of PSS1, PSS2 over PSS. Benefits of modeling. Datasets: Twitter, Clueweb, Enron s, YahooMusic, Google news. Preprocessing: Stopword removal + df-cut. Static partitioning for dissimilarity detection.
Improvement Ratio of PSS1,PSS2 over PSS 2.7x 18
RECALL: coalescing size b SiSi … …… C … … … B … b … Avg. # of sharing= 2 18
Average number of shared features 19
Overall performance
Clueweb
Impact of split size s in PSS1 Clueweb Twitter s
RECALL: split size s & coalescing size b SiSi … …… C … … … B … b s 20
Affect of s & b on PSS2 performance (Twitter) fastest 21
Conclusions Splitting hosted partitions to fit into cache reduces slow memory data access (PSS1) Coalescing vectors with size-controlled inverted indexing can improve the temporal locality of visited data.(PSS2) Cost modeling for memory hierarchy access is a guidance to optimize parameter setting. Experiments show cache-conscious design can be upto 2.74x as fast as the cache-oblivious baseline.