Download presentation
Presentation is loading. Please wait.
Published byChristiana Miller Modified over 9 years ago
1
Click to edit Present’s Name Xiaoyang Zhang 1, Jianbin Qin 1, Wei Wang 1, Yifang Sun 1, Jiaheng Lu 2 HmSearch: An Efficient Hamming Distance Query Processing Algorithm 1 University of New South Wales, Australia 2 Renmin University of China, Chnia
2
School of Computer Science and Engineering Motivation Identify Near Duplicate Webpages 0012345679ABCDEF simhash 1012345679ABCDEF Similar Chemical data Maps in to Binary code 012345679ABCDEF0012345679ABCDEF1 Similar
3
School of Computer Science and Engineering More Applications Iris recognition Image retrieval C2LSH
4
School of Computer Science and Engineering Outline Problem Definition Framework HmSearch –Partitioning Scheme –Signature Generation –Enhanced Filtering –Hierarchical Filtering and Verification –Dimension Rearrangement Conclusion Experiment
5
School of Computer Science and Engineering Hamming Distance Query Hamming distance Hamming distance query Number of positions at which the corresponding symbols are different for two equal length vectors. q: ABCD v: ACCD Hamming distance(R, S) = 1 Given a database V of vectors, a query vector Q (all the vectors have the same dimensionality N) and a Hamming distance threshold k, find all v i in V, that hd (v i, Q) <= k
6
School of Computer Science and Engineering Outline Problem Definition Framework HmSearch –Partitioning Scheme –Signature Generation –Enhanced Filtering –Hierarchical Filtering and Verification –Dimension Rearrangement Conclusion Experiment
7
School of Computer Science and Engineering Basic Idea General framework: 1.We can do k=1 efficiently (show later) 2.So we transform larger k problem to several small k=1 problem by partitioning 3.We do filtering by looking at each partition 4.We do verification at last 1111 1211 v q the same hd (q, v)<=1 hd(q left, v left )=0 or hd(q right, v righ t)=0 So if k =1, can be filtered by looking at each part 1111 1221 v q
8
School of Computer Science and Engineering Framework Data Partitioning Indexing Index Query Partitioning Candidates0 Filtering Candidates1 Verification Results Generating Signatures General Partitionin g Scheme 1-variants and 1-deletion variants Enhanced Filtering Hierarchical Filtering and Verification Dimension Rearrangement
9
School of Computer Science and Engineering Outline Problem Definition Framework HmSearch –Partitioning Scheme –Signature Generation –Enhanced Filtering –Hierarchical Filtering and Verification –Dimension Rearrangement Conclusion Experiment
10
School of Computer Science and Engineering Partitioning Lowerbound for partition strategy Given q and v such that hd(q, v)<=k, if the N dimensions are divided into κ parts, there should be at least partitions, such that hd(q part, v part )<= In our algorithm, we choose When k is even, m = 1 When k is odd, m = 2 When k= 0 or 1, m=1, hd = 0 When k>=2, hd <= 1
11
School of Computer Science and Engineering Outline Problem Definition Framework HmSearch –Partitioning Scheme –Signature Generation –Enhanced Filtering –Hierarchical Filtering and Verification –Dimension Rearrangement Conclusion Experiment
12
School of Computer Science and Engineering Signature Generation 1-variants 1-deletion-variants Substituting each dimension with ‘#’ each time Substituting each dimension with each domain value each time (plus itself) v=[1, 2, 3] 1-del-val(v)=[#, 2, 3], [1, #, 1], [1, 2, #] v=[1, 2, 3] and Σ (domain) =[1, 2, 3] 1-val(v)=[1, 2, 3], [2, 2, 3], [3, 2, 3], [1, 1, 3], [1, 3, 3], [1, 2, 1], [1, 2, 2] We index all 1-val(v) and when q comes in, we search q in the index We index all 1-del-val(v) and when q comes in, we generate 1-del-val(q), and search all 1-del-val(q) in the index OR
13
School of Computer Science and Engineering Outline Problem Definition Framework HmSearch –Partitioning Scheme –Signature Generation –Enhanced Filtering –Hierarchical Filtering and Verification –Dimension Rearrangement Conclusion Experiment
14
School of Computer Science and Engineering Enhanced Filter (Even) v q If k =2, based on the formula before, m=1, hd(v part, q part )=1 So this v becomes a false positive However, we find that If k (k>=1) is even, v is qualified for two situations: 1) m=1, where hd(v part, q part )=0 2)m=2, where hd(v part, q part )<=1 Using enhanced filter, no situation applied so v is filtered Based on the Formula before When k (k>=1) is even, m = 1 Example 123456 121423
15
School of Computer Science and Engineering Enhanced Filter (Odd) If k =3, based on the formula before, m=2, hd(v part, q part )=1 So this v becomes a false positive However, we find that If k (k>=1) is odd, v is qualified for two situations: 1) m=2, where hd(v part, q part )<=1 and at least one of them = 0 2) m=3, where hd(v part, q part )<=1 Using enhanced filter, no situation applied so v is filtered Based on the Formula before When k (k>=1) is odd, m = 2 Example v q 123456 111423
16
School of Computer Science and Engineering Outline Problem Definition Framework HmSearch –Partitioning Scheme –Signature Generation –Enhanced Filtering –Hierarchical Filtering and Verification –Dimension Rearrangement Conclusion Experiment
17
School of Computer Science and Engineering Hierarchical Filtering and Verification Significant bit 1st 2nd 3rd v=[5, 0, 3, 6] 1 0 1 0 0 0 1 0 0 0 1 1 q=[5, 2, 2, 5] 1 0 1 0 1 0 0 1 0 1 1 1 Σ=|8|, k=1 So hd(v, q)>=2, filtered More over, even if k=4 4 comparisons to calculate hd(v,q)=3 diff 0011 0110 0000 XOR OR 0111hd(v,q)=3 We can use binary operations to do a hierarchical filtering and verification
18
School of Computer Science and Engineering Hierarchical Filtering and Verification Significant bit 1st 2nd 3rd v=[5, 0, 3, 6] 1 0 1 0 0 0 1 0 0 0 1 1 q=[5, 2, 3, 5] 1 0 1 0 1 0 1 1 0 1 1 1 diffcumdiff XOR 0001 0101 0001 0000 0101 OR Number of 1 In cumdiff 1 2 <=1, conti. >1, filtered
19
School of Computer Science and Engineering Outline Problem Definition Framework HmSearch –Partitioning Scheme –Signature Generation –Enhanced Filtering –Hierarchical Filtering and Verification –Dimension Rearrangement Conclusion Experiment
20
School of Computer Science and Engineering Impact of Data Skewness Given k=2, then m = 1 and k’=1 Only v1 is qualified We propose to reset the order and partition Length to improve performance All vectors are qualified Dim v2 v1 q Partition1 1 1 0 1 1 0 1 1 0 Partition2 1 0 2 0 0 0 0 0 0 v3 202000 v4 300000 123456 Dim v2 v1 q Partition1 1 1 0 1 1 0 0 0 0 Partition2 1 0 2 1 1 0 0 0 0 v3 200020 v4 300000 125436
21
School of Computer Science and Engineering Greedy Dimension Rearrangement Dim v2 v1 Partition1 1 0 1 0 1 0 Partition2 0 2 0 0 0 0 v3 202000 v4 300000 123456 MaxFreq for Dim 133344 MaxFreq is the Max Frequency of any values in each dimension Dim v2 v1 Partition1 0 0 1 0 1 0 Partition2 0 0 1 0 0 2 v3 020020 v4 030000 512634 Our goal: Minimize the global MaxFreq MaxFreq for partition 441211 Achieve the goal
22
School of Computer Science and Engineering Outline Problem Definition Framework HmSearch –Partitioning Scheme –Signature Generation –Enhanced Filtering –Hierarchical Filtering and Verification –Dimension Rearrangement Conclusion Experiment
23
School of Computer Science and Engineering Conclusion 1.General Partition Scheme 2.1-variants and 1-deleltion-variants 3.Techniques help boost the performance –Enhanced Filtering –Hierarchical Filtering and Verification –Dimension Rearrangement
24
School of Computer Science and Engineering Outline Problem Definition Framework HmSearch –Partitioning Scheme –Signature Generation –Enhanced Filtering –Hierarchical Filtering and Verification –Dimension Rearrangement Conclusion Experiment
25
School of Computer Science and Engineering Experiment Settings Environment –Intel Xeon X3330 2.664GHz CPU, 4GB RAM –Debian 5.0.6 –AMD Operon™ 8378 2.4GHZ CPU, 96GB RAM (for Pubchem) –Ubuntu/Linaro 4.6.4-1 unbuntu5 –All complied with GCC 4.1.2 with –O3 Dataset
26
School of Computer Science and Engineering Experiment Settings Terms –EF, Enhanced Filtering –HB, Hierarchical Binary Filter –RD, Rearranging Dimensions Our algorithms 1.HSD, HSV, our proposed algorithms, the former one using 1-deleltion- variants as signatures and the latter one using 1-varitnas as signatures 2.HSD-nEB, HSV-nEB, variations that remove EF and HB 3.HSD-nB, HSV-nB, variations that remove HB 4.HSD-nR, HSV-nR, variations that remove RD Baseline algorithm 1.Scancount (Li et. ICDE08) State-of-the-art algorithms 1.Google (Manku et. www07) 2.Hengine (Liu et. ICDE11)
27
School of Computer Science and Engineering Query time HSV has the best performance
28
School of Computer Science and Engineering Candidate Size HSV has the smallest candidate size
29
School of Computer Science and Engineering Effect of EF and HB EF and HB help improve the performance
30
School of Computer Science and Engineering Effect of RD RD boost the performance for PubChem Data
31
School of Computer Science and Engineering Index Size HSV and HSD have a larger candidate size
32
School of Computer Science and Engineering Thank you
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.