Click to edit Present’s Name Xiaoyang Zhang 1, Jianbin Qin 1, Wei Wang 1, Yifang Sun 1, Jiaheng Lu 2 HmSearch: An Efficient Hamming Distance Query Processing Algorithm 1 University of New South Wales, Australia 2 Renmin University of China, Chnia
School of Computer Science and Engineering Motivation Identify Near Duplicate Webpages ABCDEF simhash ABCDEF Similar Chemical data Maps in to Binary code ABCDEF ABCDEF1 Similar
School of Computer Science and Engineering More Applications Iris recognition Image retrieval C2LSH
School of Computer Science and Engineering Outline Problem Definition Framework HmSearch –Partitioning Scheme –Signature Generation –Enhanced Filtering –Hierarchical Filtering and Verification –Dimension Rearrangement Conclusion Experiment
School of Computer Science and Engineering Hamming Distance Query Hamming distance Hamming distance query Number of positions at which the corresponding symbols are different for two equal length vectors. q: ABCD v: ACCD Hamming distance(R, S) = 1 Given a database V of vectors, a query vector Q (all the vectors have the same dimensionality N) and a Hamming distance threshold k, find all v i in V, that hd (v i, Q) <= k
School of Computer Science and Engineering Outline Problem Definition Framework HmSearch –Partitioning Scheme –Signature Generation –Enhanced Filtering –Hierarchical Filtering and Verification –Dimension Rearrangement Conclusion Experiment
School of Computer Science and Engineering Basic Idea General framework: 1.We can do k=1 efficiently (show later) 2.So we transform larger k problem to several small k=1 problem by partitioning 3.We do filtering by looking at each partition 4.We do verification at last v q the same hd (q, v)<=1 hd(q left, v left )=0 or hd(q right, v righ t)=0 So if k =1, can be filtered by looking at each part v q
School of Computer Science and Engineering Framework Data Partitioning Indexing Index Query Partitioning Candidates0 Filtering Candidates1 Verification Results Generating Signatures General Partitionin g Scheme 1-variants and 1-deletion variants Enhanced Filtering Hierarchical Filtering and Verification Dimension Rearrangement
School of Computer Science and Engineering Outline Problem Definition Framework HmSearch –Partitioning Scheme –Signature Generation –Enhanced Filtering –Hierarchical Filtering and Verification –Dimension Rearrangement Conclusion Experiment
School of Computer Science and Engineering Partitioning Lowerbound for partition strategy Given q and v such that hd(q, v)<=k, if the N dimensions are divided into κ parts, there should be at least partitions, such that hd(q part, v part )<= In our algorithm, we choose When k is even, m = 1 When k is odd, m = 2 When k= 0 or 1, m=1, hd = 0 When k>=2, hd <= 1
School of Computer Science and Engineering Outline Problem Definition Framework HmSearch –Partitioning Scheme –Signature Generation –Enhanced Filtering –Hierarchical Filtering and Verification –Dimension Rearrangement Conclusion Experiment
School of Computer Science and Engineering Signature Generation 1-variants 1-deletion-variants Substituting each dimension with ‘#’ each time Substituting each dimension with each domain value each time (plus itself) v=[1, 2, 3] 1-del-val(v)=[#, 2, 3], [1, #, 1], [1, 2, #] v=[1, 2, 3] and Σ (domain) =[1, 2, 3] 1-val(v)=[1, 2, 3], [2, 2, 3], [3, 2, 3], [1, 1, 3], [1, 3, 3], [1, 2, 1], [1, 2, 2] We index all 1-val(v) and when q comes in, we search q in the index We index all 1-del-val(v) and when q comes in, we generate 1-del-val(q), and search all 1-del-val(q) in the index OR
School of Computer Science and Engineering Outline Problem Definition Framework HmSearch –Partitioning Scheme –Signature Generation –Enhanced Filtering –Hierarchical Filtering and Verification –Dimension Rearrangement Conclusion Experiment
School of Computer Science and Engineering Enhanced Filter (Even) v q If k =2, based on the formula before, m=1, hd(v part, q part )=1 So this v becomes a false positive However, we find that If k (k>=1) is even, v is qualified for two situations: 1) m=1, where hd(v part, q part )=0 2)m=2, where hd(v part, q part )<=1 Using enhanced filter, no situation applied so v is filtered Based on the Formula before When k (k>=1) is even, m = 1 Example
School of Computer Science and Engineering Enhanced Filter (Odd) If k =3, based on the formula before, m=2, hd(v part, q part )=1 So this v becomes a false positive However, we find that If k (k>=1) is odd, v is qualified for two situations: 1) m=2, where hd(v part, q part )<=1 and at least one of them = 0 2) m=3, where hd(v part, q part )<=1 Using enhanced filter, no situation applied so v is filtered Based on the Formula before When k (k>=1) is odd, m = 2 Example v q
School of Computer Science and Engineering Outline Problem Definition Framework HmSearch –Partitioning Scheme –Signature Generation –Enhanced Filtering –Hierarchical Filtering and Verification –Dimension Rearrangement Conclusion Experiment
School of Computer Science and Engineering Hierarchical Filtering and Verification Significant bit 1st 2nd 3rd v=[5, 0, 3, 6] q=[5, 2, 2, 5] Σ=|8|, k=1 So hd(v, q)>=2, filtered More over, even if k=4 4 comparisons to calculate hd(v,q)=3 diff XOR OR 0111hd(v,q)=3 We can use binary operations to do a hierarchical filtering and verification
School of Computer Science and Engineering Hierarchical Filtering and Verification Significant bit 1st 2nd 3rd v=[5, 0, 3, 6] q=[5, 2, 3, 5] diffcumdiff XOR OR Number of 1 In cumdiff 1 2 <=1, conti. >1, filtered
School of Computer Science and Engineering Outline Problem Definition Framework HmSearch –Partitioning Scheme –Signature Generation –Enhanced Filtering –Hierarchical Filtering and Verification –Dimension Rearrangement Conclusion Experiment
School of Computer Science and Engineering Impact of Data Skewness Given k=2, then m = 1 and k’=1 Only v1 is qualified We propose to reset the order and partition Length to improve performance All vectors are qualified Dim v2 v1 q Partition Partition v v Dim v2 v1 q Partition Partition v v
School of Computer Science and Engineering Greedy Dimension Rearrangement Dim v2 v1 Partition Partition v v MaxFreq for Dim MaxFreq is the Max Frequency of any values in each dimension Dim v2 v1 Partition Partition v v Our goal: Minimize the global MaxFreq MaxFreq for partition Achieve the goal
School of Computer Science and Engineering Outline Problem Definition Framework HmSearch –Partitioning Scheme –Signature Generation –Enhanced Filtering –Hierarchical Filtering and Verification –Dimension Rearrangement Conclusion Experiment
School of Computer Science and Engineering Conclusion 1.General Partition Scheme 2.1-variants and 1-deleltion-variants 3.Techniques help boost the performance –Enhanced Filtering –Hierarchical Filtering and Verification –Dimension Rearrangement
School of Computer Science and Engineering Outline Problem Definition Framework HmSearch –Partitioning Scheme –Signature Generation –Enhanced Filtering –Hierarchical Filtering and Verification –Dimension Rearrangement Conclusion Experiment
School of Computer Science and Engineering Experiment Settings Environment –Intel Xeon X GHz CPU, 4GB RAM –Debian –AMD Operon™ GHZ CPU, 96GB RAM (for Pubchem) –Ubuntu/Linaro unbuntu5 –All complied with GCC with –O3 Dataset
School of Computer Science and Engineering Experiment Settings Terms –EF, Enhanced Filtering –HB, Hierarchical Binary Filter –RD, Rearranging Dimensions Our algorithms 1.HSD, HSV, our proposed algorithms, the former one using 1-deleltion- variants as signatures and the latter one using 1-varitnas as signatures 2.HSD-nEB, HSV-nEB, variations that remove EF and HB 3.HSD-nB, HSV-nB, variations that remove HB 4.HSD-nR, HSV-nR, variations that remove RD Baseline algorithm 1.Scancount (Li et. ICDE08) State-of-the-art algorithms 1.Google (Manku et. www07) 2.Hengine (Liu et. ICDE11)
School of Computer Science and Engineering Query time HSV has the best performance
School of Computer Science and Engineering Candidate Size HSV has the smallest candidate size
School of Computer Science and Engineering Effect of EF and HB EF and HB help improve the performance
School of Computer Science and Engineering Effect of RD RD boost the performance for PubChem Data
School of Computer Science and Engineering Index Size HSV and HSD have a larger candidate size
School of Computer Science and Engineering Thank you