Efficient Algorithms for Approximate Member Extraction Using Signature- based Inverted Lists Jialong Han Co-authored with Jiaheng Lu, Xiaofeng Meng Renmin University of China
Jiaheng Lu, Jialong Han, Xiaofeng Meng 2 Introduction: An Example A dictionary of strings we are interested in E.g. product names, postal addresses… We are going to locate their approximate apparences in a series of documents. See the meaning of approximate apparence in the following example:
Jiaheng Lu, Jialong Han, Xiaofeng Meng 3 Problem Definition Given a dictionary R and a threshold δ, extract all proper substrings m from input documents S such that there exists r R, and Similarity (r, m) δ(or Distance(r, m) k ). Here we call r a piece of evidence for m. Similarity() is a function measuring the similarity of two strings Strings are viewed as sets of tokens (words) An example for Sim(): Jaccard similarity:
Jiaheng Lu, Jialong Han, Xiaofeng Meng 4 Outline Introduction State-of-the-art techniques The filtration-verification framework K-signature scheme Inverted Signature-based Hashtable Our algorithms and evaluations Conclusion
Jiaheng Lu, Jialong Han, Xiaofeng Meng 5 Why pre-pruning is needed We need spot evidence to decide whether a substring m should be extracted Simple verification on all dictionary strings may be inefficient Pre-pruning and post-verifying is beneficial But should it be running-speed-oriented or filtering- power-oriented? Less time or less survivors?
Jiaheng Lu, Jialong Han, Xiaofeng Meng 6 The issue of compromise comes again Balance between the two stages should be reached: More(less) filtration time Strong(weak) filtration power Fewer(more) candidates Less(more) verification time Overall performance =Tf+Tv ?????
Jiaheng Lu, Jialong Han, Xiaofeng Meng 7 Outline Introduction State-of-the-art techniques The filtration-verification framework K-signature scheme Inverted Signature-based Hashtable Our algorithms and evaluations Conclusion
Jiaheng Lu, Jialong Han, Xiaofeng Meng 8 K-signature scheme Proposed by Chakrabarti et al. (SIGMOD 2008) Choose several top-weighted tokens in a string as signatures to represent it: s => Sig(s) Observation: if r cannot match m, r is likely to have insufficient signature overlapping with m K is a parameter for filtration power tuning Potential evidence loss A counter-example found when k=3 We tried and only proved that it works for k=1 and k=
Jiaheng Lu, Jialong Han, Xiaofeng Meng 9 Outline Introduction State-of-the-art techniques The filtration-verification framework K-signature scheme Inverted Signature-based Hashtable Our algorithms and evaluations Conclusion
Jiaheng Lu, Jialong Han, Xiaofeng Meng 10 Inverted Signature-based Hashtable Proposed by Chakrabarti et al. (SIGMOD 2008) Each dictionary string encoded into a solid 0-1 matrix An 1 for each occurrence of a tuple (1- rectangle) Bitwise-or all solid matrices to get the matrix of R Observation: if m is an approximate member of R, the matrix of m must have enough intersections with that of R. Formalized into an NPC problem Solution causes too weak filtering power
Jiaheng Lu, Jialong Han, Xiaofeng Meng 11 Outline Introduction State-of-the-art techniques Our algorithms and evaluations Corrected filtering conditions EvSCAN: Filtration by SIL EvITER: Incremental optimization on EvSCAN Supporting Dynamic Thresholds Conclusion
Jiaheng Lu, Jialong Han, Xiaofeng Meng 12 If Sim(m,r) δ, what do we have ? wt(Sig(m)Sig(r)) τ(m) wt(Sig(m)Sig(r)) min{τ(m),τ(r) } So the threshold does not remain constant involves unknown evidence Our solution: Use inverted lists to count sig- token overlappings. Note that sig-tokens usually have low document frequency (e.g. IDF as weights) Our proposed theorem Too strict ! Proved by us
Jiaheng Lu, Jialong Han, Xiaofeng Meng 13 Outline Introduction State-of-the-art techniques Our algorithms and evaluations Corrected filtering conditions EvSCAN: Filtration by SIL EvITER: Incremental optimization on EvSCAN Supporting Dynamic Thresholds Conclusion
Jiaheng Lu, Jialong Han, Xiaofeng Meng 14 Lists indexed by sig-tokens Each sig-token of a string creates a node (containing the strings id) in the corresponding list. E.g. R = { r1 = canon eos 5d digital camera", r2 =nikon digital slr camera, r3=canon slr camera}. wt(digital, camera, canon, nikon, slr, eos, 5d) = (1, 1, 2, 2, 2, 7,9). Signature-based Inverted Lists 5d, 9.0 canon, 2.0 camera, 1.0 eos, 7.0 nikon, 2.0 slr,
Jiaheng Lu, Jialong Han, Xiaofeng Meng 15 Filtration by SIL Using an array called accumulator to compute the overlapped sig weight wt(Sig(m)Sig(r)) E.g. m=canon eos digital camera, δ=0.8 5d, 9.0 canon, 2.0 camera, 1.0 eos, 7.0 nikon, 2.0 slr, rid123 wt(Sig(m)Sig(r)) min{τ(m),τ(r) } Accumulator Qualified!
Jiaheng Lu, Jialong Han, Xiaofeng Meng 16 Outline Introduction State-of-the-art techniques Our algorithms and evaluations Corrected filtering conditions EvSCAN: Filtration by SIL EvITER: Incremental optimization on EvSCAN Supporting Dynamic Thresholds Conclusion
Jiaheng Lu, Jialong Han, Xiaofeng Meng 17 EvITER: Progressive Computation Recall we are checking all substrings Some of them are quite similar, indicating that they share duplicate computation An intuition: if m have potential evidence r, then m t is very likely to match r Formally we proved that Let ES(m) be the set of potential evidence for m, list[t]={s| all dictionary strings that contain token t} We have ES(m t) ES(m) list[t]
Jiaheng Lu, Jialong Han, Xiaofeng Meng 18 Example Docoment M: m t …. cannon eos digital camera lens… We know that only r1, r22, r53 are possible to match cannon eos digital camera lens ES(m) {r1} … lens, 3.0 … 2253 List[t]
Jiaheng Lu, Jialong Han, Xiaofeng Meng 19 Flow of Evidence EvITER for Evidence ITERATION …
Jiaheng Lu, Jialong Han, Xiaofeng Meng 20 Outline Introduction State-of-the-art techniques Our algorithms and evaluations Corrected filtering conditions EvSCAN: Filtration by SIL EvITER: Incremental optimization on EvSCAN Supporting Dynamic Thresholds Conclusion
Jiaheng Lu, Jialong Han, Xiaofeng Meng 21 The Static Threshold Problem How does this index work so far? -Get ready forδ=0.8 please. -Please wait 30min for index generation… -Ready! -Document M1,δ=0.8. Go! -…Extraction complete. -Document M2, and I wantδ=0.9… -Sorry, please wait another 30min for index regeneration… -:-(
Jiaheng Lu, Jialong Han, Xiaofeng Meng 22 The Static Threshold Problem This One Seems Better -Get ready forδ>=0.8 please. -Please wait 30min for index generation… -Ready! -Document M1,δ=0.8. Go! -…Extraction complete. -Document M2, and I wantδ=0.9… -…Extraction complete. :-)
Jiaheng Lu, Jialong Han, Xiaofeng Meng 23 Supporting Dynamic Thresholds An Observation When δ descends, a string rs tokens fall into Sig(r) one by one, in the order of their weight ranking. I.e. any node is active when δ is below certain threshold u. We record u in each node and sort all nodes in each list according to the descending order of their u value. For any given δ, we only need retrieve a prefix of each list to get all active nodes
Jiaheng Lu, Jialong Han, Xiaofeng Meng 24 Experimental Datasets DBLP: 274,788 Paper titles 1,838,973 URLs
Jiaheng Lu, Jialong Han, Xiaofeng Meng 25 Balance should be reached Recall our two stages of filtration and verification
Jiaheng Lu, Jialong Han, Xiaofeng Meng 26 Performance (DBLP)
Jiaheng Lu, Jialong Han, Xiaofeng Meng 27 Outline Introduction State-of-the-art techniques Our algorithms and evaluations Corrected filtering conditions EvSCAN: Filtration by SIL EvITER: Incremental optimization on EvSCAN Supporting Dynamic Thresholds Conclusion
Jiaheng Lu, Jialong Han, Xiaofeng Meng 28 Conclusion Our method causes no false negatives Our method achieves a good balance between the two phases of filtration and verification We also propose EvITER to eliminate duplicate computation Our method has both effective & efficient performance
Jiaheng Lu, Jialong Han, Xiaofeng Meng 29
Jiaheng Lu, Jialong Han, Xiaofeng Meng 30 References [1] A. Arasu, V. Ganti, R. Kaushik. Efficient exact set-similarity joins. In VLDB, pages , [2] K. Chakrabarti, S. Chaudhuri, V. Ganti, D. Xin. An efficient filter for approximate membership checking. In SIGMOD Conference, [3] A. Chandel, P. C. Nagesh, and S. Sarawagi. Efficient batch top-k search for dictionary-based entity recognition. In ICDE, page 28, [4] S. Chaudhuri, V. Ganti, and R. Kaushik. A primitive operator for similarity joins in data cleaning. In ICDE, page 5, [5] M.R.Garey and D.S.Johnson. Computers and Intractability: Guidance to the Theory of NP-Completeness. [6] L. Gravano, P. G. Ipeirotis, H. V. Jagadish, N. Koudas, S. Muthukrishnan, and D. Srivastava. Approximate string joins in a database (almost) for free. In VLDB, pages , 2001.
Jiaheng Lu, Jialong Han, Xiaofeng Meng 31 References [7] C. Li, J. Lu, and Y. Lu. Efficient merging and filtering algorithms for approximate string searches. In ICDE, pages 257–266, [8] C. Li, B,Wang, X. Yang, VGRAM: Improving performance of approximate queries on string collections using variable length grams. In VLDB [9] G. Navarro. A guided tour to approximate string matching. ACM Comput. Surv., 33(1):31–88, [10] S. Sarawagi, A.Kirpal, Efficient set joins on similarity predicates. In SIGMOD Conference, [11] A. Singhal. Modern information retrieval: A brief overview. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering, 24(4):35-43, [12] E. Sutinen and J. Tarhio. On using q-grams locations in approximate string matching. In ESA, pages , [13] W. Wang, C. Xiao, X. Lin, C. Zhang. Efficient approximate entity extraction with edit distance constraints. In SIGMOD Conference, 2009.