Efficient Algorithms for Approximate Member Extraction Using Signature- based Inverted Lists Jialong Han Co-authored with Jiaheng Lu, Xiaofeng Meng Renmin.

Slides:

Advertisements

Similar presentations

Ting Chen, Jiaheng Lu, Tok Wang Ling

Advertisements

Shared-Memory Model and Threads Intel Software College Introduction to Parallel Programming – Part 2.

Applications Computational LogicLecture 11 Michael Genesereth Spring 2004.

Online Max-Margin Weight Learning with Markov Logic Networks Tuyen N. Huynh and Raymond J. Mooney Machine Learning Group Department of Computer Science.

1 Vorlesung Informatik 2 Algorithmen und Datenstrukturen (Parallel Algorithms) Robin Pomplun.

Copyright © 2003 Pearson Education, Inc. Slide 1 Computer Systems Organization & Architecture Chapters 8-12 John D. Carpinelli.

Chapter 1 The Study of Body Function Image PowerPoint

1 Copyright © 2013 Elsevier Inc. All rights reserved. Appendix 01.

APWeb 2004 Hangzhou, China 1 Labeling and Querying Dynamic XML Trees Jiaheng Lu and Tok Wang Ling School of Computing National University of Singapore.

1 Efficient Merging and Filtering Algorithms for Approximate String Searches Jiaheng Lu, University of California, Irvine Joint work with Chen Li, Yiming.

Jiaheng Lu, University of California, Irvine

Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai Space-Constrained Gram-Based Indexing for Efficient.

String Similarity Measures and Joins with Synonyms

1 Faster algorithms for string matching with k mismatches Adviser : R. C. T. Lee Speaker: C. C. Yen Journal of Algorithms, Volume 50, Issue 2, February.

UNITED NATIONS Shipment Details Report – January 2006.

Title Subtitle.

Year 6 mental test 10 second questions

Evaluating Window Joins over Unbounded Streams Author: Jaewoo Kang, Jeffrey F. Naughton, Stratis D. Viglas University of Wisconsin-Madison CS Dept. Presenter:

Vote Elicitation with Probabilistic Preference Models: Empirical Estimation and Cost Tradeoffs Tyler Lu and Craig Boutilier University of Toronto.

1 Outline relationship among topics secrets LP with upper bounds by Simplex method basic feasible solution (BFS) by Simplex method for bounded variables.

REVIEW: Arthropod ID. 1. Name the subphylum. 2. Name the subphylum. 3. Name the order.

Pole Placement.

Fact-finding Techniques Transparencies

Lazy Updates: An Efficient Technique to Continuously Monitoring Reverse kNN Presented By: Ying Zhang Joint work with Muhammad Aamir Cheema, Xuemin Lin,

Charge Pump PLL.

Randomized Algorithms Randomized Algorithms CS648 1.

ABC Technology Project

EU Market Situation for Eggs and Poultry Management Committee 21 June 2012.

Page Replacement Algorithms

Association Rule Mining

Name Convolutional codes Tomashevich Victor. Name- 2 - Introduction Convolutional codes map information to code bits sequentially by convolving a sequence.

Differential Forms for Target Tracking and Aggregate Queries in Distributed Networks Rik Sarkar Jie Gao Stony Brook University 1.

Factor P 16 8(8-5ab) 4(d² + 4) 3rs(2r – s) 15cd(1 + 2cd) 8(4a² + 3b²)

Extending Q-Grams to Estimate Selectivity of String Matching with Low Edit Distance [1] Pirooz Chubak May 22, 2008.

© 2012 National Heart Foundation of Australia. Slide 2.

Page 1 of 43 To the ETS – Bidding Query by Map Online Training Course Welcome This training module provides the procedures for using Query by Map for a.

Understanding Generalist Practice, 5e, Kirst-Ashman/Hull

Chapter 5 Loops Liang, Introduction to Java Programming, Tenth Edition, (c) 2015 Pearson Education, Inc. All rights reserved.

Model and Relationships 6 M 1 M M M M M M M M M M M M M M M M

25 seconds left…...

Januar MDMDFSSMDMDFSSS

Vincent S. Tseng, Cheng-Wei Wu, Bai-En Shie, and Philip S. Yu SIG KDD 2010 UP-Growth: An Efficient Algorithm for High Utility Itemset Mining 2010/8/25.

We will resume in: 25 Minutes.

©Brooks/Cole, 2001 Chapter 12 Derived Types-- Enumerated, Structure and Union.

Intracellular Compartments and Transport

PSSA Preparation.

Essential Cell Biology

Sequential PAttern Mining using A Bitmap Representation

CO-AUTHOR RELATIONSHIP PREDICTION IN HETEROGENEOUS BIBLIOGRAPHIC NETWORKS Yizhou Sun, Rick Barber, Manish Gupta, Charu C. Aggarwal, Jiawei Han 1.

Introduction into Simulation Basic Simulation Modeling.

1 Decidability continued…. 2 Theorem: For a recursively enumerable language it is undecidable to determine whether is finite Proof: We will reduce the.

The Pumping Lemma for CFL’s

Top-k Set Similarity Joins Chuan Xiao, Wei Wang, Xuemin Lin and Haichuan Shang University of New South Wales and NICTA.

Efficient Exact Similarity Searches using Multiple Token Orderings Jongik Kim 1 and Hongrae Lee 2 1 Chonbuk National University, South Korea 2 Google Inc.

Experiments An Efficient Trie-based Method for Approximate Entity Extraction with Edit-Distance Constraints Entity Extraction A Document An Efficient Filter.

Top-k Set Similarity Joins Chuan Xiao, Wei Wang, Xuemin Lin and Haichuan Shang Univ. of New South Wales, Austrailia ICDE ’09 9 Feb 2011 Taewhi Lee Based.

Experiments Faerie: Efficient Filtering Algorithms for Approximate Dictionary-based Entity Extraction Entity Extraction A Document An Efficient Filter.

EFFICIENT ALGORITHMS FOR APPROXIMATE MEMBER EXTRACTION By Swapnil Kharche and Pavan Basheerabad.

COMP9313: Big Data Management Lecturer: Xin Cao Course web site:

Jiannan Wang (Tsinghua, China) Guoliang Li (Tsinghua, China)

TT-Join: Efficient Set Containment Join

Chuan Xiao, Wei Wang, Xuemin Lin

Guoliang Li (Tsinghua, China) Dong Deng (Tsinghua, China)

Weighted Exact Set Similarity Join

Presentation transcript:

Efficient Algorithms for Approximate Member Extraction Using Signature- based Inverted Lists Jialong Han Co-authored with Jiaheng Lu, Xiaofeng Meng Renmin University of China

Jiaheng Lu, Jialong Han, Xiaofeng Meng 2 Introduction: An Example A dictionary of strings we are interested in E.g. product names, postal addresses… We are going to locate their approximate apparences in a series of documents. See the meaning of approximate apparence in the following example:

Jiaheng Lu, Jialong Han, Xiaofeng Meng 3 Problem Definition Given a dictionary R and a threshold δ, extract all proper substrings m from input documents S such that there exists r R, and Similarity (r, m) δ(or Distance(r, m) k ). Here we call r a piece of evidence for m. Similarity() is a function measuring the similarity of two strings Strings are viewed as sets of tokens (words) An example for Sim(): Jaccard similarity:

Jiaheng Lu, Jialong Han, Xiaofeng Meng 4 Outline Introduction State-of-the-art techniques The filtration-verification framework K-signature scheme Inverted Signature-based Hashtable Our algorithms and evaluations Conclusion

Jiaheng Lu, Jialong Han, Xiaofeng Meng 5 Why pre-pruning is needed We need spot evidence to decide whether a substring m should be extracted Simple verification on all dictionary strings may be inefficient Pre-pruning and post-verifying is beneficial But should it be running-speed-oriented or filtering- power-oriented? Less time or less survivors?

Jiaheng Lu, Jialong Han, Xiaofeng Meng 6 The issue of compromise comes again Balance between the two stages should be reached: More(less) filtration time Strong(weak) filtration power Fewer(more) candidates Less(more) verification time Overall performance =Tf+Tv ?????

Jiaheng Lu, Jialong Han, Xiaofeng Meng 7 Outline Introduction State-of-the-art techniques The filtration-verification framework K-signature scheme Inverted Signature-based Hashtable Our algorithms and evaluations Conclusion

Jiaheng Lu, Jialong Han, Xiaofeng Meng 8 K-signature scheme Proposed by Chakrabarti et al. (SIGMOD 2008) Choose several top-weighted tokens in a string as signatures to represent it: s => Sig(s) Observation: if r cannot match m, r is likely to have insufficient signature overlapping with m K is a parameter for filtration power tuning Potential evidence loss A counter-example found when k=3 We tried and only proved that it works for k=1 and k=

Jiaheng Lu, Jialong Han, Xiaofeng Meng 9 Outline Introduction State-of-the-art techniques The filtration-verification framework K-signature scheme Inverted Signature-based Hashtable Our algorithms and evaluations Conclusion

Jiaheng Lu, Jialong Han, Xiaofeng Meng 10 Inverted Signature-based Hashtable Proposed by Chakrabarti et al. (SIGMOD 2008) Each dictionary string encoded into a solid 0-1 matrix An 1 for each occurrence of a tuple (1- rectangle) Bitwise-or all solid matrices to get the matrix of R Observation: if m is an approximate member of R, the matrix of m must have enough intersections with that of R. Formalized into an NPC problem Solution causes too weak filtering power

Jiaheng Lu, Jialong Han, Xiaofeng Meng 11 Outline Introduction State-of-the-art techniques Our algorithms and evaluations Corrected filtering conditions EvSCAN: Filtration by SIL EvITER: Incremental optimization on EvSCAN Supporting Dynamic Thresholds Conclusion

Jiaheng Lu, Jialong Han, Xiaofeng Meng 12 If Sim(m,r) δ, what do we have ? wt(Sig(m)Sig(r)) τ(m) wt(Sig(m)Sig(r)) min{τ(m),τ(r) } So the threshold does not remain constant involves unknown evidence Our solution: Use inverted lists to count sig- token overlappings. Note that sig-tokens usually have low document frequency (e.g. IDF as weights) Our proposed theorem Too strict ! Proved by us

Jiaheng Lu, Jialong Han, Xiaofeng Meng 13 Outline Introduction State-of-the-art techniques Our algorithms and evaluations Corrected filtering conditions EvSCAN: Filtration by SIL EvITER: Incremental optimization on EvSCAN Supporting Dynamic Thresholds Conclusion

Jiaheng Lu, Jialong Han, Xiaofeng Meng 14 Lists indexed by sig-tokens Each sig-token of a string creates a node (containing the strings id) in the corresponding list. E.g. R = { r1 = canon eos 5d digital camera", r2 =nikon digital slr camera, r3=canon slr camera}. wt(digital, camera, canon, nikon, slr, eos, 5d) = (1, 1, 2, 2, 2, 7,9). Signature-based Inverted Lists 5d, 9.0 canon, 2.0 camera, 1.0 eos, 7.0 nikon, 2.0 slr,

Jiaheng Lu, Jialong Han, Xiaofeng Meng 15 Filtration by SIL Using an array called accumulator to compute the overlapped sig weight wt(Sig(m)Sig(r)) E.g. m=canon eos digital camera, δ=0.8 5d, 9.0 canon, 2.0 camera, 1.0 eos, 7.0 nikon, 2.0 slr, rid123 wt(Sig(m)Sig(r)) min{τ(m),τ(r) } Accumulator Qualified!

Jiaheng Lu, Jialong Han, Xiaofeng Meng 16 Outline Introduction State-of-the-art techniques Our algorithms and evaluations Corrected filtering conditions EvSCAN: Filtration by SIL EvITER: Incremental optimization on EvSCAN Supporting Dynamic Thresholds Conclusion

Jiaheng Lu, Jialong Han, Xiaofeng Meng 17 EvITER: Progressive Computation Recall we are checking all substrings Some of them are quite similar, indicating that they share duplicate computation An intuition: if m have potential evidence r, then m t is very likely to match r Formally we proved that Let ES(m) be the set of potential evidence for m, list[t]={s| all dictionary strings that contain token t} We have ES(m t) ES(m) list[t]

Jiaheng Lu, Jialong Han, Xiaofeng Meng 18 Example Docoment M: m t …. cannon eos digital camera lens… We know that only r1, r22, r53 are possible to match cannon eos digital camera lens ES(m) {r1} … lens, 3.0 … 2253 List[t]

Jiaheng Lu, Jialong Han, Xiaofeng Meng 19 Flow of Evidence EvITER for Evidence ITERATION …

Jiaheng Lu, Jialong Han, Xiaofeng Meng 20 Outline Introduction State-of-the-art techniques Our algorithms and evaluations Corrected filtering conditions EvSCAN: Filtration by SIL EvITER: Incremental optimization on EvSCAN Supporting Dynamic Thresholds Conclusion

Jiaheng Lu, Jialong Han, Xiaofeng Meng 21 The Static Threshold Problem How does this index work so far? -Get ready forδ=0.8 please. -Please wait 30min for index generation… -Ready! -Document M1,δ=0.8. Go! -…Extraction complete. -Document M2, and I wantδ=0.9… -Sorry, please wait another 30min for index regeneration… -:-(

Jiaheng Lu, Jialong Han, Xiaofeng Meng 22 The Static Threshold Problem This One Seems Better -Get ready forδ>=0.8 please. -Please wait 30min for index generation… -Ready! -Document M1,δ=0.8. Go! -…Extraction complete. -Document M2, and I wantδ=0.9… -…Extraction complete. :-)

Jiaheng Lu, Jialong Han, Xiaofeng Meng 23 Supporting Dynamic Thresholds An Observation When δ descends, a string rs tokens fall into Sig(r) one by one, in the order of their weight ranking. I.e. any node is active when δ is below certain threshold u. We record u in each node and sort all nodes in each list according to the descending order of their u value. For any given δ, we only need retrieve a prefix of each list to get all active nodes

Jiaheng Lu, Jialong Han, Xiaofeng Meng 24 Experimental Datasets DBLP: 274,788 Paper titles 1,838,973 URLs

Jiaheng Lu, Jialong Han, Xiaofeng Meng 25 Balance should be reached Recall our two stages of filtration and verification

Jiaheng Lu, Jialong Han, Xiaofeng Meng 26 Performance (DBLP)

Jiaheng Lu, Jialong Han, Xiaofeng Meng 27 Outline Introduction State-of-the-art techniques Our algorithms and evaluations Corrected filtering conditions EvSCAN: Filtration by SIL EvITER: Incremental optimization on EvSCAN Supporting Dynamic Thresholds Conclusion

Jiaheng Lu, Jialong Han, Xiaofeng Meng 28 Conclusion Our method causes no false negatives Our method achieves a good balance between the two phases of filtration and verification We also propose EvITER to eliminate duplicate computation Our method has both effective & efficient performance

Jiaheng Lu, Jialong Han, Xiaofeng Meng 29

Jiaheng Lu, Jialong Han, Xiaofeng Meng 30 References [1] A. Arasu, V. Ganti, R. Kaushik. Efficient exact set-similarity joins. In VLDB, pages , [2] K. Chakrabarti, S. Chaudhuri, V. Ganti, D. Xin. An efficient filter for approximate membership checking. In SIGMOD Conference, [3] A. Chandel, P. C. Nagesh, and S. Sarawagi. Efficient batch top-k search for dictionary-based entity recognition. In ICDE, page 28, [4] S. Chaudhuri, V. Ganti, and R. Kaushik. A primitive operator for similarity joins in data cleaning. In ICDE, page 5, [5] M.R.Garey and D.S.Johnson. Computers and Intractability: Guidance to the Theory of NP-Completeness. [6] L. Gravano, P. G. Ipeirotis, H. V. Jagadish, N. Koudas, S. Muthukrishnan, and D. Srivastava. Approximate string joins in a database (almost) for free. In VLDB, pages , 2001.

Jiaheng Lu, Jialong Han, Xiaofeng Meng 31 References [7] C. Li, J. Lu, and Y. Lu. Efficient merging and filtering algorithms for approximate string searches. In ICDE, pages 257–266, [8] C. Li, B,Wang, X. Yang, VGRAM: Improving performance of approximate queries on string collections using variable length grams. In VLDB [9] G. Navarro. A guided tour to approximate string matching. ACM Comput. Surv., 33(1):31–88, [10] S. Sarawagi, A.Kirpal, Efficient set joins on similarity predicates. In SIGMOD Conference, [11] A. Singhal. Modern information retrieval: A brief overview. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering, 24(4):35-43, [12] E. Sutinen and J. Tarhio. On using q-grams locations in approximate string matching. In ESA, pages , [13] W. Wang, C. Xiao, X. Lin, C. Zhang. Efficient approximate entity extraction with edit distance constraints. In SIGMOD Conference, 2009.