String Similarity Measures and Joins with Synonyms

Slides:

Advertisements

Similar presentations

Ting Chen, Jiaheng Lu, Tok Wang Ling

Advertisements

Efficient Algorithms for Approximate Member Extraction Using Signature- based Inverted Lists Jialong Han Co-authored with Jiaheng Lu, Xiaofeng Meng Renmin.

APWeb 2004 Hangzhou, China 1 Labeling and Querying Dynamic XML Trees Jiaheng Lu and Tok Wang Ling School of Computing National University of Singapore.

1 Efficient Processing of XML Twig Patterns with Parent Child Edges: A Look-ahead Approach Jiaheng Lu, Ting Chen, Tok Wang Ling National University of.

1 Efficient Merging and Filtering Algorithms for Approximate String Searches Jiaheng Lu, University of California, Irvine Joint work with Chen Li, Yiming.

Jiaheng Lu, University of California, Irvine

Optimal Top-k Generation of Attribute Combinations based on Ranked Lists Jiaheng Lu, Renmin University of China Joint work with Pierre Senellart, Chunbin.

Speaker: C. C. Lin Adviser: R. C. T. Lee

1 Faster algorithms for string matching with k mismatches Adviser : R. C. T. Lee Speaker: C. C. Yen Journal of Algorithms, Volume 50, Issue 2, February.

By John E. Hopcroft, Rajeev Motwani and Jeffrey D. Ullman

A Fractional Order (Proportional and Derivative) Motion Controller Design for A Class of Second-order Systems Center for Self-Organizing Intelligent.

Reducing Order Enforcement Cost in Complex Query Plans Ravindra Guravannavar and S. Sudarshan (To appear in ICDE 2007)

Lazy Updates: An Efficient Technique to Continuously Monitoring Reverse kNN Presented By: Ying Zhang Joint work with Muhammad Aamir Cheema, Xuemin Lin,

Extending Q-Grams to Estimate Selectivity of String Matching with Low Edit Distance [1] Pirooz Chubak May 22, 2008.

Efficient Top-k Search across Heterogeneous XML Data Sources Jianxin Li 1 Chengfei Liu 1 Jeffrey Xu Yu 2 Rui Zhou 1 1 Swinburne University of Technology.

Computing Structural Similarity of Source XML Schemas against Domain XML Schema Jianxin Li 1 Chengfei Liu 1 Jeffrey Xu Yu 2 Jixue Liu 3 Guoren Wang 4 Chi.

Before Between After.

. Lecture #8: - Parameter Estimation for HMM with Hidden States: the Baum Welch Training - Viterbi Training - Extensions of HMM Background Readings: Chapters.

Improved Census Transforms for Resource-Optimized Stereo Vision

The Within-Strip Discrete Unit Disk Cover Problem Bob Fraser (joint work with Alex López-Ortiz) University of Waterloo CCCG Aug. 8, 2012.

Chen Li ( 李晨 ) Chen Li Scalable Interactive Search NFIC August 14, 2010, San Jose, CA Joint work with colleagues at UC Irvine and Tsinghua University.

Learning to Recommend Questions Based on User Ratings Ke Sun, Yunbo Cao, Xinying Song, Young-In Song, Xiaolong Wang and Chin-Yew Lin. In Proceeding of.

1 Decidability continued…. 2 Theorem: For a recursively enumerable language it is undecidable to determine whether is finite Proof: We will reduce the.

1 Zi Yang, Wei Li, Jie Tang, and Juanzi Li Knowledge Engineering Group Department of Computer Science and Technology Tsinghua University, China {yangzi,

Discovering Queries based on Example Tuples

Twig 2 Stack: Bottom-up Processing of Generalized-Tree-Pattern Queries over XML Documents Songting Chen, Hua-Gang Li *, Junichi Tatemura Wang-Pin Hsiung,

Lei Zou 1, Jinghui Mo 1, Lei Chen 2, M. Tamer Özsu 3, Dongyan Zhao 1 1 gStore: Answering SPARQL Queries Via Subgraph Matching 1 Peking University, 2 Hong.

Jiannan Wang (Tsinghua, China) Guoliang Li (Tsinghua, China) Jianhua Feng (Tsinghua, China)

DISCOVER: Keyword Search in Relational Databases Vagelis Hristidis University of California, San Diego Yannis Papakonstantinou University of California,

Top-k Set Similarity Joins Chuan Xiao, Wei Wang, Xuemin Lin and Haichuan Shang University of New South Wales and NICTA.

Arvind Arasu, Surajit Chaudhuri, and Raghav Kaushik Presented by Bryan Wilhelm.

Speaker: Alexander Behm Space-Constrained Gram-Based Indexing for Efficient Approximate String Search Alexander Behm 1, Shengyue Ji 1, Chen Li 1, Jiaheng.

Liang Jin and Chen Li VLDB’2005 Supported by NSF CAREER Award IIS Selectivity Estimation for Fuzzy String Predicates in Large Data Sets.

33 rd International Conference on Very Large Data Bases, Sep. 2007, Vienna Towards Graph Containment Search and Indexing Chen Chen 1, Xifeng Yan 2, Philip.

Near-duplicates detection Comparison of the two algorithms seen in class Romain Colle.

CBLOCK: An Automatic Blocking Mechanism for Large-Scale Deduplication Tasks Ashwin Machanavajjhala Duke University with Anish Das Sarma, Ankur Jain, Philip.

Efficient Parallel Set-Similarity Joins Using Hadoop Chen Li Joint work with Michael Carey and Rares Vernica.

Efficient Exact Similarity Searches using Multiple Token Orderings Jongik Kim 1 and Hongrae Lee 2 1 Chonbuk National University, South Korea 2 Google Inc.

Click to edit Present’s Name Xiaoyang Zhang 1, Jianbin Qin 1, Wei Wang 1, Yifang Sun 1, Jiaheng Lu 2 HmSearch: An Efficient Hamming Distance Query Processing.

Extracting Semantic Constraint from Description Text for Semantic Web Service Discovery Dengping Wei, Ting Wang, Ji Wang, and Yaodong Chen Reporter: Ting.

Experiments An Efficient Trie-based Method for Approximate Entity Extraction with Edit-Distance Constraints Entity Extraction A Document An Efficient Filter.

Top-k Set Similarity Joins Chuan Xiao, Wei Wang, Xuemin Lin and Haichuan Shang Univ. of New South Wales, Austrailia ICDE ’09 9 Feb 2011 Taewhi Lee Based.

Cache-Conscious Performance Optimization for Similarity Search Maha Alabduljalil, Xun Tang, Tao Yang Department of Computer Science University of California.

CSCI 3160 Design and Analysis of Algorithms Tutorial 10 Chengyu Lin.

The Sweet Spot between Inverted Indices and Metric-Space Indexing for Top-K–List Similarity Search Evica Milchevski , Avishek Anand ★ and Sebastian Michel.

University at BuffaloThe State University of New York Lei Shi Department of Computer Science and Engineering State University of New York at Buffalo Frequent.

GStore: Answering SPARQL Queries Via Subgraph Matching Lei Zou 1, Jinghui Mo 1, Lei Chen 2, M. Tamer Özsu 3, Dongyan Zhao Peking University, 2 Hong.

Computer Science and Engineering TreeSpan Efficiently Computing Similarity All-Matching Gaoping Zhu #, Xuemin Lin #, Ke Zhu #, Wenjie Zhang #, Jeffrey.

Wei Lu, Hwee Tou Ng, Wee Sun Lee National University of Singapore

Lecture 4: Data Integration and Cleaning CMPT 733, SPRING 2016 JIANNAN WANG.

EFFICIENT ALGORITHMS FOR APPROXIMATE MEMBER EXTRACTION By Swapnil Kharche and Pavan Basheerabad.

COMP9313: Big Data Management Lecturer: Xin Cao Course web site:

Privacy Preserving Subgraph Matching on Large Graphs in Cloud

Jiannan Wang (Tsinghua, China) Guoliang Li (Tsinghua, China)

TT-Join: Efficient Set Containment Join

Pass-Join: A Partition based Method for Similarity Joins

Entity Matching : How Similar Is Similar?

Chuan Xiao, Wei Wang, Xuemin Lin

Probably Approximately

Weighted Exact Set Similarity Join

Sequential Data Cleaning: A Statistical Approach

MEgo2Vec: Embedding Matched Ego Networks for User Alignment Across Social Networks Jing Zhang+, Bo Chen+, Xianming Wang+, Fengmei Jin+, Hong Chen+, Cuiping.

Minimizing the Aggregate Movements for Interval Coverage

Efficient Record Linkage in Large Data Sets

Jongik Kim1, Dong-Hoon Choi2, and Chen Li3

Relaxing Join and Selection Queries

Donghui Zhang, Tian Xia Northeastern University

An Efficient Partition Based Method for Exact Set Similarity Joins

PRSim: Sublinear Time SimRank Computation on Large Power-Law Graphs.

Presentation transcript:

String Similarity Measures and Joins with Synonyms Jiaheng Lu Renmin University of China Joint work with Chunbin Lin, Wei Wang, Chen Li, Haiyong Wang

Motivation Example (String Measure) S1=“International Conference on Management of Data NY USA” S2=“SIGMOD 2013 New York United States” no semantic

SIGMOD  ACM's Special Interest Group on Management Of Data Example (String Measure) S1=“International Conference on Management of Data NY USA” S2=“SIGMOD 2013 New York United States” Synonyms SIGMOD  International Conference on Management of Data NY  New York USA  United States SIGMOD  ACM's Special Interest Group on Management Of Data How to use the existing synonyms?

Using R to return the maximal Jaccard similarity Research Problem 1--(String Measurements) Input Two strings s and t, and a set of synonyms R Output Using R to return the maximal Jaccard similarity Jaccard(s,t,R)

Return all similar pairs , such that Jaccard(s,t,R)>= Problem 2-- (String Similarity Join) Input Two set of strings S and T, and a set of synonyms R, and a threshold value Output Return all similar pairs , such that Jaccard(s,t,R)>=

An example of similarity join Table S1 Table S2 ID String q1 2013 ACM Intl Conf on Management of Data USA q2 Very Large Data Bases Conf q3 VLDB Conf q4 ICDE 2013 ID String s1 SIGMOD s2 VLDB Synonyms SIGMOD  International Conference on Management of Data VLDB  Very Large Data Bases

Existing works on approximate string match with synonyms Transform based framework (JaccT) [1], compared with our method. Machine leaning method [2], Hidden Markov Model-based measure. Depend on training data, not efficient Transform based framework (JaccT) [1], compared with our method. Machine leaning method [2], Hidden Markov Model-based measure. Depend on training data, not efficient [1] A. Arasu, S. Chaudhuri, and R. Kaushik. Transformation-based framework for record matching. In ICDE, pages 40–49, 2008. [2] M. Bilenko and R. J. Mooney. Adaptive duplicate detection using learnable string similarity measures. In KDD, pages 39–48, 2003. [1] A. Arasu, S. Chaudhuri, and R. Kaushik. Transformation-based framework for record matching. In ICDE, pages 40–49, 2008. [2] M. Bilenko and R. J. Mooney. Adaptive duplicate detection using learnable string similarity measures. In KDD, pages 39–48, 2003.

Outline Motivation & Problem Statement String Similarity Measures String Similarity Joins Experimental Results Conclusion

String Similarity Measures (Full-expansion) S1=“International Conference on Management of Data NY USA” S2=“SIGMOD 2013 New York United States” Synonyms SIGMOD  ACM's Special Interest Group on Management Of Data SIGMOD  International Conference on Management of Data NY  New York USA  United States Expanding using all synonyms S1’=" International Conference on Management of Data NY USA SIGMOD New York United States " S2’=" SIGMOD 2013 New York United States International Conference on Management of Data NY USA ACM's Special Interest Group on Management Of Data Jaccard(S1’,S2’)= 13/18 = 0.72

String Similarity Measures (Selective-expansion) S1=“International Conference on Management of Data NY USA” S2=“SIGMOD 2013 New York United States of America” Synonyms SIGMOD  ACM's Special Interest Group on Management Of Data SIGMOD  International Conference on Management of Data NY  New York USA  United States Expanding using only good synonyms S1’=" International Conference on Management of Data NY USA SIGMOD New York United States " S2’=" SIGMOD 2013 New York United States International Conference on Management of Data NY USA " Jaccard(S1’,S2’)= 13/14 = 0.93

String Similarity Measures (Selective) Selective-expansion is: NP-hard : Reduction from 3-SAT Choose synonyms that can increase current similarity by computing the similarity-gain Greedy algorithm Property Optimal, when more than 70% cases in practice.

Outline Motivation & Problem Statement String Similarity Measures String Similarity Joins Experimental Results Conclusion

Generate Signatures with full expansion Similarity Joins (Filtering and Verification) Filtering candidates Verify candidates Generate Signatures with full expansion Similarity Measures Full expansion Prefix method Selective expansion LSH method

String Similarity Joins (SN-Join) Prefix method Global ordering: {a b c d e f g h i j k l} S1=“c k, e, a, f” S2=“d, b, f, e, k” Order the strings Threshold=0.8 S1’=“a, c, e, f, k” S2’=“b, d, e, f, k” Get signatures Sig(s1)=“a, c” Sig(s2)=“b, d” No overlap Jacc(s1,s2)<0.8

Signatures selection is important How to select signatures to enhance the signature filtering power? It is unrealistic to find a “one-size-fits-all” solution.

Estimation-based signatures selection . Three steps to select signatures: Generate multiple signatures schemes for each data set. Given two tables for join, quickly estimate the filtering power of each scheme. Select the scheme with the best filtering power.

An example on estimator Self-join: ID String Signatures q1 2013 ACM Intl Conf on Management of Data USA ACM, International, Conference, on q2 Very Large Data Bases Conf Conf, Conference q3 VLDB Conf q4 ICDE 2013 ICDE Filtering results (candidates): (q2,q3) ,(q1,q2) (q1,q3) ACM Conf Conference International on ICDE q1 q2 q3 q1 q2 q3 q1 q1 q4

Applying FM sketches on inverted lists Self-join: ID String Signatures q1 2013 ACM Intl Conf on Management of Data USA ACM, International, Conference, on q2 Very Large Data Bases Conf Conf, Conference q3 VLDB Conf q4 ICDE 2013 ICDE Filtering results (candidates): (q2,q3) ,(q1,q2), (q1,q3) ACM Conf Conference International on ICDE Using Flajolet-Martin (FM) sketch for each list q1 q2 q3 q1 q2 q3 q1 q1 q4

FM sketches （Flajolet and Martin JCSS 1985） Estimates the number of distinct items in a multi-set of values from [0,…, M-1] Assume a hash function h(x) that maps incoming values x in [0,…, M-1] uniformly across [0,…, 2L-1], where L = O(logM) Let lsb(y) denote the position of the least-significant 1 bit in the binary representation of y A value x is mapped to lsb(h(x)) Number of distinct values: 5 3 0 5 3 0 1 7 5 1 0 3 7 x = 5 h(x) = 101100 lsb(h(x)) = 2 1 BITMAP 5 4 3 2 1 0

Estimating the filtering power of a signature scheme Constructing a two-dimensional hash sketch Computing tighter upper and lower bounds of candidates size

String Similarity Filtering with Length Filter Filtering candidates Verify candidates Generate Signatures Compute lengths Similarity Measures Full expansion Prefix method Length filter Selective expansion LSH method

String Similarity Joins (SI-Join) Length filtering Strings S1=“a b c d e” S2=“x y z” Full-expansion Length range S1’=“a b c d e f g h k” S2’=“x y z s” s1: [5, 9] s2: [3, 4] Synonyms a->f g h x-> s b->k Jacc(s1,s2,R)<0.9

String Similarity Joins (SI-Join) Filtering candidates Verify candidates Generate Signatures Compute lengths Similarity Measures Full expansion Prefix/LSH method Length filter Selective expansion

String Similarity Joins (SI-tree)

Outline Motivation & Problem Statement String Similarity Measures String Similarity Joins Experimental Results Conclusion

Data sets and algorithms Compared method: JaccT [Arasu et al. ICDE 2008] Three datasets: Data # of strings String Len (avg/max) #of Synonyms # of applied synonyms (avg/max) USPS 1M 6.75/15 300 2.19/5 CONF 10K 5.84/14 1000 1.43/4 SPROT 10.32/20 37.78/104

String Similarity Measures Effectiveness of different similarity measurements String Similarity Measures Selective-expansion (SE) achieves the best effectiveness.

String Similarity Joins Efficiency of algorithms S: selective expansion F: full expansion String Similarity Joins SI-Join achieve the best performance.

Prefix scheme VS. LSH schemee Prefix V.s. LSH Prefix scheme VS. LSH schemee Prefix is better LSH is better

Estimation effectiveness

Outline Motivation & Problem Statement String Similarity Measures String Similarity Joins Experimental Results Conclusion

？ Conclusion and future work String similarity measure with synonyms Two new measures and a new join algorithm One estimator for signature selection Future work: how to deal with synonym ambiguity E.g. UW = University of Washington UW = University of Waterloo ？ OR

String Similarity Measures and Joins with Synonyms