Large-scale Similarity Join with Edit-distance Constraints ---BY Yu Haiyang 1/30.

Slides:



Advertisements
Similar presentations
An Extension of the String-to- String Correction Problem Roy Lowrance and Robert A. Wagner Journal of the ACM, vol. 22, No. 2, April 1975, pp
Advertisements

Extending Q-Grams to Estimate Selectivity of String Matching with Low Edit Distance [1] Pirooz Chubak May 22, 2008.
Indexing DNA Sequences Using q-Grams
Effective Keyword Based Selection of Relational Databases Bei Yu, Guoliang Li, Karen Sollins, Anthony K.H Tung.
Reference-based Indexing of Sequence Databases Jayendra Venkateswaran, Deepak Lachwani, Tamer Kahveci, Christopher Jermaine University of Florida-Gainesville.
Jiannan Wang (Tsinghua, China) Guoliang Li (Tsinghua, China) Jianhua Feng (Tsinghua, China)
TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets Chun Chen 1, Feng Li 2, Beng Chin Ooi 2, and Sai Wu 2 1 Zhejiang University, 2 National.
Copyright 2003Curt Hill Hash indexes Are they better or worse than a B+Tree?
Top-k Set Similarity Joins Chuan Xiao, Wei Wang, Xuemin Lin and Haichuan Shang University of New South Wales and NICTA.
A Comparison of String Matching Distance Metrics for Name-Matching Tasks William Cohen, Pradeep RaviKumar, Stephen Fienberg.
Dong Deng, Guoliang Li, Jianhua Feng Database Group, Tsinghua University Present by Dong Deng A Pivotal Prefix Based Filtering Algorithm for String Similarity.
Segmentation and Region Detection Defining regions in an image.
--Presented By Sudheer Chelluboina. Professor: Dr.Maggie Dunham.
Text Comparison of Genetic Sequences Shiri Azenkot Pomona College DIMACS REU 2004.
Distance Functions for Sequence Data and Time Series
Sequence Alignment Variations Computing alignments using only O(m) space rather than O(mn) space. Computing alignments with bounded difference Exclusion.
Effective Indexing and Filtering for Similarity Search in Large Biosequence Databases O. Ozturk and H. Ferhatosmanoglu. IEEE International Symp. on Bioinformatics.
Cost-Based Variable-Length-Gram Selection for String Collections to Support Approximate Queries Efficiently Xiaochun Yang, Bin Wang Chen Li Northeastern.
L. Padmasree Vamshi Ambati J. Anand Chandulal J. Anand Chandulal M. Sreenivasa Rao M. Sreenivasa Rao Signature Based Duplicate Detection in Digital Libraries.
Space-Efficient Sequence Alignment Space-Efficient Sequence Alignment Bioinformatics 202 University of California, San Diego Lecture Notes No. 7 Dr. Pavel.
Graph Indexing: A Frequent Structure­ based Approach Authors:Xifeng Yan†, Philip S‡. Yu, Jiawei Han†
Efficient Exact Similarity Searches using Multiple Token Orderings Jongik Kim 1 and Hongrae Lee 2 1 Chonbuk National University, South Korea 2 Google Inc.
VGRAM: Improving Performance of Approximate Queries on String Collections Using Variable-Length Grams Chen Li Bin Wang and Xiaochun Yang Northeastern University,
DBease: Making Databases User-Friendly and Easily Accessible Guoliang Li, Ju Fan, Hao Wu, Jiannan Wang, Jianhua Feng Database Group, Department of Computer.
Sanjay Agarwal Surajit Chaudhuri Gautam Das Presented By : SRUTHI GUNGIDI.
Similarity join problem with Pass- Join-K using Hadoop ---BY Yu Haiyang.
Click to edit Present’s Name Xiaoyang Zhang 1, Jianbin Qin 1, Wei Wang 1, Yifang Sun 1, Jiaheng Lu 2 HmSearch: An Efficient Hamming Distance Query Processing.
Experiments An Efficient Trie-based Method for Approximate Entity Extraction with Edit-Distance Constraints Entity Extraction A Document An Efficient Filter.
Top-k Set Similarity Joins Chuan Xiao, Wei Wang, Xuemin Lin and Haichuan Shang Univ. of New South Wales, Austrailia ICDE ’09 9 Feb 2011 Taewhi Lee Based.
Hugh E. Williams and Justin Zobel IEEE Transactions on knowledge and data engineering Vol. 14, No. 1, January/February 2002 Presented by Jitimon Keinduangjun.
Approximate XML Joins Huang-Chun Yu Li Xu. Introduction XML is widely used to integrate data from different sources. Perform join operation for XML documents:
Date: 2011/12/26 Source: Dustin Lange et. al (CIKM’11) Advisor: Jia-ling, Koh Speaker: Jiun Jia, Chiou Frequency-aware Similarity Measures 1.
Faster Algorithm for String Matching with k Mismatches (II) Amihood Amir, Moshe Lewenstin, Ely Porat Journal of Algorithms, Vol. 50, 2004, pp
Graph Indexing: A Frequent Structure- based Approach Alicia Cosenza November 26 th, 2007.
Experiments Faerie: Efficient Filtering Algorithms for Approximate Dictionary-based Entity Extraction Entity Extraction A Document An Efficient Filter.
VGRAM:Improving Performance of Approximate Queries on String Collections Using Variable- Length Grams VLDB 2007 Chen Li (UC, Irvine) Bin Wang (Northeastern.
Presented by: Aneeta Kolhe. Named Entity Recognition finds approximate matches in text. Important task for information extraction and integration, text.
Activity 4 Data flow diagram of a school attendance system
基于关键因子过滤的正则 表达式匹配算法 邱涛 东北大学 王斌 东北大学 杨晓春 东北大学 王佳英 东北大学.
A * Search A* (pronounced "A star") is a best first, graph search algorithm that finds the least-cost path from a given initial node to one goal node out.
Chen Li Department of Computer Science Joint work with Liang Jin, Nick Koudas, Anthony Tung, and Rares Vernica Answering Approximate Queries Efficiently.
Some string optimization tips Haiyang Yu /14 Outline  Background  Tips for dealing with strings.
8.3 Similar Polygons. Identifying Similar Polygons.
Search Engines WS 2009 / 2010 Prof. Dr. Hannah Bast Chair of Algorithms and Data Structures Department of Computer Science University of Freiburg Lecture.
1 Review of report "LSDX: A New Labeling Scheme for Dynamically Updating XML Data"
Measuring the Structural Similarity of Semistructured Documents Using Entropy Sven Helmer University of London, Birkbeck VLDB’07, September 23-28, 2007,
Jeffrey D. Ullman Stanford University. 2  Generalized LSH is based on some kind of “distance” between points.  Similar points are “close.”  Example:
8.3 Similar Polygons. Identifying Similar Polygons.
COMP53311 Association Rule Mining Prepared by Raymond Wong Presented by Raymond Wong
Author :K. Thambiratnam and S. Sridharan DYNAMIC MATCH PHONE-LATTICE SEARCHES FOR VERY FAST AND ACCURATE UNRESTRICTED VOCABULARY KEYWORD SPOTTING Reporter.
Dynamic Programming for the Edit Distance Problem.
EFFICIENT ALGORITHMS FOR APPROXIMATE MEMBER EXTRACTION By Swapnil Kharche and Pavan Basheerabad.
Outline Introduction State-of-the-art solutions
COMP9313: Big Data Management Lecturer: Xin Cao Course web site:
Indexing Structures for Files and Physical Database Design
Indexing Goals: Store large files Support multiple search keys
Taku Aratsu1, Kouichi Hirata1 and Tetsuji Kuboyama2
Jiannan Wang (Tsinghua, China) Guoliang Li (Tsinghua, China)
Distance Functions for Sequence Data and Time Series
Integrating XML Data Sources Using Approximate Joins
Pass-Join: A Partition based Method for Similarity Joins
Guoliang Li (Tsinghua, China) Dong Deng (Tsinghua, China)
Dynamic Programming Computation of Edit Distance
Efficient Record Linkage in Large Data Sets
Jongik Kim1, Dong-Hoon Choi2, and Chen Li3
Time Relaxed Spatiotemporal Trajectory Joins
15-826: Multimedia Databases and Data Mining
An Efficient Partition Based Method for Exact Set Similarity Joins
Presentation transcript:

Large-scale Similarity Join with Edit-distance Constraints ---BY Yu Haiyang 1/30

Outline  Background  The introduction of Pass-Join-K  Combining Pass-Join-K with Hadoop 2/30

Background  Similarity join: Find all similar pairs from two sets. Data Cleaning. Query Relaxation Spellchecking 3/30

Background  How to define similarity? Jaccard distance( 词袋模型 ) Cosine distance Edit distance 4/30

Background  Edit distance The minimum number of edit operations (insertion, deletion, and substitution) to transform one string to another. BabyBodySubstitution BodBodyInsertion 5/30

Background  How does the edit distance compare with other two? Accuracy: {“abcdefg”,”gfedcba”} Verification time: O(m+n) -> O(mn) 6/30

Background  Find similar pairs We have two string sets,one is {vldb,sigmod,….},the other is {pvldb,icde,…}. Find some candidate pairs, and then verify these pairs. {,,,,,….} Yes No 7/30

Background  So we have to: Finding candidate pairs. There are O(N 2 ) if we do not prune some pairs. verifying these pairs. O(mn) 8/30

Outline  Background  The introduction of Pass-Join-K  Combining Pass-Join-K with Hadoop 9/30

Introduction of Pass-Join-K  Partition-based pruning technique We suppose the threshold tau = 2, K= 1 and we have a pair 10/30

Introduction of Pass-Join-K  Partition-based pruning technique We suppose the threshold tau = 2, K=2and we have a pair 11/30

Introduction of Pass-Join-K  Some obvious pruning techniques Length –based: threshold = 2, Shift-based: abcd cdef 12/30

Introduction of Pass-Join-K  Partition Scheme We have seen that the longer the substrings are, the harder they could be marched. So we break the string into tau+k parts and each part while its length equals length/(tau+k) or length/(tau+k)+1. 13/30

Introduction of Pass-Join-K  Partition Scheme 14/30

Introduction of Pass-Join-K  Substring Selection Here we suppose tau = 3 and k = 1; abcdefghijk a b d e f g h k 15/30

Introduction of Pass-Join-K  Substring Selection Here we suppose tau = 3 and k = 1; abcdefghijk a b d e f g h k 16/30

Introduction of Pass-Join-K  Substring Selection Here we suppose tau = 3 and k = 1; abcdefghijk a b d e f gh k 17/30

Introduction of Pass-Join-K  Substring Selection Here we suppose tau = 3 and k = 1; abcdefghijk abdefghk 18/30

Introduction of Pass-Join-K  Substring Selection Here we suppose tau = 3 and k = 1; abcdefghijk a b d e f g h k 19/30

Introduction of Pass-Join-K  Substring Selection So what we do is to deduce the number of substrings. More pruning techniques, please read our paper: 《 Pass-Join-K 多分段匹配的相 似性连接算法》 20/30

Introduction of Pass-Join-K  Verification DP( Dynamic programming)  D(m,n)=max(D(m,n-1)+1,D(m-1,n)+1,D(m-1,n- 1)+flag) where flag = 1 when s m =r n, s and r are both strings. 21/30

Introduction of Pass-Join-K  Verification Here we suppose tau = 3 and k = 1; abcdefghijk def e f g h k Tau left = 3 Tau right = 3-3=0 22/30

Outline  Background  The introduction of Pass-Join-K  Combining Pass-Join-K with Hadoop 23/30

Combining Pass-Join-K with Hadoop  Big data Big file Large number of files 24/30

Combining Pass-Join-K with Hadoop  Inverted index tree in hadoop (abc, 1, 11,r,IFlag) (def,2,11,r,IFlag) (ghi,3,11,r,IFlag) (jk,4,11,r,IFlag) abcdefghijk rr rr L 11 25/30

Combining Pass-Join-K with Hadoop  Substrings in hadoop Suppose tau = 3, k = 1, and s = “abdefghk”, length(s) = 8. We have to generate some records such as (a,1,5,s,SFlag),(a,2,6,s,SFlag)(a,3,7,s,SFlag),(a b,1,8,s,SFlag),…,(ab,1,11,s,SFlag),… 26/30

Combining Pass-Join-K with Hadoop  Data flows in hadoop 27/30

Combining Pass-Join-K with Hadoop  Big data Big file Large number of files 28/30

/30 Combining Pass-Join-K with Hadoop  [segmentString, segmentNumber, stringLength, FLAG], [DirNumber, ID]

/30 