1 Methods for High Degrees of Similarity Index-Based Methods Exploiting Prefixes and Suffixes Exploiting Length.

Slides:



Advertisements
Similar presentations
Frequent Itemset Mining Methods. The Apriori algorithm Finding frequent itemsets using candidate generation Seminal algorithm proposed by R. Agrawal and.
Advertisements

1 Frequent Itemset Mining: Computation Model uTypically, data is kept in a flat file rather than a database system. wStored on disk. wStored basket-by-basket.
Space-for-Time Tradeoffs
CS4432: Database Systems II Hash Indexing 1. Hash-Based Indexes Adaptation of main memory hash tables Support equality searches No range searches 2.
Quick Review of Apr 10 material B+-Tree File Organization –similar to B+-tree index –leaf nodes store records, not pointers to records stored in an original.
Multidimensional Indexing
Copyright 2003Curt Hill Hash indexes Are they better or worse than a B+Tree?
1 Mining Associations Apriori Algorithm. 2 Computation Model uTypically, data is kept in a flat file rather than a database system. wStored on disk. wStored.
Chapter 14 Multi-Way Search Trees
Chapter 23 Multi-Way Search Trees. Chapter Scope Examine 2-3 and 2-4 trees Introduce the concept of a B-tree Example specialized implementations of B-trees.
Liang, Introduction to Java Programming, Eighth Edition, (c) 2011 Pearson Education, Inc. All rights reserved Chapter 24 Sorting.
Suffix Sorting & Related Algoritmics Martin Farach-Colton Rutgers University USA.
1 Suffix Trees and Suffix Arrays Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto Addison-Wesley, (Chapter 8)
Indexes. Primary Indexes Dense Indexes Pointer to every record of a sequential file, (ordered by search key). Can make sense because records may be much.
Indexes. Primary Indexes Dense Indexes Pointer to every record of a sequential file, (ordered by search key). Can make sense because records may be much.
COMP 451/651 Indexes Chapter 1.
1 Association Rules Apriori Algorithm. 2 Computation Model uTypically, data is kept in a flat file rather than a database system. wStored on disk. wStored.
Indexes. An index on an attribute A of a relation is a data structure that makes it efficient to find those tuples that have a fixed value for attribute.
BTrees & Bitmap Indexes
Multiple-key indexes Index on one attribute provides pointer to an index on the other. If V is a value of the first attribute, then the index we reach.
1 Intractable Problems Time-Bounded Turing Machines Classes P and NP Polynomial-Time Reductions.
1 Query Languages. 2 Boolean Queries Keywords combined with Boolean operators: –OR: (e 1 OR e 2 ) –AND: (e 1 AND e 2 ) –BUT: (e 1 BUT e 2 ) Satisfy e.
1 More on Indexes Secondary Indexes B-Trees Source: our textbook, slides by Hector Garcia-Molina.
1 Indexes on Sequential Files Source: our textbook, slides by Hector Garcia-Molina.
Normal Forms for CFG’s Eliminating Useless Variables Removing Epsilon
The Pumping Lemma for CFL’s
1 Query Processing Two-Pass Algorithms Source: our textbook.
This material in not in your text (except as exercises) Sequence Comparisons –Problems in molecular biology involve finding the minimum number of edit.
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Part B Part A:  Index Definition in SQL  Ordered Indices  Index Sequential.
Quick Review of material covered Apr 8 B+-Tree Overview and some definitions –balanced tree –multi-level –reorganizes itself on insertion and deletion.
1 Decidability Turing Machines Coded as Binary Strings Diagonalizing over Turing Machines Problems as Languages Undecidable Problems.
1 Indexing Structures for Files. 2 Basic Concepts  Indexing mechanisms used to speed up access to desired data without having to scan entire.
Primary Indexes Dense Indexes
© 2004 Goodrich, Tamassia Dynamic Programming1. © 2004 Goodrich, Tamassia Dynamic Programming2 Matrix Chain-Products (not in book) Dynamic Programming.
Homework #3 Due Thursday, April 17 Problems: –Chapter 11: 11.6, –Chapter 12: 12.1, 12.2, 12.3, 12.4, 12.5, 12.7.
CS4432: Database Systems II
ICS 220 – Data Structures and Algorithms Week 7 Dr. Ken Cosh.
Different Tree Data Structures for Different Problems
ALGORITHMS FOR ISNE DR. KENNETH COSH WEEK 6.
Introduction n – length of text, m – length of search pattern string Generally suffix tree construction takes O(n) time, O(n) space and searching takes.
1 Applications of LSH (Locality-Sensitive Hashing) Entity Resolution Fingerprints Similar News Articles.
| 1 › Gertjan van Noord2014 Zoekmachines Lecture 3: tolerant retrieval.
Information Retrieval CSE 8337 Spring 2007 Query Languages & Matching Material for these slides obtained from: Modern Information Retrieval by Ricardo.
Introduction to Indexes. Indexes An index on an attribute A of a relation is a data structure that makes it efficient to find those tuples that have a.
ICS 220 – Data Structures and Algorithms Lecture 11 Dr. Ken Cosh.
12.1 Chapter 12: Indexing and Hashing Spring 2009 Sections , , Problems , 12.7, 12.8, 12.13, 12.15,
MA/CSSE 473 Day 28 Dynamic Programming Binomial Coefficients Warshall's algorithm Student questions?
© 2010 Pearson Addison-Wesley. All rights reserved. Addison Wesley is an imprint of CHAPTER 12: Multi-way Search Trees Java Software Structures: Designing.
IKI 10100: Data Structures & Algorithms Ruli Manurung (acknowledgments to Denny & Ade Azurat) 1 Fasilkom UI Ruli Manurung (Fasilkom UI)IKI10100: Lecture17.
Indexing and hashing Azita Keshmiri CS 157B. Basic concept An index for a file in a database system works the same way as the index in text book. For.
Presented by: Aneeta Kolhe. Named Entity Recognition finds approximate matches in text. Important task for information extraction and integration, text.
Indexes. Primary Indexes Dense Indexes Pointer to every record of a sequential file, (ordered by search key). Can make sense because records may be much.
Marwan Al-Namari Hassan Al-Mathami. Indexing What is Indexing? Indexing is a mechanisms. Why we need to use Indexing? We used indexing to speed up access.
 B-tree is a specialized multiway tree designed especially for use on disk  B-Tree consists of a root node, branch nodes and leaf nodes containing the.
1 Closure Properties of Regular Languages Union, Intersection, Difference, Concatenation, Kleene Closure, Reversal, Homomorphism.
Query Processing CS 405G Introduction to Database Systems.
Liang, Introduction to Java Programming, Ninth Edition, (c) 2013 Pearson Education, Inc. All rights reserved. 1 Chapter 25 Sorting.
Week 15 – Friday.  What did we talk about last time?  Student questions  Review up to Exam 2  Recursion  Binary trees  Heaps  Tries  B-trees.
Jeffrey D. Ullman Stanford University. 2  Generalized LSH is based on some kind of “distance” between points.  Similar points are “close.”  Example:
Chapter 11. Chapter Summary  Introduction to trees (11.1)  Application of trees (11.2)  Tree traversal (11.3)  Spanning trees (11.4)
Chapter 11 Indexing And Hashing (1) Yonsei University 1 st Semester, 2016 Sanghyun Park.
Liang, Introduction to Java Programming, Tenth Edition, (c) 2013 Pearson Education, Inc. All rights reserved. 1 Chapter 23 Sorting.
MA/CSSE 473 Day 26 Student questions Boyer-Moore B Trees.
Azita Keshmiri CS 157B Ch 12 indexing and hashing
Andrzej Ehrenfeucht, University of Colorado, Boulder
Chapter 11 Indexing And Hashing (1)
Database Design and Programming
Space-for-time tradeoffs
Space-for-time tradeoffs
Presentation transcript:

1 Methods for High Degrees of Similarity Index-Based Methods Exploiting Prefixes and Suffixes Exploiting Length

2 Overview uLSH-based methods are excellent for similarity thresholds that are not too high. wPossibly up to 80% or 90%. uBut for similarities above that, there are other methods that are more efficient. wAnd also give exact answers.

3 Setting: Sets as Strings uWe’ll again talk about Jaccard similarity and distance of sets. uHowever, now represent sets by strings (lists of symbols): 1.Enumerate the universal set. 2.Represent a set by the string of its elements in sorted order.

4 Example: Shingles uIf the universal set is k-shingles, there is a natural lexicographic order. uThink of each shingle as a single symbol. uThen the 2-shingling of abcad, which is the set {ab, bc, ca, ad}, is represented by the list ab, ad, bc, ca of length 4. uAlternative: hash shingles; order by bucket number.

5 Example: Words uIf we treat a document as a set of words, we could order the words alphabetically. uBetter: Order words lowest-frequency- first. uWhy? We shall index documents based on the early words in their lists. wDocuments spread over more buckets.

6 Jaccard and Edit Distances uSuppose two sets have Jaccard distance J and are represented by strings s 1 and s 2. Let the LCS of s 1 and s 2 have length C and the edit distance of s 1 and s 2 be E. Then: w1-J = Jaccard similarity = C/(C+E). wJ = E/(C+E). Works because these strings never repeat a symbol, and symbols appear in the same order.

7 Indexes uThe general approach is to build some indexes on the set of strings. uThen, visit each string once and use the index to find possible candidates for similarity. uFor thought: how does this approach compare with bucketizing and looking within buckets for similarity?

8 Length-Based Indexes uThe simplest thing to do is create an index on the length of strings. uA string of length L can be Jaccard distance J from a string of length M only if L  (1-J) < M < L/(1-J). uExample: if 1-J = 90% (Jaccard similarity), then M is between 90% and 111% of L.

9 Why the Limit on Lengths? L M 1-J = M/L M = L  (1-J) A shortest candidate M L 1-J = L/M M = L/(1-J) A longest candidate

10 B-Tree Indexes uThe B-tree is a perfect index structure for a length-based index. uGiven a string of length L, we can find strings in the range L  (1-J) to L/(1-J) without looking at any candidates outside that range. uBut just because strings are similar in length, doesn’t mean they are similar.

11 Aside: B-Trees uIf you didn’t take CS245 yet, a B-tree is a generalization of a binary search tree, where each node has many children, and each child leads to a segment of the range of values handled by its parent. uTypically, a node is a disk block.

12 Aside: B-Trees – (2) | |50| |80| |145| |190| |225| | To values < 50 To values > 50, < 80 To values > 80, < 145 Etc. From parent

13 Prefix-Based Indexing uIf two strings are 90% similar, they must share some symbol in their prefixes whose length is just above 10% of the shorter.  Thus, we can index symbols in just the first ⌊ JL+1 ⌋ positions of a string of length L.

14 Why the Limit on Prefixes? L E If two strings do not share any of the first E symbols, then J > E/L. Extreme case: second string has none of the first E symbols of the first string, but they agree thereafter. Thus, E = JL is possible, but any larger E is impossible. Index E+1 positions. x x Must be Equal

15 Indexing Prefixes uThink of a bucket for each possible symbol.  Each string of length L is placed in the bucket for each of its first ⌊ JL+1 ⌋ positions. uA B-tree with symbol as key leads to pointers to the strings.

16 Lookup uGiven a probe string s of length L, with J the limit on Jaccard distance: for (each symbol a among the first ⌊ JL+1 ⌋ positions of s) look for other strings in the bucket for a;

17 Example: Indexing Prefixes uLet J = 0.2. uString abcdef is indexed under a and b. uString acdfg is indexed under a and c. uString bcde is indexed only under b. uIf we search for strings similar to cdef, we need look only in the bucket for c.

18 Using Positions Within Prefixes uIf position i of string s is the first position to match a prefix position of string t, and it matches position j, then the edit distance between s and t is at least i + j – 2. uThe LCS of s and t is no longer than L-i +1, where L is the length of s.

19 Positions/Prefixes – (2) uIf J is the limit on Jaccard distance, then remember E/(E+C) < J. wE = i + j - 2. wC = L – i + 1. uThus, (i + j – 2)/(L + j – 1) < J. uOr, j < (JL – J – i +2)/(1 – J).

20 Positions/Prefixes – (3) uWe only need to find a candidate once, so we may as well: 1.Visit positions of our given string in numerical order, and 2.Assume that there have been no matches for earlier positions.

21 Positions/Prefixes – Indexing uCreate a 2-attribute index on (symbol, position). uIf string s has symbol a as the i th position of its prefix, add s to the bucket (a, i ). uA B-tree index with keys ordered first by symbol, then position is excellent.

22 Lookup uIf we want to find matches for probe string s of length L, do: for (i=1; i<=J*L+1; i++) { let s have a in position i; for (j=1; j<=(J*L-J-i+2)/(1-J); j++) compare s with strings in bucket (a, j); }

23 Example: Lookup uSuppose J = 0.2. uGiven probe string adegjkmprz, L=10 and the prefix is ade. uFor the i th position of the prefix, we must look at buckets where j < (JL – J – i +2)/(1 – J) = (3.8 – i )/0.8. uFor i = 1, j < 3; for i = 2, j < 2, and for i = 3, j < 1.

24 Example: Lookup – (2) uThus, for probe adegjkmprz we look in the following buckets: (a, 1), (a, 2), (a, 3), (d, 1), (d, 2), (e, 1). uSuppose string t is in (d, 3). Either: wWe saw t, because a is in position 1 or 2, or wThe edit distance is at least 3 and the length of the LCS is at most 9 (thus the Jaccard distance is at least ¼).

25 We Win Two Ways 1.Triangular nested loops let us look at only half the possible buckets. 2.Strings that are much longer than the probe string but whose prefixes have a symbol far from the beginning that also appears in the prefix of the probe string are not considered at all.

26 Adding Length to the Mix uWe can index on three attributes: 1.Character at a prefix position. 2.Number of that position. 3.Length of the suffix = number of positions in the entire string to the right of the given position.

27 Edit Distance uSuppose we are given probe string s, and we find string t because its j th position matches the i th position of s. uA lower bound on edit distance E is: 1.i + j – 2 plus 2.The absolute difference of the lengths of the suffixes of s and t (what follows positions i and j, respectively).

28 Longest Common Subsequence uSuppose we are given probe string s, and we find string t first because its j th position matches the i th position of s. uIf the suffixes of s and t have lengths k and m, respectively, then an upper bound on the length C of the LCS is 1 + min(k, m ).

29 Bound on Jaccard Distance uIf J is the limit on Jaccard distance, then E/(E+C) < J becomes: ui + j – 2 + |k – m | < J(i + j – 2 + |k – m | min(k, m )). uThus: j + |k – m | < ( J(i – 1 + min(k, m )) – i + 2 ) /(1 – J).

30 Positions/Prefixes/Suffixes – Indexing uCreate a 3-attribute index on (symbol, position, suffix-length). uIf string s has symbol a as the i th position of its prefix, and the length of the suffix relative to that position is k, add s to the bucket (a, i, k ).

31 Example: Indexing uConsider string abcde with J = 0.2. uPrefix length = 2. uIndex in: (a, 1, 4) and (b, 2, 3).

32 Lookup uAs for the previous case, to find candidate matches for a probe string s of length L, with required similarity J, visit the positions of s ’s prefix in order. uIf position i has symbol a and suffix length k, look in index bucket (a, j, m ) for all j and m such that j + |k – m | < ( J(i – 1 + min(k, m )) – i + 2 ) /(1 – J).

33 Example: Lookup uConsider abcde with J = 0.2. uRequire: j + |k – m | < ( J(i – 1 + min(k, m )) – i + 2 ) /(1 – J). uFor i = 1, note k = 4. We want j + |4 –m | < (0.2min(4, m)+1)/0.8. uLook in (a, 1, 3), (a, 1, 4), (a, 1, 5), (a, 2, 4), (b, 1, 3). From i = 2, k = 3, j + |3–m | < 0.2(1+min(4, m))/0.8.

34 Pattern of Search Length of suffix Position k i = 1

35 Pattern of Search Length of suffix Position k i = 2

36 Pattern of Search Length of suffix Position k i = 3

37 Physical-Index Issues uA B-tree on (symbol, position, length) isn’t perfect. wFor a given symbol and position, you only want some of the suffix lengths. wSimilar problem for any order of the attributes. uSeveral two-dimensional index structures might work better.