Liang Jin * UC Irvine Nick Koudas University of Toronto Chen Li * UC Irvine Anthony K.H. Tung National University of Singapore VLDB’2005 * Liang Jin and.

Slides:



Advertisements
Similar presentations
Trees for spatial indexing
Advertisements

Extending Q-Grams to Estimate Selectivity of String Matching with Low Edit Distance [1] Pirooz Chubak May 22, 2008.
Indexing DNA Sequences Using q-Grams
1 Efficient Record Linkage in Large Data Sets Liang Jin, Chen Li, Sharad Mehrotra University of California, Irvine DASFAA, Kyoto, Japan, March 2003.
File Processing : Hash 2015, Spring Pusan National University Ki-Joune Li.
Advanced Database Discussion B Trees. Motivation for B-Trees So far we have assumed that we can store an entire data structure in main memory What if.
1 NNH: Improving Performance of Nearest- Neighbor Searches Using Histograms Liang Jin (UC Irvine) Nick Koudas (AT&T Labs Research) Chen Li (UC Irvine)
Image Indexing and Retrieval using Moment Invariants Imran Ahmad School of Computer Science University of Windsor – Canada.
1 Lecture 8: Data structures for databases II Jose M. Peña
Indexes. An index on an attribute A of a relation is a data structure that makes it efficient to find those tuples that have a fixed value for attribute.
Liang Jin (UC Irvine) Nick Koudas (AT&T) Chen Li (UC Irvine)
B+-tree and Hashing.
Liang, Introduction to Java Programming, Eighth Edition, (c) 2011 Pearson Education, Inc. All rights reserved Chapter Trees and B-Trees.
CPSC 231 B-Trees (D.H.)1 LEARNING OBJECTIVES Problems with simple indexing. Multilevel indexing: B-Tree. –B-Tree creation: insertion and deletion of nodes.
Presented by Ozgur D. Sahin. Outline Introduction Neighborhood Functions ANF Algorithm Modifications Experimental Results Data Mining using ANF Conclusions.
Chapter 3: Data Storage and Access Methods
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Part B Part A:  Index Definition in SQL  Ordered Indices  Index Sequential.
Liang Jin and Chen Li VLDB’2005 Supported by NSF CAREER Award IIS Selectivity Estimation for Fuzzy String Predicates in Large Data Sets.
B + -Trees (Part 1). Motivation AVL tree with N nodes is an excellent data structure for searching, indexing, etc. –The Big-Oh analysis shows most operations.
Spring 2004 ECE569 Lecture ECE 569 Database System Engineering Spring 2004 Yanyong Zhang
1 Notes 06: Efficient Fuzzy Search Professor Chen Li Department of Computer Science UC Irvine CS122B: Projects in Databases and Web Applications Spring.
CS4432: Database Systems II
R-Trees: A Dynamic Index Structure for Spatial Data Antonin Guttman.
Efficient Parallel Set-Similarity Joins Using Hadoop Chen Li Joint work with Michael Carey and Rares Vernica.
Indexing. Goals: Store large files Support multiple search keys Support efficient insert, delete, and range queries.
Database Management 8. course. Query types Equality query – Each field has to be equal to a constant Range query – Not all the fields have to be equal.
Record Linkage: A 10-Year Retrospective Chen Li and Sharad Mehrotra UC Irvine 1.
The X-Tree An Index Structure for High Dimensional Data Stefan Berchtold, Daniel A Keim, Hans Peter Kriegel Institute of Computer Science Munich, Germany.
1 CPS216: Advanced Database Systems Notes 04: Operators for Data Access Shivnath Babu.
Multi-way Trees. M-way trees So far we have discussed binary trees only. In this lecture, we go over another type of tree called m- way trees or trees.
Chapter 11 Indexing & Hashing. 2 n Sophisticated database access methods n Basic concerns: access/insertion/deletion time, space overhead n Indexing 
Approximate XML Joins Huang-Chun Yu Li Xu. Introduction XML is widely used to integrate data from different sources. Perform join operation for XML documents:
Introduction to The NSP-Tree: A Space-Partitioning Based Indexing Method Gang Qian University of Central Oklahoma November 2006.
Introduction to Indexes. Indexes An index on an attribute A of a relation is a data structure that makes it efficient to find those tuples that have a.
12.1 Chapter 12: Indexing and Hashing Spring 2009 Sections , , Problems , 12.7, 12.8, 12.13, 12.15,
Efficient Metric Index For Similarity Search Lu Chen, Yunjun Gao, Xinhan Li, Christian S. Jensen, Gang Chen.
Lecture1 introductions and Tree Data Structures 11/12/20151.
Nearest Neighbor Queries Chris Buzzerd, Dave Boerner, and Kevin Stewart.
Reporter : Yu Shing Li 1.  Introduction  Querying and update in the cloud  Multi-dimensional index R-Tree and KD-tree Basic Structure Pruning Irrelevant.
Spring 2003 ECE569 Lecture 05.1 ECE 569 Database System Engineering Spring 2003 Yanyong Zhang
Liang Jin * UC Irvine Nick Koudas University of Toronto Chen Li * UC Irvine Anthony K.H. Tung National University of Singapore * Liang Jin and Chen Li:
R-Trees: A Dynamic Index Structure For Spatial Searching Antonin Guttman.
1 Approximate XML Query Answers Presenter: Hongyu Guo Authors: N. polyzotis, M. Garofalakis, Y. Ioannidis.
Session 1 Module 1: Introduction to Data Integrity
Approximate NN queries on Streams with Guaranteed Error/performance Bounds Nick AT&T labs-research Beng Chin Ooi, Kian-Lee Tan, Rui National.
Database Management Systems, R. Ramakrishnan 1 Algorithms for clustering large datasets in arbitrary metric spaces.
Chen Li Department of Computer Science Joint work with Liang Jin, Nick Koudas, Anthony Tung, and Rares Vernica Answering Approximate Queries Efficiently.
XCluster Synopses for Structured XML Content Alkis Polyzotis (UC Santa Cruz) Minos Garofalakis (Intel Research, Berkeley)
1 CSIS 7101: CSIS 7101: Spatial Data (Part 1) The R*-tree : An Efficient and Robust Access Method for Points and Rectangles Rollo Chan Chu Chung Man Mak.
Spring 2004 ECE569 Lecture 05.1 ECE 569 Database System Engineering Spring 2004 Yanyong Zhang
Bootstrapped Optimistic Algorithm for Tree Construction
Improving Search for Emerging Applications * Some techniques current being licensed to Bimaple Chen Li UC Irvine.
CS4432: Database Systems II More on Index Structures 1.
Presenters: Amool Gupta Amit Sharma. MOTIVATION Basic problem that it addresses?(Why) Other techniques to solve same problem and how this one is step.
CS411 Database Systems Kazuhiro Minami 10: Indexing-1.
Efficient Merging and Filtering Algorithms for Approximate String Searches Chen Li, Jiaheng Lu and Yiming Lu Univ. of California, Irvine, USA ICDE ’08.
Em Spatiotemporal Database Laboratory Pusan National University File Processing : Hash 2004, Spring Pusan National University Ki-Joune Li.
Data Integrity & Indexes / Session 1/ 1 of 37 Session 1 Module 1: Introduction to Data Integrity Module 2: Introduction to Indexes.
Spatial Approximate String Search. Abstract This work deals with the approximate string search in large spatial databases. Specifically, we investigate.
CPS216: Data-intensive Computing Systems
Indexing Structures for Files and Physical Database Design
RE-Tree: An Efficient Index Structure for Regular Expressions
Jiannan Wang (Tsinghua, China) Guoliang Li (Tsinghua, China)
Top-k String Similarity Search with Edit-Distance Constraints
Indexing and Hashing B.Ramamurthy Chapter 11 2/5/2019 B.Ramamurthy.
Efficient Record Linkage in Large Data Sets
2018, Spring Pusan National University Ki-Joune Li
Relaxing Join and Selection Queries
Liang Jin (UC Irvine) Nick Koudas (AT&T Labs Research)
Donghui Zhang, Tian Xia Northeastern University
Presentation transcript:

Liang Jin * UC Irvine Nick Koudas University of Toronto Chen Li * UC Irvine Anthony K.H. Tung National University of Singapore VLDB’2005 * Liang Jin and Chen Li: supported by NSF CAREER Award IIS Indexing Mixed Types for Approximate Retrieval

2 Queries with Mixed-Type Predicates StarTitleYearGenre Keanu ReevesThe Matrix1999Sci-Fi Samuel JacksonStar Wars: Episode III - Revenge of the Sith2005Sci-Fi SchwarzeneggerThe Terminator1984Sci-Fi Samuel JacksonGoodfellas1990Drama ………… SELECT * FROM Movies WHERE star SIMILARTO ’Schwarrzenger’ AND |year – 1980| <= 5; SIMLARTO: –a domain-specific function –returns a similarity value between two strings Example: edit distance ed(Tom Hanks, Ton Hank) = 2

3 Why fuzzy predicates? Errors in queries –User doesn’t remember a string exactly –User types a wrong string Samuel Jackson … Schwarzenegger Samuel Jackson Keanu Reeves Star … Samuel L. Jackson Schwarzenegger Samuel L. Jackson Keanu Reeves Star Relation R Relation S Errors in databases: –Data is not clean –Especially true in data integration and cleansing

4 Problem Formulation SELECT * FROM Movies WHERE star SIMILARTO ’Schwarrzenger’ AND |year – 1980| <= 5; Given: A query with fuzzy predicates on strings and range predicates on numeric attributes on a single relation Goal: Answer the query efficiently

5 Rest of the talk Motivation: supporting queries with mixed-type predicates Our approach: MAT tree Construction and maintenance of MAT tree Experiments

6 Assumptions SELECT * FROM Movies WHERE star SIMILARTO ’Schwarrzenger’ AND |year – 1980| <= 5; One fuzzy string predicate (edit distance) One numeric predicate (’Schwarrzenger’, 2, 1980, 5) (Qs, δs, Qn, δn) Query:

7 Intuition of MAT (Mixed-attribute-type) Tree “2 > 1 + 1” –One integrated indexing structure is better than –two independent indexing structures on two attributes Indexing numeric attributes: B-tree or R-tree Indexing strings as a tree to support fuzzy predicates? MAT tree

8 Answering a query (Qs, δs, Qn, δn) Top-down traverse the MAT-tree At each node, do pruning by checking: –If [Q n – δ n, Q n + δ n ] overlap with the numeric range. –If minEditDistance(Q s, T n ) <= δ s.

9 Challenge How to represent strings to fit into a limited space and support fuzzy-predicate pruning Limited space (disk based)

10 Existing Approaches to Indexing Strings as Trees M-tree: –Edit distance: metric space Q-tree –Utilize the q-gram property of strings. –See our paper for details

11 Representing strings as a trie

12 Compressing a trie Select k representative nodes (centers). Each center is in the format of. A compressed trie represents more strings compression

13 minEditDistace (Q s, T n )? –Convert a trie to an automaton. –Compute the min distance between a string and an automaton [Myers and Miller, 1989] –Early termination possible Minimum edit distance between a string a trie

14 Compressed trie  Automaton Each node is a state. Each edge becomes a transition between two states. For compressed node, expand it to L levels. At each level, all characters in Σ become single states and are connected to a common tail ε. Convert a compressed node into automaton nodes.

15 Outline Motivation: supporting queries with mixed-type predicates Our approach: MAT tree Construction and maintenance of MAT tree Experiments

16 Constructing MAT-tree Option 1: insert records one by one. Option 2: –bulk-load records –construct the MAT-tree bottom-up

17 Compressing a trie Important: –Accurately represent strings in a limited space. –Minimize “information loss”. –Maintain the pruning power during a traversal. Three methods: –(1) Reducing # of accepted strings –(2) Keeping accepted strings “clustered” –(3) Combining of (1) and (2)

18 Method (1): Reducing # of accepted strings Intuition: –reducing this # makes the compressed trie more accurate Goodness function: # of accepted strings Algorithm: “Randomized” –Randomly select k initial centers –Randomly select one of the centers –Randomly select an unselected node –Swap them if it can improve the goodness function –Do certain # of iterations

19 Method (2): Keeping accepted strings clustered Intuition: –keeping the accepted strings similar to the original ones by letting them share common prefix. –Place k centers as close to the root as possible. Algorithm: “BreadthFirst”

20 Method (3): Combining (1) and (2) Intuition: –minimize the number of accepted strings, and in the same time maintain their similarity to the originals. Algorithm: “Bottomup” –Keep shrinking the trie bottom up until we have k nodes –Compress a node that minimizes # of additional strings

21 Dynamic maintenance Insertion (s, n) Search the index for (s, n). If it’s not in the index, identify the correct leaf node. If no overflow: –update the “MBR” of the leaf node and its precedents recursively if necessary. If overflow: –Split the leaf node and –Construct two compressed tries –Cascade the split to the precedents if necessary. Deletion and Update are handled similarly

22 Outline Motivation: supporting queries with mixed-type predicates Our approach: MAT tree Construction and maintenance of MAT tree Experiments

23 Setting Data –IMDB: 100K movie star records (Name and YOB). –Customers: 50K records (Name and YOB) Test bed –PC: 2.4G P4, 1.2GB Memory, Windows XP –Visual C++ compiler Similar results. Report result for IMDB.

24 Implemented approaches B-tree Q-tree B-tree & Q-tree BQ-tree BM-tree Sequential scan “BBQ-tree”?

25 “2 > 1 + 1” An integrated indexing structure is better than two separate indexing structures δs=3, δn=4

26 Scalability

27 Effect of numeric threshold δn

28 Effect of string threshold δs

29 Dynamic Maintenance: time

30 Dynamic maintenance: MAT quality

31 Number of centers Increasing cluster # may not reduce the running time: pruning power versus computational cost For BottomUp and BreadthFirst (compared to Randomized) - Centers close to the root, thus more likely to do early termination

32 Conclusion MAT-tree: an efficient indexing structure for queries with mixed-type predicates Can be efficiently constructed and maintained Future work: develop a uniform framework to support different kinds of similarity functions Q&A? The Flamingo Project :