A Combination of Trie-trees and Inverted files for the Indexing of Set-valued Attributes Manolis Terrovitis (NTUA) Spyros Passas (NTUA) Panos Vassiliadis.

Slides:



Advertisements
Similar presentations
The HV-tree: a Memory Hierarchy Aware Version Index Rui Zhang University of Melbourne Martin Stradling University of Melbourne.
Advertisements

Computer Science and Engineering Inverted Linear Quadtree: Efficient Top K Spatial Keyword Search Chengyuan Zhang 1,Ying Zhang 1,Wenjie Zhang 1, Xuemin.
Query Optimization of Frequent Itemset Mining on Multiple Databases Mining on Multiple Databases David Fuhry Department of Computer Science Kent State.
Effective Keyword Based Selection of Relational Databases Bei Yu, Guoliang Li, Karen Sollins, Anthony K.H Tung.
Reference-based Indexing of Sequence Databases Jayendra Venkateswaran, Deepak Lachwani, Tamer Kahveci, Christopher Jermaine University of Florida-Gainesville.
Frequent Closed Pattern Search By Row and Feature Enumeration
Fast Algorithms For Hierarchical Range Histogram Constructions
TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets Chun Chen 1, Feng Li 2, Beng Chin Ooi 2, and Sai Wu 2 1 Zhejiang University, 2 National.
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
Towards Estimating the Number of Distinct Value Combinations for a Set of Attributes Xiaohui Yu 1, Calisto Zuzarte 2, Ken Sevcik 1 1 University of Toronto.
1 Efficient Subgraph Search over Large Uncertain Graphs Ye Yuan 1, Guoren Wang 1, Haixun Wang 2, Lei Chen 3 1. Northeastern University, China 2. Microsoft.
Estimating the Selectivity of XML Path Expressions for Internet Scale Applications Ashraf Aboulnaga Alaa R. Alameldeen Jeffrey F. Naughton Computer Sciences.
New Sampling-Based Summary Statistics for Improving Approximate Query Answers P. B. Gibbons and Y. Matias (ACM SIGMOD 1998) Rongfang Li Feb 2007.
Data Mining Techniques So Far: Cluster analysis K-means Classification Decision Trees J48 (C4.5) Rule-based classification JRIP (RIPPER) Logistic Regression.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Bitmap Index Buddhika Madduma 22/03/2010 Web and Document Databases - ACS-7102.
Modern Information Retrieval
Evaluating Reachability Queries over Path Collections* P. Bouros 1, S. Skiadopoulos 2, T. Dalamagas 3, D. Sacharidis 3, T. Sellis 1,3 1 National Technical.
Efficient Processing of Top-k Spatial Keyword Queries João B. Rocha-Junior, Orestis Gkorgkas, Simon Jonassen, and Kjetil Nørvåg 1 SSTD 2011.
1 Searching the Web Junghoo Cho UCLA Computer Science.
Time-dependent Similarity Measure of Queries Using Historical Click- through Data Qiankun Zhao*, Steven C. H. Hoi*, Tie-Yan Liu, et al. Presented by: Tie-Yan.
LinkSelector: A Web Mining Approach to Hyperlink Selection for Web Portals Xiao Fang University of Arizona 10/18/2002.
1 Internet and Data Management Junghoo “John” Cho UCLA Computer Science.
Efficient Data Mining for Path Traversal Patterns CS401 Paper Presentation Chaoqiang chen Guang Xu.
Efficient Join Processing over Uncertain Data - By Reynold Cheng, et all. Presented By Lydia & Usha.
Fast Algorithms for Association Rule Mining
File Organizations and Indexing Lecture 4 R&G Chapter 8 "If you don't find it in the index, look very carefully through the entire catalogue." -- Sears,
1 Overview of Storage and Indexing Chapter 8 1. Basics about file management 2. Introduction to indexing 3. First glimpse at indices and workloads.
Data Structures Introduction Phil Tayco Slide version 1.0 Jan 26, 2015.
Mining High Utility Itemsets without Candidate Generation Date: 2013/05/13 Author: Mengchi Liu, Junfeng Qu Source: CIKM "12 Advisor: Jia-ling Koh Speaker:
1 A Bayesian Method for Guessing the Extreme Values in a Data Set Mingxi Wu, Chris Jermaine University of Florida September 2007.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
Database Management 9. course. Execution of queries.
Experimenting Lucene Index on HBase in an HPC Environment Xiaoming Gao Vaibhav Nachankar Judy Qiu.
FINDING NEAR DUPLICATE WEB PAGES: A LARGE- SCALE EVALUATION OF ALGORITHMS - Monika Henzinger Speaker Ketan Akade 1.
Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining By Tan, Steinbach, Kumar Lecture.
Modul 7: Association Analysis. 2 Association Rule Mining  Given a set of transactions, find rules that will predict the occurrence of an item based on.
Database Design Part of the design process is deciding how data will be stored in the system –Conventional files (sequential, indexed,..) –Databases (database.
Mining High Utility Itemset in Big Data
Introduction n How to retrieval information? n A simple alternative is to search the whole text sequentially n Another option is to build data structures.
Qingqing Gan Torsten Suel CSE Department Polytechnic Institute of NYU Improved Techniques for Result Caching in Web Search Engines.
Expert Systems with Applications 34 (2008) 459–468 Multi-level fuzzy mining with multiple minimum supports Yeong-Chyi Lee, Tzung-Pei Hong, Tien-Chin Wang.
A data-mining approach for multiple structural alignment of proteins WY Siu, N Mamoulis, SM Yiu, HL Chan The University of Hong Kong Sep 9, 2009.
1 CS 430: Information Discovery Lecture 4 Files Structures for Inverted Files.
New Sampling-Based Summary Statistics for Improving Approximate Query Answers Yinghui Wang
1 Inverted Matrix: Efficient Discovery of Frequent Items in Large Datasets in the Context of Interactive Mining -SIGKDD’03 Mohammad El-Hajj, Osmar R. Zaïane.
Building a Distributed Full-Text Index for the Web by Sergey Melnik, Sriram Raghavan, Beverly Yang and Hector Garcia-Molina from Stanford University Presented.
Mining Document Collections to Facilitate Accurate Approximate Entity Matching Presented By Harshda Vabale.
Design Exploration of an Instruction-Based Shared Markov Table on CMPs Karthik Ramachandran & Lixin Su Design Exploration of an Instruction-Based Shared.
Fast Phrase Querying With Combined Indexes HUGH E. WILLIAMS, JUSTIN ZOBEL, and DIRK BAHLE RMIT University 2004 Burak Görener Doğuş University.
Microsoft Access Database Creation and Management.
Toward Entity Retrieval over Structured and Text Data Mayssam Sayyadian, Azadeh Shakery, AnHai Doan, ChengXiang Zhai Department of Computer Science University.
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
Data Mining Association Rules Mining Frequent Itemset Mining Support and Confidence Apriori Approach.
Using category-Based Adherence to Cluster Market-Basket Data Author : Ching-Huang Yun, Kun-Ta Chuang, Ming-Syan Chen Graduate : Chien-Ming Hsiao.
1 Data Mining Lecture 6: Association Analysis. 2 Association Rule Mining l Given a set of transactions, find rules that will predict the occurrence of.
Rethinking Choices for Multi-dimensional Point Indexing You Jung Kim and Jignesh M. Patel University of Michigan.
On the Intersection of Inverted Lists Yangjun Chen and Weixin Shen Dept. Applied Computer Science, University of Winnipeg 515 Portage Ave. Winnipeg, Manitoba,
University at BuffaloThe State University of New York Pattern-based Clustering How to cluster the five objects? qHard to define a global similarity measure.
Spatial Approximate String Search. Abstract This work deals with the approximate string search in large spatial databases. Specifically, we investigate.
CLASS INHERITANCE TREE (CIT)
Indexing Goals: Store large files Support multiple search keys
Bolin Ding Silu Huang* Surajit Chaudhuri Kaushik Chakrabarti Chi Wang
Data Mining Association Analysis: Basic Concepts and Algorithms
Query Processing B.Ramamurthy Chapter 12 11/27/2018 B.Ramamurthy.
Indexing and Hashing B.Ramamurthy Chapter 11 2/5/2019 B.Ramamurthy.
Similarity Search: A Matching Based Approach
Efficient Cost Models for Spatial Queries Using R-Trees
Donghui Zhang, Tian Xia Northeastern University
Presentation transcript:

A Combination of Trie-trees and Inverted files for the Indexing of Set-valued Attributes Manolis Terrovitis (NTUA) Spyros Passas (NTUA) Panos Vassiliadis (UoI) Timos Sellis (NTUA)

Terrovitis et. al., CIKM '06 Problem We are interested in low cardinality set-values – Retail store transaction logs – Web logs – Biomedical databases etc. We address the efficient evaluation of containment queries – In which transactions were products ‘a’ and ‘b’ sold together? – Which users visited only the main page or the download page of our site? We propose the Hybrid Trie-Inverted file (HTI) index

Terrovitis et. al., CIKM '06 Outline Problem definition The HTI index Query evaluation Experiments Conclusions

Terrovitis et. al., CIKM '06 Outline Problem definition The HTI index Query evaluation Experiments Conclusions

Terrovitis et. al., CIKM '06 Data and queries tidproductstidproducts 1 {f,a} 9 {a,e} 2 {a,d,c} 10 {g,c,a} 3{c,b,a} 11 {b,a,e} 4{f,a,c} 12 {b,d,c} 5 {c,g} 13 {c,f,a,d,b} 6 {a,b,g,c,d,e} 14 {b,d} 7 {a,d,b} 15 {e}{e} 8 {a,e,b} 16 {b,f,a}

Terrovitis et. al., CIKM '06 Data and queries Find all transactions that contain ‘a’, ‘b’ and ‘d’ (subset) tidproductstidproducts 1 {f,a} 9 {a,e} 2 {a,d,c} 10 {g,c,a} 3{c,b,a} 11 {b,a,e} 4{f,a,c} 12 {b,d,c} 5 {c,g} 13 {c,f,a,d,b} 6 {a,b,g,c,d,e} 14 {b,d} 7 {a,d,b} 15 {e}{e} 8 {a,e,b} 16 {b,f,a}

Terrovitis et. al., CIKM '06 Data and queries Find all transactions that contain ‘a’, ‘b’ and ‘d’ (subset) Find all transactions that contain exactly ‘a’, ‘b’ and ‘d’ (equality) tidproductstidproducts 1 {f,a} 9 {a,e} 2 {a,d,c} 10 {g,c,a} 3{c,b,a} 11 {b,a,e} 4{f,a,c} 12 {b,d,c} 5 {c,g} 13 {c,f,a,d,b} 6 {a,b,g,c,d,e} 14 {b,d} 7 {a,d,b} 15 {e}{e} 8 {a,e,b} 16 {b,f,a}

Terrovitis et. al., CIKM '06 Data and queries Find all transactions that contain ‘a’, ‘b’ and ‘d’ (subset) Find all transactions that contain exactly ‘a’, ‘b’ and ‘d’ (equality) Find all transactions that contain only items from ‘a’, ‘b’ and ‘d’ (superset) tidproductstidproducts 1 {f,a} 9 {a,e} 2 {a,d,c} 10 {g,c,a} 3{c,b,a} 11 {b,a,e} 4{f,a,c} 12 {b,d,c} 5 {c,g} 13 {c,f,a,d,b} 6 {a,b,g,c,d,e} 14 {b,d} 7 {a,d,b} 15 {e}{e} 8 {a,e,b} 16 {b,f,a}

Terrovitis et. al., CIKM '06 Data and queries Traditional methods – Signature files – Inverted files Differences from text databases: – Low cardinality – Large number of records in comparison with vocabulary size – New types of queries (equality-superset)

Terrovitis et. al., CIKM '06 Outline Problem definition The HTI index Query evaluation Experiments Conclusions

Terrovitis et. al., CIKM '06 The HTI index Background – The inverted file

Terrovitis et. al., CIKM '06 HTI index Inverted files - problems The evaluation of containment queries relies on merge-joining the inverted lists The inverted lists become very long – when the database size is very big compared to the vocabulary – when the items’ distribution is skewed This is often the case in the real world!

Terrovitis et. al., CIKM '06 HTI index Solution? We need to break up the lists! But how? – Lets make a list for every combination of items!

Terrovitis et. al., CIKM '06 HTI index Solution? We assume a total order based on the frequency of appearance for the items of the database We order the items in each set-value and we transform it to a sequence We create a path in the access tree for each sequence

Terrovitis et. al., CIKM '06 HTI index All combinations?

Terrovitis et. al., CIKM '06 HTI index All combinations?

Terrovitis et. al., CIKM '06 HTI index All combinations?

Terrovitis et. al., CIKM '06 HTI index All combinations?

Terrovitis et. al., CIKM '06 HTI index All combinations? Maybe, not…

Terrovitis et. al., CIKM '06 HTI index An access tree for the frequent items

Terrovitis et. al., CIKM '06 HTI index An access tree for the frequent items

Terrovitis et. al., CIKM '06 The HTI index

Terrovitis et. al., CIKM '06 The HTI index

Terrovitis et. al., CIKM '06 The HTI index

Terrovitis et. al., CIKM '06 The HTI index

Terrovitis et. al., CIKM '06 HTI index The basic points The access tree is used only for the most frequent items The inverted lists are restructured so that each node of the access tree points to a different inverted sublist We keep the access tree in main memory

Terrovitis et. al., CIKM '06 Outline Problem definition The HTI index Query evaluation Experiments Conclusions

Terrovitis et. al., CIKM '06 Query Evaluation Basic Steps 1. Find the frequent items of the query set 2. Use the access tree to detect the sublists which might participate in the answer 3. Merge-join these sublists with the inverted lists of the non-frequent items

Terrovitis et. al., CIKM '06 Subset - (‘b’, ‘c’, ‘d’’)

Terrovitis et. al., CIKM '06 Subset - (‘b’, ‘c’, ‘d’’)

Terrovitis et. al., CIKM '06 Subset - (‘b’, ‘c’, ‘d’’)

Terrovitis et. al., CIKM '06 Subset - (‘b’, ‘c’, ‘d’’)

Terrovitis et. al., CIKM '06 Subset - (‘b’, ‘c’, ‘d’’)

Terrovitis et. al., CIKM '06 Equality - (‘b’, ‘c’, ‘d’’)

Terrovitis et. al., CIKM '06 Equality - (‘b’, ‘c’, ‘d’’)

Terrovitis et. al., CIKM '06 Equality - (‘b’, ‘c’, ‘d’’)

Terrovitis et. al., CIKM '06 Equality - (‘b’, ‘c’, ‘d’’)

Terrovitis et. al., CIKM '06 Superset - (‘b’, ‘c’, ‘d’’)

Terrovitis et. al., CIKM '06 Superset - (‘b’, ‘c’, ‘d’’)

Terrovitis et. al., CIKM '06 Superset - (‘b’, ‘c’, ‘d’’)

Terrovitis et. al., CIKM '06 Superset - (‘b’, ‘c’, ‘d’’)

Terrovitis et. al., CIKM '06 Superset - (‘b’, ‘c’, ‘d’’)

Terrovitis et. al., CIKM '06 Superset - (‘b’, ‘c’, ‘d’’)

Terrovitis et. al., CIKM '06 Superset - (‘b’, ‘c’, ‘d’’)

Terrovitis et. al., CIKM '06 Outline Problem definition The HTI index Query evaluation Experiments Conclusions

Terrovitis et. al., CIKM '06 Experiments Setup Real Data from UCI – web log from microsoft.com [ 320k records, 294 items] – web log from msnbc.com [1M records, 17 items] Synthetic data – Zipfian distribution of order 1 – 100k-1M records – 1k-10k items – Queries with 2-22 items

Terrovitis et. al., CIKM '06 Experiments Query performance – DB size

Terrovitis et. al., CIKM '06 Experiments Query performance – query length

Terrovitis et. al., CIKM '06 Experiments Query performance – query length

Terrovitis et. al., CIKM '06 Experiments Query performance – query length

Terrovitis et. al., CIKM '06 Experiments Query performance – query length

Terrovitis et. al., CIKM '06 Experiments Access tree size – DB size

Terrovitis et. al., CIKM '06 Experiments Access tree size – DB size

Terrovitis et. al., CIKM '06 Experiments The HTI scales a lot better than the inverted file as the query and the database size grow A small threshold is enough for a performance gain over an order of magnitude The main memory requirements do not exceed 0.5M for the real data.

Terrovitis et. al., CIKM '06 Outline Problem Definition The HTI index Query evaluation Experiments Conclusions

Terrovitis et. al., CIKM '06 Conclusions The HTI index relies on breaking up the larger inverted lists in smaller lists that contain known combinations of items The HTI index significantly outperforms the inverted file for small domains and skewed item distributions It has moderate memory requirements that can be adjusted by using the right threshold

Terrovitis et. al., CIKM '06 The End Thank You!

Terrovitis et. al., CIKM '06 Experiments Vocabulary size

Terrovitis et. al., CIKM '06 Experiments Threshold choice

Terrovitis et. al., CIKM '06 Experiments Threshold choice