A Combination of Trie-trees and Inverted files for the Indexing of Set-valued Attributes Manolis Terrovitis (NTUA) Spyros Passas (NTUA) Panos Vassiliadis (UoI) Timos Sellis (NTUA)
Terrovitis et. al., CIKM '06 Problem We are interested in low cardinality set-values – Retail store transaction logs – Web logs – Biomedical databases etc. We address the efficient evaluation of containment queries – In which transactions were products ‘a’ and ‘b’ sold together? – Which users visited only the main page or the download page of our site? We propose the Hybrid Trie-Inverted file (HTI) index
Terrovitis et. al., CIKM '06 Outline Problem definition The HTI index Query evaluation Experiments Conclusions
Terrovitis et. al., CIKM '06 Outline Problem definition The HTI index Query evaluation Experiments Conclusions
Terrovitis et. al., CIKM '06 Data and queries tidproductstidproducts 1 {f,a} 9 {a,e} 2 {a,d,c} 10 {g,c,a} 3{c,b,a} 11 {b,a,e} 4{f,a,c} 12 {b,d,c} 5 {c,g} 13 {c,f,a,d,b} 6 {a,b,g,c,d,e} 14 {b,d} 7 {a,d,b} 15 {e}{e} 8 {a,e,b} 16 {b,f,a}
Terrovitis et. al., CIKM '06 Data and queries Find all transactions that contain ‘a’, ‘b’ and ‘d’ (subset) tidproductstidproducts 1 {f,a} 9 {a,e} 2 {a,d,c} 10 {g,c,a} 3{c,b,a} 11 {b,a,e} 4{f,a,c} 12 {b,d,c} 5 {c,g} 13 {c,f,a,d,b} 6 {a,b,g,c,d,e} 14 {b,d} 7 {a,d,b} 15 {e}{e} 8 {a,e,b} 16 {b,f,a}
Terrovitis et. al., CIKM '06 Data and queries Find all transactions that contain ‘a’, ‘b’ and ‘d’ (subset) Find all transactions that contain exactly ‘a’, ‘b’ and ‘d’ (equality) tidproductstidproducts 1 {f,a} 9 {a,e} 2 {a,d,c} 10 {g,c,a} 3{c,b,a} 11 {b,a,e} 4{f,a,c} 12 {b,d,c} 5 {c,g} 13 {c,f,a,d,b} 6 {a,b,g,c,d,e} 14 {b,d} 7 {a,d,b} 15 {e}{e} 8 {a,e,b} 16 {b,f,a}
Terrovitis et. al., CIKM '06 Data and queries Find all transactions that contain ‘a’, ‘b’ and ‘d’ (subset) Find all transactions that contain exactly ‘a’, ‘b’ and ‘d’ (equality) Find all transactions that contain only items from ‘a’, ‘b’ and ‘d’ (superset) tidproductstidproducts 1 {f,a} 9 {a,e} 2 {a,d,c} 10 {g,c,a} 3{c,b,a} 11 {b,a,e} 4{f,a,c} 12 {b,d,c} 5 {c,g} 13 {c,f,a,d,b} 6 {a,b,g,c,d,e} 14 {b,d} 7 {a,d,b} 15 {e}{e} 8 {a,e,b} 16 {b,f,a}
Terrovitis et. al., CIKM '06 Data and queries Traditional methods – Signature files – Inverted files Differences from text databases: – Low cardinality – Large number of records in comparison with vocabulary size – New types of queries (equality-superset)
Terrovitis et. al., CIKM '06 Outline Problem definition The HTI index Query evaluation Experiments Conclusions
Terrovitis et. al., CIKM '06 The HTI index Background – The inverted file
Terrovitis et. al., CIKM '06 HTI index Inverted files - problems The evaluation of containment queries relies on merge-joining the inverted lists The inverted lists become very long – when the database size is very big compared to the vocabulary – when the items’ distribution is skewed This is often the case in the real world!
Terrovitis et. al., CIKM '06 HTI index Solution? We need to break up the lists! But how? – Lets make a list for every combination of items!
Terrovitis et. al., CIKM '06 HTI index Solution? We assume a total order based on the frequency of appearance for the items of the database We order the items in each set-value and we transform it to a sequence We create a path in the access tree for each sequence
Terrovitis et. al., CIKM '06 HTI index All combinations?
Terrovitis et. al., CIKM '06 HTI index All combinations?
Terrovitis et. al., CIKM '06 HTI index All combinations?
Terrovitis et. al., CIKM '06 HTI index All combinations?
Terrovitis et. al., CIKM '06 HTI index All combinations? Maybe, not…
Terrovitis et. al., CIKM '06 HTI index An access tree for the frequent items
Terrovitis et. al., CIKM '06 HTI index An access tree for the frequent items
Terrovitis et. al., CIKM '06 The HTI index
Terrovitis et. al., CIKM '06 The HTI index
Terrovitis et. al., CIKM '06 The HTI index
Terrovitis et. al., CIKM '06 The HTI index
Terrovitis et. al., CIKM '06 HTI index The basic points The access tree is used only for the most frequent items The inverted lists are restructured so that each node of the access tree points to a different inverted sublist We keep the access tree in main memory
Terrovitis et. al., CIKM '06 Outline Problem definition The HTI index Query evaluation Experiments Conclusions
Terrovitis et. al., CIKM '06 Query Evaluation Basic Steps 1. Find the frequent items of the query set 2. Use the access tree to detect the sublists which might participate in the answer 3. Merge-join these sublists with the inverted lists of the non-frequent items
Terrovitis et. al., CIKM '06 Subset - (‘b’, ‘c’, ‘d’’)
Terrovitis et. al., CIKM '06 Subset - (‘b’, ‘c’, ‘d’’)
Terrovitis et. al., CIKM '06 Subset - (‘b’, ‘c’, ‘d’’)
Terrovitis et. al., CIKM '06 Subset - (‘b’, ‘c’, ‘d’’)
Terrovitis et. al., CIKM '06 Subset - (‘b’, ‘c’, ‘d’’)
Terrovitis et. al., CIKM '06 Equality - (‘b’, ‘c’, ‘d’’)
Terrovitis et. al., CIKM '06 Equality - (‘b’, ‘c’, ‘d’’)
Terrovitis et. al., CIKM '06 Equality - (‘b’, ‘c’, ‘d’’)
Terrovitis et. al., CIKM '06 Equality - (‘b’, ‘c’, ‘d’’)
Terrovitis et. al., CIKM '06 Superset - (‘b’, ‘c’, ‘d’’)
Terrovitis et. al., CIKM '06 Superset - (‘b’, ‘c’, ‘d’’)
Terrovitis et. al., CIKM '06 Superset - (‘b’, ‘c’, ‘d’’)
Terrovitis et. al., CIKM '06 Superset - (‘b’, ‘c’, ‘d’’)
Terrovitis et. al., CIKM '06 Superset - (‘b’, ‘c’, ‘d’’)
Terrovitis et. al., CIKM '06 Superset - (‘b’, ‘c’, ‘d’’)
Terrovitis et. al., CIKM '06 Superset - (‘b’, ‘c’, ‘d’’)
Terrovitis et. al., CIKM '06 Outline Problem definition The HTI index Query evaluation Experiments Conclusions
Terrovitis et. al., CIKM '06 Experiments Setup Real Data from UCI – web log from microsoft.com [ 320k records, 294 items] – web log from msnbc.com [1M records, 17 items] Synthetic data – Zipfian distribution of order 1 – 100k-1M records – 1k-10k items – Queries with 2-22 items
Terrovitis et. al., CIKM '06 Experiments Query performance – DB size
Terrovitis et. al., CIKM '06 Experiments Query performance – query length
Terrovitis et. al., CIKM '06 Experiments Query performance – query length
Terrovitis et. al., CIKM '06 Experiments Query performance – query length
Terrovitis et. al., CIKM '06 Experiments Query performance – query length
Terrovitis et. al., CIKM '06 Experiments Access tree size – DB size
Terrovitis et. al., CIKM '06 Experiments Access tree size – DB size
Terrovitis et. al., CIKM '06 Experiments The HTI scales a lot better than the inverted file as the query and the database size grow A small threshold is enough for a performance gain over an order of magnitude The main memory requirements do not exceed 0.5M for the real data.
Terrovitis et. al., CIKM '06 Outline Problem Definition The HTI index Query evaluation Experiments Conclusions
Terrovitis et. al., CIKM '06 Conclusions The HTI index relies on breaking up the larger inverted lists in smaller lists that contain known combinations of items The HTI index significantly outperforms the inverted file for small domains and skewed item distributions It has moderate memory requirements that can be adjusted by using the right threshold
Terrovitis et. al., CIKM '06 The End Thank You!
Terrovitis et. al., CIKM '06 Experiments Vocabulary size
Terrovitis et. al., CIKM '06 Experiments Threshold choice
Terrovitis et. al., CIKM '06 Experiments Threshold choice