Download presentation
Presentation is loading. Please wait.
Published byRichard Nichols Modified over 9 years ago
1
A Combination of Trie-trees and Inverted files for the Indexing of Set-valued Attributes Manolis Terrovitis (NTUA) Spyros Passas (NTUA) Panos Vassiliadis (UoI) Timos Sellis (NTUA)
2
Terrovitis et. al., CIKM '06 Problem We are interested in low cardinality set-values – Retail store transaction logs – Web logs – Biomedical databases etc. We address the efficient evaluation of containment queries – In which transactions were products ‘a’ and ‘b’ sold together? – Which users visited only the main page or the download page of our site? We propose the Hybrid Trie-Inverted file (HTI) index
3
Terrovitis et. al., CIKM '06 Outline Problem definition The HTI index Query evaluation Experiments Conclusions
4
Terrovitis et. al., CIKM '06 Outline Problem definition The HTI index Query evaluation Experiments Conclusions
5
Terrovitis et. al., CIKM '06 Data and queries tidproductstidproducts 1 {f,a} 9 {a,e} 2 {a,d,c} 10 {g,c,a} 3{c,b,a} 11 {b,a,e} 4{f,a,c} 12 {b,d,c} 5 {c,g} 13 {c,f,a,d,b} 6 {a,b,g,c,d,e} 14 {b,d} 7 {a,d,b} 15 {e}{e} 8 {a,e,b} 16 {b,f,a}
6
Terrovitis et. al., CIKM '06 Data and queries Find all transactions that contain ‘a’, ‘b’ and ‘d’ (subset) tidproductstidproducts 1 {f,a} 9 {a,e} 2 {a,d,c} 10 {g,c,a} 3{c,b,a} 11 {b,a,e} 4{f,a,c} 12 {b,d,c} 5 {c,g} 13 {c,f,a,d,b} 6 {a,b,g,c,d,e} 14 {b,d} 7 {a,d,b} 15 {e}{e} 8 {a,e,b} 16 {b,f,a}
7
Terrovitis et. al., CIKM '06 Data and queries Find all transactions that contain ‘a’, ‘b’ and ‘d’ (subset) Find all transactions that contain exactly ‘a’, ‘b’ and ‘d’ (equality) tidproductstidproducts 1 {f,a} 9 {a,e} 2 {a,d,c} 10 {g,c,a} 3{c,b,a} 11 {b,a,e} 4{f,a,c} 12 {b,d,c} 5 {c,g} 13 {c,f,a,d,b} 6 {a,b,g,c,d,e} 14 {b,d} 7 {a,d,b} 15 {e}{e} 8 {a,e,b} 16 {b,f,a}
8
Terrovitis et. al., CIKM '06 Data and queries Find all transactions that contain ‘a’, ‘b’ and ‘d’ (subset) Find all transactions that contain exactly ‘a’, ‘b’ and ‘d’ (equality) Find all transactions that contain only items from ‘a’, ‘b’ and ‘d’ (superset) tidproductstidproducts 1 {f,a} 9 {a,e} 2 {a,d,c} 10 {g,c,a} 3{c,b,a} 11 {b,a,e} 4{f,a,c} 12 {b,d,c} 5 {c,g} 13 {c,f,a,d,b} 6 {a,b,g,c,d,e} 14 {b,d} 7 {a,d,b} 15 {e}{e} 8 {a,e,b} 16 {b,f,a}
9
Terrovitis et. al., CIKM '06 Data and queries Traditional methods – Signature files – Inverted files Differences from text databases: – Low cardinality – Large number of records in comparison with vocabulary size – New types of queries (equality-superset)
10
Terrovitis et. al., CIKM '06 Outline Problem definition The HTI index Query evaluation Experiments Conclusions
11
Terrovitis et. al., CIKM '06 The HTI index Background – The inverted file
12
Terrovitis et. al., CIKM '06 HTI index Inverted files - problems The evaluation of containment queries relies on merge-joining the inverted lists The inverted lists become very long – when the database size is very big compared to the vocabulary – when the items’ distribution is skewed This is often the case in the real world!
13
Terrovitis et. al., CIKM '06 HTI index Solution? We need to break up the lists! But how? – Lets make a list for every combination of items!
14
Terrovitis et. al., CIKM '06 HTI index Solution? We assume a total order based on the frequency of appearance for the items of the database We order the items in each set-value and we transform it to a sequence We create a path in the access tree for each sequence
15
Terrovitis et. al., CIKM '06 HTI index All combinations?
16
Terrovitis et. al., CIKM '06 HTI index All combinations?
17
Terrovitis et. al., CIKM '06 HTI index All combinations?
18
Terrovitis et. al., CIKM '06 HTI index All combinations?
19
Terrovitis et. al., CIKM '06 HTI index All combinations? Maybe, not…
20
Terrovitis et. al., CIKM '06 HTI index An access tree for the frequent items
21
Terrovitis et. al., CIKM '06 HTI index An access tree for the frequent items
22
Terrovitis et. al., CIKM '06 The HTI index
23
Terrovitis et. al., CIKM '06 The HTI index
24
Terrovitis et. al., CIKM '06 The HTI index
25
Terrovitis et. al., CIKM '06 The HTI index
26
Terrovitis et. al., CIKM '06 HTI index The basic points The access tree is used only for the most frequent items The inverted lists are restructured so that each node of the access tree points to a different inverted sublist We keep the access tree in main memory
27
Terrovitis et. al., CIKM '06 Outline Problem definition The HTI index Query evaluation Experiments Conclusions
28
Terrovitis et. al., CIKM '06 Query Evaluation Basic Steps 1. Find the frequent items of the query set 2. Use the access tree to detect the sublists which might participate in the answer 3. Merge-join these sublists with the inverted lists of the non-frequent items
29
Terrovitis et. al., CIKM '06 Subset - (‘b’, ‘c’, ‘d’’)
30
Terrovitis et. al., CIKM '06 Subset - (‘b’, ‘c’, ‘d’’)
31
Terrovitis et. al., CIKM '06 Subset - (‘b’, ‘c’, ‘d’’)
32
Terrovitis et. al., CIKM '06 Subset - (‘b’, ‘c’, ‘d’’)
33
Terrovitis et. al., CIKM '06 Subset - (‘b’, ‘c’, ‘d’’)
34
Terrovitis et. al., CIKM '06 Equality - (‘b’, ‘c’, ‘d’’)
35
Terrovitis et. al., CIKM '06 Equality - (‘b’, ‘c’, ‘d’’)
36
Terrovitis et. al., CIKM '06 Equality - (‘b’, ‘c’, ‘d’’)
37
Terrovitis et. al., CIKM '06 Equality - (‘b’, ‘c’, ‘d’’)
38
Terrovitis et. al., CIKM '06 Superset - (‘b’, ‘c’, ‘d’’)
39
Terrovitis et. al., CIKM '06 Superset - (‘b’, ‘c’, ‘d’’)
40
Terrovitis et. al., CIKM '06 Superset - (‘b’, ‘c’, ‘d’’)
41
Terrovitis et. al., CIKM '06 Superset - (‘b’, ‘c’, ‘d’’)
42
Terrovitis et. al., CIKM '06 Superset - (‘b’, ‘c’, ‘d’’)
43
Terrovitis et. al., CIKM '06 Superset - (‘b’, ‘c’, ‘d’’)
44
Terrovitis et. al., CIKM '06 Superset - (‘b’, ‘c’, ‘d’’)
45
Terrovitis et. al., CIKM '06 Outline Problem definition The HTI index Query evaluation Experiments Conclusions
46
Terrovitis et. al., CIKM '06 Experiments Setup Real Data from UCI – web log from microsoft.com [ 320k records, 294 items] – web log from msnbc.com [1M records, 17 items] Synthetic data – Zipfian distribution of order 1 – 100k-1M records – 1k-10k items – Queries with 2-22 items
47
Terrovitis et. al., CIKM '06 Experiments Query performance – DB size
48
Terrovitis et. al., CIKM '06 Experiments Query performance – query length
49
Terrovitis et. al., CIKM '06 Experiments Query performance – query length
50
Terrovitis et. al., CIKM '06 Experiments Query performance – query length
51
Terrovitis et. al., CIKM '06 Experiments Query performance – query length
52
Terrovitis et. al., CIKM '06 Experiments Access tree size – DB size
53
Terrovitis et. al., CIKM '06 Experiments Access tree size – DB size
54
Terrovitis et. al., CIKM '06 Experiments The HTI scales a lot better than the inverted file as the query and the database size grow A small threshold is enough for a performance gain over an order of magnitude The main memory requirements do not exceed 0.5M for the real data.
55
Terrovitis et. al., CIKM '06 Outline Problem Definition The HTI index Query evaluation Experiments Conclusions
56
Terrovitis et. al., CIKM '06 Conclusions The HTI index relies on breaking up the larger inverted lists in smaller lists that contain known combinations of items The HTI index significantly outperforms the inverted file for small domains and skewed item distributions It has moderate memory requirements that can be adjusted by using the right threshold
57
Terrovitis et. al., CIKM '06 The End Thank You!
58
Terrovitis et. al., CIKM '06 Experiments Vocabulary size
59
Terrovitis et. al., CIKM '06 Experiments Threshold choice
60
Terrovitis et. al., CIKM '06 Experiments Threshold choice
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.