Presentation is loading. Please wait.

Presentation is loading. Please wait.

A Combination of Trie-trees and Inverted files for the Indexing of Set-valued Attributes Manolis Terrovitis (NTUA) Spyros Passas (NTUA) Panos Vassiliadis.

Similar presentations


Presentation on theme: "A Combination of Trie-trees and Inverted files for the Indexing of Set-valued Attributes Manolis Terrovitis (NTUA) Spyros Passas (NTUA) Panos Vassiliadis."— Presentation transcript:

1 A Combination of Trie-trees and Inverted files for the Indexing of Set-valued Attributes Manolis Terrovitis (NTUA) Spyros Passas (NTUA) Panos Vassiliadis (UoI) Timos Sellis (NTUA)

2 Terrovitis et. al., CIKM '06 Problem We are interested in low cardinality set-values – Retail store transaction logs – Web logs – Biomedical databases etc. We address the efficient evaluation of containment queries – In which transactions were products ‘a’ and ‘b’ sold together? – Which users visited only the main page or the download page of our site? We propose the Hybrid Trie-Inverted file (HTI) index

3 Terrovitis et. al., CIKM '06 Outline Problem definition The HTI index Query evaluation Experiments Conclusions

4 Terrovitis et. al., CIKM '06 Outline Problem definition The HTI index Query evaluation Experiments Conclusions

5 Terrovitis et. al., CIKM '06 Data and queries tidproductstidproducts 1 {f,a} 9 {a,e} 2 {a,d,c} 10 {g,c,a} 3{c,b,a} 11 {b,a,e} 4{f,a,c} 12 {b,d,c} 5 {c,g} 13 {c,f,a,d,b} 6 {a,b,g,c,d,e} 14 {b,d} 7 {a,d,b} 15 {e}{e} 8 {a,e,b} 16 {b,f,a}

6 Terrovitis et. al., CIKM '06 Data and queries Find all transactions that contain ‘a’, ‘b’ and ‘d’ (subset) tidproductstidproducts 1 {f,a} 9 {a,e} 2 {a,d,c} 10 {g,c,a} 3{c,b,a} 11 {b,a,e} 4{f,a,c} 12 {b,d,c} 5 {c,g} 13 {c,f,a,d,b} 6 {a,b,g,c,d,e} 14 {b,d} 7 {a,d,b} 15 {e}{e} 8 {a,e,b} 16 {b,f,a}

7 Terrovitis et. al., CIKM '06 Data and queries Find all transactions that contain ‘a’, ‘b’ and ‘d’ (subset) Find all transactions that contain exactly ‘a’, ‘b’ and ‘d’ (equality) tidproductstidproducts 1 {f,a} 9 {a,e} 2 {a,d,c} 10 {g,c,a} 3{c,b,a} 11 {b,a,e} 4{f,a,c} 12 {b,d,c} 5 {c,g} 13 {c,f,a,d,b} 6 {a,b,g,c,d,e} 14 {b,d} 7 {a,d,b} 15 {e}{e} 8 {a,e,b} 16 {b,f,a}

8 Terrovitis et. al., CIKM '06 Data and queries Find all transactions that contain ‘a’, ‘b’ and ‘d’ (subset) Find all transactions that contain exactly ‘a’, ‘b’ and ‘d’ (equality) Find all transactions that contain only items from ‘a’, ‘b’ and ‘d’ (superset) tidproductstidproducts 1 {f,a} 9 {a,e} 2 {a,d,c} 10 {g,c,a} 3{c,b,a} 11 {b,a,e} 4{f,a,c} 12 {b,d,c} 5 {c,g} 13 {c,f,a,d,b} 6 {a,b,g,c,d,e} 14 {b,d} 7 {a,d,b} 15 {e}{e} 8 {a,e,b} 16 {b,f,a}

9 Terrovitis et. al., CIKM '06 Data and queries Traditional methods – Signature files – Inverted files Differences from text databases: – Low cardinality – Large number of records in comparison with vocabulary size – New types of queries (equality-superset)

10 Terrovitis et. al., CIKM '06 Outline Problem definition The HTI index Query evaluation Experiments Conclusions

11 Terrovitis et. al., CIKM '06 The HTI index Background – The inverted file

12 Terrovitis et. al., CIKM '06 HTI index Inverted files - problems The evaluation of containment queries relies on merge-joining the inverted lists The inverted lists become very long – when the database size is very big compared to the vocabulary – when the items’ distribution is skewed This is often the case in the real world!

13 Terrovitis et. al., CIKM '06 HTI index Solution? We need to break up the lists! But how? – Lets make a list for every combination of items!

14 Terrovitis et. al., CIKM '06 HTI index Solution? We assume a total order based on the frequency of appearance for the items of the database We order the items in each set-value and we transform it to a sequence We create a path in the access tree for each sequence

15 Terrovitis et. al., CIKM '06 HTI index All combinations?

16 Terrovitis et. al., CIKM '06 HTI index All combinations?

17 Terrovitis et. al., CIKM '06 HTI index All combinations?

18 Terrovitis et. al., CIKM '06 HTI index All combinations?

19 Terrovitis et. al., CIKM '06 HTI index All combinations? Maybe, not…

20 Terrovitis et. al., CIKM '06 HTI index An access tree for the frequent items

21 Terrovitis et. al., CIKM '06 HTI index An access tree for the frequent items

22 Terrovitis et. al., CIKM '06 The HTI index

23 Terrovitis et. al., CIKM '06 The HTI index

24 Terrovitis et. al., CIKM '06 The HTI index

25 Terrovitis et. al., CIKM '06 The HTI index

26 Terrovitis et. al., CIKM '06 HTI index The basic points The access tree is used only for the most frequent items The inverted lists are restructured so that each node of the access tree points to a different inverted sublist We keep the access tree in main memory

27 Terrovitis et. al., CIKM '06 Outline Problem definition The HTI index Query evaluation Experiments Conclusions

28 Terrovitis et. al., CIKM '06 Query Evaluation Basic Steps 1. Find the frequent items of the query set 2. Use the access tree to detect the sublists which might participate in the answer 3. Merge-join these sublists with the inverted lists of the non-frequent items

29 Terrovitis et. al., CIKM '06 Subset - (‘b’, ‘c’, ‘d’’)

30 Terrovitis et. al., CIKM '06 Subset - (‘b’, ‘c’, ‘d’’)

31 Terrovitis et. al., CIKM '06 Subset - (‘b’, ‘c’, ‘d’’)

32 Terrovitis et. al., CIKM '06 Subset - (‘b’, ‘c’, ‘d’’)

33 Terrovitis et. al., CIKM '06 Subset - (‘b’, ‘c’, ‘d’’)

34 Terrovitis et. al., CIKM '06 Equality - (‘b’, ‘c’, ‘d’’)

35 Terrovitis et. al., CIKM '06 Equality - (‘b’, ‘c’, ‘d’’)

36 Terrovitis et. al., CIKM '06 Equality - (‘b’, ‘c’, ‘d’’)

37 Terrovitis et. al., CIKM '06 Equality - (‘b’, ‘c’, ‘d’’)

38 Terrovitis et. al., CIKM '06 Superset - (‘b’, ‘c’, ‘d’’)

39 Terrovitis et. al., CIKM '06 Superset - (‘b’, ‘c’, ‘d’’)

40 Terrovitis et. al., CIKM '06 Superset - (‘b’, ‘c’, ‘d’’)

41 Terrovitis et. al., CIKM '06 Superset - (‘b’, ‘c’, ‘d’’)

42 Terrovitis et. al., CIKM '06 Superset - (‘b’, ‘c’, ‘d’’)

43 Terrovitis et. al., CIKM '06 Superset - (‘b’, ‘c’, ‘d’’)

44 Terrovitis et. al., CIKM '06 Superset - (‘b’, ‘c’, ‘d’’)

45 Terrovitis et. al., CIKM '06 Outline Problem definition The HTI index Query evaluation Experiments Conclusions

46 Terrovitis et. al., CIKM '06 Experiments Setup Real Data from UCI – web log from microsoft.com [ 320k records, 294 items] – web log from msnbc.com [1M records, 17 items] Synthetic data – Zipfian distribution of order 1 – 100k-1M records – 1k-10k items – Queries with 2-22 items

47 Terrovitis et. al., CIKM '06 Experiments Query performance – DB size

48 Terrovitis et. al., CIKM '06 Experiments Query performance – query length

49 Terrovitis et. al., CIKM '06 Experiments Query performance – query length

50 Terrovitis et. al., CIKM '06 Experiments Query performance – query length

51 Terrovitis et. al., CIKM '06 Experiments Query performance – query length

52 Terrovitis et. al., CIKM '06 Experiments Access tree size – DB size

53 Terrovitis et. al., CIKM '06 Experiments Access tree size – DB size

54 Terrovitis et. al., CIKM '06 Experiments The HTI scales a lot better than the inverted file as the query and the database size grow A small threshold is enough for a performance gain over an order of magnitude The main memory requirements do not exceed 0.5M for the real data.

55 Terrovitis et. al., CIKM '06 Outline Problem Definition The HTI index Query evaluation Experiments Conclusions

56 Terrovitis et. al., CIKM '06 Conclusions The HTI index relies on breaking up the larger inverted lists in smaller lists that contain known combinations of items The HTI index significantly outperforms the inverted file for small domains and skewed item distributions It has moderate memory requirements that can be adjusted by using the right threshold

57 Terrovitis et. al., CIKM '06 The End Thank You!

58 Terrovitis et. al., CIKM '06 Experiments Vocabulary size

59 Terrovitis et. al., CIKM '06 Experiments Threshold choice

60 Terrovitis et. al., CIKM '06 Experiments Threshold choice


Download ppt "A Combination of Trie-trees and Inverted files for the Indexing of Set-valued Attributes Manolis Terrovitis (NTUA) Spyros Passas (NTUA) Panos Vassiliadis."

Similar presentations


Ads by Google