Download presentation
Presentation is loading. Please wait.
Published byTrevor McCarthy Modified over 9 years ago
1
Efficient Keyword Search over Virtual XML Views Feng Shao and Lin Guo and Chavdar Botev and Anand Bhaskar and Muthiah Chettiar and Fan Yang Cornell University Jayavel Shanmugasundaram Yahoo! Research 2008. 02. 21. Summarized by Dongmin Shin, IDS Lab., Seoul National University Presented by Dongmin Shin, IDS Lab., Seoul National University
2
Copyright 2007 by CEBT Index Introduction Background System Overview QPT Generation Module PDT Generation Module Experiments Conclusion and Future Work 2
3
Copyright 2007 by CEBT Index Introduction Background System Overview QPT Generation Module PDT Generation Module Experiments Conclusion and Future Work 3
4
Copyright 2007 by CEBT Introduction Fundamental assumption of traditional information retrieval systems 4 The set of documents being searched is materialized.
5
Copyright 2007 by CEBT Introduction But 5 The view is often virtual (unmaterialized) Aggregator may not have resources to materialize all the data If the view is materialized, the contents of the view may be out-of-date or maintaining the view may be expensive The data sources may not wish to provide the entire data
6
Copyright 2007 by CEBT Introduction Example Personalized views : MyYahoo or Microsoft Sharepoint – There are many users and their content is often overlapping – It could lead to data duplication and its associated space-overhead Information integration 6
7
Copyright 2007 by CEBT Introduction Efficiently evaluating keyword search queries over virtual XML views 7 Need
8
Copyright 2007 by CEBT Index Introduction Background System Overview QPT Generation Module PDT Generation Module Experiments Conclusion and Future Work 8
9
Copyright 2007 by CEBT Background 9
10
Background XML Scoring tf(e,k) : the number of distinct occurrences of the keyword k in element e and its descendants idf(k) = score(e,Q) = 10 TF-IDF method
11
Copyright 2007 by CEBT Index Introduction Background System Overview QPT Generation Module PDT Generation Module Experiments Conclusion and Future Work 11
12
Copyright 2007 by CEBT System Overview (1) Keyword queries over virtual views 12 (2) The parser redirects the query to the Query Pattern Tree(QPT) Generation Module (3) QPT is sent to the Pruned Document Tree(PDT) Generation Module (4) Generate PDTs using only the path indices and inverted list indices (5) Rewritten query and PDTs are sent to Evaluator(6) Produce the view that contains all view elements with pruned content (7) Elements are scored, only those with highest scores are fully materialized using document storage
13
Copyright 2007 by CEBT System Overview XML Storage Dewey IDs – Popular id format – Hierarchical numbering scheme – ID of an element contains the ID of its parent 13
14
Copyright 2007 by CEBT System Overview XML Indexing Path indices – Evaluate XML path and twig(i.e., branching path) – Store XML paths with values in a relational table – Use indices such as B+-tree – One row for each unique (Path, Value) pair – IDList : the list of ids of all elements on the path – B+-tree index is built on the (Path, Value) pair 14
15
Copyright 2007 by CEBT System Overview Inverted list indices – Store the list of XML elements that directly contain the keyword for each keyword in the document collection 15
16
Copyright 2007 by CEBT Index Introduction Background System Overview QPT Generation Module PDT Generation Module Experiments Conclusion and Future Work 16
17
Copyright 2007 by CEBT QPT(Query Pattern Tree) Generation Module 17 V : used for query evaluation C : used for result materialization
18
Copyright 2007 by CEBT Index Introduction Background System Overview QPT Generation Module PDT Generation Module Experiments Conclusion and Future Work 18
19
Copyright 2007 by CEBT PDT Generation Module Output Only contains elements that correspond to nodes in the QPT Only contains element values that are required during query evaluation Advantage Query evaluation is likely to be more efficient and scalable – Since PDT is much smaller than the underlying data Allows us to use the regular(unmodified) query evaluator – PDT is in regular XML format 19
20
Copyright 2007 by CEBT PDT Generation Module Key Idea An element e in the document corresponding to a node n in the QPT is selected for inclusion only if it satisfies three types of constraints (1) Ancestor constraint – an ancestor element of e that corresponds to the parent of n in the QPT should also be selected (2) Descendant constraint – for each mandatory edge from n to a child of n in the QPT, at least one child/descendant element of e corresponding to that child of n should also be selected (3) Predicate Constraint – if e is a leaf node, it satisfies all predicates associated with n 20
21
Copyright 2007 by CEBT PDT Generation Module PrepareList (1) Issues a lookup on path indices for each QPT node that has no mandatory child edges (2) Identifies nodes that have a ‘v’ annotation to obtain values and ids (3) Looks up inverted lists indices and retrieves the list of Dewey IDs containing the keywords along with tf values 21
22
Copyright 2007 by CEBT PDT Generation Module Candidate Tree(CT) 22
23
Copyright 2007 by CEBT PDT Generation Module Step 1 : adding new IDs – Adds the current minimum IDs in pathLists 23
24
Copyright 2007 by CEBT PDT Generation Module Step 2 : creating PDT nodes – Create PDT nodes using CT nodes – Top-down – Check DM value of each CT node if it is “1”, create it in pdt cache If not, check children of that node If DM value of that children node is “1”, create is in pdt cache of parent node 24
25
Copyright 2007 by CEBT PDT Generation Module Step 3 : removing CT nodes – Bottom-up – Check if each node satisfies ancestor constraints If not, remove If so, propagate to the pdt cache of the ancestor – If some node has no children and does not satisfy descendant constraints, remove 25
26
Copyright 2007 by CEBT PDT Generation Module – When we remove the root node “books”, all IDs in its pdt cache will be propagated to the result PDT 26
27
Copyright 2007 by CEBT PDT Generation Module 27
28
Copyright 2007 by CEBT Index Introduction Background System Overview QPT Generation Module PDT Generation Module Experiments Conclusion and Future Work 28
29
Copyright 2007 by CEBT Experiments 500MB INEX dataset Varying parameters Size of data, # keywords, selectivity of keywords # of joins, join selectivity, level of nesting # of results, Avg. size of view element Four alternative approaches Baseline GTP : general solution to integrate structure and keyword search queries Efficient : proposed architecture Proj : techniques of projecting XML documents 29
30
Copyright 2007 by CEBT Experiments EFFICIENT is a scalable and efficient soultion 30 The cost of generating PDTs scales gracefully Overhead of post- processing(scoring and materializing) is negligible The cost of the query evaluator dominates the entire cost
31
Copyright 2007 by CEBT Experiments Run time for EFFICIENT increases slightly Because it accesses more inverted lists to retrieve tf values 31 Run time for EFFICIENT increases Because the cost of the query evaluation increases
32
Copyright 2007 by CEBT Index Introduction Background System Overview QPT Generation Module PDT Generation Module Experiments Conclusion and Future Work 32
33
Copyright 2007 by CEBT Conclusion and Future Work Conclusion A general technique for evaluating keyword search queries over views Efficient over a wide range of parameters Future Work Instead of using the regular query evaluator, we could use the techniques proposed for ranked query evaluation Views may contain non-monotonic operators such as group-by 33
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.