APEX: An Adaptive Path Index for XML data Chin-Wan Chung, Jun-Ki Min, Kyuseok Shim SIGMOD 2002 Presentation: M.S.3 HyunSuk Jung Data Warehousing Lab. In.

APEX: An Adaptive Path Index for XML data Chin-Wan Chung, Jun-Ki Min, Kyuseok Shim SIGMOD 2002 Presentation: M.S.3 HyunSuk Jung Data Warehousing Lab. In Ewha Womans University In Lab Seminar 2003.10.29

Data Warehousing Lab. 2 Outline Introduction Related work Overview of APEX Construction and management of APEX Experiment result Conclusion

Data Warehousing Lab. 3 Introducton (1/2) Traditional indexes generally record all label paths from the root element in XML data. APEX does not keep all paths starting from the root and utilizes frequently used paths to improve the query performance.

Data Warehousing Lab. 4 Introduction (2/2) contribution Efficient Processing of Partial Matching Queries Workload-Aware Path Indexes Incremental Update

Data Warehousing Lab. 5 Related work Strong Dataguide 1-index Index Fabric  Such path indexes may result in performance degradation due to large sizes and exhaustive navigations for partial matching path queries start with the self-or- descendent axis(“//”)

Data Warehousing Lab. 6 Preliminary (1/2) Figure1: A sample XML data

Data Warehousing Lab. 7 Preliminary (2/2) Definition 2 Label path Ex) movie.title, name: label path of node 7 Definition 3 Data path Ex) movie.8.title and name.11: data paths of node 7 Definition 4 Data path d is an instance of a label path Ex) movie.8.title.10 is an instance of movie.title and name.11 is an instance of name.

Data Warehousing Lab. 8 Overview of APEX APEX (H APEX and G APEX ) Example query: //actor/name Searches for all actor names Just looks up the hash tree with actor.name in reverse order

Data Warehousing Lab. 9 Overview of APEX The support of a label path p=l i …l j : sub(p) The ratio of the number of queries having p as a subpath to the total number queries. Definition 6. p=l i …l j in GXML is a frequently used path if sup(p)≥minSup. Let p be a required path if it is either a frequently used path or the length of p is one. Definition 7. For a label path p of the form l i. l i+1 …l j in G XML, a edge set, T(p), is { | l i o i. l j-1.o j-1 …l j. o j is a data path in G XML }. That is, a edge set T(p) is a set of pairs of nids for the incoming edges to the last nodes that are reachable by traversing a given label path p. Ex) edge set T(title)={, }

Data Warehousing Lab. 10 Construction and management of APEX Architecture of APEX management tool The system consists of 3 main components

Data Warehousing Lab. 11 APEX 0 : initial index structure APEX 0 is the initial structure to build APEX. This step is executed only once at the beginning. have not only the structural summary in G APEX but also the extents in the nodes of G APEX.

Data Warehousing Lab. 12 Frequently used path extraction Sequential pattern mining Used naïve algorithm simply counts all sequential subsequences that appear in the query workload by one scan. Suppose required path set:{A,B,C,D,B.D} Entry of hash table Label: key value for the entry Count: frequency of label path New: check a newly create entry Xnode: points to a node in G APEX whose incoming label path is represented by the entry Next: points another node in H APEX

Data Warehousing Lab. 13 The update with frequently used paths The basic idea of update traverse the nodes in G APEX update not only the structure of G APEX with frequently used paths but also the xnode field of entries in H APEX.

Data Warehousing Lab. 14 The update with frequently used paths

Data Warehousing Lab. 15 Experiment results (1/2) Data Sets Play: tree structured data FlixML: graph structured data GedML: graph structured data Query Workload //l 1 /l 2 /…/l m or //l 1 /…/l i  l i+1 /…/l m where l i is a tag or an attribute (with the prefix of ‘@’) : QTYPE1 //l i // l i : QTYPE2

Data Warehousing Lab. 16 Experiment results (2/2) the strong DataGuide is generally inefficient for complex XML data. The performance of APEX depends on the value of minSup. As minSup decreases, the number of frequently used paths increases and more entries is stored in the H APEX. Thus, more queries can be directly obtained by look-up of H APEX. best

Data Warehousing Lab. 17 Conclusion APEX review summary does not keep all paths starting from the root and utilizes frequently used paths to improve the query performance. Utilizing the data mining algorithm to summarize paths that appear frequently in the query workload. can be incrementally updated in order to minimize the overhead of construction whenever the query workload changes. guarantees all paths of length two so that any label path expression can be evaluated by joins of extents in APEX without scanning original data. consists of two structures the graph structure G APEX : represents the structural summary of XML data with extents the hash tree H APEX : keeps the information for frequently used paths and their corresponding nodes in G APEX.

APEX: An Adaptive Path Index for XML data Chin-Wan Chung, Jun-Ki Min, Kyuseok Shim SIGMOD 2002 Presentation: M.S.3 HyunSuk Jung Data Warehousing Lab. In.

Similar presentations

Presentation on theme: "APEX: An Adaptive Path Index for XML data Chin-Wan Chung, Jun-Ki Min, Kyuseok Shim SIGMOD 2002 Presentation: M.S.3 HyunSuk Jung Data Warehousing Lab. In."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

APEX: An Adaptive Path Index for XML data Chin-Wan Chung, Jun-Ki Min, Kyuseok Shim SIGMOD 2002 Presentation: M.S.3 HyunSuk Jung Data Warehousing Lab. In.

Similar presentations

Presentation on theme: "APEX: An Adaptive Path Index for XML data Chin-Wan Chung, Jun-Ki Min, Kyuseok Shim SIGMOD 2002 Presentation: M.S.3 HyunSuk Jung Data Warehousing Lab. In."— Presentation transcript:

Similar presentations

About project

Feedback