Presentation is loading. Please wait.

Presentation is loading. Please wait.

Master Informatique 1 Qiuyue WangXML Data Management Structure Indexes for XML.

Similar presentations


Presentation on theme: "Master Informatique 1 Qiuyue WangXML Data Management Structure Indexes for XML."— Presentation transcript:

1 Master Informatique 1 Qiuyue WangXML Data Management Structure Indexes for XML

2 Master Informatique 2 Qiuyue WangXML Data Management Structure Indexes for XML Kinds of Indexes Value Indexes – index atomic values; e.g., data(//emp/salary) – use B+ trees (like in relational world) – (integration into query optimizer more tricky) Structure Indexes – materialize results of path expressions – (pendant to join indexes, path indices) Full Text indexes – Keyword search, inverted files – (IR world, text extenders)

3 Master Informatique 3 Qiuyue WangXML Data Management Structure Indexes for XML Value Indexes: Open Questions What is the key of the index? (Physical Design) – singletons vs. sequences – string vs. typed-value which type? (even for homogeneous domains) heterogeneous domain – composite indexes Index for what comparison? (Physical Design) – =: virually impossible due to implicit cast + exists – eq, leq, …: problems with implicit casts When is a value index applicable? (Compiler)

4 Master Informatique 4 Qiuyue WangXML Data Management Structure Indexes for XML Structure Index: Examples DataGuides 1-Index APEX Index Fabric ……

5 Master Informatique 5 Qiuyue WangXML Data Management Structure Indexes for XML Semi-Structured Data Model Object Exchange Model (OEM) &r1 &c1&c2 &s2&s3&s6&s7 &s10 company name address name url address “Widget” “Trenton” “Gadget” “www.gp.fr” “Paris” &p2&p1&p3 &s0&s1&s4&s5&s8&s9 person “Smith” name position name phone name position “Manager” “Jones” “5552121” “Dupont” “Sales” &a1 &a2 &a3 &a4 &a5 &a6 &a7 description procurement salesrep contact task eval 1997 1998 “on target” “below target”

6 Master Informatique 6 Qiuyue WangXML Data Management Structure Indexes for XML SS vs. XML Data Models Semi-Structured Data – Edge-labeled graph XML data – Node-labeled ordered tree

7 Master Informatique 7 Qiuyue WangXML Data Management Structure Indexes for XML DataGuides Given a semistructured/XML database instance DB, a DataGuide for DB is a graph G such that: – Every label path in DB also occurs in G Complete coverage – Every label path in G also occurs in DB Accurate coverage (no bogus path) – Every label path in G (starting from a particular object) is unique (i.e., G is a DFA) Efficient search: to process a label path of length n, just examine n nodes in G

8 Master Informatique 8 Qiuyue WangXML Data Management Structure Indexes for XML DataGuide Example 12 1314 1516171819 Name Entree Phone Owner Manager RestaurantBar 13 ={2,3} 18=19={8} 14={4} 17={7} 15={5,9} 16={6,10,11} 12={1} Each node in the DataGuide can point to a set of database nodes

9 Master Informatique 9 Qiuyue WangXML Data Management Structure Indexes for XML Multiple DataGuides for Same Data

10 Master Informatique 10 Qiuyue WangXML Data Management Structure Indexes for XML Strong DataGuides Let p, p’ be two label path expressions and G a graph; define p ≡ G p’. if p(G) = p’(G) – That is, p and p’ are indistinguishable on G DG is a strong DataGuide for a database DB if the equivalence relations ≡ DG and ≡ DB are the same Example: G1 is strong; G2 is not A.C(DB) = { 5 }, B.C(DB) = { 6, 7 } A.C(G2) = { 20 }, B.C(G2) = { 20 }

11 Master Informatique 11 Qiuyue WangXML Data Management Structure Indexes for XML Size of DataGuides If DB is a tree, then | G | ≤ | DB | – Linear construction time In the worst case, however, the size of a strong DataGuide may be exponential in | DB |

12 Master Informatique 12 Qiuyue WangXML Data Management Structure Indexes for XML A First Attempt at 1-Index Equivalence relation ≡ on the nodes in DB: – u≡v if u and v are reachable by the exactly same set of paths starting from the root. Index is also a graph (no bigger than DB) – Each index node corresponds to an equivalent class; it points to the set of DB nodes in that equivalent class. – There is an index edge labeled e from s to s’. if there is a DB edge labeled e from a node in s to a node in s’. Any accurate index should have at least this many nodes Expensive to construct (PSPACE-complete)

13 Master Informatique 13 Qiuyue WangXML Data Management Structure Indexes for XML 1-Index Idea: use simulation/bi-simulation instead of ≡ Stronger conditions  finer equivalence classes  more index nodes Simulation and bi-simulation are much easier to compute (PTIME) – To be practical, still need External-memory construction algorithm Incremental index update algorithm

14 Master Informatique 14 Qiuyue WangXML Data Management Structure Indexes for XML Simulation Given two edge-labeled graphs G1, G2, a simulation is a binary relation on their nodes, denoted as ≤, s.t., – if x1 ≤ x2 and (x1, a, y1) is an edge in G1, then there exists an edge (x2, a, y2) in G2 (same label) such that y1 ≤ y2. x1x2 a ≤ G1G2 y1 a ≤ y2

15 Master Informatique 15 Qiuyue WangXML Data Management Structure Indexes for XML Bisimulation Given two edge-labeled graphs G1, G2, a bisimulation is a relation between their nodes, denoted as , s.t. – if x1  x2 and (x1, a, y1) is an edge in G1, then there exists an edge (x2, a, y2) in G2 (same label) such that y1  y2; and vice versa equivalence relation

16 Master Informatique 16 Qiuyue WangXML Data Management Structure Indexes for XML Simulation/Bisimulation Two nodes u and v are bisimilar (u ≈ b v) if they are related in some bisimulation Two nodes u and v are similar (u ≈ s v) if there are two simulations ~ and ~’ s.t. u ~ v and v ~’ u Fact: u ≈ b v ⇒ u ≈ s v ⇒ u ≡ v – Why?

17 Master Informatique 17 Qiuyue WangXML Data Management Structure Indexes for XML Computing a (Bi)Simulation The empty set is always a (bi)simulation If R, R’ are (bi)simulations, so is R U R’. Hence, there always exists a maximal (bi)simulation. Computing the maximal (bi)simulation: – start with R = nodes(G1) x nodes(G2) – while there exists (x1, x2) ∈ R that violates the definition, remove (x1, x2) from R This runs in polynomial time O(mn)! – Better: O((m+n)log(m+n)) for bisimulation [Paige and Tarjan 87] O(mn) for simulation [Henzinger, et a. 1995]

18 Master Informatique 18 Qiuyue WangXML Data Management Structure Indexes for XML 1-Index Example (a) A data graph, (b) its 1-index, (c) its strong DataGuide

19 Master Informatique 19 Qiuyue WangXML Data Management Structure Indexes for XML Analyzing 1-Index For a tree-structured DB, 1-indexes using ≈ b, ≈ s, ≡ are all identical to DataGuide Always: size(1-index) ≤ size(DB) – Unlike DataGuide – But we are back to NFA; is lookup time bounded? Always: can construct index in O(|DB| log|DB|) Still need: external-memory construction algorithm and incremental update algorithm Designed to answer arbitrarily complex path expressions, but such expressions may not show up often in queries

20 Master Informatique 20 Qiuyue WangXML Data Management Structure Indexes for XML More on Graph Indexing Graph indexing: 1.Partition nodes into equivalence classes 2.Store the extent of each equivalence class, use it as "pre-cooked" answer to some queries Equivalence notions: 1.Reachable by some common paths: DataGuide [MW97] 2.Reachable by exactly the same paths, or equivalently, indistinguishable by any forward path expression: 1-index [MS99] 3.Indistinguishable by any (forward and backward) path expression: F&B Index [ABS99,KBN+02] 4.Indistinguishable by the (forward and backward) path expressions in the set Q: covering index [KBN+02] 5.Indistinguishable by any path expression of length < k: A(k) index [KSB+02]

21 Master Informatique 21 Qiuyue WangXML Data Management Structure Indexes for XML Adaptive Path Indexing: APEX Most indexing work indexes all possible paths in the data, but few paths actually come up in queries. Index only the frequently used paths (mined from a query workload). Chung et al., “APEX: An Adaptive Path Index for XML Data”, SIGMOD 2002 – Efficient processing of partial matching queries starting with self-or-descendant axis. – Workload-aware path indexes. – Incremental update

22 Master Informatique 22 Qiuyue WangXML Data Management Structure Indexes for XML Components of APEX Graph structure: G APEX – Represents the structural summary of XML data with extents (a set of edges whose ending nodes are the result of a label path expression). Hash tree: H APEX – Keeps the information for frequently used paths and their corresponding nodes in G APEX.

23 Master Informatique 23 Qiuyue WangXML Data Management Structure Indexes for XML Example XML Data

24 Master Informatique 24 Qiuyue WangXML Data Management Structure Indexes for XML Example APEX H APEX G APEX Query = //actor/name search the query in hash tree must in reverse order extent: &1 = { } &2 = {,, } &3 = {,,, } …

25 Master Informatique 25 Qiuyue WangXML Data Management Structure Indexes for XML Construction & Management of APEX

26 Master Informatique 26 Qiuyue WangXML Data Management Structure Indexes for XML APEX0: Initial Index Structure H APEX G APEX &0 &1&1 &2&2 movieDB actor &3&3 name &4&4 @movie &5&5 movie actor &9&9 @actor &8&8 title &6&6 @director name &7&7 director movie director labelxnode xroot&0 movieDB&1 actor&2 name&3 @movie&4 movie&5 @director&6 dicrector&7 title&8 &actor&9

27 Master Informatique 27 Qiuyue WangXML Data Management Structure Indexes for XML Frequently Used Path Extraction Labelcountxnodenext xroot&0 A&1 B&2 C&3 D Labelcountxnodenext xroot0&0 A2&1 B0&2 C1&3 D2 Labelcountxnodenext B&4 Remainder&5 Labelcountxnodenext A2NULL B0&4 Remainder&5 (a) Current state (b) After frequency count Q workload ={ A.D, C, A.D } Extend H APEX with a count field. Simply counts all subsequences that appear in query workload.

28 Master Informatique 28 Qiuyue WangXML Data Management Structure Indexes for XML Frequently Used Path Extraction Labelcountxnodenext xroot0&0 A2&1 B0&2 C1&3 D2 Labelcountxnodenext A2NULL B0&4 RemainderNULL minsup = 2 (c) After pruning if count < minsup then delete. but if node is head-node then can't delete. if has nodes deleted or created then “Remainder” = NULL

29 Master Informatique 29 Qiuyue WangXML Data Management Structure Indexes for XML Update APEX Basic idea – traverse the nodes in G APEX. – update not only the structure of G APEX with frequently used paths but also the xnode field of entries in H APEX. Recursively calling updateAPEX(xnode, ΔEset, path); – xnode: a node in G APEX – ΔEset: set of new edges added to the extent of the xnode – path: the incoming label path from the root to the xnode

30 Master Informatique 30 Qiuyue WangXML Data Management Structure Indexes for XML Before Update &0 &1 &2 &4 &3 &5 A BC D D D Labelcountxnodenext xroot&0 A&1 B&2 C&3 D Labelcountxnodenext ANULL RemainderNULL extent &0: { } &1: { } &2: { } &3: { } &4: { } &5: {, } Labelcountxnodenext xroot&0 A&1 B&2 C&3 D After identifying frequently used paths and pruning

31 Master Informatique 31 Qiuyue WangXML Data Management Structure Indexes for XML Update APEX &0 &1 &2 &6 &3 &5 A BC D D D Labelcountxnodenext xroot&0 A&1 B&2 C&3 D Labelcountxnodenext ANULL Remainder&6 extent &0: { } &1: { } &2: { } &3: { } &4: { } &5: {, } &6: { }

32 Master Informatique 32 Qiuyue WangXML Data Management Structure Indexes for XML Update APEX &0 &1 &2 &6 &3 &7 A BC D D D Labelcountxnodenext xroot&0 A&1 B&2 C&3 D Labelcountxnodenext A&7 Remainder&6 extent &0: { } &1: { } &2: { } &3: { } &5: {, } &6: { } &7: { } &5

33 Master Informatique 33 Qiuyue WangXML Data Management Structure Indexes for XML After Update &0 &1 &2 &6 &3 &7 A BC D D D Labelcountxnodenext xroot&0 A&1 B&2 C&3 D Labelcountxnodenext A&7 Remainder&6 extent &0: { } &1: { } &2: { } &3: { } &6: {, } &7: { }

34 Master Informatique 34 Qiuyue WangXML Data Management Structure Indexes for XML Index Fabric B.Cooper, N.Sample, M.Franklin, et al., “A Fast Index for Semistructured Data”, VLDB 2001 Tree Structured Data Conceptual similar to strong DataGuide Layered structure Use Patricia trie to index a large number of search keys – The simple path of an element which has a data value is encoded as a special character sequence – Keeps the key which is the combination of encoded sequence and data value.

35 Master Informatique 35 Qiuyue WangXML Data Management Structure Indexes for XML Encoding Paths w/Designators  ABC Corp.  123 ABC Way  17 Main St.  Goods Inc.  widget  thingy  jobber Invoice as a tree Invoice Buyer Seller Itemlist Name Address Item ABC Corp.123 ABC Way Goods Inc. 17 Main St. widgetthingyjobber Name Address Item           

36 Master Informatique 36 Qiuyue WangXML Data Management Structure Indexes for XML Index Fabric An index structure for long strings. – Provides fast lookups – Handles long strings – Ideal substrate for designated keys Based on Patricia tries – Highly compressed string representation – Cost in index independent of string length – But, need to balance.

37 Master Informatique 37 Qiuyue WangXML Data Management Structure Indexes for XML Patricia Tries Indexes first point of difference between keys greenbeans greentea gc r w 0 22 corn cow a 2 grass 5 e b t greenbeansgreentea D. R. Morrison. “PATRICIA – Practical algorithm to retrieve information coded in alphanumeric.” J. ACM, 15 (1968) pp. 514-534

38 Master Informatique 38 Qiuyue WangXML Data Management Structure Indexes for XML Layered Approach Index Fabric improves Patricia tries and make it balanced and optimized for disk-based access like B- tree Each query accesses the same number of layers The index can have as many layers as necessary, the highest layer always contains one block The keys are stored very compactly, and blocks have a very high out-degree. In practice, three layers is enough to store billions of keys

39 Master Informatique 39 Qiuyue WangXML Data Management Structure Indexes for XML Balancing Patricia tries gc r w 0 22 corn cow a 2 grass 5 e b t greenbeansgreentea

40 Master Informatique 40 Qiuyue WangXML Data Management Structure Indexes for XML Balancing Patricia tries Step 1: divide trie into blocks gc r w 0 22 corn cow a 2 grass 5 e b t greenbeansgreentea

41 Master Informatique 41 Qiuyue WangXML Data Management Structure Indexes for XML Balancing Patricia tries Step 2: build another layer g 0 2 Layer 1 Layer 0 e gc r w 0 22 corn cow a 2 grass 5 e b t greenbeansgreentea “” “gr” “green” “”

42 Master Informatique 42 Qiuyue WangXML Data Management Structure Indexes for XML Two Kinds of Links Labeled far links – Like normal edges in a trie, but connects a node in one layer to a subtrie in the lower layer Unlabeled direct links – Connects a node in one layer to a block with a node representing the same prefix in the lower layer

43 Master Informatique 43 Qiuyue WangXML Data Management Structure Indexes for XML Balancing Patricia tries Layer 0 Data Search Layer 0 Layer 1 Layer 2 Layer 3

44 Master Informatique 44 Qiuyue WangXML Data Management Structure Indexes for XML Searching Search for “greenbeans” g 0 2 Layer 1 Layer 0 e greenbeans gc r w 0 22 corn cow a 2 grass 5 e b t greenbeansgreentea

45 Master Informatique 45 Qiuyue WangXML Data Management Structure Indexes for XML Searching g 0 2 Layer 1 Layer 0 e greenbeans gc r w 0 22 corn cow a 2 grass 5 e b t greenbeansgreentea Search for “greenbeans”

46 Master Informatique 46 Qiuyue WangXML Data Management Structure Indexes for XML Searching g 0 2 Layer 1 Layer 0 e greenbeans gc r w 0 22 corn cow a 2 grass 5 e b t greenbeansgreentea Search for “greenbeans”

47 Master Informatique 47 Qiuyue WangXML Data Management Structure Indexes for XML Designators Designator – A unique special character or characters assign to each tag in XML Designator dictionary – Maintain the mapping between designators and XML tags Insert the designator-encoded XML string into Index Fabric XML Tags in queries are translated into designators, and to form a search key over the Index Fabric

48 Master Informatique 48 Qiuyue WangXML Data Management Structure Indexes for XML Raw Paths Index the hierarchical structure of the XML by encoding a root-to-leaf path as a string Treat attribute like tagged children, but use different designators to distinguish the same name tag and attribute Can use alternate designators to encode the ordering of tags in the XML documents

49 Master Informatique 49 Qiuyue WangXML Data Management Structure Indexes for XML Example XML Data Doc 1: ABC Corp 1 Industrial Way Acme Inc 2 Acme Rd. saw drill Doc 2: Oracle Inc 555-1212 IBM Corp 4 nail

50 Master Informatique 50 Qiuyue WangXML Data Management Structure Indexes for XML Designators and Raw Paths =I =B =N =A =S =T =P =C Count attribute = C´

51 Master Informatique 51 Qiuyue WangXML Data Management Structure Indexes for XML Index Fabric

52 Master Informatique 52 Qiuyue WangXML Data Management Structure Indexes for XML Refined Paths Refined paths are specialized paths through XML that optimize frequent access patterns Support queries that have wildcards, alternates, and different constants DBA decides which refined paths are appropriate Both raw paths and refined paths are stored in the same index and accessed using string lookup

53 Master Informatique 53 Qiuyue WangXML Data Management Structure Indexes for XML Refined Paths Optimize specific access paths “Find invoices where X sold to Y ” “Find invoices where X bought Y and Z” “Find invoices where a buyer bought X, Y and Z ”  X Y  ABC Corp. Goods Inc.  XYZ Corp. Acme Inc.  ABC Corp. jobber widget  XYZ Corp. drill hammer  X Y Z  X Y Z  jobber thingy widget  drill hammer nail

54 Master Informatique 54 Qiuyue WangXML Data Management Structure Indexes for XML Refined Paths Two steps to create a refined path index: – The XML documents are parsed to extract information matching the access pattern of the refined path; – The information is encoded as an Index Fabric key and inserted into the index.

55 Master Informatique 55 Qiuyue WangXML Data Management Structure Indexes for XML Summary of Index Fabric Index Fabric combines the advantages of both Patricia tries and B-trees – Scaling property of Patricia tries – Balanced and optimized for disk-based access like B-trees Index Fabric can be built over relational database – Leverage the maturity of relational database Not need a priori knowledge of the schema of data – Can handle XML data without DTD

56 Master Informatique 56 Qiuyue WangXML Data Management Structure Indexes for XML Lineage of XML Indexing Schemes


Download ppt "Master Informatique 1 Qiuyue WangXML Data Management Structure Indexes for XML."

Similar presentations


Ads by Google