Presentation for Cmpe-521 VIST – Virtual Suffix Tree Prepared by: Evren CEYLAN – 2003700163 Aslı UYAR - 2003701321.

Slides:



Advertisements
Similar presentations
APWeb 2004 Hangzhou, China 1 Labeling and Querying Dynamic XML Trees Jiaheng Lu and Tok Wang Ling School of Computing National University of Singapore.
Advertisements

XML: Extensible Markup Language
Twig 2 Stack: Bottom-up Processing of Generalized-Tree-Pattern Queries over XML Documents Songting Chen, Hua-Gang Li *, Junichi Tatemura Wang-Pin Hsiung,
Dynamic Planar Convex Hull Operations in Near- Logarithmic Amortized Time TIMOTHY M. CHAN.
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Part C Part A:  Index Definition in SQL  Ordered Indices  Index Sequential.
Chapter 11 Indexing and Hashing (2) Yonsei University 2 nd Semester, 2013 Sanghyun Park.
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B+-Tree Index Files B-Tree.
B+-Trees (PART 1) What is a B+ tree? Why B+ trees? Searching a B+ tree
1 CS 561 Presentation: Indexing and Querying XML Data for Regular Path Expressions A Paper by Quanzhong Li and Bongki Moon Presented by Ming Li.
QUANZHONG LI BONGKI MOON Indexing & Querying XML Data for../Regular Path Expressions/* SUNDAR SUPRIYA.
ViST: a dynamic index method for querying XML data by tree structures Authors: Haixun Wang, Sanghyun Park, Wei Fan, Philip Yu Presenter: Elena Zheleva,
BTrees & Bitmap Indexes
Selective Dissemination of Streaming XML By Hyun Jin Moon, Hetal Thakkar.
Xyleme A Dynamic Warehouse for XML Data of the Web.
Constraint Logic Programming Ryan Kinworthy. Overview Introduction Logic Programming LP as a constraint programming language Constraint Logic Programming.
Chapter 8 File organization and Indices.
Tutorial 6 & 7 Symbol Table
2010/3/81 Lecture 8 on Physical Database DBMS has a view of the database as a collection of stored records, and that view is supported by the file manager.
1 Indexing and Querying XML Data for Regular Path Expressions A Paper by Quanzhong Li and Bongki Moon Presented by Amnon Shochot.
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Part B Part A:  Index Definition in SQL  Ordered Indices  Index Sequential.
B + -Trees (Part 1) Lecture 20 COMP171 Fall 2006.
Chapter 4: Transaction Management
B + -Trees (Part 1). Motivation AVL tree with N nodes is an excellent data structure for searching, indexing, etc. –The Big-Oh analysis shows most operations.
General Trees and Variants CPSC 335. General Trees and transformation to binary trees B-tree variants: B*, B+, prefix B+ 2-4, Horizontal-vertical, Red-black.
Data Structures Using C++ 2E Chapter 11 Binary Trees and B-Trees.
XML –Query Languages, Extracting from Relational Databases ADVANCED DATABASES Khawaja Mohiuddin Assistant Professor Department of Computer Sciences Bahria.
1 Database Tuning Rasmus Pagh and S. Srinivasa Rao IT University of Copenhagen Spring 2007 February 8, 2007 Tree Indexes Lecture based on [RG, Chapter.
Homework #3 Due Thursday, April 17 Problems: –Chapter 11: 11.6, –Chapter 12: 12.1, 12.2, 12.3, 12.4, 12.5, 12.7.
C++ Programming: Program Design Including Data Structures, Third Edition Chapter 20: Binary Trees.
1 Efficiently Mining Frequent Trees in a Forest Mohammed J. Zaki.
Data Structures and Algorithms Session 13 Ver. 1.0 Objectives In this session, you will learn to: Store data in a tree Implement a binary tree Implement.
1.A file is organized logically as a sequence of records. 2. These records are mapped onto disk blocks. 3. Files are provided as a basic construct in operating.
Xpath Query Evaluation. Goal Evaluating an Xpath query against a given document – To find all matches We will also consider the use of types Complexity.
XML as a Boxwood Data Structure Feng Zhou, John MacCormick, Lidong Zhou, Nick Murphy, Chandu Thekkath 8/20/04.
A Summary of XISS and Index Fabric Ho Wai Shing. Contents Definition of Terms XISS (Li and Moon, VLDB2001) Numbering Scheme Indices Stored Join Algorithms.
Querying Structured Text in an XML Database By Xuemei Luo.
Data Structures and Algorithms Lecture (BinaryTrees) Instructor: Quratulain.
Approximate XML Joins Huang-Chun Yu Li Xu. Introduction XML is widely used to integrate data from different sources. Perform join operation for XML documents:
COSC 2007 Data Structures II Chapter 15 External Methods.
12.1 Chapter 12: Indexing and Hashing Spring 2009 Sections , , Problems , 12.7, 12.8, 12.13, 12.15,
QED: A Novel Quaternary Encoding to Completely Avoid Re-labeling in XML Updates Changqing Li,Tok Wang Ling.
S EQUENCE I NDEXING S CHEMES Roman Čížek Erasmus 2687, Nelly Vouzoukidou MET601.
Marwan Al-Namari Hassan Al-Mathami. Indexing What is Indexing? Indexing is a mechanisms. Why we need to use Indexing? We used indexing to speed up access.
Spring 2003 ECE569 Lecture 05.1 ECE 569 Database System Engineering Spring 2003 Yanyong Zhang
CPSC 404, Laks V.S. Lakshmanan1 Overview of Query Evaluation Chapter 12 Ramakrishnan & Gehrke (Sections )
B-TREE. Motivation for B-Trees So far we have assumed that we can store an entire data structure in main memory What if we have so much data that it won’t.
Graph Data Management Lab, School of Computer Science Branch Code: A Labeling Scheme for Efficient Query Answering on Tree
Indexing Database Management Systems. Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B + -Tree Index Files File Organization 2.
APEX: An Adaptive Path Index for XML data Chin-Wan Chung, Jun-Ki Min, Kyuseok Shim SIGMOD 2002 Presentation: M.S.3 HyunSuk Jung Data Warehousing Lab. In.
Spring 2004 ECE569 Lecture 05.1 ECE 569 Database System Engineering Spring 2004 Yanyong Zhang
1 Review of report "LSDX: A New Labeling Scheme for Dynamically Updating XML Data"
Holistic Twig Joins Optimal XML Pattern Matching Nicolas Bruno Columbia University Nick Koudas Divesh Srivastava AT&T Labs-Research SIGMOD 2002.
1 Holistic Twig Joins: Optimal XML Pattern Matching Nicolas Bruno, Nick Koudas, Divesh Srivastava ACM SIGMOD 2002 Presented by Jun-Ki Min.
8/3/2007CMSC 341 BTrees1 CMSC 341 B- Trees D. Frey with apologies to Tom Anastasio.
1 Query Processing Part 3: B+Trees. 2 Dense and Sparse Indexes Advantage: - Simple - Index is sequential file good for scans Disadvantage: - Insertions.
Chapter 11. Chapter Summary  Introduction to trees (11.1)  Application of trees (11.2)  Tree traversal (11.3)  Spanning trees (11.4)
Chapter 11 Indexing And Hashing (1) Yonsei University 1 st Semester, 2016 Sanghyun Park.
Indexing and Querying XML Data for Regular Path Expressions Quanzhong Li and Bongki Moon Dept. of Computer Science University of Arizona VLDB 2001.
1 Efficient Processing of Partially Specified Twig Queries Junfeng Zhou Renmin University of China.
Object-Oriented Analysis and Design
Web Data Extraction Based on Partial Tree Alignment
Heap Chapter 9 Objectives Upon completion you will be able to:
Chapter 11: Indexing and Hashing
Indexing and Hashing Basic Concepts Ordered Indices
Lecture 2- Query Processing (continued)
Early Profile Pruning on XML-aware Publish-Subscribe Systems
Chapter 11 Indexing And Hashing (1)
A Semantic Peer-to-Peer Overlay for Web Services Discovery
Chapter 11: Indexing and Hashing
Data Structures Using C++ 2E
Presentation transcript:

Presentation for Cmpe-521 VIST – Virtual Suffix Tree Prepared by: Evren CEYLAN – Aslı UYAR

VIST: A Dynamic Index Method for Querying XML Data by Tree Structures Written by: Haixun Wang, Sanghyun Park, Wei Fan, Philip S. Yu – SIGMOD 2003

What is XML? XML : Extentional Markup Language Has a great importance in Data Exchange. So, lots of research has been done in providing flexible query mechanisms in order to extract data from XML Documents.

VIST : Virtual Suffix Tree In this paper, VIST is proposed to search XML Documents. XML Documents and XML Queries will be represented in structured-encoded sequences (that will be explained in on-going pages). By using this type of sequences it is shown that, querying XML data is equal to finding subsequence matches.

Index Methods in XML Previous index methods: Disassemble a query into multiple sub- queries, and then join the results of these sub-queries to provide final answers.

What does VIST do? Converts both XML Data and XML Queries to structure-encoded sequences Uses tree structures as the basic unit of query in order to avoid highly expensive join operations In other words, uses structured-encoded sequences instead of nodes or paths

What does VIST do? Matches structured queries against structured data as a whole, without breaking down the queries into sub- queries of paths or nodes and relying on join operations. Supports dynamic index update.

What does VIST do?  In this paper, it is shown that VIST is effective and efficient in supporting structural queries.

Introduction XML has a growing importance in data exchange (extracting data from XML documents) XML provides a flexible way to define semi-structured data In this paper a ‘novel index structure’ is introduced called “VIST”(Virtual Suffix Tree) VIST provides solutions, offers better performance and usability than previous approaches in XML indexing.

In XML query language design, expressing complex structural or graphical queries is one of the major concept. (In figure 2, four sample queries is displayed in graph form )

In previous approaches; i. Indexes are created on path (e.g. “/P/S/I/M” in Q1) Path indexes can answer simple queries efficiently (no branches in Q1). ii. However, queries that involves branching structures (such as Q2), have to be disassembled into sub-queries, then combined by expensive join operations to produce final results. iii. So, these methods are inefficient in handling.

In VIST approach; Objective: to provide a general method so that structural XML queries need not to be decomposed into sub-queries. Result: no need to perform expensive join operations.

Method: XML Data and XML Queries is transformed into to “structure-encoded sequences”. In order to organize structure-encoded sequences Virtual Suffix Tree is used. VIST also speeds up the matching process.

Structure: VIST’s index structure includes two parts: D- Ancestor index, S-Ancestor index (that will be explained in on-going pages). VIST unifies structural indexes and value indexes into a single index. To achieve this, a method is proposed called “dynamic virtual suffix tree labeling” (index update can be performed directly on B+Trees.

Structure-Encoded Sequences Sequential representation of both XML Data and XML Queries.

Objective: Modeling of XML queries through sequence matching makes us to avoid unnecessary join operations in query processing. Result: Structure-Encoded Sequences are used instead of paths or nodes.

Mapping Data and Queries to Structure-Encoded Sequences: Stage 1: Lets consider the purchase record example in figure 3. Notation: Capital letters represent names of Attributes. Lowercase letter represent names of attribute values. To encode attribute values into integers we use hash( ) function. e.g. v 1 = h(“dell”) and v2 = h(“ibm”) V1 and v2 is used to represent delle and ibm respectively.

Representing an XML document by the preorder sequence of its tree structure. e.g. preorder sequence of the tree in Figure 3 is: PSNv1IMv2Nv3IMv4Inv5Lv6BLv7Nv8 Stage 2:

Stage 3: Definition: A structure-encoded sequence is a sequence of (symbol,prefix) pairs: D = (a1,p1), (a2,p2),..., (an,pn) ai: node in the XML doc tree. pi: path from the root node to node ai.

Figure 3 can be converted into the structure-encoded sequence. D = (Figure 4)

Benefits: Modeling XML queries through sequence matching is that structural queries can be processed as a whole instead of being broken into smaller query units(paths or nodes of XML doc tree) Combining the results of the sub queries by join operations is expensive.

The VIST Approach: Presented in 3 stages: Naïve algorithm based on the suffix trees RIST : improves the naïve algorithm by using B+Trees to index suffix tree nodes VIST : an index structure but relying only on the B+Trees

Requirements XML indexing method needs to include: Should support structural queries directly. This is done by “structure-encoded sequences”. Instead of relying on “suffix trees”, the index method uses better indexing techniques such as B+Trees. The index structure should allow dynamic data insertion and deletion, etc.

A Naïve Algorithm Based on Suffix Trees Most widely used index structure for subsequence matching is the suffix tree.

Example: 2 XML Documents called Doc1 and Doc2, 2 XML Queries called Q1 and Q2 in structure-encoded sequences. Doc1 : (P,e)(S,P)(N,PS)(V1,PSN)(L,PS) (V2,PSL) Doc2 : (P,e) (B,P) (L,PB) (V2,PBL) Q1 : (P,e) (B,P) (L,PB) (V2,PBL) Q2 : (P,e) (L,P*) (V2,P*L)

A tree structure for Doc1 and Doc2 is shown in Figure 5 Example: (Cont’d)

As it is shown above elements in the sequences represent nodes in the suffix tree. Since the nodes are involed in 2 different trees, there is 2 kinds of ancestor-descendent relationships among the nodes. i ) D-Ancestorship e.g. (S,P) is a D-ancestor of (L,PS) ii ) S-Ancestorship e.g. (v1,PSN) is a S-ancestor of (L,PS) Example: (Cont’d)

Naïve Algorithm based on the suffix trees: NaiveSearch algorithm based on suffix trees. Represents a naïve method for non- contigious subsequence matching.

For example to match Q2; Start with the root node, which matches the 1 st element of Q2 that is (P,e). Then search under the root for ll nodes that match (L,P*) which yields to (L,PS) and (L,PB) Finally, search for - (v2,PSL) under the node labeled (L,PS) - (v2,PBL) under the node labeled (L,PB) Algorithm 1, searches nodes first by S-Ancestorship, and then D-Ancestorship.

Difficulties of Naive Algorithm: There are difficulties in using suffix tree to index structure-encoded sequences. Major difficulty is explained below: Searching for nodes satisfying both S- Ancestorship, and D-Ancestorship is extremely costly. (because we need to go over a large portion of the subtree for each match)

RIST: Indexing by Ancestor- Descendent Relationships Improves Naïve Algorithm by eliminating the expensive go-over operations in suffix tree. When we reach node X after matching, we can jump directly to those nodes Y to which X is both D-Ancestor and S-Ancestor. So, no longer need to search among the descendents of X to find Ys one by one.

RIST Algorithm: 1. i ndex nodes in suffix tree by their (Symbol,Prefix) pairs. This is represented by a B+Tree. i.This enables us to search nodes by these (Symbol,Prefix) pairs that is D-Ancestorship. ii. This B+Tree is called D-Ancestorship B+Tree.

RIST Algorithm: 2.among all the nodes satisfying D-Ancestorship, we are interested in the ones satisfying S- Ancestorship as well. i. Labels are created for suffix tree nodes in order to tell the relationship btw 2 nodes. ii. We use B+Trees to index nodes by labels. iii.This B+Tree is called S-Ancestorship B+Tree.

Labeling Notation n x : prefix traversal order of x in the suffix tree. Size x: total number of descendants of x in the suffix tree. That kind of labeling is shown in figure 5.

Note: with that labeling, the S- Ancestorship between any two nodes can be decide easily: If x and y are labeled and, node x is an S- Ancestor of y if n y Є ( n x, ) Labeling Notation

Constructing the B+Trees: Insert all suffix tree nodes into the D- Ancestorship B+Tree using their symbols as their keys. For all nodes that x inserted with the same (Symbol,Prefix), we index them by an S-Ancestorship B+Tree, using the n x values of their labels as keys. Shown in FIGURE 6

Building the DocID B+Tree: DocID B+Tree stores for each node x ( using n x as key ), the document IDs of those XML sequences that end up at node x when they are inserted into the suffix tree. Shown in DocID B+Tree

In summary; Unlike the naïve algorithm, RIST does not use suffix trees for subsequence matching (it uses D- Ancestorship B+Tree and S-Ancestorship B+Tree ) Form any node, instead of searching the entire subtree under the node, we can jump to the sub nodes that match the next element in the query. So, RIST supports non-contigious subsequence matching efficiently.

VIST: The Virtual Suffix Tree RIST uses a static scheme to label suffix tree nodes and that prevents it from supporting dynamic insertions. Because any node x labeled, late insertions can change the number of nodes that appear before x. (in the prefix order) As well as the size of the subtree rooted at x, which means neither n nor size can be fixed.

VIST: The Virtual Suffix Tree The purpose of the suffix tree is to provide a labeling mechanism to encode S-Ancestorship. Suppose a node x is created for element d i, during the insertion of sequence d1, …, di,…,dk.

VIST: The Virtual Suffix Tree If it is estimated i. how many different elements will possibly follow di in future insertions. ii.The occurrence probability of each of these elements Then we can label x’s child nodes instead of waiting until all sequences are inserted.

It also means ; the suffix tree itself is no longer needed, because it’s labeling mechanism is inefficient. It supports dynamic data insertion and deletion. VIST: The Virtual Suffix Tree (Cont’d)

Top down scope allocation: A tree structure defines nested scopes: the scope of a child node is a subscope of its parent node, and the root node has the max scope which covers the scope of each node.

Top down scope allocation: In dynamic scope allocation there is a parameter called λ, which is the expected number of child nodes of any node, λ is usually assumed as 2. without the knowledge of the occurrence rate of the each child node, 1/λ of the remaining scope is allocated to x’s 1 st inserted child. Child1 : Child2 :

Dynamic scope of a Suffix Tree Node: The dynamic scope of a node is triple, where k is the number of subscopes allocated inside current scope.

Algorithm of VIST: VIST uses the same sequence matching algorithm as RIST Dynamic method for labeling suffix tree nodes is represented without building the suffix tree.

Algorithm of VIST: The method relies on insensitive estimations of the number of attribute values. Because of that the labeling mechanism is based on a virtual suffix tree.

Example: - lets look at the index structure before and after insertion

Algortihm of VIST: Suppose, before the insertion the index structure already contains the following sequence: Doc1 = (P,e) (S,P) (N,PS) (V1,PSN) (L,PS) (V2,PSL) The sequence to be inserted => Doc2 = (P,e) (S,P) (L,PS) (V2,PSL)

Assumptions of the Example: There are 2 assumptions for the algorithm: Max = Dynamic scope allocation method uses the parameter λ =2

The insertion process is much like that of inserting a sequence into a suffix tree. We follow the branches, and when there is no branch to follow we create one.

CONCLUSION: VIST (a dynamic index method) is developed for XML Documents. XML data and XML queries is converted into sequences that encode their structural information.

VIST’s Pros: Uses tree structure as the basic unit of query to avoid expensive join operations. Supports dynamic data insertion and deletion. Unlike some other data structures used in other approaches, the index structure of VIST which is based on B+Trees, are well supported by DBMSs.

End of Presentation Questions ?