Manish Bhide, Manoj K Agarwal IBM India Research Lab India {abmanish, Amir Bar-Or, Sriram Padmanabhan IBM Software Group, USA

Slides:



Advertisements
Similar presentations
Inside an XSLT Processor Michael Kay, ICL 19 May 2000.
Advertisements

CHAPTER OBJECTIVE: NORMALIZATION THE SNOWFLAKE SCHEMA.
BY LECTURER/ AISHA DAWOOD DW Lab # 4 Overview of Extraction, Transformation, and Loading.
BY LECTURER/ AISHA DAWOOD DW Lab # 3 Overview of Extraction, Transformation, and Loading.
Parallel Databases By Dr.S.Sridhar, Ph.D.(JNUD), RACI(Paris, NICE), RMR(USA), RZFM(Germany) DIRECTOR ARUNAI ENGINEERING COLLEGE TIRUVANNAMALAI.
TIMBER A Native XML Database Xiali He The Overview of the TIMBER System in University of Michigan.
Paper by: A. Balmin, T. Eliaz, J. Hornibrook, L. Lim, G. M. Lohman, D. Simmen, M. Wang, C. Zhang Slides and Presentation By: Justin Weaver.
NaLIX: A Generic Natural Language Search Environment for XML Data Presented by: Erik Mathisen 02/12/2008.
CS263 Lecture 19 Query Optimisation.  Motivation for Query Optimisation  Phases of Query Processing  Query Trees  RA Transformation Rules  Heuristic.
Automatic Data Ramon Lawrence University of Manitoba
CIS607, Fall 2005 Semantic Information Integration Article Name: Clio Grows Up: From Research Prototype to Industrial Tool Name: DH(Dong Hwi) kwak Date:
Chapter 8 Physical Database Design. McGraw-Hill/Irwin © 2004 The McGraw-Hill Companies, Inc. All rights reserved. Outline Overview of Physical Database.
PARALLEL DBMS VS MAP REDUCE “MapReduce and parallel DBMSs: friends or foes?” Stonebraker, Daniel Abadi, David J Dewitt et al.
Query Processing Presented by Aung S. Win.
ETL By Dr. Gabriel.
SSIS Over DTS Sagayaraj Putti (139460). 5 September What is DTS?  Data Transformation Services (DTS)  DTS is a set of objects and utilities that.
Databases and LINQ Visual Basic 2010 How to Program 1.
Overview of a Database Management System
Systems analysis and design, 6th edition Dennis, wixom, and roth
Activity Running Time DurationIntro0 2 min Setup scenario 2 2 min SQL BI components & concepts 4 5 min Data input (Let’s go shopping) 9 7 min Whiteboard.
CS 345: Topics in Data Warehousing Tuesday, October 19, 2004.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
Session 4: The HANA Curriculum and Demos Dr. Bjarne Berg Associate professor Computer Science Lenoir-Rhyne University.
DBSQL 14-1 Copyright © Genetic Computer School 2009 Chapter 14 Microsoft SQL Server.
Database Management 9. course. Execution of queries.
DANIEL J. ABADI, ADAM MARCUS, SAMUEL R. MADDEN, AND KATE HOLLENBACH THE VLDB JOURNAL. SW-Store: a vertically partitioned DBMS for Semantic Web data.
Using SAS® Information Map Studio
Join Synopses for Approximate Query Answering Swarup Achrya Philip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented by Bhushan Pachpande.
Efficient XSLT Processing in Relational Database System Zhen Hua Liu Anguel Novoselsky Oracle Corporation VLDB 2006.
Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters Hung-chih Yang(Yahoo!), Ali Dasdan(Yahoo!), Ruey-Lung Hsiao(UCLA), D. Stott Parker(UCLA)
Data Management Console Synonym Editor
The Oracle9i Multi-Terabyte Data Warehouse Jeff Parker Manager Data Warehouse Development Amazon.com Session id:
Oracle Data Integrator Transformations: Adding More Complexity
Frontiers in Massive Data Analysis Chapter 3.  Difficult to include data from multiple sources  Each organization develops a unique way of representing.
ETL Extract Transform Load. Introduction of ETL ETL is used to migrate data from one database to another, to form data marts and data warehouses and also.
Prepared By Aakanksha Agrawal & Richa Pandey Mtech CSE 3 rd SEM.
Fushen Wang, XinZhou, Carlo Zaniolo Using XML to Build Efficient Transaction- Time Temporal Database Systems on Relational Databases In Time Center, 2005.
1 Biometric Databases. 2 Overview Problems associated with Biometric databases Some practical solutions Some existing DBMS.
Creating and Maintaining Geographic Databases. Outline Definitions Characteristics of DBMS Types of database Relational model SQL Spatial databases.
XML and Database.
Physical Database Design Purpose- translate the logical description of data into the technical specifications for storing and retrieving data Goal - create.
7 Strategies for Extracting, Transforming, and Loading.
Dec. 13, 2002 WISE2002 Processing XML View Queries Including User-defined Foreign Functions on Relational Databases Yoshiharu Ishikawa Jun Kawada Hiroyuki.
Query Processing – Query Trees. Evaluation of SQL Conceptual order of evaluation – Cartesian product of all tables in from clause – Rows not satisfying.
Chapter 8 Physical Database Design. Outline Overview of Physical Database Design Inputs of Physical Database Design File Structures Query Optimization.
9-1 © Prentice Hall, 2007 Topic 9: Physical Database Design Object-Oriented Systems Analysis and Design Joey F. George, Dinesh Batra, Joseph S. Valacich,
Query Optimization CMPE 226 Database Systems By, Arjun Gangisetty
Last Updated : 27 th April 2004 Center of Excellence Data Warehousing Group Teradata Performance Optimization.
MapReduce and the New Software Stack. Outline  Algorithm Using MapReduce  Matrix-Vector Multiplication  Matrix-Vector Multiplication by MapReduce 
SSIS – Deep Dive Praveen Srivatsa Director, Asthrasoft Consulting Microsoft Regional Director | MVP.
©2007 Really Strategies, Inc. CONFIDENTIAL 1 Native XML Content Management Philadelphia XML Users’ Group.
Lecture 15: Query Optimization. Very Big Picture Usually, there are many possible query execution plans. The optimizer is trying to chose a good one.
1 Storing and Maintaining Semistructured Data Efficiently in an Object- Relational Database Mo Yuanying and Ling Tok Wang.
Author: Akiyoshi Matonoy, Toshiyuki Amagasay, Masatoshi Yoshikawaz, Shunsuke Uemuray.
Aggregator Stage : Definition : Aggregator classifies data rows from a single input link into groups and calculates totals or other aggregate functions.
Introduction to Core Database Concepts Getting started with Databases and Structure Query Language (SQL)
D Copyright © 2004, Oracle. All rights reserved. Using Oracle XML Developer’s Kit.
Physical Layer of a Repository. March 6, 2009 Agenda – What is a Repository? –What is meant by Physical Layer? –Data Source, Connection Pool, Tables and.
Diving into Query Execution Plans ED POLLACK AUTOTASK CORPORATION DATABASE OPTIMIZATION ENGINEER.
1 Copyright © 2008, Oracle. All rights reserved. Repository Basics.
Tim Hall Oracle ACE Director
XML: Extensible Markup Language
Visual Basic 2010 How to Program
LOCO Extract – Transform - Load
Parallel Databases.
IBM DATASTAGE online Training at GoLogica
Presented by: Warren Sifre
Chapter 15 QUERY EXECUTION.
Populating a Data Warehouse
Cse 344 May 4th – Map/Reduce.
Presentation transcript:

Manish Bhide, Manoj K Agarwal IBM India Research Lab India {abmanish, Amir Bar-Or, Sriram Padmanabhan IBM Software Group, USA Srinivas K. Mittapalli, Girish Venkatachaliah IBM Software Group India

XPEDIA - Introduction XPEDIA stands for “XML Processing for Data Integration” XML documents became popular XPEDIA is designed to improve data integration for XML documents XPEDIA uses parallelization and ELT flow

ETL In Databases Extract, transform, and load (ETL): Extracting data from outside sources Transforming data to fit operational needs Loading it into the end target (database or data warehouse)

Typical ETL Scenario With XML

Zoom-In Flow-1

The Read_XML_Table operator simply reads the XML Documents

XML Hierarchy Tree

The Equi-Hierarchical Join operator The operator goes over all the “Country” sub-tree in the xml The operator finds the set of employees working in each department in that country The operator creates new element named “Dept2” which contain a list of all employees working in that department

The Aggregation operator The operator calc the total salary of all the employees in a department The operator adds the calc to the XML document as “totalSalary”.

The Shredder Operator The operator writes the totalSalary in the modified XML document to the relational database.

Problem Today, databases support a limited representation of XML documents Processing an XML document, requires full extraction and parsing of the document XML documents grow larger with time A need for complex transformations has arose

Problem – Computational Model Relational data is represented in the form of rows and columns In this model, each XML document is represented as a single row and a single column. There is a need for a technique that handles complex data flows while preserving the simple specification

Problem – Scalability In relational data, the size of a row/tuple is seldom larger than a few KB’s XML documents, which are composed of many small objects, often gets to over 1GB

The Solution – XPEDIA Computational Model ELT Support Scalability – parallelism

XPEDIA Computational Model

XPEDIA uses a dataflow model consisting of operators and edges The key difference in XPEDIA model: The data that flows between operators is an ordered list of XML documents that comply with a single XML schema

Example. List:

XPEDIA Computational Model (cont.) Operators can iterate over a sub-vector of a document object The iterated vector is defined as “scope” vector of the operator

XML Operators Filter operator: Filters one of the vectors within a scope Project Operator: ― Iterates over a single vector and generates a new output vector that is based on a set of select expressions

XML Operators – Aggregate Operator Produces statistics by aggregating one of the vectors. The aggregation restarts for each scope item

XML Operators – Equi-Hierarchical- Join Performs an equality based join between two vectors that are contained within the scope instance

XML Operators – Read/Write Table Read Table Operator Reads all the rows of a single table and outputs a relational tuple or XML document Write Table Operator Used for writing a relational or XML data into a table

XML Operators – Output Stage Operator Input: Department/Company/Country/Dept Project/Company/Country/Emplyee/PName Emp ID/Company/Country/Emplyee/Einfo/EmpID

ELT ELT (Extract, Load, Transform) Take parts of the ETL job flow and converts it into SQL/XML queries ELT is a technique to gain efficiency and performance by shifting a significant processing into the database

ELT In XPEDIA Databases such as DB2 9, Oracle 11g and SQL Server 2005 have inbuilt XQuery and SQL/XML query engines. XPEDIA applies rewriting techniques to transform parts of the ETL job flow into SQL/XML

How XPEDIA converts ETL to ELT The following tasks are required for converting ETL to ELT: 1. Rewrite the ETL flow in terms of simpler operators. 2. Convert each operator into a SQL/XML query. 3. Merge the SQL/XML queries of adjacent operators into a single SQL/XML query. 4. Convert the merged SQL/XML queries to an ELT job definition which can be executed on XPEDIA.

Simplify The ETL Flow Most of the operators in XPEDIA can be directly converted to a SQL/XML query Complex operators, like the OutputStage, are difficult to translate to SQL/XML queries directly We need to rewrite complex operators with a simpler operators

Example The algorithm to convert the OutputStage operator to the set of simpler operators Step 1: Apply XMLize operator on the relational data to obtain flat XML document

Example (cont.) Step 2:

Example (cont.) Step 3: Use Project Operator to add and drop nodes, so as to bring the height of all output node at correct position. Step 4: Use Project Operator to change names of nodes

Query Generation and Merging The XPEDIA ELT optimizer has a set of algorithms for converting operators to SQL/XML query. The XPEDIA ELT optimizer uses a set of rules for merging these SQL/XML queries..

Generating The ELT Job Definition The generated SQL/XML queries are mapped to the XPEDIA job definition XPEDIA translates the job definition to a Read Table operator and the rest of the ETL flow remain the same

The Result We can now use a single SQL/XML query to replace the operators between the XML data source to RDBMS ELT allows us to use only Read/Write table operators Benefits: reduction of the size of the data that needs to be moved

XPEDIA ELT Conclusion XPEDIA is able to use the native XML processing capabilities of the database engine to greatly improve performance. If the database does not have native XML support or is present in a flat file, XPEDIA can not use the ELT optimizer

Parallel Processing of XML Data XPEDIA supports 2 types of job parallelism: Pipeline: each operator is handled by a different resource Partitioning: the XML document is divided into several partitions, each processed separately

Pipelining Limitations Pipelining limits the scalability – can only use as much resources as the number of operators In pipelining, each resource will need to work on the entire data By using partitioning, we allow better usage of available resources on large documents

Partitions Generation XPEDIA identifies what nodes are optimal for partitioning The chosen partition is than divided between resources in one of the following methods: Round Robin Chunking Scheme

Shallow Parsing Dividing the work requires some parsing The parsing that is done is only partial, from root node to partition node Since shallow parsing overhead is different for every partition, sometimes load balancing is done when choosing chunks sizes

What Have We Gained With XPEDIA performance gain of up to 70% by using XPEDIA ETL tools so that more processing is done inside the database engine.

Using XPEDIA to partitioning the ETL job on multiple nodes is scalable and can improve the processing speed of the ETL job by up to 2.9 times for a 4 processor configuration

Summary We saw how the XPEDIA deals with this new problems that arose Parallel processing techniques is used for handling large XML document XPEDIA ELT system is able to take advantage of the native XML processing capabilities of the database engine and greatly improve performance.

Questions ?