1 SCHEMALESS APPROACH OF MAPPING XML DOCUMENTS INTO RELATIONAL DATABASE Ibrahim Dweib, Ayman Awadi, Seif Elduola Fath Elrhman, Joan Lu CIT 2008 Sydney,

Slides:



Advertisements
Similar presentations
CHAPTER OBJECTIVE: NORMALIZATION THE SNOWFLAKE SCHEMA.
Advertisements

XML: text format Dr Andy Evans. Text-based data formats As data space has become cheaper, people have moved away from binary data formats. Text easier.
Analyzing Systems Using Data Dictionaries Systems Analysis and Design, 7e Kendall & Kendall 8 © 2008 Pearson Prentice Hall.
NaLIX: A Generic Natural Language Search Environment for XML Data Presented by: Erik Mathisen 02/12/2008.
CS 898N – Advanced World Wide Web Technologies Lecture 21: XML Chin-Chih Chang
Aki Hecht Seminar in Databases (236826) January 2009
Storing and Querying XML Data in Databases Anupama Soli
File Systems and Databases
Mark Graves Leveraging Existing DBMS Storage for XML DBMS.
Semi-structured Data. Facts about the Web Growing fast Popular Semi-structured data –Data is presented for ‘human’-processing –Data is often ‘self-describing’
XML(EXtensible Markup Language). XML XML stands for EXtensible Markup Language. XML is a markup language much like HTML. XML was designed to describe.
The Information School at the University of Washington LIS 549 U/TU: Intro to Content Management Fall 2003 * Bob Boiko * MSIM Associate Chair XML Schemas.
Introduction to XML Extensible Markup Language
1 Advanced Topics XML and Databases. 2 XML u Overview u Structure of XML Data –XML Document Type Definition DTD –Namespaces –XML Schema u Query and Transformation.
4/20/2017.
XP New Perspectives on XML Tutorial 3 1 DTD Tutorial – Carey ISBN
Pemrograman Berbasis WEB XML part 2 -Aurelio Rahmadian- Sumber: w3cschools.com.
Ihr Logo Data Explorer - A data profiling tool. Your Logo Agenda  Introduction  Existing System  Limitations of Existing System  Proposed Solution.
XML-to-Relational Schema Mapping Algorithm ODTDMap Speaker: Artem Chebotko* Wayne State University Joint work with Mustafa Atay,
An Extension to XML Schema for Structured Data Processing Presented by: Jacky Ma Date: 10 April 2002.
ASP.NET Programming with C# and SQL Server First Edition
1 XML at a neighborhood university near you Innovation 2005 September 16, 2005 Kwok-Bun Yue University of Houston-Clear Lake.
Neminath Simmachandran
McGraw-Hill/Irwin © 2004 by The McGraw-Hill Companies, Inc. All rights reserved. Schemas Ellen Pearlman Eileen Mullin Programming the Web Using XML.
School of Computing and Management Sciences © Sheffield Hallam University To understand the Oracle XML notes you need to have an understanding of all these.
Document Type Definitions Kanda Runapongsa Dept. of Computer Engineering Khon Kaen University.
Introduction to Databases A line manager asks, “If data unorganized is like matter unorganized and God created the heavens and earth in six days, how come.
Graph Data Management Lab, School of Computer Science gdm.fudan.edu.cn XMLSnippet: A Coding Assistant for XML Configuration Snippet.
XML Schema and Stylus Studio. Introduction to XML Schema XML Schema defines building blocks of a XML document XML Schemas are alternative to DTD Why XML.
A Z Approach in Validating ORA-SS Data Models Scott Uk-Jin Lee Jing Sun Gillian Dobbie Yuan Fang Li.
XP 1 DECLARING A DTD A DTD can be used to: –Ensure all required elements are present in the document –Prevent undefined elements from being used –Enforce.
What is XML?  XML stands for EXtensible Markup Language  XML is a markup language much like HTML  XML was designed to carry data, not to display data.
The main mathematical concepts that are used in this research are presented in this section. Definition 1: XML tree is composed of many subtrees of different.
Introduction to XML Extensible Markup Language. What is XML XML stands for eXtensible Markup Language. A markup language is used to provide information.
Processing of structured documents Spring 2002, Part 2 Helena Ahonen-Myka.
Electronic Commerce COMP3210 Session 4: Designing, Building and Evaluating e-Commerce Initiatives – Part II Dr. Paul Walcott Department of Computer Science,
1 Tutorial 13 Validating Documents with DTDs Working with Document Type Definitions.
Avoid using attributes? Some of the problems using attributes: Attributes cannot contain multiple values (child elements can) Attributes are not easily.
Winter 2006Keller, Ullman, Cushing18–1 Plan 1.Information integration: important new application that motivates what follows. 2.Semistructured data: a.
Of 33 lecture 3: xml and xml schema. of 33 XML, RDF, RDF Schema overview XML – simple introduction and XML Schema RDF – basics, language RDF Schema –
Chapter 9 (modified) Abstract Data Types and Algorithms Nell Dale John Lewis.
1 Design Issues in XML Databases Ref: Designing XML Databases by Mark Graves.
1 Introduction  Extensible Markup Language (XML) –Uses tags to describe the structure of a document –Simplifies the process of sharing information –Extensible.
Introduction to XML This presentation covers introductory features of XML. What XML is and what it is not? What does it do? Put different related technologies.
Lecture 16 Introduction to XML Boriana Koleva Room: C54
ITGS Databases.
Web Technologies COMP6115 Session 4: Adding a Database to a Web Site Dr. Paul Walcott Department of Computer Science, Mathematics and Physics University.
1 Tutorial 14 Validating Documents with Schemas Exploring the XML Schema Vocabulary.
Tutorial 13 Validating Documents with Schemas
Introducing Cascading Style Sheets. Cascading Style Sheet Basics  Cascading Style Sheet Basics  Creating Styles  Using Styles  Manipulating Styles.
1. 2 Preface In the time since the 1986 edition of this book, the world of compiler design has changed significantly 3.
CS 157B: Database Management Systems II February 11 Class Meeting Department of Computer Science San Jose State University Spring 2013 Instructor: Ron.
XML eXtensible Markup Language. XML A method of defining a format for exchanging documents and data. –Allows one to define a dialect of XML –A library.
Internet & World Wide Web How to Program, 5/e. © by Pearson Education, Inc. All Rights Reserved.2.
COMP9321 Web Application Engineering Semester 2, 2015 Dr. Amin Beheshti Service Oriented Computing Group, CSE, UNSW Australia Week 4 1COMP9321, 15s2, Week.
When we create.rtf document apart from saving the actual info the tool saves additional info like start of a paragraph, bold, size of the font.. Etc. This.
Lection №4 Development of the Relational Databases.
Working with XML. Markup Languages Text-based languages based on SGML Text-based languages based on SGML SGML = Standard Generalized Markup Language SGML.
1 Storing and Maintaining Semistructured Data Efficiently in an Object- Relational Database Mo Yuanying and Ling Tok Wang.
SEMI-STRUCTURED DATA (XML) 1. SEMI-STRUCTURED DATA ER, Relational, ODL data models are all based on schema Structure of data is rigid and known is advance.
CITA 330 Section 2 DTD. Defining XML Dialects “Well-formedness” is the minimal requirement for an XML document; all XML parsers can check it Any useful.
XML intro. What is XML? XML stands for EXtensible Markup Language XML is a markup language much like HTML XML was designed to carry data, not to display.
XML BASICS and more…. What is XML? In common:  XML is a standard, simple, self-describing way of encoding both text and data so that content can be processed.
XML QUESTIONS AND ANSWERS
Data Modeling II XML Schema & JAXB Marc Dumontier May 4, 2004
Semi-Structured data (XML Data MODEL)
New Perspectives on XML
How to use hash tables to solve olympiad problems
Semi-Structured data (XML)
Presentation transcript:

1 SCHEMALESS APPROACH OF MAPPING XML DOCUMENTS INTO RELATIONAL DATABASE Ibrahim Dweib, Ayman Awadi, Seif Elduola Fath Elrhman, Joan Lu CIT 2008 Sydney, Australia 8-11 July 2008

2 Why schema-less schema-less Many applications deal with highly flexible XML documents from different sources, which make it difficult to define their structure by a fixed schema or a DTD. Therefore, it is necessary for schema-less approaches to deal with such XML documents.

3 The method aims to overcome the challenges faced due to fixed shredding  No loss of information while shredding.  Reconstruction of original XML documents is easier and much faster.  Maintaining XML document structure.  Preserve the ordering nature of XML data.

4 Theory guidance The main mathematical concepts that are used in this method are: Definition 1: Definition 1: XML tree is composed of many sub-trees of different levels; it can be define as the following: i=1, 2 … n, represent the levels of XML tree, 0 represents the root i=1, 2 … n, represent the levels of XML tree, 0 represents the root Where, E i is a finite set of elements in the level i. A i is a finite set of attributes in the level i. X i is a finite set of texts in the level i. r i-1 is the root of the sub-tree of level i.

5 Theory guidance (Con ’ t) Definition 2: A dynamic fragment (shred) df(i) is defined to be the attributes and texts (leaf children) of the sub-tree i of the XML tree plus its root r i-1, as follows: df(i) = (A i, X i, r i-1 ), Where: A i is a finite set of attributes in the level i X i is a finite set of texts in the level i. r i-1 is the root of the sub-tree of level i. r i-1 is the root of the sub-tree of level i.

6 Design framework A master table for documents. Called "documents “ table, to keep information about documents themselves, A master table for documents. Called "documents “ table, to keep information about documents themselves, documents(doc_id, doc_structure, ….. ), Additional fields may be added to keep all information about the document itself such as dates, statistics, types … etc.  The doc_id is a unique id generated per document to identify documents.  The doc_structure is a big text field containing a coded string describing each document structure, any changes on the document structure should be reflected in this field, such as adding a new tag or property, deleting an existing tag or property, or relocating a given tag or property to a different location in the same document

7 Design framework (Con ’ t) A second table to store the actual contents for all documents. Documents will be shredded into pieces of data that will be called tokens, each document element, tag, or property will be considered a token, the tokens table will have at the minimum this structure, A second table to store the actual contents for all documents. Documents will be shredded into pieces of data that will be called tokens, each document element, tag, or property will be considered a token, the tokens table will have at the minimum this structure, tokens(doc_id, token_id, token_name, token_value).  The token_id is the primary generated id for each token.  The doc_id is the foreign key linking the tokens table to the documents table.  token_name is the tag name or the property name as found in the original XML document.  token_value is the text value of the XML tag property.

8 Design framework, (Con ’ t) “ doc_structure ” field construction rules: The doc_structure field is where the document structure maintained. The doc_structure field is where the document structure maintained. It consists of long series of related keys. It consists of long series of related keys. Each key should start with a given alphabet character, Each key should start with a given alphabet character,  The letter 'T' for element (child), and the letter 'A' for attribute,  These letters are necessary to delimit keys in the sequence.  Then the letter is followed by a numeric number representing the token_id that this key is referring to, Example: T120 is a key referring to a token in the tokens table whose token_id = 120. Example: T120 is a key referring to a token in the tokens table whose token_id = 120.

9 Design framework, “ doc_structure ” field construction rules: (Con ’ t) If the token has properties then If the token has properties then the key representing this token in the doc_structure will be followed with a set of keys defining these properties. Example: T120A12A17A2 is a valid key string for token number 120 which has three properties defined by tokens number 12, 17, and 2. Example: T120A12A17A2 is a valid key string for token number 120 which has three properties defined by tokens number 12, 17, and 2. These properties appear in the original document in this order. These properties appear in the original document in this order.

10 Design framework, “ doc_structure ” field construction rules: (Con ’ t) If the token has some children tags then If the token has some children tags then these children will be represented as a key-string surrounded by angle brackets. Example: T120 T77> is a valid string that can be read, token 120 has three sub tags in this order: token 12, followed by token 7, then token 77, and token 7 itself has also two sub tags 2, and 1 in the given order. Example: T120 T77> is a valid string that can be read, token 120 has three sub tags in this order: token 12, followed by token 7, then token 77, and token 7 itself has also two sub tags 2, and 1 in the given order.

11 Theory implementation on simple case study Theory implementation on simple case study <books> M. John M. John Computer Science 101 Computer Science 101 A. Mark A. Mark Applied Math 101 Applied Math 101 Math Math </books> Figure 1: XML document

12 Theory implementation on simple case study Theory implementation on simple case study Figure 2: A tree representation for XML document in figure 1 Books Book author name M. John CS 101 Id "11210" Category "fiction" Id "a1" Sex "m" Book authorsubjectname A. Mark Math Applied Math 101 Id "11211" Books Book author name M. John CS 101 Id "11210" Category "fiction" Id "a1" Sex "m" Book authorsubjectname A. Mark Math Applied Math 101 Id "11211" Books Book author name M. John CS 101 Id "11210" Category "fiction" Id "a1" Sex "m" Book authorsubjectname A. Mark Math Applied Math 101 Id "11211"

13 Theory implementation on simple case study Theory implementation on simple case study Doc_strcutureDoc_id T99 T107A108 >10 Figure 5: Documents table

14 Theory implementation on simple case study Theory implementation on simple case study token_valuetoken_nametoken_iddoc_id Nullbooks9910 Nullbook id10110 fictioncategory10210 M. Johnauthor10310 a1id10410 msex10510 Computer Science 101name10610 Nullbook id10810 A. Markauthor10910 Applied Math 101name11010 Mathsubject11110 Figure 6: Tokens table

15 EXPERIMENTAL Environment  An Intel Core 2 Duo computer with 2 GHz CPU, 1 GB RAM, 256 MB shared Cache  OS: Windows Vista home edition.  Visual Basic 6 is used as software development kit with Microsoft Access 2003 as relational database target.  Five XML documents with different sizes are used in the experiment.  The data is taken from the XML data repository that is available at the web site of the School of Computer Science and Engineering, University of Washington.  The performance metric is the time spent for mapping XML documents to relational database and the time spent for reconstructing these documents from relational database.  The experiment is repeated five times and the mean value of those times is reported to obtain a realistic and accurate results.

16 EXPERIMENTAL RESULTS 1MB602KB 64 KB 28 KB 4 KB Document size Mapping time (secs) Reconstructing time (secs) Table 1: The time spent for mapping XML documents to RDBMS, and the time for reconstructing them

17 EXPERIMENTAL RESULTS

18 Conclusion (1) By using this method:  Maintaining document structure at a low cost price and easily,  Building the original document is straight forward,  Performing first level semantic search is also achievable either on a single document or on all documents.

19 Conclusion (2) Method Limitation:  Complex semantic search is not achievable easily in this structure.  Document size is limited to memory size since we use DOM based parsing

20 Future Works  Improving this method to achieve complex semantic search, differentiate between XML data type (i.e., strings, dates, integers), in order to apply less than or greater than queries.  Making an intensive testing and compare our method with other methods in the literature to see its performance.  Using SAX parsing for XML document to solve document size limitation.

21 Thank You for Your Time