Getting Data out of XML Documents Bálint Joó School of Physics University of Edinburgh May 02, 2003.

Slides:



Advertisements
Similar presentations
XML-XSL Introduction SHIJU RAJAN SHIJU RAJAN Outline Brief Overview Brief Overview What is XML? What is XML? Well Formed XML Well Formed XML Tag Name.
Advertisements

XML: Extensible Markup Language
XML, XML Schema, Xpath and XQuery Slides collated from various sources, many from Dan Suciu at Univ. of Washington.
1 XSLT – eXtensible Stylesheet Language Transformations Modified Slides from Dr. Sagiv.
XML Name spaces. Different people may invent similar tag names Here is an XML element describing a piece of furniture: table99 dining table Here.
DAVID M. KROENKE’S DATABASE PROCESSING, 10th Edition © 2006 Pearson Prentice Hall 13-1 COS 346 Day 25.
Introduction to XLink Transparency No. 1 XML Information Set W3C Recommendation 24 October 2001 (1stEdition) 4 February 2004 (2ndEdition) Cheng-Chia Chen.
1 COS 425: Database and Information Management Systems XML and information exchange.
Cornell CS 502 More XML XML schema, XPATH, XSLT CS 502 – Carl Lagoze – Cornell University.
XML –Query Languages, Extracting from Relational Databases ADVANCED DATABASES Khawaja Mohiuddin Assistant Professor Department of Computer Sciences Bahria.
XML Technologies and Applications Rajshekhar Sunderraman Department of Computer Science Georgia State University Atlanta, GA 30302
MC 365 – Software Engineering Presented by: John Ristuccia Shawn Posts Ndi Sampson XSLT Introduction BCi.
Manohar – Why XML is Required Problem: We want to save the data and retrieve it further or to transfer over the network. This.
Introduction to XSLT & its use in Grainger Library full-text & metadata projects Thomas G. Habing Grainger Engineering Library Presentation to ASIS&T,
Overview of XPath Author: Dan McCreary Date: October, 2008 Version: 0.2 with TEI Examples M D.
Introduction to XPath Bun Yue Professor, CS/CIS UHCL.
® IBM Software Group © 2006 IBM Corporation How to read/write XML using EGL This Learning Module shows how to utilize an EGL Library to read/write an XML.
XML files (with LINQ). Introduction to LINQ ( Language Integrated Query ) C#’s new LINQ capabilities allow you to write query expressions that retrieve.
XP New Perspectives on XML Tutorial 4 1 XML Schema Tutorial – Carey ISBN Working with Namespaces and Schemas.
Sheet 1XML Technology in E-Commerce 2001Lecture 6 XML Technology in E-Commerce Lecture 6 XPointer, XSLT.
JSP Standard Tag Library
An Extension to XML Schema for Structured Data Processing Presented by: Jacky Ma Date: 10 April 2002.
Xpath Query Evaluation. Goal Evaluating an Xpath query against a given document – To find all matches We will also consider the use of types Complexity.
CSE3201/CSE4500 XPath. 2 XPath A locator for elements or attributes in an XML document. XPath expression gives direction.
Lecture 6 of Advanced Databases XML Schema, Querying & Transformation Instructor: Mr.Ahmed Al Astal.
XML  JSON Lossless reversible transformation Epocrates, Inc. David A. Lee Senior Principle Software Engineer Tom Angelopoulos, Staff Engineer.
TDDD43 XML and RDF Slides based on slides by Lena Strömbäck and Fang Wei-Kleiner 1.
Comparing XSLT and XQuery Michael Kay XTech 2005.
XML and its applications: 4. Processing XML using PHP.
CSE3201/CSE4500 Information Retrieval Systems
XP New Perspectives on XML Tutorial 6 1 TUTORIAL 6 XSLT Tutorial – Carey ISBN
XP 1 CREATING AN XML DOCUMENT. XP 2 INTRODUCING XML XML stands for Extensible Markup Language. A markup language specifies the structure and content of.
WORKING WITH XSLT AND XPATH
1 CIS336 Website design, implementation and management (also Semester 2 of CIS219, CIS221 and IT226) Lecture 6 XSLT (Based on Møller and Schwartzbach,
 XML is designed to describe data and to focus on what data is. HTML is designed to display data and to focus on how data looks.  XML is created to structure,
Processing of structured documents Spring 2002, Part 2 Helena Ahonen-Myka.
Lecture 22 XML querying. 2 Example 31.5 – XQuery FLWOR Expressions ‘=’ operator is a general comparison operator. XQuery also defines value comparison.
Electronic Commerce COMP3210 Session 4: Designing, Building and Evaluating e-Commerce Initiatives – Part II Dr. Paul Walcott Department of Computer Science,
XPath. Why XPath? Common syntax, semantics for [XSLT] [XPointer][XSLT] [XPointer] Used to address parts of an XML document Provides basic facilities for.
XSLT part of XSL (Extensible Stylesheet Language) –includes also XPath and XSL Formatting Objects used to transform an XML document into: –another XML.
XSLT Kanda Runapongsa Dept. of Computer Engineering Khon Kaen University.
Computing & Information Sciences Kansas State University Thursday, 15 Mar 2007CIS 560: Database System Concepts Lecture 24 of 42 Thursday, 15 March 2007.
Openadaptor XML Support Using openadaptor for XML processing Oleg Dulin,
XML Patch Operations based on XPath selectors Jari Urpalainen IETF62 Minneapolis.
Copyrighted material John Tullis 10/17/2015 page 1 04/15/00 XML Part 3 John Tullis DePaul Instructor
Lesson 3 CDT301 – Compiler Theory, Spring 2011 Teacher: Linus Källberg.
August Chapter 6 - XPath & XPointer Learning XML by Erik T. Ray Slides were developed by Jack Davis College of Information Science and Technology.
Topic 1 Object Oriented Programming. 1-2 Objectives To review the concepts and terminology of object-oriented programming To discuss some features of.
XML Refresher Course Bálint Joó School of Physics University of Edinburgh May 02, 2003.
XP New Perspectives on XML, 2 nd Edition Tutorial 8 1 TUTORIAL 8 CREATING ELEMENT GROUPS.
CS 157B: Database Management Systems II February 11 Class Meeting Department of Computer Science San Jose State University Spring 2013 Instructor: Ron.
Web Technologies Lecture 4 XML and XHTML. XML Extensible Markup Language Set of rules for encoding a document in a format readable – By humans, and –
® A Proposed UML Profile For EXPRESS David Price Seattle ISO STEP Meeting October 2004.
Martin Kruliš by Martin Kruliš (v1.1)1.
CSE3201/CSE4500 XPath. 2 XPath A locator for items in XML document. XPath expression gives direction of navigation.
XPath --XML Path Language Motivation of XPath Data Model and Data Types Node Types Location Steps Functions XPath 2.0 Additional Functionality and its.
 Packages:  Scrapy, Beautiful Soup  Scrapy  Website  
1 The XPath Language. 2 XPath Expressions Flexible notation for navigating around trees A basic technology that is widely used uniqueness and scope in.
Jackson, Web Technologies: A Computer Science Perspective, © 2007 Prentice-Hall, Inc. All rights reserved Chapter 7 Representing Web Data:
SEMI-STRUCTURED DATA (XML) 1. SEMI-STRUCTURED DATA ER, Relational, ODL data models are all based on schema Structure of data is rigid and known is advance.
XML Extensible Markup Language
1 XSL Transformations (XSLT). 2 XSLT XSLT is a language for transforming XML documents into XHTML documents or to other XML documents. XSLT uses XPath.
Unit 4 Representing Web Data: XML
Querying and Transforming XML Data
XML QUESTIONS AND ANSWERS
{ XML Technologies } BY: DR. M’HAMED MATAOUI
XML in Web Technologies
Chapter 7 Representing Web Data: XML
CS 5010 Program Design Paradigms “Bootcamp” Lesson 6.5
More XML XML schema, XPATH, XSLT
Presentation transcript:

Getting Data out of XML Documents Bálint Joó School of Physics University of Edinburgh May 02, 2003

Contents In search of a simple API for accessing DOM The multiple tag problem What is it? Is it a problem for us? How can we get around it? XPath What is easy to parse? Software: XPathReader package Conclusions

Motivation (Starting Points) Lack of free Data- binding tools for C/C++ Desire to read ILDG Metadata documents, marshal application data => Have to write our own tools Would like simple API to get at document data Would like same API to cope with ILDG metadata AND application data. We got as far as reading into a DOM.

Start With Simple Idea Consider simple API with functions push(tagname) -- select tag with name tagname pop() -- move up a level getType( tagname, result ) Type = string | float | double | int | bool; Equivalent API: directory like structure with no absolute paths: cd(tagname) = push(tagname), cd(..) = pop() Simple Data: No Attributes, No Namespaces No Empty Elements.

Example String 5.0 Open(''file.xml''); push(''foo''); string bar; getString(''bar'', bar); double fred; getDouble(''fred'', fred); pop(); So far so good - nice and simple Current UKQCD Schema has no attributes/namespaces Empty tags serve no purpose except as placeholders BUT Soon we encounter...

The Multiple Tag Problem Consider following snippet: Lets try our API: push(''size''); But what does: push(''axis''); do?

Multiple Tag Problem (cont'd) push(“axis”) could select in document order We could add an index to push(“axis”) push(“axis”, 1) push(“axis”,2) We could add an index attribute to But then we'd need a mechanism to match index attribute We could change the names of axis: We could put the different into different namespaces -- effectively same as adding attribute We could try and match the tag.

The consequences Changing tagnames for simplicity of parsing just seems wrong Matching the tag is not possible without first selecting an in our scheme (locality) Adding attributes/namespaces complicates API. This use of different namespaces would be philosophically wrong. Adding order of occurrance index into API is cleanest No need to change Schema, Instance documents etc. Document ordering removes random access capability

In General For less simple (more general) XML documents duplicate tags can be distinguished by: Occurrance Order Name Attributes Content Namespace An ideal, simple API should allow matching on all of these to interrogate any XML document.

What about Locality ? push(namespace, tagname, attributes, occurrance) getType(ns, tagname, attributes, occurrance, result) But NO local parser can match on element content. need to open a tag based on value of content BUT can't get to content without opening tag Document order may not help here Schema document still satisfied. Would like to match on tag Need to abandon locality

Lesson In order to avoid ambiguity we must Restrict the form of markup we deal with Force decisions onto our Schema writers OR complicate our API rely on tag ordering (either implicitly or explicitly) introduce attributes (forcing decision on Schema writers) give up locality in the API

Global Queries: XPath Would like a nice way to encode tag name attributes order of occurrence attribute/content matching predicates Can this be done? YES! Using XPath

XPath Axes Node Parent axis:.. Attribute Child axis:./ Following Sibling Axis (no compact selector) Preceding Sibling Axis (no compact selector) XPath Axes specify coordinates for DOM. Some Axes can include more than one node: ancestors: parent and all its ancestors

XPath Selectors tagname selects all children of current node called tagname * selects all children of selects all attribute nodes called selects all atributes nodes of current node. name[i] selects the i-th occurrance of child node called name.. selects parent of current node //name selects name with any set of ancestors

XPath Examples XPath Query: / Selection

XPath Examples XPath Query: /size Selection

XPath Examples XPath Query: /size/axis Selection OR /size/* OR //axis

XPath Examples XPath Query: /size/axis[2] Selection /size/axis[dimension=”2”] OR Query on element content Query on order of occurrance

XPath Examples XPath Query: /size/bj:axis Selection Support Namespaces

XPath Examples XPath Query: Selection Attribute Matching Visit: for more...

XPath Notes Can return sets of nodes - not just unique node Has more features: Functions to turn query results into strings, numbers, booleans Encodes all features we need C/C++ linkable XPath Processors exist Xerces, Xalan, libxml Solves all our reader API problems in nice way.

XPath Based Reader API Basic Functions: open(file/stream); getType(xpath_string, result); getAttributeType(xpath_string, attributeName, result); Semantics: The xpath_string must identify a unique node.

What is Easy to Parse? Stylistic discussion on Metadata Mailing list. One particular question: “ How should we mark up things?” 4 X 16 Y 16 Chris' Way: Tomoteru's Way: Known as the: “ Element v.s. Attribute” debate in the XML world.

What is Easy to Parse? One statement is that the attribute way is perhaps easier to parse? With XPath, both ways are easy to parse. To get the length of the x dimension: Chris' Way: number(//size/axis[normalize-space(string(name))=”X”]/length) getInt(“//size/axis[normalize-space(string(name))=\”X\”]/length”, intValue); Tomoteru's Way: getIntAttribute(“//size/x”, “value”, intValue); Chris' Way has more complex query. But equally simple API Call.

Element v.s. Attribute Debate (aside) Looked on Web Tomoteru's way is preferred in general by object modellers (eg. database people) Mark up most “ atomic” data as attributes Use tags to indicate “ table structure” Chris' way is perhaps preferred by archivists or librarians (Go Kim!) Decide for yourself, a discussion is available at: Found no universally accepted best practice.

Software: XPathReader Wrote software to implement XPath Reader API in C++ Wraps around free libxml2 (C) library Uses overloading and templating Two Classes: BasicXPathReader: Use XPath to get at basic C++ types ( ints, std::strings, etc) XPathReader Allows reading of Complex Numbers and Arrays.

XPathReader Class Public Members void open(istream& is); void close(void); template void getXPathAttribute(const string& xpath_to_node, const string& attribute_name, T& result); template void getXPath(const string& xpath, T& result); int countXPath(const string& xpath_query); open/close functions: count results of XPath Query: get value of attribute from node identified by XPath: get value of node identified by XPath

Complex Numbers and Arrays XPathReader Library provides Classes for Complex Numbers and Arrays: template class TComplex {... }; template class Array {... }; Can have Complex numbers of arrays Eg for storing real/imaginary parts of arrays: TComplex > Can also have Complex-es templated on string -s Mathematically not sensible...

Complex Number Markup & Marshal real part imag part Invented simple mark up: can maintain API through C++ function overloading and recursion: template void getXPath(const string& path, TComplex & result) { getXPath( path+”/cmpx/re”, result.real() ); getXPath( path+”/cmpx/im”, result.imag() ); } similar but slightly more involved for Array.

Array Markup Arrays were marked up as follows: <array sizeName=”size” elemName=”el” indexName=”idx” indexStart=”x”> N element[0] element[1]... element[N-1] This is a general mark up -- suitable for local parsers too

Array Mark - Up Example <array sizeName=”num_dimensions” elemName=”axis” indexName=”dimension” indexStart =”1”> Minimally invasive Insert tags Copy tag to attribute Easy to implement with XSL transformation Working group needn't amend current metadata schema for it.

Conclusions Discussed API Issues for Parsing XML without full “data binding” tools. Discussed Repeated Tag problem Concluded that XPath is simple and elegant way to solve problem - hopefully convinced you too. Discussed C++ Implementation of an XPathReader API Discussed how to parse compound data types Described markup for Complex Numbers and Arrays Suggest Complex and Array markup be standardised by Metadata Working Group (but not necessarily that it be used in metadata documents) - to assist sharing of data.

References/Links XML, DOM, XPath: Tutorials (XPath/XSLT): libxml2: Attribute v.s. Entities (and other discussions): XPathReader software send to me: SciDAC CVS repository at JLAB ( xpath_reader ) SciDAC: