Perl/XML::DOM - reading and writing XML from Perl Dr. Andrew C.R. Martin

Slides:



Advertisements
Similar presentations
XML-XSL Introduction SHIJU RAJAN SHIJU RAJAN Outline Brief Overview Brief Overview What is XML? What is XML? Well Formed XML Well Formed XML Tag Name.
Advertisements

Advanced XSLT. Branching in XSLT XSLT is functional programming –The program evaluates a function –The function transforms one structure into another.
XML: Extensible Markup Language
1 XSLT – eXtensible Stylesheet Language Transformations Modified Slides from Dr. Sagiv.
XML & Data Structures for the Internet Yingcai Xiao.
© De Montfort University, XML – a meta language Howell Istance and Peter Norris School of Computing De Montfort University.
1 CP3024 Lecture 9 XML revisited, XSL, XSLT, XPath, XSL Formatting Objects.
31 Signs That Technology Has Taken Over Your Life: #6. When you go into a computer store, you eavesdrop on a salesperson talking with customers -- and.
XML Introduction What is XML –XML is the eXtensible Markup Language –Became a W3C Recommendation in 1998 –Tag-based syntax, like HTML –You get to make.
XML Primer. 2 History: SGML vs. HTML vs. XML SGML (1960) XML(1996) HTML(1990) XHTML(2000)
COS 381 Day 16. Agenda Assignment 4 posted Due April 1 There was no resubmits of Assignment Capstone Progress report Due March 24 Today we will discuss.
Introduction to XML This material is based heavily on the tutorial by the same name at
Technical Track Session XML Techie Tools Tim Bornholt.
PHP with XML Dequan Chen and Narith Kun ---Term Project--- for WSU 2010 Summer Course - CS366 s:
CS 174: Web Programming April 16 Class Meeting Department of Computer Science San Jose State University Spring 2015 Instructor: Ron Mak
XML Anisha K J Jerrin Thomas. Outline  Introduction  Structure of an XML Page  Well-formed & Valid XML Documents  DTD – Elements, Attributes, Entities.
XML eXtensible Markup Language by Darrell Payne. Experience Logicon / Sterling Federal C, C++, JavaScript/Jscript, Shell Script, Perl XML Training XML.
XML and its applications: 4. Processing XML using PHP.
CIT 383: Administrative ScriptingSlide #1 CIT 383: Administrative Scripting XML.
XP 1 CREATING AN XML DOCUMENT. XP 2 INTRODUCING XML XML stands for Extensible Markup Language. A markup language specifies the structure and content of.
1 © Netskills Quality Internet Training, University of Newcastle Introducing XML © Netskills, Quality Internet Training University.
SAX Parsing Presented by Clifford Lemoine CSC 436 Compiler Design.
Session IV Chapter 9 – XML Schemas
Processing of structured documents Spring 2002, Part 2 Helena Ahonen-Myka.
XML Parsers Overview  Types of parsers  Using XML parsers  SAX  DOM  DOM versus SAX  Products  Conclusion.
Tutorial 1: XML Creating an XML Document. 2 Introducing XML XML stands for Extensible Markup Language. A markup language specifies the structure and content.
Electronic Commerce COMP3210 Session 4: Designing, Building and Evaluating e-Commerce Initiatives – Part II Dr. Paul Walcott Department of Computer Science,
Intro to XML Originally Presented by Clifford Lemoine Modified by Box.
JSTL, XML and XSLT An introduction to JSP Standard Tag Library and XML/XSLT transformation for Web layout.
XSLT Kanda Runapongsa Dept. of Computer Engineering Khon Kaen University.
WEB BASED DATA TRANSFORMATION USING XML, JAVA Group members: Darius Balarashti & Matt Smith.
McGraw-Hill/Irwin © 2004 by The McGraw-Hill Companies, Inc. All rights reserved. Scripting with the DOM Ellen Pearlman Eileen Mullin Programming the Web.
VICTORIA UNIVERSITY OF WELLINGTON Te Whare Wananga o te Upoko o te Ika a Maui SWEN 432 Advanced Database Design and Implementation An Introduction to XML.
XML Documents Chao-Hsien Chu, Ph.D. School of Information Sciences and Technology The Pennsylvania State University Elements Attributes Comments PI Document.
Introduction to XML This presentation covers introductory features of XML. What XML is and what it is not? What does it do? Put different related technologies.
XP 1 Creating an XML Document Developing an XML Document for the Jazz Warehouse XML Tutorial.
School of Computing and Information Systems CS 371 Web Application Programming XML and JSON Encoding Data.
Lecture 3 : PHP5-DOM API UFCFR Advanced Topics in Web Development II 2014/15 SHAPE Hong Kong.
Web Technologies COMP6115 Session 4: Adding a Database to a Web Site Dr. Paul Walcott Department of Computer Science, Mathematics and Physics University.
What it is and how it works
XML Design Goals 1.XML must be easily usable over the Internet 2.XML must support a wide variety of applications 3.XML must be compatible with SGML 4.It.
1 Tutorial 11 Creating an XML Document Developing a Document for a Cooking Web Site.
XML Refresher Course Bálint Joó School of Physics University of Edinburgh May 02, 2003.
XML eXtensible Markup Language. XML A method of defining a format for exchanging documents and data. –Allows one to define a dialect of XML –A library.
CISC 3140 (CIS 20.2) Design & Implementation of Software Application II Instructor : M. Meyer Address: Course Page:
COMP9321 Web Application Engineering Semester 2, 2015 Dr. Amin Beheshti Service Oriented Computing Group, CSE, UNSW Australia Week 4 1COMP9321, 15s2, Week.
When we create.rtf document apart from saving the actual info the tool saves additional info like start of a paragraph, bold, size of the font.. Etc. This.
What is XML? eXtensible Markup Language eXtensible Markup Language A subset of SGML (Standard Generalized Markup Language) A subset of SGML (Standard Generalized.
XML CSC1310 Fall HTML (TIM BERNERS-LEE) HyperText Markup Language  HTML (HyperText Markup Language): December  Markup  Markup is a symbol.
Jackson, Web Technologies: A Computer Science Perspective, © 2007 Prentice-Hall, Inc. All rights reserved Chapter 7 Representing Web Data:
USING ANDROID WITH THE DOM. Slide 2 Lecture Summary DOM concepts SAX vs DOM parsers Parsing HTTP results The Android DOM implementation.
1 XSL Transformations (XSLT). 2 XSLT XSLT is a language for transforming XML documents into XHTML documents or to other XML documents. XSLT uses XPath.
XML Notes taken from w3schools. What is XML? XML stands for EXtensible Markup Language. XML was designed to store and transport data. XML was designed.
XML & JSON. Background XML and JSON are to standard, textual data formats for representing arbitrary data – XML stands for “eXtensible Markup Language”
XML 1.Introduction to XML 2.Document Type Definition (DTD) 3.XML Parser 4.Example: CGI Gateway to XML Middleware.
Week-9 (Lecture-1) XML DTD (Data Type Document): An XML document with correct syntax is called "Well Formed". An XML document validated against a DTD is.
Revision Lecture: HTML, CSS and XML Perl/CGI, Perl/DOM, Perl/DBI, Remote Procedure Calls Dr. Andrew C.R. Martin
XML. Contents  Parsing an XML Document  Validating XML Documents.
Extensible Markup Language (XML) Pat Morin COMP 2405.
XML Parsers Overview Types of parsers Using XML parsers SAX DOM
Unit 4 Representing Web Data: XML
Java XML IS
Intro to XML.
XML in Web Technologies
Chapter 7 Representing Web Data: XML
XML Parsers Overview Types of parsers Using XML parsers SAX DOM
CS 240 – Advanced Programming Concepts
Python and XML Styling and other issues XML
XML and its applications: 4. Processing XML using PHP
XML and Web Services (II/2546)
Presentation transcript:

Perl/XML::DOM - reading and writing XML from Perl Dr. Andrew C.R. Martin

Aims and objectives  Refresh the structure of a XML (or XHTML) document  Know problems in reading and writing XML  Understand the requirements of XML parsers and the two main types  Know how to write code using the DOM parser  PRACTICAL: write a script to read XML

An XML refresher! x-ray x-ray Tags: paired opening and closing tags May contain data and/or other (nested) tags un-paired tags use special syntax Attributes: optional – contained within the opening tag

Writing XML Writing XML is straightforward  Generate XML from a Perl script using print() statements.  However:  tags correctly nested  quote marks correctly paired  international character sets

Reading XML As simple or complex as you wish!  Full control over XML:  simple pattern may suffice  Otherwise, may be dangerous  may rely on un-guaranteed formatting x-ray x-ray <mutation res='L24’ native='ALA’ subs='ARG'/>

XML Parsers  Clear rules for data boundaries and hierarchy  Predictable; unambiguous  Parser translates XML into  stream of events  complex data object

XML Parsers  Different data sources of data  files  character strings  remote references  different character encodings  standard Latin  Japanese  checking for well-formedness errors Good parser will handle:

XML Parsers  Read stream of characters  differentiate markup and data  Optionally replace entity references  (e.g. < with <)  Assemble complete document  disparate (perhaps remote) sources  Report syntax and validation errors  Pass data to client program

XML Parsers  If XML has no syntax errors it is 'well formed'  With a DTD, a validating parser will check it matches: 'valid'

XML Parsers  Writing a good parser is a lot of work!  A lot of testing needed  Fortunately, many parsers available

Getting data to your program  Parser can generate 'events'  Tags are converted into events  Events triggered in your program as the document is read  Parser acts as a pipeline converting XML into processed chunks of data sent to your program:  an 'event stream'

Getting data to your program OR…  XML converted into a tree structure  Reflects organization of the XML  Whole document read into memory before your program gets access

Pros and cons  In the parser, everything is likely to be event-driven  tree-based parsers create a data structure from the event stream Data structure More convenient Can access data in any order Code usually simpler  May be impossible to handle very large files  Need more processor time  Need more memory Event stream Faster to access limited data Use less memory  Parser loses data at the next event  More complex code

SAX and DOM de facto standard APIs for XML parsing  SAX (Simple API for XML)  event-stream API  originally for Java, but now for several programming languages (including Perl)  development promoted by Peter Murray Rust,  DOM (Document Object Model)  W3C standard tree-based parser  platform- and language-neutral  allows update of document content and structure as well as reading

Perl XML parsers Many parsers available  Differ in three major ways:  parsing style (event driven or data structure)  'standards-completeness’  speed (implementation in C or pure Perl)

Perl XML parsers XML::Simple  Very easy to use  Designed for simple applications  Can’t handle 'mixed content'  tags containing both data and other tags This is mixed content

Perl XML parsers XML::Parser  Oldest Perl XML parser  Reasonably fast and flexibile  Not very standards-compliant.  Is a wrapper to 'expat’  probably the first C XML parser written by James Clark

XML::Parser use XML::Parser; my $xmlfile = # the file to parse # initialize parser object and parse the string my $parser = XML::Parser->new( ErrorContext => 2 ); eval { $parser->parsefile( $xmlfile ); }; # report error or success if( ) { =~ s/at \/.*?$//s; # remove module line number print STDERR "\nERROR in } else { print STDERR "'$xmlfile' is well-formed\n"; } Simple example - check well-formedness Eval used so parser doesn’t cause program to exit Remember 2 lines before an error stores the error

Perl XML parsers XML::DOM  Implements W3C DOM Level 1  Built on top of XML::Parser  Very good fast, stable and complete  Limited extended functionality XML::SAX  Implements SAX2 wrapper to Expat  Fast, stable and complete

Perl XML parsers XML::LibXML  Wrapper around GNOME libxml2  Very fast, complete and stable  Validating/non-validating  DOM and SAX support

Perl XML parsers XML::Twig  DOM-like parser, BUT  Allows you to define elements which can be parsed as discrete units  'twigs' (small branches of a tree)

Perl XML parsers Several others  More specialized, adding…  XPath (to select data from XML document)  re-formatting (XSLT or other methods) ...

XML::DOM  DOM is a standard API  once learned moving to a different language is straightforward  moving between implementations also easy  Suppose we want to extract some data from an XML file...

XML::DOM cat fruit fly We want: cat (Felix domesticus) not endangered fruit fly (Drosophila melanogaster) not endangered

#!/usr/bin/perl use XML::DOM; $file = $parser = XML::DOM::Parser->new(); $doc = $parser->parsefile($file); foreach $species ($doc->getElementsByTagName('species')) { $common_name = $species->getElementsByTagName('common-name'); $cname = $common_name->item(0)->getFirstChild->getNodeValue; $name = $species->getAttribute('name'); $conservation = $species->getElementsByTagName('conservation'); $status = $conservation->item(0)->getAttribute('status'); print "$cname ($name) $status\n"; } $doc->dispose(); Import XML::DOM module obtain filename initialize a parser parse the file Nested loops correspond to nested tags Returns each element matching ‘species’ Look at each species element in turn $species contains content and descendant elements Can now treat $species in the same way as $doc

#!/usr/bin/perl use XML::DOM; $file = $parser = XML::DOM::Parser->new(); $doc = $parser->parsefile($file); foreach $species ($doc->getElementsByTagName('species')) { $common_name = $species->getElementsByTagName('common-name'); $cname = $common_name->item(0)->getFirstChild->getNodeValue; $name = $species->getAttribute('name'); $conservation = $species->getElementsByTagName('conservation'); $status = $conservation->item(0)->getAttribute('status'); print "$cname ($name) $status\n"; } $doc->dispose(); $common_name contains content and any descendant elements Here there is only one element per element, but parser can't know that. Have to specify that we want the first (and only) element The element contains only one child which is data. Obtain first (and only) child element using ->getFirstChild Extract the text from the element object using ->getNodeValue Returns each element matching ‘common-name’

->getElementsByTagName returns an array Here we work through the array: foreach $species ($doc->getElementsByTagName('species')) { } $common_name = $species->getElementsByTagName('common-name'); $cname = $common_name->item(0)->getFirstChild->getNodeValue; Here we obtain an object and use the ->item() method to obtain the first = $species->getElementsByTagName('common-name'); $cname = $common_names[0]->getFirstChild->getNodeValue; Alternative is to access an individual array element:

$common_name = $species->getElementsByTagName('common-name'); $cname = $common_name->item(0)->getFirstChild->getNodeValue; cat There could have been more than one element within this element Obtain the actual text rather than an element object

#!/usr/bin/perl use XML::DOM; $file = $parser = XML::DOM::Parser->new(); $doc = $parser->parsefile($file); foreach $species ($doc->getElementsByTagName('species')) { $common_name = $species->getElementsByTagName('common-name'); $cname = $common_name->item(0)->getFirstChild->getNodeValue; $name = $species->getAttribute('name'); $conservation = $species->getElementsByTagName('conservation'); $status = $conservation->item(0)->getAttribute('status'); print "$cname ($name) $status\n"; } $doc->dispose(); Attributes much simpler! Can only contain text (no nested elements) Simply specify the attribute Attributes

cat $conservation = $species->getElementsByTagName('conservation'); $status = $conservation->item(0)->getAttribute('status'); Extract the elements from this element DOM parser can't know there is only one element per species. Extract the first (and only) one. This is an empty element, there are no child elements so we don’t need ->getFirstChild Contains only a ‘ status’ attribute, extract its value with ->getAttribute()

#!/usr/bin/perl use XML::DOM; $file = $parser = XML::DOM::Parser->new(); $doc = $parser->parsefile($file); foreach $species ($doc->getElementsByTagName('species')) { $common_name = $species->getElementsByTagName('common-name'); $cname = $common_name->item(0)->getFirstChild->getNodeValue; $name = $species->getAttribute('name'); $conservation = $species->getElementsByTagName('conservation'); $status = $conservation->item(0)->getAttribute('status'); print "$cname ($name) $status\n"; } $doc->dispose(); Print the extracted information Loop back to next element Clean up and free memory

XML::DOM Note  Not necessary to use variable names that match the tags, but it is a very good idea!  There are many many more functions, but this set covers most needs

Writing XML with XML::DOM

#!/usr/bin/perl use XML::DOM; $nspecies = = ('Felix domesticus', 'Drosophila = ('cat', 'fruit = ('not endangered', 'not endangered'); $doc = XML::DOM::Document->new; $xml_pi = $doc->createXMLDecl ('1.0'); print $xml_pi->toString; $root = $doc->createElement('data'); for($i=0; $i<$nspecies; $i++) { $species = $doc->createElement('species'); $species->setAttribute('name', $names[$i]); $root->appendChild($species); $cname = $doc->createElement('common-name'); $text = $doc->createTextNode($commonNames[$i]); $cname->appendChild($text); $species->appendChild($cname); $cons = $doc->createElement('conservation'); $cons->setAttribute('status', $consStatus[$i]); $species->appendChild($cons); } print $root->toString; Import XML::DOM Initialize data Create an XML document object Utility method to print XML header

#!/usr/bin/perl use XML::DOM; $nspecies = = ('Felix domesticus', 'Drosophila = ('cat', 'fruit = ('not endangered', 'not endangered'); $doc = XML::DOM::Document->new; $xml_pi = $doc->createXMLDecl ('1.0'); print $xml_pi->toString; $root = $doc->createElement('data'); for($i=0; $i<$nspecies; $i++) { $species = $doc->createElement('species'); $species->setAttribute('name', $names[$i]); $root->appendChild($species); $cname = $doc->createElement('common-name'); $text = $doc->createTextNode($commonNames[$i]); $cname->appendChild($text); $species->appendChild($cname); $cons = $doc->createElement('conservation'); $cons->setAttribute('status', $consStatus[$i]); $species->appendChild($cons); } print $root->toString; Create a ‘root’ element for our data Loop through each of the species Create a element Set its ‘ name ’ attribute Join the element to the parent element

#!/usr/bin/perl use XML::DOM; $nspecies = = ('Felix domesticus', 'Drosophila = ('cat', 'fruit = ('not endangered', 'not endangered'); $doc = XML::DOM::Document->new; $xml_pi = $doc->createXMLDecl ('1.0'); print $xml_pi->toString; $root = $doc->createElement('data'); for($i=0; $i<$nspecies; $i++) { $species = $doc->createElement('species'); $species->setAttribute('name', $names[$i]); $root->appendChild($species); $cname = $doc->createElement('common-name'); $text = $doc->createTextNode($commonNames[$i]); $cname->appendChild($text); $species->appendChild($cname); $cons = $doc->createElement('conservation'); $cons->setAttribute('status', $consStatus[$i]); $species->appendChild($cons); } print $root->toString; Create a element Create a text node containing the specified text Join this text node as a child of the element Join the element as a child of

cat

#!/usr/bin/perl use XML::DOM; $nspecies = = ('Felix domesticus', 'Drosophila = ('cat', 'fruit = ('not endangered', 'not endangered'); $doc = XML::DOM::Document->new; $xml_pi = $doc->createXMLDecl ('1.0'); print $xml_pi->toString; $root = $doc->createElement('data'); for($i=0; $i<$nspecies; $i++) { $species = $doc->createElement('species'); $species->setAttribute('name', $names[$i]); $root->appendChild($species); $cname = $doc->createElement('common-name'); $text = $doc->createTextNode($commonNames[$i]); $cname->appendChild($text); $species->appendChild($cname); $cons = $doc->createElement('conservation'); $cons->setAttribute('status', $consStatus[$i]); $species->appendChild($cons); } print $root->toString; Create a element Set the ‘status’ attribute Join the element as a child of

cat

#!/usr/bin/perl use XML::DOM; $nspecies = = ('Felix domesticus', 'Drosophila = ('cat', 'fruit = ('not endangered', 'not endangered'); $doc = XML::DOM::Document->new; $xml_pi = $doc->createXMLDecl ('1.0'); print $xml_pi->toString; $root = $doc->createElement('data'); for($i=0; $i<$nspecies; $i++) { $species = $doc->createElement('species'); $species->setAttribute('name', $names[$i]); $root->appendChild($species); $cname = $doc->createElement('common-name'); $text = $doc->createTextNode($commonNames[$i]); $cname->appendChild($text); $species->appendChild($cname); $cons = $doc->createElement('conservation'); $cons->setAttribute('status', $consStatus[$i]); $species->appendChild($cons); } print $root->toString; Loop back to handle the next species Finally print the resulting data structure

cat fruit fly

Summary - reading XML Create a parser $parser = XML::DOM::Parser->new(); Parse a file $doc = $parser->parsefile( ' filename ' ); Extract all elements matching tag-name $element_set = $doc->getElementsByTagName( ' tag-name') Extract first element of a set $element = $element_set->item(0); Extract first child of an element $child_element = $element->getFirstChild; Extract text from an element $text = $element->getNodeValue; Get the value of a tag’s attribute $text = $element->getAttribute( ' attribute-name');

Summary - writing XML Create an XML document structure $doc = XML::DOM::Document->new; Utility to create an XML header $header = $doc->createXMLDecl('1.0'); Create a tagged element $element = $doc->createElement('tag-name'); Set an attribute for an element $element->setAttribute('attrib-name', 'value'); Append a child element to an element $parent_element->appendChild($child_element); Create a text node element $element = $doc->createTextNode('text'); Print a document structure as a string print $root_element->toString;

Summary  Two types of parser  Event-driven  Data structure  Writing a good parser is difficult!  Many parsers available  XML::DOM for reading and writing data