1 Extensible Markup Language: XML XML developed by World Wide Consortium’s (W3C’s) XML Working Group (1996) XML portable, widely supported technology for describing data XML quickly becoming standard for data exchange between applications
XML Documents XML marks up data using tags, which are names enclosed in angle brackets All tags appear in pairs:.. Elements: units of data (i.e., everything included between a start tag and its corresponding end tag) Root element contains all other document elements Tag pairs cannot appear interleaved: Must be: Nested elements form hierarchies (trees) Thus: What defines an XML document is not its tag names but that it has tags that are formatted in this way.
3 article.xml Simple XML 9 10 December 21, John 14 Doe XML is pretty easy In this chapter, we present a wide variety of examples 20 that use XML End tag has format Root element contains all other document elements Optional XML declaration includes version information parameter XML comments delimited by article titledateauthor summarycontent firstNamelastName Because of the nice.. structure, the data can be viewed as organized in a tree:
4 dna Aspergillus awamori U03518 aacctgcggaaggatcattaccgagtgcgggtcctttgggccca acctcccatccgtgtctattgtaccctgttgcttcgg cgggcccgccgcttgtcggccgccgggggggcgcctctg ccccccgggcccgtgcccgccggagaccccaacacgaac actgtctgaaagcgtgcagtctgagttgattgaatgcaat cagttaaaactttcaacaatggatctcttggttccggc An I-sequence structured as XML SEQUENCEDATA TYPE SEQ DATA IDNAME
5 Parsing and displaying XML XML is just another data format We need to write yet another parser No more filters, please! ? No! XML is becoming standard Many different systems can read XML – not many systems can read our I-sequence format.. Thus, parsers exist already
6 XML document opened in Internet Explorer Minus sign Each parent element/node can be expanded and collapsed Plus sign
7 XML document opened in Mozilla Again: Each parent element/node can be expanded and collapsed (here by pressing the minus, not the element)
8 letter.xml Jane Doe 9 Box Any Ave. 11 Othertown 12 Otherstate John Doe Main St Anytown 23 Anystate Dear Sir: 30 Attribute (name-value pair, value in quotes): element contact has the attribute type which has the value “from” Empty elements do not contain character data. The tags of an empty element may be written in one like this: Attributes Data can also be placed in attributes: name/value pairs
9 letter.xml 31 It is our privilege to inform you about our new 32 database managed with XML. This 33 new system allows you to reduce the load on 34 your inventory list server by having the client machine 35 perform the work of sorting and filtering the data Please visit our Web site for availability 39 and pricing Sincerely Ms. Doe 45
10 Intermezzo 1 1. Finish this i2xml.py filter so it translates a list of Isequence objects into XML (following the above structure) and saves it in a file. Assume the list contains only one Isequence object. Use your module with this driver program and translate this Fasta file into XML. Load the resulting XML file into a browser. i2xml.pydriver program Fasta file 2.Change the XML structure defined by your filter so that TYPE is no longer a tag by itself but an attribute of the SEQ tag (see page 496). 3.Modify your i2xml filter so that it can now translate a list of several Isequence objects into one XML file, using the structure from part 2. Test your program with the same driver on this Fasta file.Fasta file All files found from the Example Programs page
11 solution from Isequence import Isequence import sys # Save a list of Isequences in XML class SaveToFiles: """Stores a list of ISequences in XML format""" def save_to_files(self, iseqlist, savefilename): try: savefile = open(savefilename, "w") print >> savefile, " " for seq in iseqlist: print >> savefile, ’ ’%seq.get_type() print >> savefile, " %s "%seq.get_name() print >> savefile, " %s "%seq.get_id() print >> savefile, " %s "%seq.get_sequence() print >> savefile, " " savefile.close() except IOError, message: sys.exit(message)
12 solution XML file loaded in Internet Explorer
13 Parsers and trees We’ve already seen that XML markup can be displayed as a tree Some XML parsers exploit this. They –parse the file –extract the data –return it organized in a tree data structure called a Document Object Model article titledateauthor summarycontent firstNamelastName
Document Object Model (DOM) DOM parser retrieves data from XML document Hierarchical tree structure called a DOM tree Each component of an XML document represented as a tree node Parent nodes contain child nodes Sibling nodes have same parent Single root (or document) node contains all other document nodes
15 DOM tree of previous example article title author summary contents lastName firstName date Fig. 15.6Tree structure for article.xml. one single document root node sibling nodes parent node child nodes Simple XML December 21, 2001 John Doe XML is pretty easy. In this chapter, we present a wide variety of examples that use XML.
16 Python provides a DOM parser! all nodes have name (of tag) and value text (incl. whitespace) represented in nodes with tag name #text Simple XML December 21, 2001 John Doe XML is pretty easy. In this chapter, we present a wide variety of examples that use XML. article title #text date author summary content #text firstName #text lastName #text Simple XML #text Dec #text XML.. easy. #text In this..XML. #text John #text Doe
17 import sys from xml.dom.minidom import parse # stuff we have to import from xml.parsers.expat import ExpatError # the book uses an old version.. > try: document = parse( file ) file.close() except ExpatError: sys.exit( "Error processing XML file" ) rootElement = document.documentElement print "Here is the root element of the document: %s" % rootElement.nodeName # traverse all child nodes of root element for node in rootElement.childNodes: print node.nodeName # get first child node of root element child = rootElement.firstChild print "\nThe first child of root element is:", child.nodeName print "whose next sibling is:", # get next sibling of first child sibling = child.nextSibling print sibling.nodeName print “Text inside “+ sibling.nodeName + “ tag is”, textnode = sibling.firstChild print textnode.nodeValue print "Parent node of %s is: %s" % ( sibling.nodeName, sibling.parentNode.nodeName ) Parse XML document and load data into variable document List of a node’s children get root element of the DOM tree, documentElement attribute refers to root node nodeName refers to element’s tag name Other node attributes: firstChild nextSibling nodeValue parentNode revised fig16_04. py
18 Program output Here is the root element of the document: article The following are its child elements: #text title #text date #text author #text summary #text content #text The first child of root element is: #text whose next sibling is: title Text inside "title" tag is Simple XML Parent node of title is: article.. print “Text inside “+ sibling.nodeName + “ tag is”, textnode = sibling.firstChild # print text value of sibling print textnode.nodeValue.. article title #text date author summary content #text firstName #text lastName #text Simple XML #text Dec #text XML.. easy. #text In this..XML. #text John #text Doe
19 Parsing XML sequence? We have i2xml filter – we want xml2i also Don’t have to write XML parser, Python provides one Thus, algorithm: –Open file –Use Python parser to obtain the DOM tree –Traverse tree to extract sequence information, build Isequence objects SEQUENCEDATA SEQ (type) DATA IDNAME SEQ (type) DATA IDNAME Ignoring whitespace nodes, we have to search a tree like this:
20 from Isequence import Isequence import sys from xml.dom.minidom import parse from xml.parsers.expat import ExpatError class Parser: """Parses xml file, stores sequences in Isequence list""" def __init__( self ): self.iseqlist = [] # make empty list def parse_file( self, loadfilename ): try: loadfile = open( loadfilename, "r“ ) except IOError, message: sys.exit( message ) # Use Python's own xml parser to parse xml file: try: dom = parse( loadfilename ) loadfile.close() except ExpatError: sys.exit( "Couldn't parse xml file“ ) # now dom is our dom tree structure. Was the xml file a sequence file? if dom.documentElement.nodeName == "SEQUENCEDATA“ : # recursively search the parse tree: for child in dom.documentElement.childNodes: self.traverse_dom_tree( child ) else: sys.exit( "This is not a sequence file" ) return self.iseqlist part 1:2
21 def traverse_dom_tree( self, node ): """Recursive method that traverses the DOM tree""" if node.nodeName == "SEQ“ : # marks the beginning of a new sequence self.iseq = Isequence() # make new Isequence object self.iseqlist.append( self.iseq ) # add to list newformat = 0 # the type should be an attribute of the SEQ tag. # go through all attributes of this node: for i in range( node.attributes.length ): if node.attributes.item(i).name == "type“ : # good, found a 'type' attribute newformat = 1 # get the value of the attribute, put it in the Isequence: self.iseq.set_type( node.getAttribute( "type" ) ) break if not newformat: # we didn't find any 'type' attribute, this is old format print "No 'type' attribute in element SEQ" # next recursively traverse the child nodes of this SEQ node: for child in node.childNodes: self.traverse_dom_tree( child ) elif node.nodeName == "NAME“ : self.iseq.set_name( node.firstChild.nodeValue ) elif node.nodeName == "ID“ : self.iseq.set_id( node.firstChild.nodeValue ) elif node.nodeName == "DATA“ : self.iseq.set_sequence( node.firstChild.nodeValue ) part 2:2 SEQ (type) DATA IDNAME
22 What if the XML sequence format changes? Now the name of the finder of the sequence is also stored as a new tag: SEQUENCEDATA SEQ (type) DATA ID FOUNDBY SEQ (type) DATA ID FOUNDBYNAME
23 Robustness of XML format Our xml2i filter still works: –Can’t extract the finder information: ignores the foundby node: –But: doesn’t crash! Still extracts other information –Easy to incorporate new info def traverse_dom_tree( self, node ): """Recursive method that traverses the DOM tree""" if node.nodeName == "SEQ“ :.. # next recursively traverse the child nodes of this SEQ node: for child in node.childNodes: self.traverse_dom_tree( child ) elif node.nodeName == "NAME“ : self.iseq.set_name( node.firstChild.nodeValue ) elif node.nodeName == "ID“ : self.iseq.set_id( node.firstChild.nodeValue ) elif node.nodeName == "DATA“ : self.iseq.set_sequence( node.firstChild.nodeValue ) SEQ (type) DATA ID FOUNDBYNAME
24 Compare with extending Fasta format Say that the Fasta format is modified so the finder appears in the second line after a >: >HSBGPG Human gene for bone gla protein (BGP) >BiRC CGAGACGGCGCGCGTCCCCTTCGGAGGCGCGGCGCTCTATTACGCGCGATCGACCC.. Our Fasta parser would go wrong: for line in lines: if line[0] == '>': # new sequence starts items = line.split() #put new Isequence obj. in list.. elif self.iseq: # we are currently building an iseq object, extend its sequence self.iseq.extend_sequence( line.strip() ) # skip trailing newline
25 XML robust So, the good thing about XML is that it is robust because of its well-defined structure Widely used, i.e. this overall tag structure won’t change Parsers available in Python already: –Read XML into a DOM tree –DOM tree can be traversed but also manipulated (see next slide) –Read XML using so-called SAX method
26 See all the methods and attributes of a DOM tree on pages 537ff Possible to manipulate the DOM tree using these methods: add new nodes, remove nodes, set attributes etc.
27 Remark: book uses old version of DOM parser XML examples in book won’t work (except the revised fig16.04) Look in the presented example programs to see what you have to import All the methods and attributes of a DOM tree on pages 537ff are the same
28 Intermezzo 2 1.Copy this file and take a look at it in your editor: /users/chili/CSS.E03/Intermezzi/data.xml Any idea what this data is? 2.Open the file in a browser. Expand and collapse nodes by clicking the - and + symbols. Do you see the structure of the tree? Any idea what the data represents now? 3.Copy this program to the same directory. Run it and find the name of Jakob's mother's father's mother. See how the program works?this 4.Modify the program so it reports the birth year of the current person as well as the name. 5.Enhance the program so the user can also go back to the son or daughter of the current person. See table on page If you have time: Enhance the program so it prints the current person's mother-in-law, if she exists.
29 solution name = person.getAttribute( "n" ) print( "%s" %name ) if name != 'Jakob‘ : print "%s's mother in law is“ %name, parentNode = person.parentNode # parentNode is either an 'm' or an 'f' node. If it is a mother # node, we need the father node, and vice versa: if parentNode.nextSibling: spouse = parentNode.nextSibling.firstChild else: spouse = parentNode.previousSibling.firstChild # Now we need the mother of the spouse: for childNode in spouse.childNodes: if childNode.nodeName == 'm‘ : print childNode.firstChild.getAttribute( 'n' ) break input = raw_input( "Report (m)other or (f)ather or (o)ffspring of %s? “ %name ) if input != 'm' and input != 'f' and input != 'o‘ : break if input == 'o‘ : print "\n" + name + "'s offspring is“, person = person.parentNode.parentNode else: for child in person.childNodes: if child.nodeName == input: if input == 'm‘ : print "\nMother of “ + name + " is“, elif input == 'f': print "\nFather of “ + name + " is“, person = child.firstChild break