Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Extensible Markup Language: XML HTML: widely supported protocol for formatting data XML: widely supported protocol for describing data XML is quickly.

Similar presentations


Presentation on theme: "1 Extensible Markup Language: XML HTML: widely supported protocol for formatting data XML: widely supported protocol for describing data XML is quickly."— Presentation transcript:

1 1 Extensible Markup Language: XML HTML: widely supported protocol for formatting data XML: widely supported protocol for describing data XML is quickly becoming standard for data exchange between applications

2 1 XML Documents XML marks up data using tags, which are names enclosed in angle brackets All tags appear in pairs:.. Elements: units of data (i.e., anything between a start tag and its corresponding end tag) Root element contains all other document elements Tag pairs cannot appear interleaved: Must be: Nested elements form trees What defines an XML document is not its tag names but that it has tags that are formatted in this way.

3 Root element contains all other document elements Optional XML declaration includes version information parameter (MUST be very first line of file) Because of the nice.. structure, the data can be viewed as organized in a tree: article titledateauthor summarycontent firstNamelastName

4 dna Aspergillus awamori U03518 aacctgcggaaggatcattaccgagtgcgggtcctttgggccca acctcccatccgtgtctattgtaccctgttgcttcgg cgggcccgccgcttgtcggccgccgggggggcgcctctg ccccccgggcccgtgcccgccggagaccccaacacgaac actgtctgaaagcgtgcagtctgagttgattgaatgcaat cagttaaaactttcaacaatggatctcttggttccggc An I-sequence might be structured as XML like this.. SEQUENCEDATA TYPE SEQ DATA IDNAME comment

5 1 Parsing and displaying XML XML is just another data format We need to write yet another parser No more filters, please! ? No! XML is becoming standard Many different systems can read XML – not many systems can read our I-sequence format.. Thus, parsers exist already

6 Attributes Data can also be placed in attributes: name/value pairs Attribute (name-value pair, value in quotes): element contact has the attribute type which has the value “to” Empty elements are elements with no character data between the tags. The tags of an empty element may be written in one like this: letter.xml

7 1 Parsers and trees We’ve already seen that XML markup can be displayed as a tree Some XML parsers exploit this. They – parse the file – extract the data – return it organized in a tree data structure called a Document Object Model article titledateauthor summarycontent firstNamelastName

8 1 Document Object Model (DOM) A DOM parser retrieves data from XML document and returns a tree Single root node (the document node) contains all other nodes

9 1 XML is standard: Parsers exist already! Minus sign Each parent element/node can be expanded and collapsed Plus sign Standard browsers can format XML documents nicely!

10 1 Python offers a Document Object Model parser! A DOM parser returns the whole XML document represented as a tree All nodes have name (of tag) and value (data) Text (including whitespace) represented in nodes with tag name #text article title #text date author summary content #text firstName #text lastName #text Simple XML #text Dec..2001 #text XML.. easy. #text In this..XML. #text John #text Doe

11 deitel_fig16_04revised.py Parse XML document and load data into variable document documentElement attribute refers to root node nodeName refers to element’s tag name Various node attributes: firstChild nextSibling nodeValue parentNode NB: Changes since book!

12 1 Program output The first child of root element is: #text whose next sibling is: title Text inside "title" tag is Simple XML Parent node of title is: article Here is the root element of the document: article The following are its child elements: #text title #text date #text author #text summary #text content #text article title #text date author summary content #text firstName #text lastName #text Simple XML #text Dec..2001 #text XML.. easy. #text In this..XML. #text John #text Doe

13 1 Parsing XML sequence? We have i2xml filter (exercise) – we want xml2i also New XML structure for Isequences: holds more than one Algorithm: – Open file – Use Python parser to obtain the DOM tree – Traverse tree to extract sequence information, build Isequence objects SEQUENCEDATA SEQ (type) DATA IDNAME SEQ (type) DATA IDNAME Ignoring whitespace nodes, we have to search a tree like this:

14 We're still being systematic: Usual name for parse method Obtain a parse tree with the xml data for free xml2i.py (part 1) SEQUENCEDATA SEQ (type) Convert this SEQ subtree to an Isequence object

15 xml2i.py (part 2) SEQ (type) DATA IDNAME Way of getting to all attributes of a node Way of getting to the value of a specific attribute Recall: text kept in a #text node underneath #text

16 1 What if the XML sequence format changes? Now the name of the finder of the sequence is stored as a new tag: SEQUENCEDATA SEQ (type) DATA ID FOUNDBY SEQ (type) DATA ID FOUNDBYNAME

17 1 Robustness of XML format Our xml2i filter still works because the DOM parser still works – Can’t extract the finder information: ignores the foundby node: – But: doesn’t crash! Still extracts other information – Easy to update filter to incorporate new info NB: can also read old format SEQ (type) DATA ID FOUNDBYNAME

18 1 Compare with extending Fasta format Say that the Fasta format is modified so the finder appears in the second line after a >: >HSBGPG Human gene for bone gla protein (BGP) >BiRC CGAGACGGCGCGCGTCCCCTTCGGAGGCGCGGCGCTCTATTACGCGCGATCGACCC.. Our Fasta parser would go wrong!

19 1 XML robust So, the good thing about XML is that it is robust because of its well-defined structure Widely used, i.e. this overall tag structure won’t change and other applications can read your XML data Parser available in Python already: – Read XML into a DOM tree – DOM tree can be traversed but also manipulated (see next slide)

20 1 See all the methods and attributes of a DOM tree on pages 537ff Possible to manipulate the DOM tree using these methods (add new nodes, remove nodes, set attributes etc.)

21 1 Convert old format XML sequence to new format SEQUENCEDATA TYPE SEQ DATA IDNAME Old format: sequence type has its own tag TYPE SEQUENCEDATA SEQ (type) DATA IDNAME New format: sequence type is attribute of SEQ tag

22 old_xml2i.py Add new method to original xml2i.py and call it after parsing the XML file

23 old_xml2phylip.py Import new module Check that type information is saved in the Isequence (not used in phylip format)

24 1 Testing on old format XML sequence dna Aspergillus awamori U03518 aacctgcggaaggatcattaccgagtgcgggtcctttgggcccaacctcccatc cgtgtctattgtaccctgttgcttcggcgggcccgccgcttgtcggccgccgggggggcgcctctgcc ccccgggcccgtgcccgccggagaccccaacacgaacactgtctgaaagcgtgcagtctgagttgatt gaatgcaatcagttaaaactttcaacaatggatctcttggttccggc U03518b.xml python old_xml2phylip.py U03518b.xml U03518b sequence is of type dna

25 1 Remark: book uses old version of DOM parser XML examples in book won’t work (except the revised fig16.04) Look in the presented example programs to see what you have to import All the methods and attributes of a DOM tree on pages 537ff are the same

26 1.. on to the exercises


Download ppt "1 Extensible Markup Language: XML HTML: widely supported protocol for formatting data XML: widely supported protocol for describing data XML is quickly."

Similar presentations


Ads by Google