Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Parsing XML sequence? We have i2xml filter (exercise) – we want xml2i also Don’t have to write XML parser, Python provides one Thus, algorithm: – Open.

Similar presentations


Presentation on theme: "1 Parsing XML sequence? We have i2xml filter (exercise) – we want xml2i also Don’t have to write XML parser, Python provides one Thus, algorithm: – Open."— Presentation transcript:

1 1 Parsing XML sequence? We have i2xml filter (exercise) – we want xml2i also Don’t have to write XML parser, Python provides one Thus, algorithm: – Open file – Use Python parser to obtain the DOM tree – Traverse tree to extract sequence information, build Isequence objects SEQUENCEDATA SEQ (type) DATA IDNAME SEQ (type) DATA IDNAME Ignoring whitespace nodes, we have to search a tree like this:

2 We're still being systematic: Usual name for parse method Obtain a parse tree with the xml data for free xml2i.py (part 1) SEQUENCEDATA SEQ (type) Convert this SEQ subtree to an Isequence object

3 xml2i.py (part 2) SEQ (type) DATA IDNAME Way of getting to all attributes of a node Way of getting to a specific named attribute Recall: text kept in a #text node underneath #text..

4 4 What if the XML sequence format changes? Now the name of the finder of the sequence is stored as a new tag: SEQUENCEDATA SEQ (type) DATA ID FOUNDBY SEQ (type) DATA ID FOUNDBYNAME

5 5 Robustness of XML format Our xml2i filter still works because the DOM parser still works – Can’t extract the finder information: ignores the foundby node: – But: doesn’t crash! Still extracts other information – Easy to update filter to incorporate new info SEQ (type) DATA ID FOUNDBYNAME

6 6 Compare with extending Fasta format Say that the Fasta format is modified so the finder appears in the second line after a >: >HSBGPG Human gene for bone gla protein (BGP) >BiRC CGAGACGGCGCGCGTCCCCTTCGGAGGCGCGGCGCTCTATTACGCGCGATCGACCC.. Our Fasta parser would go wrong!

7 7 XML robust So, the good thing about XML is that it is robust because of its well-defined structure Widely used, i.e. this overall tag structure won’t change and other applications can read your XML data Parser available in Python already: – Read XML into a DOM tree – DOM tree can be traversed but also manipulated (see next slide)

8 8 See all the methods and attributes of a DOM tree on pages 537ff Possible to manipulate the DOM tree using these methods (add new nodes, remove nodes, set attributes etc.)

9 9 Convert old format XML sequence to new format SEQUENCEDATA TYPE SEQ DATA IDNAME Old format: sequence type has its own tag TYPE SEQUENCEDATA SEQ (type) DATA IDNAME New format: sequence type is attribute of SEQ tag

10 old_xml2i.py Add new method to original xml2i.py and call it after parsing the XML file

11 old_xml2phylip.py Import new module Check that type information is saved in the Isequence (not used in phylip format)

12 12 Testing on old format XML sequence dna Aspergillus awamori U03518 aacctgcggaaggatcattaccgagtgcgggtcctttgggcccaacctcccatc cgtgtctattgtaccctgttgcttcggcgggcccgccgcttgtcggccgccgggggggcgcctctgcc ccccgggcccgtgcccgccggagaccccaacacgaacactgtctgaaagcgtgcagtctgagttgatt gaatgcaatcagttaaaactttcaacaatggatctcttggttccggc U03518b.xml python old_xml2phylip.py U03518b.xml U03518b sequence is of type dna

13 13 Remark: book uses old version of DOM parser XML examples in book won’t work (except the revised fig16.04) Look in the presented example programs to see what you have to import All the methods and attributes of a DOM tree on pages 537ff are the same


Download ppt "1 Parsing XML sequence? We have i2xml filter (exercise) – we want xml2i also Don’t have to write XML parser, Python provides one Thus, algorithm: – Open."

Similar presentations


Ads by Google