XML Files and ElementTree

Slides:



Advertisements
Similar presentations
The eXtensible Markup Language (XML) An Applied Tutorial Kevin Thomas.
Advertisements

Jabber and Extensible Messaging and Presence Protocol (XMPP) Presenter: Michael Smith Cisc 856 Dec. 6, 2005.
1 Extensible Markup Language: XML HTML: portable, widely supported protocol for describing how to format data XML: portable, widely supported protocol.
3 November 2008CIS 340 # 1 Topics To define XML as a technology To place XML in the context of system architectures.
1 Extensible Markup Language: XML HTML: portable, widely supported protocol for describing how to format data XML: portable, widely supported protocol.
XML CS 105. What is XML? XML stands for Extensible Markup Language. XML is a markup language like HTML. XML was designed to describe data. You must define.
XML Introduction What is XML –XML is the eXtensible Markup Language –Became a W3C Recommendation in 1998 –Tag-based syntax, like HTML –You get to make.
XML(EXtensible Markup Language). XML XML stands for EXtensible Markup Language. XML is a markup language much like HTML. XML was designed to describe.
Tutorial 11 Creating XML Document
XML Primer. 2 History: SGML vs. HTML vs. XML SGML (1960) XML(1996) HTML(1990) XHTML(2000)
A U eXtensible Markup Language (XML) Professor J. Alberto Espinosa ITEC 334 Fall 2008 Computer Programming in the Web Era.
XML: What, Why, When & How? Hope Greenberg Center for Teaching & Learning June 11 & 18.
XML October 24, Unit 6. What is XML? Stands for eXtensible Markup Language It is a markup language, like HTML But, –XML is designed to markup data –HTML.
Introducing XHTML: Module B: HTML to XHTML. Goals Understand how XHTML evolved as a language for Web delivery Understand the importance of DTDs Understand.
How to make bread.
Introduction to XML This material is based heavily on the tutorial by the same name at
Copyright © 2003 Pearson Education, Inc. Slide 2-1 Created by Cheryl M. Hughes, Harvard University Extension School — Cambridge, MA The Web Wizard’s Guide.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 XML Taken from Chapter 7.
XML Anisha K J Jerrin Thomas. Outline  Introduction  Structure of an XML Page  Well-formed & Valid XML Documents  DTD – Elements, Attributes, Entities.
Introduction to XML cs3505. References –I got most of this presentation from this site –O’reilly tutorials.
CS 299 – Web Programming and Design Introduction to HTML.
CISC 3140 (CIS 20.2) Design & Implementation of Software Application II Instructor : M. Meyer Address: Course Page:
August Chapter 2 - Markup and Core Concepts Learning XML by Erik T. Ray Slides were developed by Jack Davis College of Information Science and Technology.
XML Syntax - Writing XML and Designing DTD's
XML Extensible Markup Language. What is XML? An infrastructure for describing text and data Developed by W3C(the World Wide Web Consortium)
These Questions are copied from
CIS 451: Introduction to XML Dr. Ralph D. Westfall October, 2011.
Session IV Chapter 9 – XML Schemas
Tutorial 1: XML Creating an XML Document. 2 Introducing XML XML stands for Extensible Markup Language. A markup language specifies the structure and content.
6-2 Science Fair Finishing Up Ms. Bridgeland. What is a Procedure? The detailed steps of an experiment that can be followed by anyone. Steps must occur.
Softsmith Infotech XML. Softsmith Infotech XML EXtensible Markup Language XML is a markup language much like HTML Designed to carry data, not to display.
XML Documents Chao-Hsien Chu, Ph.D. School of Information Sciences and Technology The Pennsylvania State University Elements Attributes Comments PI Document.
1 11/29/05CS360 Windows Programming XML. 2 11/29/05CS360 Windows Programming What is XML?  XML: Extensible Markup Language  HTML expresses appearance.
Consuming eXtensible Markup Language (XML) feeds.
Introduction to XML This presentation covers introductory features of XML. What XML is and what it is not? What does it do? Put different related technologies.
XP 1 Creating an XML Document Developing an XML Document for the Jazz Warehouse XML Tutorial.
17 Apr 2002 XML Syntax: Documents Andy Clark. Basic Document Structure Element tags – Elements have associated attributes Text content Miscellaneous –
Lecture 16 Introduction to XML Boriana Koleva Room: C54
Accessing Data Using XML CHAPTER NINE Matakuliah: T0063 – Pemrograman Visual Tahun: 2009.
Web Technologies COMP6115 Session 4: Adding a Database to a Web Site Dr. Paul Walcott Department of Computer Science, Mathematics and Physics University.
1 Introduction to XML XML stands for Extensible Markup Language. Because it is extensible, XML has been used to create a wide variety of different markup.
XML Introduction. What is XML? XML stands for eXtensible Markup Language XML stands for eXtensible Markup Language XML is a markup language much like.
The eXtensible Markup Language (XML). Presentation Outline Part 1: The basics of creating an XML document Part 2: Developing constraints for a well formed.
An Introduction to XML Paul Donohue May 8th 2002 Hotel Senator Zürich.
XML Design Goals 1.XML must be easily usable over the Internet 2.XML must support a wide variety of applications 3.XML must be compatible with SGML 4.It.
XML Introduction. Markup Language A markup language must specify What markup is allowed What markup is required How markup is to be distinguished from.
1 Tutorial 11 Creating an XML Document Developing a Document for a Cooking Web Site.
Jennifer Widom XML Data Introduction, Well-formed XML.
Intro2cs Tirgul 5 1. What we will cover today?  File IO  Dictionaries  XML  Importing files  Function with unknown number of arguments  List comprehension.
XML Document Type Definitions and the Document object model.
XML CSC1310 Fall HTML (TIM BERNERS-LEE) HyperText Markup Language  HTML (HyperText Markup Language): December  Markup  Markup is a symbol.
Using Local Tools: BLAST
Introduction to XML Jussi Pohjolainen TAMK University of Applied Sciences.
Introduction to XML Kanda Runapongsa Dept. of Computer Engineering Khon Kaen University.
SEMI-STRUCTURED DATA (XML) 1. SEMI-STRUCTURED DATA ER, Relational, ODL data models are all based on schema Structure of data is rigid and known is advance.
CHAPTER NINE Accessing Data Using XML. McGraw Hill/Irwin ©2002 by The McGraw-Hill Companies, Inc. All rights reserved Introduction The eXtensible.
XML Extensible Markup Language
XML & JSON. Background XML and JSON are to standard, textual data formats for representing arbitrary data – XML stands for “eXtensible Markup Language”
1 XML eXtensible Markup Language. 2 Introduction and Motivation Dr. Praveen Madiraju Modified from Dr.Sagiv’s slides.
Advanced Accounting Information Systems Day 28 Introduction to XBRL October 30, 2009.
XML Files and ElementTree
Formal Language Theory
XML Files and ElementTree
XML Data Introduction, Well-formed XML.
XML Data DTDs, IDs & IDREFs.
CS 240 – Advanced Programming Concepts
Making a White Bloomer.
How to make bread? Ingredients:
Review of XML IST 421 Spring 2004 Lecture 5.
Presentation transcript:

XML Files and ElementTree BCHB524 2013 Lecture 13 10/16/2013 BCHB524 - 2013 - Edwards

Outline XML Python module ElementTree Exercises eXtensible Markup Language Python module ElementTree Exercises 10/16/2013 BCHB524 - 2013 - Edwards

XML: eXtensible Markup Language Ubiquitous in bioinformatics, internet, everywhere Most in-house data formats being replaced with XML Information is structured and named Can be checked for correct syntax and correct semantics (to a point) 10/16/2013 BCHB524 - 2013 - Edwards

XML: Advantages Structured - records, lists, trees Self-documenting, to a point Hierarchical Can be changed incrementally Good generic parsers exist. Platform independent 10/16/2013 BCHB524 - 2013 - Edwards

XML: Disadvantages Verbose! Less good for binary data numbers, sequence All data are strings Hierarchy isn't always a good fit to the data Many ways to represent the same data Problems of data semantics remain 10/16/2013 BCHB524 - 2013 - Edwards

XML: Examples <?xml version="1.0"?> <!-- Bread recipie description --> <recipe name="bread" prep_time="5 mins" cook_time="3 hours"> <title>Basic bread</title> <ingredient amount="8" unit="dL">Flour</ingredient> <ingredient amount="10" unit="grams">Yeast</ingredient> <ingredient amount="4" unit="dL" state="warm">Water</ingredient> <ingredient amount="1" unit="teaspoon">Salt</ingredient> <instructions> <step>Mix all ingredients together.</step> <step>Knead thoroughly.</step> <step>Cover with a cloth, and leave for one hour in warm room.</step> <step>Knead again.</step> <step>Place in a bread baking tin.</step> <step>Bake in the oven at 180(degrees)C for 30 minutes.</step> </instructions> </recipe> 10/16/2013 BCHB524 - 2013 - Edwards

XML: Examples recipe title Basic bread ingredient Flour ingredient Salt instructions step Mix all ingredients together. step Bake in the oven at 180(degrees)C for 30 minutes. 10/16/2013 BCHB524 - 2013 - Edwards

XML: Well-formed XML All XML elements must have a closing tag XML tags are case sensitive All XML elements must be properly nested All XML documents must have a root tag Attribute values must always be quoted 10/16/2013 BCHB524 - 2013 - Edwards

XML: Bioinformatics All major bioinformatics sites provide some form of XML data Paul Gordon's List (a bit out of date) http://www.visualgenomics.ca/gordonp/xml/ Lets look at SwissProt. http://www.uniprot.org/uniprot/Q9H400 10/16/2013 BCHB524 - 2013 - Edwards

XML: UniProt Entry <?xml version='1.0' encoding='UTF-8'?> <uniprot xmlns="http://uniprot.org/uniprot" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://uniprot.org/uniprot http://www.uniprot.org/support/docs/uniprot.xsd"> <entry dataset="Swiss-Prot" created="2005-12-20" modified="2011-09-21" version="77"> <accession>Q9H400</accession> <accession>E1P5K5</accession> <accession>E1P5K6</accession> <accession>Q5JWJ2</accession> <accession>Q6XYB3</accession> <accession>Q9NX69</accession> <name>LIME1_HUMAN</name> <protein> <recommendedName> <fullName>Lck-interacting transmembrane adapter 1</fullName> <shortName>Lck-interacting membrane protein</shortName> </recommendedName> <alternativeName> <fullName>Lck-interacting molecule</fullName> </alternativeName> </protein> <gene> <name type="primary">LIME1</name> <name type="synonym">LIME</name> <name type="ORF">LP8067</name> </gene> ... </entry> </uniprot> 10/16/2013 BCHB524 - 2013 - Edwards

XML: UniProt Entry Web-browsers can "layout" the XML document structure Elements can be collapsed interactively. 10/16/2013 BCHB524 - 2013 - Edwards

ElementTree Access the contents of an XML file in a "pythonic" way. Use iteration to access nested structure Use dictionaries to access attributes Each element/node is an "Element" Google "ElementTree python" for docs 10/16/2013 BCHB524 - 2013 - Edwards

Basic ElementTree Usage import xml.etree.ElementTree as ET # Parse the XML file and get the recipe element document = ET.parse("recipe.xml") root = document.getroot() # What is the root? print root.tag # Get the (single) title element contained in the recipe element ele = root.find('title') print ele.tag, ele.attrib, ele.text # All elements contained in the recipe element for ele in root:     print ele.tag, ele.attrib, ele.text # Finds all ingredients contained in the recipe element for ele in root.findall('ingredient'):     print ele.tag, ele.attrib, ele.text # Continued... 10/16/2013 BCHB524 - 2013 - Edwards

Basic ElementTree Usage # Continued... # Finds all steps contained in the root element # There are none! for ele in root.findall('step'):     print "!",ele.tag, ele.attrib, ele.text # Gets the instructions element inst = root.find('instructions') # Finds all steps contained in the instructions element for ele in inst.findall('step'):     print ele.tag, ele.attrib, ele.text # Finds all steps contained at any depth in the recipe element for ele in root.getiterator('step'):     print ele.tag, ele.attrib, ele.text 10/16/2013 BCHB524 - 2013 - Edwards

Basic ElementTree Usage import xml.etree.ElementTree as ET # Parse the XML file and get the recipe element document = ET.parse("recipe.xml") root = document.getroot() ele = root.find('title') print ele.text for ele in root.findall('ingredient'):     print ele.attrib['amount'], ele.attrib['unit'],     print ele.attrib.get('state',''), ele.text print "Instructions:" ele = root.find('instructions') for i,step in enumerate(ele.findall('step')):     print i+1, step.text 10/16/2013 BCHB524 - 2013 - Edwards

Basic ElementTree Usage import xml.etree.ElementTree as ET # Parse the XML file and get the recipe element document = ET.parse("recipe.xml") root = document.getroot() ele = root.find('title') title = ele.text ingredients = [] for ele in root.findall('ingredient'):     ingredients.append([ele.text, ele.attrib['amount'], ele.attrib['unit']])     if ele.attrib.get('state'):         ingredients[-1].append(ele.attrib['state']) ele = root.find('instructions') steps = [] for step in ele.findall('step'):     steps.append(step.text) # Continued... 10/16/2013 BCHB524 - 2013 - Edwards

Basic ElementTree Usage # Continued... print "====",title,"====" print "Instructions:" for i,inst in enumerate(steps):     print " ",i+1, inst print "Ingredients:" for indg in sorted(ingredients):     print " "," ".join(indg[1:]+indg[:1]) 10/16/2013 BCHB524 - 2013 - Edwards

Advanced ElementTree Usage Use iterparse when the file is mostly a long list of specific items (single tag) and you need to examine each one in turn… Call clear() when done with each item. import xml.etree.ElementTree as ET for event,ele in ET.iterparse("recipe.xml"):     print event,ele.tag,ele.attrib,ele.text for event,ele in ET.iterparse("recipe.xml"):     if ele.tag == 'step':         print ele.text         ele.clear() 10/16/2013 BCHB524 - 2013 - Edwards

XML Namespaces <?xml version='1.0' encoding='UTF-8'?> <uniprot xmlns="http://uniprot.org/uniprot" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://uniprot.org/uniprot http://www.uniprot.org/support/docs/uniprot.xsd"> <entry dataset="Swiss-Prot" created="2005-12-20" modified="2011-09-21" version="77"> <accession>Q9H400</accession> <accession>E1P5K5</accession> <accession>E1P5K6</accession> <accession>Q5JWJ2</accession> <accession>Q6XYB3</accession> <accession>Q9NX69</accession> <name>LIME1_HUMAN</name> <protein> <recommendedName> <fullName>Lck-interacting transmembrane adapter 1</fullName> <shortName>Lck-interacting membrane protein</shortName> </recommendedName> <alternativeName> <fullName>Lck-interacting molecule</fullName> </alternativeName> </protein> <gene> <name type="primary">LIME1</name> <name type="synonym">LIME</name> <name type="ORF">LP8067</name> </gene> ... </entry> </uniprot> 10/16/2013 BCHB524 - 2013 - Edwards

Advanced ElementTree Usage import xml.etree.ElementTree as ET import urllib thefile = urllib.urlopen('http://www.uniprot.org/uniprot/Q9H400.xml') document = ET.parse(thefile) root = document.getroot() print root.tag,root.attrib,root.text for ele in root:     print ele.tag,ele.attrib,ele.text entry = root.find('entry') print entry ns = '{http://uniprot.org/uniprot}' entry = root.find(ns+'entry') print entry print entry.tag,entry.attrib,entry.text import xml.etree.ElementTree as ET import urllib thefile = urllib.urlopen('http://www.uniprot.org/uniprot/Q9H400.xml') document = ET.parse(thefile) root = document.getroot() print root.tag,root.attrib,root.text for ele in root: print ele.tag,ele.attrib,ele.text entry = root.find('entry') print entry ns = '{http://uniprot.org/uniprot}' entry = root.find(ns+'entry') print entry.tag,entry.attrib,entry.text 10/16/2013 BCHB524 - 2013 - Edwards

Exercise Read through the ElementTree tutorials Write a program to pick out, and print, the references of a XML format UniProt entry, in a nicely formatted way. 10/16/2013 BCHB524 - 2013 - Edwards

Exercise (Bonus) Write a program to count the number of spectra in the file "Data1.mzXML.gz" using ElementTree’s iterparse function. How many MS (attribute "msLevel" is 1) spectra (tag "scan") are there? How many MS/MS (attribute "msLevel" is 2) spectra (tag "scan") are there? How many MS/MS spectra have precursor m/z value between 750 and 1000 Da? 10/17/2011 BCHB524 - 2011 - Edwards

Homework 8 Due Monday, October 21. Exercise from Lecture 13 Bonus exercise from Lecture 13 Optional! Excuse lowest homework score to-date! Rosalind exercise 14 10/2/2013 BCHB524 - 2013 - Edwards