XML eXtensible Markup Language
Introduction and Motivation
XML vs. HTML HTML is a HyperText Markup language –Designed for a specific application, namely, presenting and linking hypertext documents XML describes structure and content (“semantics”) –The presentation is defined separately from the structure and the content
An Address Book as an XML document Donald Duck Miki Mouse
Main Features of XML No fixed set of tags –New tags can be added for new applications An agreed upon set of tags can be used in many applications –Namespaces facilitate uniform and coherent descriptions of data For example, a namespace for address books determines whether to use or
Main Features of XML (cont’d) XML has the concept of a schema –DTD and the more expressive XML Schema XML is a data model –Similar to the semistructured data model XML supports internationalization (Unicode) and platform independence (an XML file is just a character file)Unicode
XML is Self-Describing Data Traditionally, a data file is just a bit stream Only a program that reads or writes this file has the details about –How to break the bit stream into records –How to break each record into fields –The type of each data field Over the years, companies retained valuable data (e.g., on magnetic tapes), but lost the programs that have the above information –As a result, the data was practically lost It cannot happen with XML data
XML is the Standard for Data Exchange Web services (e.g., ecommerce) require exchanging data between various applications that run on different platforms XML (augmented with namespaces) is the preferred syntax for data exchange on the Web
XML is not Alone XML Schemas strengthen the data-modeling capabilities of XML (in comparison to XML with only DTDs) XPath is a language for accessing parts of XML documents XLink and XPointer support cross-references XSLT is a language for transforming XML documents into other XML documents (including XHTML, for displaying XML files)XSLT –Limited styling of XML can be done with CSS aloneCSS XQuery is a lanaguage for querying XML documents
The Two Facets of XML Some XML files are just text documents with tags that denote their structure and include some metadata (e.g., an attribute that gives the name of the person who did the proofreading) –See an example on the next slide –XML is a subset of SGML (Standard Generalized Markup Language) Other XML documents are similar to database files (e.g., an address book)
XML can Describe the Structure of a Document Complexity of Computations M. O. Rabin Hebrew University …
XML Syntax W3Schools Resources on XML Syntax
The Structure of XML XML consists of tags and text Tags come in pairs... They must be properly nested –good –bad (You can’t do in HTML)
A Useful Abbreviation Abbreviating elements with empty contents: for For example: Lisa Simpson... Note that a tag may have a set of attributes, each consisting of a name and a value
XML Text XML has only one “basic” type – text It is bounded by tags, e.g., The Big Sleep 1935 – 1935 is still text XML text is called PCDATA –(for parsed character data) It uses a 16-bit encoding, e.g., \&\#x0152 for the Hebrew letter Mem
XML Structure Nesting tags can be used to express various structures, e.g., a tuple (record): Lisa Simpson
XML Structure (cont’d) We can represent a list by using the same tag repeatedly: …
XML Structure (cont’d) Donald Duck Miki Mouse
Terminology The segment of an XML document between an opening and a corresponding closing tag is called an element Bart Simpson 02 – – element element, a sub-element of not an element
An XML Document is a Tree person name tel Bart Simpson 02 – – Note that semistructured data models typically put the labels on the edges, and are arbitrary graphs and not just trees Leaves are either empty or contain PCDATA
Mixed Content An element may contain a mixture of sub- elements and PCDATA British Airways World’s favorite airline How many leaves are there in the corresponding tree? How many leaves are empty?
The Header Tag –Standalone=“no” means that there is an external DTD –You can leave out the encoding attribute and the processor will use the UTF-8 default
Processing Instructions Hello, world!
Using CDATA Entering a Kennel Club Member Enter the member by the name on his or her papers. Use the NAME tag. The NAME tag has two attributes. Common (all in lowercase, please!) is the dog's call name. Breed (also in all lowercase) is the dog's breed. Please see the breed reference guide for acceptable breeds. Your entry should look something like this: Sir Fredrick of Ledyard's End ]]> We want to see the text as is, even though it includes tags
A Complete XML Document Lisa Simpson
Well-Formed XML Documents An XML document (with or without a DTD) is well-formed if –Tags are syntactically correct –Every tag has an end tag –Tags are properly nested –There is a root tag –A start tag does not have two occurrences of the same attribute An XML document must be well formed
DTD (Document Type Definition) Imposing Structure on XML Documents (W3Schools on DTDs)W3Schools on DTDs
Motivation A DTD adds syntactical requirements in addition to the well-formed requirement It helps in eliminating errors when creating or editing XML documents It clarifies the intended semantics It simplifies the processing of XML documents
An Example In an address book, where can a phone number appear? –Under, under or under both? If we have to check for all possibilities, processing takes longer and it may not be clear to whom a phone number belongs –We would like to know that a phone number is allowed to appear under both a department and the manager of that department –If we don’t know that and there is only one phone number, we may not know whether it serves both the department and its manager or just one of them
Document Type Definitions Document Type Definitions (DTDs) impose structure on XML documents There is some relationship between a DTD and a schema, but it is not close –Hence, the need for additional “typing” systems (XML schemas) The DTD is a syntactic specification
Example: An Address Book Homer Simpson Dr. H. Simpson 1234 Springwater Road Springfield USA, (321) (321) Mixed telephones and faxes As many as needed As many address lines as needed (in order) At most one greetingExactly one name
Specifying the Structure name to specify a name element greet? to specify an optional (0 or 1) greet elements name, greet? to specify a name followed by an optional greet
Specifying the Structure (cont’d) addr*to specify 0 or more address lines tel | faxa tel or a fax element (tel | fax)* 0 or more repeats of tel or fax *0 or more elements
Specifying the Structure (cont’d) So the whole structure of a person entry is specified by name, greet?, addr*, (tel | fax)*, * This is known as a regular expression Why is it important?
Summary of Regular Expressions AThe tag (i.e., element) A occurs e1,e2The expression e1 followed by e2 e*0 or more occurrences of e e?Optional: 0 or 1 occurrences e+1 or more occurrences e1 | e2either e1 or e2 (e)grouping
The Definition of an Element Consists of Exactly One of the Following A regular expression (as defined earlier) EMPTY means that the element has no content ANY means that the content can be any mixture of PCDATA and elements defined in the DTD Mixed content which is defined as described on the next slide (#PCDATA)
The Definition of Mixed Content Mixed content is described by a repeatable OR group (#PCDATA | element-name | …)* –Inside the group, no regular expressions – just element names –#PCDATA must be first, followed by 0 or more element names that are separated by | –The group can be repeated 0 or more times
An Address-Book XML Document with an Internal DTD <!DOCTYPE addressbook [ <!ELEMENT person (name, greet?, address*, (fax | tel)*, *)> ]> The name of the DTD is addressbook “Internal” means that the DTD and the XML Document are in the same file The syntax of a DTD is not XML syntax
The Rest of the Address-Book XML Document Jeff Cohen Dr. Cohen
Regular Expressions Each regular expression determines a corresponding finite-state automaton Let’s start with a simpler example: name, addr*, name addr This suggests a simple parsing program A double circle denotes an accepting state
Another Example name,address*,(tel | fax)*, * name address tel fax Adding in the optional greet further complicates things
Deterministic Requirement: Content Models must be Deterministic If element-type declarations are deterministic, it is easier to parse XML documents W3C XML recommendation requires the Glushkov automaton to be deterministic The states of this automaton are the positions of the regular expression (semantic actions) The transitions are based on the “follows set”
Deterministic Requirement (cont’d) The associated automata are succinct A regular language may not have an associated deterministic grammar, e.g., <!ELEMENT ndeter ((movie|director)*,movie,(movie|director))> This is not allowed in a DTD
Some Things are Hard to Specify Each employee element should contain name, age and ssn elements in some order <!ELEMENT employee ( (name, age, ssn) | (age, ssn, name) | (ssn, name, age) |... )> Suppose that there were many more fields!
Some Things are Hard to Specify (cont’d) <!ELEMENT employee ( (name, age, ssn) | (age, ssn, name) | (ssn, name, age) |... )> Suppose there were many more fields! There are n! different orders of n elements It is not even polynomial
Specifying Attributes in the DTD <!ATTLIST height dimension CDATA #REQUIRED “cm” accuracy CDATA #IMPLIED > The dimension attribute is required The accuracy attribute is optional CDATA is the “type” of the attribute – it means “character data,” and may take any literal string as a value
The Format of an Attribute Definition The default value is given inside quotes
Summary of Attribute Types CDATA (value | … | … ) is an enumeration of allowed values ID, IDREF, IDRERS –to be explained later ENTITY, ENTITIES –to be explained later NMTOKEN, NMTOKENS, NOTATION
Summary of Attribute Default Values #REQUIRED means that the attribute must by included in the element #IMPLIED #FIXED “value” –The given value (inside quotes) is the only possible one “value” –The default value of the attribute if none is given
Recursive DTDs <DOCTYPE genealogy [ <!ELEMENTperson ( name, dateOfBirth, person, -- mother person )> -- father... ]> What is the problem with this? A parser does not notice it! Each person should have a father and a mother. This leads to either infinite data or a person that is a descendent of herself.
Recursive DTDs (cont’d) <DOCTYPE genealogy [ <!ELEMENTperson ( name, dateOfBirth, person?, -- mother person? )> -- father... ]> What is now the problem with this? If a person only has a mother, how can you tell that he has a mother and does not have a father?
Using ID and IDREF Attributes <!DOCTYPE family [ <!ATTLIST person id ID #REQUIRED mother IDREF #IMPLIED father IDREF #IMPLIED children IDREFS #IMPLIED> ]>
IDs and IDREFs ID stands for identifier –No two ID attributes may have the same value (of type CDATA) IDREF stands for identifier reference –Every value associated with an IDREF attribute must exist as the value of some ID attribute IDREFS specifies several (0 or more) identifier references
Some Conforming Data Lisa Simpson Bart Simpson Marge Simpson Homer Simpson
ID References do not Have Types The attributes mother and father are references to IDs of other elements However, those are not necessarily person elements! The mother attribute is not necessarily a reference to a female person
An Alternative Specification <!DOCTYPE family [ ]>
The Revised Data Marge Simpson Homer Simpson Bart Simpson Lisa Simpson
Consistency of ID and IDREF Attribute Values If an attribute is declared as ID –The associated value must be distinct, i.e., different elements (in the given document) must have different values for the ID attribute (no confusion) Even if the two elements have different element names If an attribute is declared as IDREF –The associated value must exist as the value of some ID attribute (no dangling “pointers”) Similarly for all the values of an IDREFS attribute ID, IDREF and IDREFS attributes are not typed
Adding a DTD to the Document A DTD can be internal –The DTD is part of the document file or external –The DTD and the document are on separate files –An external DTD may reside In the local file system (where the document is) In a remote file system
Connecting a Document with its DTD An internal DTD: … ]>... A DTD from the local file system: A DTD from a remote file system:
Well-Formed XML Documents An XML document (with or without a DTD) is well-formed if –Tags are syntactically correct –Every tag has an end tag –Tags are properly nested –There is a root tag –A start tag does not have two occurrences of the same attribute An XML document must be well formed
Valid Documents A well-formed XML document isvalid if it conforms to its DTD, that is, –The document conforms to the regular- expression grammar, –The types of attributes are correct, and –The constraints on references are satisfied
DTDs are CFGs (Context-Free Grammars) Checking validity and parsing a document according to a DTD is in polynomial time, using a dynamic-programming algorithm –A element has the same rules regardless of whether it is under a element or a element Note that XML Schemas are capable of describing context-sensitive structures –The complexity is higher
Validating XML Parsers A validating parser is one that checks validity It also converts the XML document to some normal form –For example, it adds attributes that have fixed values (according to the DTD) if these attributes do not appear explicitly
DTDs vs. Schemas (or Types) DTDs are rather weak specifications by DB & programming-language standards –Only one base type – PCDATA –No useful “abstractions”, e.g., sets –IDREFs are untyped – the type of the object being referenced is not known –No constraints, e.g., child is inverse of parent –No inheritance –No methods –Tag definitions are global Some extensions of XML impose a schema or types on an XML document More problems with DTD