Sebastian Bitzer Seminar Semistructured Data University of Osnabrueck May 2, 2003 XML An introduction in relation to semistructured data
XML2 Overview Background / History Basic syntax XML and semistructured data Document type definitions Extensions for XML Paraphernalia
XML3 Overview Background / History –SGML –SGML, HTML and XML –World Wide Web Consortium Basic syntax XML and semistructured data Document type definitions Extensions for XML Paraphernalia
XML4 Standard Generalized Markup Language (SGML) model information exclusively on basis of its inner laws and its function platform independent storage of structured information standard: ISO 8879 from 1986
XML5 SGML, HTML and XML SGML(web application) = HTML (is one special instance of SGML) XML SGML
XML6 Why XML from SGML? SGML: –is exceedingly complex and difficult to understand –is formally so complex, that online-applications have difficulties to process it in reasonable time –has many properties which were not designed for use in network environments (remember that it is a standard from 1986)
XML7 World Wide Web Consortium Nov 1996: initial XML draft Dec 1997: XML1.0 Proposed Recommendation Feb 1998: W3C Recommendation: Extensible Markup Language (XML) 1.0 Oct 2000: XML1.0 2nd edition
XML8 Overview Background / History Basic syntax –Elements –Attributes –Well-formed XML documents XML and semistructured data Document type definitions Extensions for XML Paraphernalia
XML9 Elements element = content, = markups content = structures between markups no predefined tags basic content (no markups) is treated as text: PCDATA (Parsed Character Data) abbreviation for empty elements:
XML10 Example John Cage Bearer Elaine Vassal chief secretary …
XML11 Attributes sometimes called “property” in data models (name=“value”) pairs value always a string (type NMTOKEN) allows building of groups of elements ambiguity: information as attribute or element?
XML12 Example John Cage Bearer Elaine Vassal chief secretary …
XML13 Well-formed XML documents a XML document is well-formed, if: –tags nest properly (not ) –attributes are unique within one element (not )
XML14 Overview Background / History Basic syntax XML and semistructured data –Simple transformations –Differences that make transformation more difficult –Additional constructs Document type definitions Extensions for XML Paraphernalia
XML15 Simple transformations with basic XML syntax (no attributes, tree as data structure): from XML to ssd: John Cage Bearer {person : {name : “John Cage”, function : ”bearer”}}
XML16 Simple transformations II from ssd to XML (transformation function T): T(atomic value) = atomic value T({l 1 : v 1, …, l n : v n }) = T(v 1 ) … T(v n )
XML17 Differences that make transformation more difficult different semantic of labels element or attribute order mixing elements and text
XML18 Semantics of labels XML graphs with labels on nodes ssd graphs with labels on edges person nameage person name age Alan 42 {person : {name : “Alan”}, {age: 42}, { }
XML19 Element or attribute ambiguity between representation of information as element or as attribute different possibilities of encoding in particular in combination with references some string or: some string aa b c “some string”
XML20 Order ssd model based on unordered collections XML elements are ordered but: XML attributes are not unordered data can be processed more efficiently for data exchange applications ignore order of XML
XML21 Mixing elements and text XML allows mixing of PCDATA and subelements: XML - An introduction in relation to semistructured data Sebastian Bitzer
XML22 Additional constructs in XML comments processing instructions CDATA (for escaping) entities e.g. “ä” but also external files can be declared as entities e.g. a gif-file as “&pic-1;”
XML23 Overview Background / History Basic syntax XML and semistructured data Document type definitions –DTDs as grammars –DTDs as schemas –Attributes –Valid XML documents –Limitations Extensions for XML Paraphernalia
XML24 DTDs as grammar document type definition (DTD) serves as grammar for underlying XML document is precisely a context-free grammar (non- terminal ordered list of one or more terminals and non-terminals) can be recursive
XML25 Definitions DTD: element-def.s: … content model: ordered list of names of elements which can occur in the outer element
XML26 Variations of content model means that elements of type “r1” contain: –0 or 1 “a” (“a” is optional) and –arbitrary many “b” (0 - ∞) and –either: exactly 1 “c” (“c” is obligatory) or:at least 1 “d” (“d” is required) groups can be build, too: means: at least one sequence of “a” followed by “b” comes in front of the optional “c”
XML27 DTDs as Schemas DTD: <!DOCTYPE db [ ]> can be seen as representation for relational schema r1(a,b,c), r2(c,d)
XML28 Declaring attributes <!ATTLIST el.name att.name1 type1 spec1 att.name2 type2 spec2 … > el.name: element which is modified by att.s type: often “CDATA”, but also more restricted e.g.: “(m|f)” for male or female in att. “sex” spec: #REQUIRED, #IMPLIED, #FIXED or default value
XML29 Unique Identifiers e.g.: <!ATTLIST person id ID#REQUIRED mom IDREF#IMPLIED dad IDREF#IMPLIED children IDREFS#IMPLIED instance:
XML30 Valid XML documents a XML document is valid, if: –document is well-formed –additionally has a DTD –conforms to that DTD: elements only nested as described in DTD just attributes used which are allowed by DTD all attributes of type ID must have distinct values all IDREFS must be to existing identifiers
XML31 Limitations of DTDs as schemas (summarized) order only one atomic type (PCDATA, but no INT etc.) names are global (partial solution: namespaces) IDREFs are not constrained to a certain type (“mother”-reference should point to a “person”)
XML32 Overview Background / History Basic syntax XML and semistructured data Document type definitions Extensions for XML –DCD –Document navigation Paraphernalia
XML33 Document Content Definitions making typing more precise seems to be gone recent approach: XML Schema which must e.g.: – provide for primitive data typing, including byte, date, integer, sequence, SQL & Java primitive data types, etc. –allow creation of user-defined datatypes, such as datatypes that are derived from existing datatypes and which may constrain certain of its properties –mechanism for URI reference to standard semantic understanding of a construct; –… (
XML34 XLink & XPointer pointing to arbitrary positions in documents using IDs or relative position links can be defined externally to both source and target (files)
XML35 Overview Background / History Basic syntax XML and semistructured data Document type definitions Extensions for XML Paraphernalia –RDF –Stylesheets –SAX and DOM
XML36 Resource Description Framework for representing metadata consists of data model and syntax simple form: edge-labelled graph additionally: –containers (bag, sequence or alternative) –higher-order statements (“John says that …”)
XML37 Stylesheets to specify presentation of data Cascading Style Sheets (CSS): associate with each element type a presentation Extensible Stylesheet Language (XSL): specifies the presentation of a class of XML documents by describing how an instance of the class is transformed into an XML document that uses the formatting vocabulary
XML38 SAX and DOM Application Programming Interfaces Simple API for XML (SAX) –standard for parsing Document Object Model (DOM): interface that will allow programs and scripts to dynamically access and update the content, structure and style of documents –compile whole document and build a tree representation for it
XML39 Outlook Database issues: –How are we going to model XML? (graphs). –How are we going to query XML? (XML-QL) –How are we going to store XML (in a relational database? object-oriented?) –How are we going to process XML efficiently? (uh… well..., um..., ah..., get some good grad students!) Raghu Ramakrishnan
XML40 References S. Abiteboul, P. Buneman, and D. Suciu, Data on the Web. From relations to Semistructured Data and XML, Morgan Kaufmann Publishers, San Francisco 2000 H. Lobin, Informationsmodellierung in XML und SGML, Berlin, Heidelberg, 2000 World Wide Web Consortium, Extensible Markup Language (XML),