XML & XML Schema Semantic Web - Spring 2007 Computer Engineering Department Sharif University of Technology
Semantic web - Computer Engineering Dept. - Spring Outline Markup Languages –SGML, HTML, XML XML Building Blocks XML Applications Namespaces XML Schema
Semantic web - Computer Engineering Dept. - Spring SGML(ISO 8879) S tandard G eneralized M arkup L anguage The international standard for defining descriptions of structure and content in text documents Interchangeable: device-independent, system-independent tags are not predefined Using DTD to validate the structure of the document Large, powerful, and very complex Heavily used in industrial and commercial usages for over a decade
Semantic web - Computer Engineering Dept. - Spring HTML(RFC 1866) H yper T ext M arkup L anguage A small SGML application used on web (a DTD and a set of processing conventions) Only uses a predefined set of tags
Semantic web - Computer Engineering Dept. - Spring What is XML? eXtensible Markup Language A simplified version of SGML Maintains the most useful parts of SGML Designed so that SGML can be delivered over the Web More flexible and adaptable than HTML XHTML: a reformulation of HTML 4 in XML 1.0XHTML
Semantic web - Computer Engineering Dept. - Spring HTML vs. XML HTML is used to mark up text so it can be displayed to users. XML is used to mark up data so it can be processed by computers. HTML describes both structure (e.g.,, ) and appearance (e.g.,, ) XML describes only content, or “meaning” HTML uses a fixed, unchangeable set of tags. In XML, you make up your own tags.
Semantic web - Computer Engineering Dept. - Spring HTML vs. XML (2) HTML is for humans –HTML describes web pages –You don’t want to see error messages about the web pages you visit –Browsers ignore and/or correct as many HTML errors as they can, so HTML is often sloppy XML is for computers –XML describes data –The rules are strict and errors are not allowed In this way, XML is like a programming language –Current versions of most browsers can display XML However, browser support of XML is spotty at best
Semantic web - Computer Engineering Dept. - Spring XML-related technologies DTD (Document Type Definition) and XML Schemas are used to define legal XML tags and their attributes for particular purposes XSLT (eXtensible Stylesheet Language Transformations) and XPath are used to translate from one form of XML to another SAX (Simple API for XML)
Semantic web - Computer Engineering Dept. - Spring XML Building blocks - Elements Delimited by angle brackets Identify the nature of the content they surround General format: … Empty element: XML Elements have Relationships –Elements are related as parents and children Elements have Content –Elements can have different content types: Element, mixed, Simple, empty
Semantic web - Computer Engineering Dept. - Spring XML Building blocks - Attributes Name-value pairs that occur inside start-tags after element name, like: Provide additional information about elements that often is not a part of data. Attributes and elements are somewhat interchangeable Should I use an element or an attribute? Example using just elements: David Matuszek Example using attributes: metadata (data about data) should be stored as attributes, and that data itself should be stored as elements
Semantic web - Computer Engineering Dept. - Spring XML Building blocks - Entities Five special characters must be written as entities: & for & (almost always necessary) < for < (almost always necessary) > for > (not usually necessary) " for " (necessary inside double quotes) ' for ' (necessary inside single quotes) These entities can be used even in places where they are not absolutely required. These are the only predefined entities in XML.
Semantic web - Computer Engineering Dept. - Spring XML Building blocks - Declaration The XML declaration looks like this: –The XML declaration is not required by browsers, but is required by most XML processors (so include it!) –If present, the XML declaration must be first--not even whitespace should precede it –Note that the brackets are –version="1.0" is required (this is the only version so far) –encoding can be "UTF-8" (ASCII) or "UTF-16" (Unicode), or something else, or it can be omitted –standalone tells whether there is a separate DTD
Semantic web - Computer Engineering Dept. - Spring XML Building blocks - Processing instructions PIs (Processing Instructions) may occur anywhere in the XML document (but usually first) A PI is a command to the program processing the XML document to handle it in a certain way XML documents are typically processed by more than one program Programs that do not recognize a given PI should just ignore it General format of a PI: Example:
Semantic web - Computer Engineering Dept. - Spring XML Building blocks - Comments Comments can be put anywhere in an XML document Comments are useful for: –Explaining the structure of an XML document –Commenting out parts of the XML during development and testing The character sequence -- cannot occur in the comment Comments are not displayed by browsers, but can be seen by anyone who looks at the source code
Semantic web - Computer Engineering Dept. - Spring CDATA By default, all text inside an XML document is parsed You can force text to be treated as unparsed character data by enclosing it in Any characters, even & and <, can occur inside a CDATA Whitespace inside a CDATA is (usually) preserved The only real restriction is that the character sequence ]]> cannot occur inside a CDATA CDATA is useful when your text has a lot of illegal characters (for example, if your XML document contains some HTML text)
Semantic web - Computer Engineering Dept. - Spring XML Syntax All XML elements must have a closing tag XML tags are case sensitive All XML elements must be properly nested All XML documents must have a root tag Attribute values must always be quoted With XML, white space is preserved With XML, a new line is always stored as LF Comments in XML:
Semantic web - Computer Engineering Dept. - Spring Well-formed XML Every element must have both a start tag and an end tag, e.g.... –But empty elements can be abbreviated:. –XML tags are case sensitive –XML tags may not begin with the letters xml, in any combination of cases Elements must be properly nested, e.g. not bold and italic Every XML document must have one and only one root element The values of attributes must be enclosed in single or double quotes, e.g. Character data cannot contain < or &
Semantic web - Computer Engineering Dept. - Spring Displaying XML XML documents do not carry information about how to display the data We can add display information to XML with –CSS (Cascading Style Sheets) –XSL (eXtensible Stylesheet Language) --- preferred
Semantic web - Computer Engineering Dept. - Spring XML Applications (1) Separate data XML can Separate Data from HTML Store data in separate XML files Using HTML for layout and display Using Data Islands Data Islands can be bound to HTML elements Benefits: Changes in the underlying data will not require any changes to your HTML
Semantic web - Computer Engineering Dept. - Spring XML Applications (2) Exchange data XML is used to Exchange Data Text format Software-independent, hardware-independent Exchange data between incompatible systems, given that they agree on the same tag definition. Can be read by many different types of applications Benefits: Reduce the complexity of interpreting data Easier to expand and upgrade a system
Semantic web - Computer Engineering Dept. - Spring XML Application (3) Store Data XML can be used to Store Data Plain text file Store data in files or databases Application can be written to store and retrieve information from the store Other clients and applications can access your XML files as data sources Benefits: Accessible to more applications
Semantic web - Computer Engineering Dept. - Spring XML Applications (4) Create new language XML can be used to Create new Languages, e.g. : WML (Wireless Markup Language) used to markup Internet applications for handheld devices like mobile phones (WAP) MusicXML used to publishing musical scores
Semantic web - Computer Engineering Dept. - Spring Names in XML Names (as used for tags and attributes) must begin with a letter or underscore, and can consist of: –Letters, both Roman (English) and foreign –Digits, both Roman and foreign. (dot) - (hyphen) _ (underscore) : (colon) should be used only for namespaces –Combining characters and extenders (not used in English)
Semantic web - Computer Engineering Dept. - Spring Namespaces Namespaces are a simple mechanism for creating globally unique names for the elements and attributes of your markup language. Benefits: –De-conflicts the meaning of identical names in different markup languages. –Allows different markup languages to be mixed together without ambiguity. Namespaces are implemented by requiring every XML name to consist of two parts: a prefix and a local part:
Semantic web - Computer Engineering Dept. - Spring Namespaces and URIs A namespace is defined as a unique string –To guarantee uniqueness, typically a URI (Uniform Resource Indicator) is used, because the author “owns” the domain –It doesn't have to be a “real” URI; it just has to be a unique string –Example: There are two ways to use namespaces: –Declare a default namespace –Associate a prefix with a namespace, then use the prefix in the XML to refer to the namespace
Semantic web - Computer Engineering Dept. - Spring Namespace syntax In any start tag you can use the reserved attribute name xmlns : –This namespace will be used as the default for all elements up to the corresponding end tag –You can override it with a specific prefix You can use almost this same form to declare a prefix: –Use this prefix on every tag and attribute you want to use from this namespace, including end tags--it is not a default prefix To Begin You can use the prefix in the start tag in which it is defined:
Semantic web - Computer Engineering Dept. - Spring Review of XML rules Start with XML is case sensitive You must have exactly one root element that encloses all the rest of the XML Every element must have a closing tag Elements must be properly nested Attribute values must be enclosed in double or single quotation marks There are only five pre-declared entities
Semantic web - Computer Engineering Dept. - Spring XML as a tree An XML document represents a hierarchy; a hierarchy is a tree novel foreword chapter number="1" paragraph This is the great American novel. It was a dark and stormy night. Suddenly, a shot rang out!
Semantic web - Computer Engineering Dept. - Spring Extended document standards You can define your own XML tag sets, but here are some already available: –XHTML: HTML redefined in XML –SMIL: Synchronized Multimedia Integration Language –MathML: Mathematical Markup Language –SVG: Scalable Vector Graphics –DrawML: Drawing MetaLanguage –ICE: Information and Content Exchange –ebXML: Electronic Business with XML –cxml: Commerce XML –CBL: Common Business Library
XML Schema
Semantic web - Computer Engineering Dept. - Spring XML Validation "Well Formed" XML document –correct XML syntax "Valid" XML document –“well formed” –Conforms to the rules of a DTD XML DTD –defines the legal building blocks of an XML document –Can be inline in XML or as an external reference XML Schema –an XML based alternative to DTD, more powerful –Support namespace and data types
Semantic web - Computer Engineering Dept. - Spring An Example XML with DTD <!DOCTYPE note [ ]> Tove Jani Reminder Don't forget me this weekend
Semantic web - Computer Engineering Dept. - Spring XML Schemas “Schema” is a general term –DTDs are a form of XML schemas When we say “XML Schemas,” we usually mean the W3C XML Schema Language –This is also known as “XML Schema Definition” language, or XSD.
Semantic web - Computer Engineering Dept. - Spring XSD vs. DTD DTDs provide a very weak specification language –You can’t put any restrictions on text content –You have very little control over mixed content (text plus elements) –You have little control over ordering of elements DTDs are written in a strange (non-XML) format –You need separate parsers for DTDs and XML The XML Schema Definition language solves these problems –XSD gives you much more control over structure and content –XSD is written in XML
Semantic web - Computer Engineering Dept. - Spring Referring to a schema To refer to a DTD in an XML document, the reference goes before the root element: –... To refer to an XML Schema in an XML document, the reference goes in the root element: – (This is where your XML Schema definition can be found)...
Semantic web - Computer Engineering Dept. - Spring The XSD document Since the XSD is written in XML, it can get confusing which we are talking about. The file extension is.xsd The root element is The XSD starts like this:
Semantic web - Computer Engineering Dept. - Spring The element may have attributes: –xmlns:xs=" This is necessary to specify where all our XSD tags are defined –elementFormDefault="qualified" This means that all XML elements must be qualified (use a namespace) It is highly desirable to qualify all elements, or problems will arise when another schema is added
Semantic web - Computer Engineering Dept. - Spring “Simple” and “complex” elements A “simple” element is one that contains text and nothing else –A simple element cannot have attributes –A simple element cannot contain other elements –A simple element cannot be empty –However, the text can be of many different types, and may have various restrictions applied to it If an element isn’t simple, it’s “complex” –A complex element may have attributes –A complex element may be empty, or it may contain text, other elements, or both text and other elements
Semantic web - Computer Engineering Dept. - Spring Defining a simple element A simple element is defined as where: –name is the name of the element –the most common values for type are xs:booleanxs:integer xs:datexs:string xs:decimalxs:time Other attributes a simple element may have: –default=" default value " if no other value is specified –fixed=" value " no other value may be specified
Semantic web - Computer Engineering Dept. - Spring Defining an attribute Attributes themselves are always declared as simple types An attribute is defined as where: –name and type are the same as for xs:element Other attributes a simple element may have: –default=" default value " if no other value is specified –fixed=" value " no other value may be specified –use="optional" the attribute is not required (default) –use="required" the attribute must be present
Semantic web - Computer Engineering Dept. - Spring Restrictions, or “facets” The general form for putting a restriction on a text value is: – (or xs:attribute )... the restrictions... For example: –
Semantic web - Computer Engineering Dept. - Spring Restrictions on numbers minInclusive -- number must be ≥ the given value minExclusive -- number must be > the given value maxInclusive -- number must be ≤ the given value maxExclusive -- number must be < the given value totalDigits -- number must have exactly value digits fractionDigits -- number must have no more than value digits after the decimal point
Semantic web - Computer Engineering Dept. - Spring Restrictions on strings length -- the string must contain exactly value characters minLength -- the string must contain at least value characters maxLength -- the string must contain no more than value characters pattern -- the value is a regular expression that the string must match whiteSpace -- not really a “restriction”--tells what to do with whitespace –value="preserve" Keep all whitespace –value="replace" Change all whitespace characters to spaces –value="collapse" Remove leading and trailing whitespace, and replace all sequences of whitespace with a single space
Semantic web - Computer Engineering Dept. - Spring Enumeration An enumeration restricts the value to be one of a fixed set of values Example: –
Semantic web - Computer Engineering Dept. - Spring Complex elements A complex element is defined as... information about the complex type... Example: says that elements must occur in this order Remember that attributes are always simple types
Semantic web - Computer Engineering Dept. - Spring Declaration and use So far we’ve been talking about how to declare types, not how to use them To use a type we have declared, use it as the value of type="..." –Examples: –Scope is important: you cannot use a type if is local to some other type
Semantic web - Computer Engineering Dept. - Spring xs:sequence We’ve already seen an example of a complex type whose elements must occur in a specific order:
Semantic web - Computer Engineering Dept. - Spring xs:all xs:all allows elements to appear in any order Despite the name, the members of an xs:all group can occur once or not at all You can use minOccurs="0" to specify that an element is optional (default value is 1 ) –In this context, maxOccurs is always 1
Semantic web - Computer Engineering Dept. - Spring Empty elements Empty elements are (ridiculously) complex
Semantic web - Computer Engineering Dept. - Spring Mixed elements Mixed elements may contain both text and elements We add mixed="true" to the xs:complexType element The text itself is not mentioned in the element, and may go anywhere (it is basically ignored)
Semantic web - Computer Engineering Dept. - Spring Extensions You can base a complex type on another complex type...new stuff...
Semantic web - Computer Engineering Dept. - Spring Predefined string types Recall that a simple element is defined as: Here are a few of the possible string types: –xs:string -- a string –xs:normalizedString -- a string that doesn’t contain tabs, newlines, or carriage returns –xs:token -- a string that doesn’t contain any whitespace other than single spaces Allowable restrictions on strings: – enumeration, length, maxLength, minLength, pattern, whiteSpace
Semantic web - Computer Engineering Dept. - Spring Predefined date and time types xs:date -- A date in the format CCYY-MM-DD, for example, xs:time -- A date in the format hh:mm:ss (hours, minutes, seconds) xs:dateTime -- Format is CCYY-MM- DD T hh:mm:ss –The T is part of the syntax Allowable restrictions on dates and times: – enumeration, minInclusive, minExclusive, maxInclusive, maxExclusive, pattern, whiteSpace
Semantic web - Computer Engineering Dept. - Spring Predefined numeric types Here are some of the predefined numeric types: Allowable restrictions on numeric types: – enumeration, minInclusive, minExclusive, maxInclusive, maxExclusive, fractionDigits, totalDigits, pattern, whiteSpace xs:decimalxs:positiveInteger xs:bytexs:negativeInteger xs:shortxs:nonPositiveInteger xs:intxs:nonNegativeInteger xs:long
Questions?
Semantic web - Computer Engineering Dept. - Spring References