XML Validation III Schemas Robin Burke ECT 360
Outline Admin Namespaces review XML Schemas Break Data types Elements Attributes Break Data types
Admin Due today Due next week Due 10/24 Project milestone #3 Document analysis Due next week Homework #3 Due 10/24 Milestone #4: Schema or DTD
Quiz
Assessment Homework – 35% Participation (including in-class exercises and labs) – 15% Quizzes – 20% Final project – 30%
Namespaces A way to identify a set of labels element / attribute names attribute values As belonging to a particular application Example course "title" html "title"
Namespace idea Associate a short prefix with an application Use the prefix with a colon to "qualify" names html:title syll:title
Namespace idea, cont'd A namespace is an association between a set of names a unique identifier (URI) a prefix used to identify them
Namespace declaration Standalone <?xml:namespace ns="http://bookpeople.com/book" prefix="book"?> Part of element <html xmlns="http://www.w3.org/1999/xhtml"> in this case, no prefix <book:book xmlns:book="http://bookpeople.com/book">
Namespaces Essential if we must combine documents from different applications For example we want to define a new XML language using XML namespace for new language namespace for defining language
XML so far Languages defined by DTDs OK for text documents contain text elements string attributes OK for text documents Not enough for Databases Business process integration Need data types
Solution XML Schema XML document becomes Write language definition in XML More control over document contents XML document becomes a complex data type XML language definition becomes complex data type specification
XML Schema Always a separate document Written in XML no internal option Written in XML very verbose Can be large and complex
Schemas and namespaces A schema uses elements from one application the XML Schema language to define another Namespaces are necessary Namespaces apply to elements not values
Example 1, XML <grades assignment="Homework 1"> <grade> <student id="1234-12345">Jane Doe</student> <assigned-grade>A</assigned-grade> </grade> <student id="5432-54321">John Doe</student> <assigned-grade>B</assigned-grade> </grades>
Example 1, DTD <!ELEMENT grades (grade*)> <!ATTLIST grades assignment CDATA #IMPLIED> <!ELEMENT grade (student, assigned-grade)> <!ELEMENT student (#PCDATA)> <!ATTLIST student id CDATA #REQUIRED> <!ELEMENT assigned-grade (#PCDATA)>
Data types grades grade student assigned-grade is text a collection of items of type grade can never have more than 40 students grade a structure containing a student and an assigned grade student a structure containing an id and some text probably should constrain the student id assigned-grade is text probably should constrain to A-D,F,I
Built-in types Part of the schema language Base types Derived types 19 fundamental types Examples: string, decimal Derived types 25 more types that use the base types Examples: ID, positiveInteger
Built-in types, cont'd
To declare an element Equivalent to type="string"> <xs:element name="assigned-grade" type="string"> Equivalent to <!ELEMENT assigned-grade (#PCDATA)>
Simple data type A renaming of an existing data type <xs:element name="assigned-grade" type="xs:string"> Or a restriction of a existing type strings beginning with "A-D"
Complex datatype <xs:element name=“name”> <xs:complexType> compositor element declarations attribute declarations </xs:complexType> </xs:element>
Compositor sequence choice all
Sequence compositor like "," in DTD DTD Schema <!ELEMENT foo (bar, baz)> Schema <xs:element name="foo"> <xs:complexType> <xs:sequence> <xs:element ref="bar" /> <xs:element ref="baz" /> </xs:sequence> </xs:complexType> </xs:element>
Elements in sequences Can specify optional / # of occurrences ? * + <xs:element ref="bar" minOccurs="0" type="xs:string"> * <xs:element ref="bar" minOccurs="0" maxOccurs="unbounded" /> + <xs:element ref="bar" minOccurs="1" maxOccurs="unbounded" /> What about... <xs:element ref="bar" minOccurs="2" maxOccurs="4" />
Choice compositor like "|" in DTD DTD Schema <!ELEMENT foo (bar | baz)> Schema <xs:element name="foo"> <xs:complexType> <xs:choice> <xs:element ref="bar" /> <xs:element ref="baz" /> </xs:choice> </xs:complexType> </xs:element>
All compositor no simple DTD equivalent DTD Schema <!ELEMENT foo ( (bar, baz?) | (baz, bar?) > Schema <xs:element name="foo"> <xs:complexType> <xs:all> <xs:element ref="bar" /> <xs:element ref="baz" /> </xs:all> </xs:complexType> </xs:element>
Nesting Compositors can be combined DTD Schema <!ELEMENT foo ( (bar | baz) , (thud | grunt) )> Schema <xs:element name="foo"> <xs:complexType> <xs:sequence> <xs:choice> <xs:element ref="bar" /> <xs:element ref="baz" /> </xs:choice> <xs:element ref="thud" /> <xs:element ref="grunt" /> </xs:sequence> </xs:complexType> </xs:element>
Exercise <!ELEMENT personal-info (person-name, job-title)> <!ELEMENT address (street, city, state, zip)> <!ELEMENT street (#PCDATA)>
Local naming Suppose we want to reuse an element name Example different place in the structure Example <!ELEMENT url-catalog (link*)> <!ELEMENT link (link, description?)> not a legal DTD schema? legal local naming permitted maybe not wise, though
Using namespaces what namespace it is using Schema must say to use schema namespace what namespace it is defining targetNamespace what namespace it is using Document must say that it is using the Schema Instance namespace what namespace(s) it is using what prefix(es) are used where to find the relevant schemas for each namespace
Ugly Schema root element Document root element <xs:schema elementFormDefault="qualified" xmlns:xs="http://www.w3.org/2001/XMLSchema" targetNamespace="http://josquin.cs.depaul.edu/~rburke/namespaces/business-card" xmlns="http://josquin.cs.depaul.edu/~rburke/namespaces/business-card"> Document root element <business-card xmlns="http://josquin.cs.depaul.edu/~rburke/namespaces/business-card" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://josquin.cs.depaul.edu/~rburke/namespaces/business-card biz-card.xsd">
Attributes DTD attribute types Schema Declaration CDATA, enumeration, token Schema can be any of the basic or derived types can also be user-defined types Declaration <xs:attribute name="x" type="xs:string" use="required" default="abc" />
Attribute declaration Part of complex type follows compositor (one exception) Declaration <xs:attribute name="foo" type="xs:positiveInteger" /> What if the attribute is a more complex type itself? we'll get to that
Exception: simple content If an element has "simple content" no compositor used instead simpleContent element and extension to declare type of the content
Example <xs:element name="student"> <xs:complexType> <!ELEMENT student (#PCDATA)> <!ATTLIST student id CDATA #REQUIRED> <xs:element name="student"> <xs:complexType> <xs:simpleContent> <xs:extension base="xs:string"> <xs:attribute name="id" type="xs:string" use="required"/> </xs:extension> </xs:simpleContent> </xs:complexType> </xs:element>
How to read this student is a complex type its content is simple it is not simply a renaming of an existing type because of the attribute its content is simple being of only one type string but with an attribute id of type string which is required
Exercise <!ATTLIST personal-info id ID #IMPLIED> <!ELEMENT contact (#PCDATA)> <!ATTLIST contact type (email | fax | phone | web) #REQUIRED> <!ELEMENT text (#PCDATA)> <!ATTLIST text type (endorsement | motto | services) #REQUIRED>
Named types We can name a complex type Example use it wherever a built-in type would work Example <xs:complexType name="barBaz"> <xs:sequence> <xs:element ref="bar" /> <xs:element ref="baz" /> </xs:sequence> </xs:complexType> <xs:element name="foo" type="barBaz"/>
Built-in types Part of the schema language Base types Derived types 19 fundamental types Examples: string, decimal Derived types 25 more types that use the base types Examples: ID, positiveInteger
Built-in types, cont'd
User-defined types Any use of complexType can be turned into a user-defined type usually called "standalone" Simple types can be derived from the built-in types
Standalone types A type can stand outside of an element definition must have a name <xs:complexType name="bar-n-baz"> <xs:sequence> <xs:element ref="bar" /> <xs:element ref="baz" /> </xs:sequence> </xs:complexType> Used in element definition <xs:element name="foo" type="bar-n-baz" />
Mixed content Can specify that an element has mixed content <xs:complexType name="bar-n-baz" mixed="true"> <xs:sequence> <xs:element ref="bar" /> <xs:element ref="baz" /> </xs:sequence> </xs:complexType>
Mixed content, cont'd Schema cannot control where the text appears If this is legal <foo>text here <bar>thud</bar><baz>grunt</baz></foo> So is this <foo><bar>thud</bar>more text<baz>grunt</baz>still more</foo>
Deriving types DTDs do not allow type restrictions beyond enumeration, CDATA, token for attributes PCDATA for content Schemas have built-in types also capability to create your own
Derivation operations list sequence of values union combine two types allowing either restriction placing limits on the legal values
List Must be separated by spaces <xs:element name="partList"> <xs:simpleType> <xs:list itemType="partNo" /> </xs:simpleType> </xs:element> <partList>PN334-04 PN223-89 PQ1112-03</partList> Must be separated by spaces probably more useful to do this with document structure partList -> partNo*
Union Allows data of either type to be used Example Database situation <xs:simpleType name="partNumberField"> <xs:union memberTypes="partNumberType noPartNum" /> </xs:simpleType> Database situation null is a possible value
Restriction Most useful Allow design to state exactly what values are legal prices must be non-negative SSN must follow a certain pattern in-stock must yes or no etc.
Restriction, cont'd Restrict a base type according to "facets" Different facets available for different data types
Facets
Example: enumeration <xs:simpleType name="grade"> <xs:restriction base="xs:string"> <xs:enumeration value="A"/> <xs:enumeration value="B"/> <xs:enumeration value="C"/> <xs:enumeration value="D"/> <xs:enumeration value="F"/> <xs:enumeration value="I"/> </xs:restriction> </xs:simpleType>
Example: numeric <xs:simpleType name="drinkingAge"> <xs:restriction base="xs:positiveInteger"> <xs:minInclusive value="21"/> </xs:restriction> </xs:simpleType>
Example: pattern Regular expressions again derived from perl <xs:simpleType> <xs:restriction base="xs:string"> <xs:pattern value="([A-D]|F|I)(\+|\-)?" /> </xs:restriction> </xs:simpleType>
Inheritance facet restrictions are inherited For example new type derivations must honor them but can restrict them further but new derivations can alter other facets For example monetary type fractionDigits facet = 2 loan amount type monetary type + maxValue = 100000 car loan amount loan amount type + maxValue = 30000
Complex Types Possible to derive from complex types Use complexContent i.e. elements Use complexContent Possibilities extension restriction elements attributes
Exercise <!ELEMENT state (#PCDATA)> <!ATTLIST contact type (email | fax | phone | web) #REQUIRED> <!ELEMENT street (#PCDATA)>
Design decisions Attribute vs element Level of granularity Naming Schema structure
Attribute vs element Some specific rules General principle Element ID must be attribute General principle data vs metadata Element for document content Attribute for information about content Not always easy to tell!
Element Consists of document content Will be shown to a human user Contains substructure Sequence may be important Could be very long Presence depends on other values
Attribute (Opposite of above) Must be from an enumeration of values Also consistency
Level of granularity How detailed to model the data? Very detailed more work to markup more detail in expressing the schema exceptions must be handled Less detailed easier to mark up easier to schematize document contents less accessible
Element content granularity Fine grained model salutation, first name, middle name, last name, appellation Coarse grained model name Tradeoff search / sort / organized document creation
Levels vs recursion Named levels Recursion Tradeoff <chapter> <section> <subsection> <subsubsection> Recursion Tradeoff ability to rearrange transparency of markup
Naming Case convention Multiple words UPPERCASE IS BAD lowercase better Multiple words CapCase camelCase Underline_Convention
Structure Nested Flat Type-based "russian doll" schema looks like the document small schema only Flat elements defined at global level references used in complex type definitions Type-based "venetian blind" all schema complex in type defintions one global element