1 XML Data Management Document Type Definitions (DTDs) Werner Nutt.

Slides:



Advertisements
Similar presentations
1 DTD (Document Type Definition) Imposing Structure on XML Documents (W3Schools on DTDs)W3Schools on DTDs.
Advertisements

XML 6.3 DTD 6. XML and DTDs A DTD (Document Type Definition) describes the structure of one or more XML documents. Specifically, a DTD describes:  Elements.
History Leading to XHTML
XML Document Type Definitions ( DTD ). 1.Introduction to DTD An XML document may have an optional DTD, which defines the document’s grammar. Since the.
1 XML DTD & XML Schema Monica Farrow G30
1 XML Data Management 2. XML Syntax Werner Nutt. 2 HTML Designed for publishing hypertext on the Web Describes how a browser should arrange text, images,
XML Study-Session: Part II Validating XML Documents.
Document Type Definition DTDs CS-328. What is a DTD Defines the structure of an XML document Only the elements defined in a DTD can be used in an XML.
Document Type Definitions
Review Writing XML  Style  Common errors 1XML Technologies David Raponi.
1 XML: Document Type Definitions 2 Road Map  Introduction to DTDs  What’s a DTD?  Why are they important?  What will we cover?  Our First DTD 
 2002 Prentice Hall, Inc. All rights reserved. ISQA 407 XML/WML Winter 2002 Dr. Sergio Davalos.
1 Document Type Descriptors (DTDs) Imposing Structure on XML Documents.
Tutorial 9 Working with XHTML. XP Objectives Describe the history and theory of XHTML Understand the rules for creating valid XHTML documents Apply a.
Creating a Well-Formed Valid Document. 2 Objectives Introducing XHTML Creating a Well-Formed Document Creating a Valid Document Creating an XHTML Document.
Full declaration When an element is declared to have element content, the children element types must also be declared Example: to which the following.
XML eXtensible Markup Language.
XML Technologies and Applications Rajshekhar Sunderraman Department of Computer Science Georgia State University Atlanta, GA 30302
Document Type Definitions. XML and DTDs A DTD (Document Type Definition) describes the structure of one or more XML documents. Specifically, a DTD describes:
Introduction to XML This material is based heavily on the tutorial by the same name at
XML Validation I DTDs Robin Burke ECT 360 Winter 2004.
XP New Perspectives on XML Tutorial 3 1 DTD Tutorial – Carey ISBN
Validating DOCUMENTS with DTDs
VICTORIA UNIVERSITY OF WELLINGTON Te Whare Wananga o te Upoko o te Ika a Maui SWEN 432 Advanced Database Design and Implementation Document Type Definition.
XP The University of Akron Summit College Business Technology Department Computer Information Systems 2440: 140 Internet Tools Instructor: Enoch E. Damson.
Copyright © 2003 Pearson Education, Inc. Slide 3-1 Created by Cheryl M. Hughes, Harvard University Extension School — Cambridge, MA The Web Wizard’s Guide.
XML CPSC 315 – Programming Studio Fall 2008 Project 3, Lecture 1.
1 XML Data Management 3. Document Type Definitions (DTDs) Werner Nutt based on slides by Sara Cohen, Jerusalem.
Document Type Definitions Kanda Runapongsa Dept. of Computer Engineering Khon Kaen University.
XML 1 Enterprise Applications CE00465-M XML. 2 Enterprise Applications CE00465-M XML Overview Extensible Mark-up Language (XML) is a meta-language that.
XML Extensible Markup Language. What is XML? ● meta-markup language ● a language for defining a family of languages ● semantic/structured mark-up language.
XHTML. Introduction to XHTML What Is XHTML? – XHTML stands for EXtensible HyperText Markup Language – XHTML is almost identical to HTML 4.01 – XHTML is.
XML - Why: The HTML-Dilemma HTML, SGML, XML - How: Syntax, Concept, Language Elements Basics Well-formed XML-Documents (without DTD) Valid XML-Documents.
XP 1 DECLARING A DTD A DTD can be used to: –Ensure all required elements are present in the document –Prevent undefined elements from being used –Enforce.
What is XML?  XML stands for EXtensible Markup Language  XML is a markup language much like HTML  XML was designed to carry data, not to display data.
Introduction to XML Extensible Markup Language. What is XML XML stands for eXtensible Markup Language. A markup language is used to provide information.
FIGIS’ML Hands-on training - © FAO/FIGIS An introduction to XML Objectives : –what is XML? –XML and HTML –XML documents structure well-formedness.
1 Tutorial 13 Validating Documents with DTDs Working with Document Type Definitions.
Avoid using attributes? Some of the problems using attributes: Attributes cannot contain multiple values (child elements can) Attributes are not easily.
 2002 Prentice Hall, Inc. All rights reserved. Chapter 6 – Document Type Definition (DTD) Outline 6.1Introduction 6.2Parsers, Well-formed and Valid XML.
Lecture 6 XML DTD Content of.xml fileContent of.dtd file.
XML A web enabled data description language 4/22/2001 By Mark Lawson & Edward Ryan L’Herault.
1 Chapter 10: XML What is XML What is XML Basic Components of XML Basic Components of XML XPath XPath XQuery XQuery.
XML 2nd EDITION Tutorial 1 Creating An Xml Document.
CIS 451: XML DTDs Dr. Ralph D. Westfall February, 2009.
XML Data Management 10. Deterministic DTDs and Schemas Werner Nutt.
IS432 Semi-Structured Data Lecture 2: DTD Dr. Gamal Al-Shorbagy.
XML Validation I DTDs Robin Burke ECT 360 Winter 2004.
XP 1 Creating an XML Document Developing an XML Document for the Jazz Warehouse XML Tutorial.
Lecture 16 Introduction to XML Boriana Koleva Room: C54
1 Introduction to XML XML stands for Extensible Markup Language. Because it is extensible, XML has been used to create a wide variety of different markup.
An Introduction to XML Sandeep Bhattaram
XML Introduction. What is XML? XML stands for eXtensible Markup Language XML stands for eXtensible Markup Language XML is a markup language much like.
CSE3201 Information Retrieval Systems DTD Document Type Definition.
1 XML eXtensible Markup Language. 2 XML vs. HTML HTML is a HyperText Markup language HTML is a HyperText Markup language Designed for a specific application,
INFSY 547: WEB-Based Technologies Gayle J Yaverbaum, PhD Professor of Information Systems Penn State Harrisburg.
COMP9321 Web Application Engineering Semester 2, 2015 Dr. Amin Beheshti Service Oriented Computing Group, CSE, UNSW Australia Week 4 1COMP9321, 15s2, Week.
DTD Document Type Definition. Agenda Introduction to DTD DTD Building Blocks DTD Elements DTD Attributes DTD Entities DTD Exercises DTD Q&A.
Tutorial 9 Working with XHTML. New Perspectives on HTML, XHTML, and XML, Comprehensive, 3rd Edition 2 Objectives Describe the history and theory of XHTML.
Tutorial 9 Working with XHTML. XP Objectives Describe the history and theory of XHTML Understand the rules for creating valid XHTML documents Apply a.
Digital Multimedia, 2nd edition Nigel Chapman & Jenny Chapman Chapter 14 This presentation © 2004, MacAvon Media Productions XML.
XML – Basic Concepts (modified version from Dr. Praveen Madiraju) 2015, Fall Pusan National University Ki-Joune Li.
1 XML eXtensible Markup Language. 2 Introduction and Motivation Dr. Praveen Madiraju Modified from Dr.Sagiv’s slides.
XML eXtensible Markup Language.
XML – eXtensible Markup Language. The World Wide Web and What We Would Like to Do with It XML has a lot of hype surrounding it This week we discuss: –Why.
Extensible Markup Language (XML) Pat Morin COMP 2405.
Session III Chapter 6 – Creating DTDs
DTD (Document Type Definition)
Session II Chapter 6 – Creating DTDs
Document Type Definition (DTD)
Presentation transcript:

1 XML Data Management Document Type Definitions (DTDs) Werner Nutt

Document Type Definitions Document Type Definitions (DTDs) impose structure on an XML document Using DTDs, we can specify what a "valid" document should contain DTD specifications require more than being well-formed, e.g., what elements are legal, what nesting is allowed DTDs do not have limited expressive power, e.g., one cannot specify types

What is This Good for? DTDs can be used to define special languages of XML, i.e., restricted XML for special needs Examples: –MathML (mathematical markup) –SVG (scalable vector graphics) –XHTML (well-formed version of HTML) –RSS ("Really Simple Syndication", news feeds) Standards can be defined using DTDs, for data exchange and special applications can be written now, often replaced by XML Schema

Alphabet Soup HTML XML SGML MathML RSS XHTML

Example: MathML x 2 ⁢ y

Example: SVG <!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN" " <svg width="250px" height="250px" xmlns=" Hello, World! Hello, World! Hello, World!

Address Book DTD Suppose we want to create a DTD that describes legal address book entries This DTD will be used to exchange address book information between programs How should it be written? What is a legal address?

Example: An Address Book Entry Homer Simpson Dr. H. Simpson 1234 Springwater Road Springfield USA, (321) (321) mixed telephones and faxes at least one as many address lines as needed at most one greetingexactly one name

Specifying the Structure How do we specify exactly what must appear in a person element? A DTD specifies for each element the permitted content The permitted content is specified by a regular expression Our plan: –first, regular expression defining the content of person –then, general syntax

What’s in a person Element? Exactly one name, followed by at most one greeting, followed by an arbitrary number of address lines, followed by a mix of telephone and fax numbers, followed by at least one . Formally: name, greet?, addr*, (tel | fax)*, + regular expression

What’s in a person Element? (cntd) name, greet?, addr*, (tel | fax)*, + name = there must be a name element greet? = there is an optional greet element (i.e., 0 or 1 greet elements) name, greet? = the name element is followed by an optional greet element addr* = there are 0 or more address elements

What’s in a person Element? (cntd) name, greet?, addr*, (tel | fax)*, + tel | fax = there is a tel or a fax element (tel | fax)* = there are 0 or more repeats of tel or fax + = there are 1 or more elements

What’s in a person Element? (cntd) name, greet?, addr*, (tel | fax)*, + Does this expression differ from: name, greet?, addr*, tel*, fax*, + name, greet?, addr*, (fax|tel)*, + name, greet?, addr*, (fax|tel)*, , * name, greet?, addr*, (fax|tel)*, *,

aelement a e1?0 or 1 occurrences of expression e1 e1*0 or more occurrences of expression e1 e1+1 or more occurrences of expression e1 e1,e2expression e2 after expression e2 e1|e2either expression e1 or expression e2 (e)grouping #PCDATAparsed character data (i.e., after parsing) EMPTYno content ANYany content (#PCDATA | a 1 | … | a n )*mixed content Element Content Descriptions

addressbook as Internal DTD <!DOCTYPE addressbook [ ]>]>

Exercise Requirements A country must have a name as the first node. A country must have a capital city as the following node. A country may have a king. A country may have a queen. What about the following?

Deterministic DTDs E Deterministic Content Models (Non-Normative) As noted in Element Content, it is required that content models in element type declarations be deterministic. This requirement is for compatibility with SGML (which calls deterministic content models "unambiguous"); XML processors built using SGML systems may flag non-deterministic content models as errors. For example, the content model ((b, c) | (b, d)) is non-deterministic, because given an initial b the XML processor cannot know which b in the model is being matched without looking ahead to see which element follows the b. In this case, the two references to b can be collapsed into a single reference, making the model read (b, (c | d)). An initial b now clearly matches only a single name in the content model. The processor doesn't need to look ahead to see what follows; either c or d would be accepted. … From: Extensible Markup Language (XML) 1.0 (Fifth Edition) W3C Recommendation 26 November 2008

Deterministic DTDs SGML requires that a DTD is deterministic, that is, when parsing a document, a parser only needs to look at the next element to know at which point it is in the regular expression Is this DTDs deterministic? Try ! Can we fix it? 1-step lookahead

Research Questions What are the typical research questions to ask about non-deterministic and deterministic DTDs? 1.Is there an algorithm to check whether a DTD is (non-)deterministic? 2.Is there an algorithm running in polynomial time? (Or is this problem NP-hard?) 3.What is the exact runtime of the best algorithm? 4.Is there for every (nondeterministic) DTD an equivalent deterministic DTD? Answers by Anne Brüggemann-Klein (1993): 1) yes, 2) yes, 3) quadratic, linear for expressions, 4) yes, but it may be exponential in the size of the input

Formalization An element definition specifies a language, i.e., the set of all legal series of children Example: Which of the following are in the language defined by a*, (b | c), a+ –aba –abca –aab –aaacaaa

Automata Languages can also be defined using automata An automaton consists of: –a set of states Q. –an alphabet  (i.e., a set of symbols) –a transition function , which maps every pair (q,a) to a set of states q’ –an initial state q 0 –a set of accepting states F A word a 1 …a n is in the language defined by an automaton if there is a path from q 0 to a state in F with edges labeled a 1,…,a n

q0q0 q1q1 q2q2 a a b q3q3 b c What Language Does This Define?

Non-Deterministic Automata An automaton is non-deterministic if there is a state q and a letter a such that there are at least two transitions from q via edges labeled with a Otherwise, it is deterministic What words are in the language of a non-deterministic automaton? We now create a Glushkov automaton from a regular expression

Creating a Glushkov Automaton from an Element Definition a*,(b|c),a+ Step 1: Normalize the expression by replacing any occurrence of an expression e+ with e,e* Step 2: Use subscripts to number each occurrence of each letter a 1 *,(b 1 |c 1 ),a 2,a 3 * a*,(b|c),a,a*

q 0 a 1 b 1 c 1 a 2 a 3 Creating a Glushkov Automaton from an Element Definition Step 3: Create a state q 0 and create a state for each subscripted letter a 1 *,(b 1 |c 1 ),a 2,a 3 * Step 4: Choose as accepting states all subscripted letters with which it is possible to end a word

q 0 a 1 b 1 c 1 a 2 a 3 Creating a Glushkov Automaton from an Element Definition Step 5: Create a transition from a state l j to a state k j if there is a word in which k j follows l i. Label the transition with k a 1 *,(b 1 |c 1 ),a 2,a 3 * Exercise!

1-Unambiguity A regular expression is 1-unambiguous if its Glushkov automaton is deterministic, otherwise it is 1-ambiguous Technically: An element definition is “deterministic” iff it is 1-unambigious! Exercise: Check whether the following expressions are 1-unambiguous by creating Glushkov automata for them –( a, b ) | ( a, c ) –a, (b | c) –a?, d+, b*, d*, ( c | b )+

Exercise Is this DTD deterministic? How can we fix it? <!ELEMENT country (president | king | (king,queen) | queen)>

Exercise: Payments Requirements: Customers at the till may pay with a combination of credit cards and cash. If cards and cash are both used the cards must come first. There may be more than one card. There must be no more than one cash element. At least one method of payment must be used. Task: Construct a deterministic DTD with the elements card and cash

Attributes How can we define the possible attributes of elements in XML documents? General Syntax: <!ATTLIST element-name attribute-name1 type1 default-value1 attribute-name2 type2 default-value2 … attribute-namen typen default-valuen> Example:

Attributes (cntd) <!ATTLIST element-name attribute-name1 type1 default-value1 … > type is one of the following: (there are additional possibilities that we don’t discuss) CDATAcharacter data (i.e., the string as it is) (en1 | en2 | …)value must be one from the given list IDvalue is a unique id IDREFvalue is the id of another element IDREFSvalue is a list of other ids

Attributes (cntd) <!ATTLIST element-name attribute-name1 type1 default-value1 … > default-value is one of the following: valuedefault value of the attribute #REQUIRED attribute must always be included in the element #IMPLIEDattribute need not be included #FIXED valueattribute value is fixed

Example: Attributes <!ATTLIST height dimension (cm | in) #REQUIRED accuracy CDATA #IMPLIED resizable CDATA #FIXED "yes" >

Specifying ID and IDREF Attributes <!DOCTYPE family [ <!ATTLIST person id ID #REQUIRED mother IDREF #IMPLIED father IDREF #IMPLIED children IDREFS #IMPLIED> ]>

Specifying ID and IDREF Attributes (cntd) Attributes mother and father are references to IDs of other elements However, those elements are not necessarily person elements the mother attribute is not necessarily a reference to a female person References to IDs have no type!

Some Conforming Data Lisa Simpson Bart Simpson Marge Simpson Homer Simpson

Consistency of ID and IDREF Attribute Values If an attribute is declared as ID the associated values must all be distinct (no confusion) That is, no two ID attributes can have the same value If an attribute is declared as IDREF the associated value must exist as the value of some ID attribute (no dangling "pointers") Similarly for all the values of an IDREFS attribute Which parallels do you see to relational databases?

Is this Legal? Clark Kent Linda Lee

Adding a DTD to a Document A DTD can be internal –the DTD is part of the document file or external –the DTD and the document are on separate files An external DTD may reside –in the local file system (where the document is) –in a remote file system (reachable using a URL)

Connecting a Document with its DTD Internal DTD: … ]>... DTD from the local file system: DTD from a remote file system:

Connecting a Document with its DTD Combination of external and internal DTD <!DOCTYPE db SYSTEM "schema.dtd" [ … ] >... internal subset

DTD Entities Entities are XML macros. They come in four kinds: Character entities: stand for arbitrary Unicode characters, like: <, ;, &, ©, … Named (internal) entities: macros in the document, can stand for any well-formed XML, mostly used for text External entities: like name entities, but refer to a file with with well-formed XML Parameter entities: stand for fragments of a DTD

Character Entities Macros expanded when the document is processed. Example: Special characters from XHTML1.0 DTD <!-- left single quotation mark, U+2018 ISOnum --> Can be specified in decimal (above) and in hexadecimal, e.g., ( x stands for hexadecimal)

Named Entities Declared in the DTD (or its local fragment, the “internal subset”) Entities can reference other entities … but must not form cycles (which the parser would detect) Example: Using dd in a document expands to Donald Duck

External Entities Represent the content of an external file. Useful when breaking a document down into parts. Example: <!DOCTYPE book SYSTEM book.dtd [ ]> &chap1;&chap2;&chap3; internal subset location of the file

Parameter Entities Can only be used in DTDs and the internal subset Indicated by percent (%) symbol instead of ampersand (&) Can be named or external entities  Modularization of DTDs Pattern:

Parameter Entities in the XHTML 1 DTD <!ENTITY % coreattrs "id ID #IMPLIED class CDATA #IMPLIED style %StyleSheet; #IMPLIED title %Text; #IMPLIED" > <!ENTITY % i18n "lang %LanguageCode; #IMPLIED xml:lang %LanguageCode; #IMPLIED dir (ltr|rtl) #IMPLIED" > …

Parameter Entities in the XHTML 1 DTD <!ATTLIST body %attrs; onload %Script; #IMPLIED onunload %Script; #IMPLIED > <!ENTITY % block "p | %heading; | div | %lists; | %blocktext; | fieldset | table">

Valid Documents A document with a DTD is valid if it conforms to the DTD, that is, the document conforms to the regular-expression grammar, types of attributes are correct, constraints on references are satisfied.

DTDs Support Document Interpretation How many children of the node will a DOM parser find?

DTDs Support Document Interpretation <!DOCTYPE a [ ]> How many children of the node will a DOM parser find now?

Not Every DTD Makes Sense <DOCTYPE genealogy [ <!ELEMENTperson ( name, dateOfBirth, person, person )>... ]> Is there a problem with this?

Not Every DTD Makes Sense (cntd) <DOCTYPE genealogy [ <!ELEMENTperson ( name, dateOfBirth, person?, person? )>... ]> Is this now okay?

Weaknesses of DTDs DTDs are rather weak specifications by DB & programming-language standards –Only one base type: PCDATA –No useful “abstractions”, e.g., sets –IDs and IDREFs are untyped –No constraints, e.g., child is inverse of parent –Tag definitions are global Some extensions impose a schema or types on an XML document, e.g., XML Schema

Weaknesses of DTDs (cntd) Questions: How would you say that element a has exactly the children c, d, e in any order? In general, can such validity of documents with respect to such definitions be checked efficiently?