SDPNotes 2: Document Grammars and Instances 2 Document Grammars and Instances A look at the foundations of hierarchical document structures 2.1 Language-Theoretic.

Slides:



Advertisements
Similar presentations
Processing of structured documents Spring 2003, Part 1 Helena Ahonen-Myka.
Advertisements

1 DTD (Document Type Definition) Imposing Structure on XML Documents (W3Schools on DTDs)W3Schools on DTDs.
XML 6.3 DTD 6. XML and DTDs A DTD (Document Type Definition) describes the structure of one or more XML documents. Specifically, a DTD describes:  Elements.
History Leading to XHTML
XML Document Type Definitions ( DTD ). 1.Introduction to DTD An XML document may have an optional DTD, which defines the document’s grammar. Since the.
Extensible Markup Language Natawut Nupairoj, Ph.D. Department of Computer Engineering Chulalongkorn University.
Querying XML Documents and Data CBU Summer School (2 ECTS) Prof. Pekka Kilpeläinen Univ of Kuopio, Dept of Computer Science
Introduction to XML: DTD
176 Formal Languages and Applications: We know that Pascal programming language is defined in terms of a CFG. All the other programming languages are context-free.
Document Type Definitions
Introduction to XLink Transparency No. 1 XML Information Set W3C Recommendation 24 October 2001 (1stEdition) 4 February 2004 (2ndEdition) Cheng-Chia Chen.
A Technical Introduction to XML Transparency No. 1 XML quick References.
Introduction to XML Extensible Markup Language
Document Type Definitions. XML and DTDs A DTD (Document Type Definition) describes the structure of one or more XML documents. Specifically, a DTD describes:
Copyright © 2003 Pearson Education, Inc. Slide 2-1 Created by Cheryl M. Hughes, Harvard University Extension School — Cambridge, MA The Web Wizard’s Guide.
XML Validation I DTDs Robin Burke ECT 360 Winter 2004.
XP New Perspectives on XML Tutorial 3 1 DTD Tutorial – Carey ISBN
1 Introduction to Parsing Lecture 5. 2 Outline Regular languages revisited Parser overview Context-free grammars (CFG’s) Derivations.
VICTORIA UNIVERSITY OF WELLINGTON Te Whare Wananga o te Upoko o te Ika a Maui SWEN 432 Advanced Database Design and Implementation Document Type Definition.
XML Anisha K J Jerrin Thomas. Outline  Introduction  Structure of an XML Page  Well-formed & Valid XML Documents  DTD – Elements, Attributes, Entities.
Introduction to XML cs3505. References –I got most of this presentation from this site –O’reilly tutorials.
XML eXtensible Markup Language by Darrell Payne. Experience Logicon / Sterling Federal C, C++, JavaScript/Jscript, Shell Script, Perl XML Training XML.
XML CPSC 315 – Programming Studio Fall 2008 Project 3, Lecture 1.
1 herbert van de sompel CS 502 Computing Methods for Digital Libraries Cornell University – Computer Science Herbert Van de Sompel
1 © Netskills Quality Internet Training, University of Newcastle Introducing XML © Netskills, Quality Internet Training University.
XML 1 Enterprise Applications CE00465-M XML. 2 Enterprise Applications CE00465-M XML Overview Extensible Mark-up Language (XML) is a meta-language that.
August Chapter 2 - Markup and Core Concepts Learning XML by Erik T. Ray Slides were developed by Jack Davis College of Information Science and Technology.
XML Syntax - Writing XML and Designing DTD's
XP 1 DECLARING A DTD A DTD can be used to: –Ensure all required elements are present in the document –Prevent undefined elements from being used –Enforce.
What is XML?  XML stands for EXtensible Markup Language  XML is a markup language much like HTML  XML was designed to carry data, not to display data.
Processing of structured documents Spring 2002, Part 2 Helena Ahonen-Myka.
1 Tutorial 13 Validating Documents with DTDs Working with Document Type Definitions.
Avoid using attributes? Some of the problems using attributes: Attributes cannot contain multiple values (child elements can) Attributes are not easily.
Lecture 6 XML DTD Content of.xml fileContent of.dtd file.
1 Chapter 10: XML What is XML What is XML Basic Components of XML Basic Components of XML XPath XPath XQuery XQuery.
CIS 451: XML DTDs Dr. Ralph D. Westfall February, 2009.
XML Extensible Markup Language Aleksandar Bogdanovski Programing Enviroment LABoratory
XML Documents Chao-Hsien Chu, Ph.D. School of Information Sciences and Technology The Pennsylvania State University Elements Attributes Comments PI Document.
IS432 Semi-Structured Data Lecture 2: DTD Dr. Gamal Al-Shorbagy.
XML Validation I DTDs Robin Burke ECT 360 Winter 2004.
SDPL 20112: XML Basics1 2 Basics of XML and XML documents Survivor's Guide to XML, or XML for Computer Scientists / Dummies 2.1 XML and XML documents 2.2.
Introduction to XML This presentation covers introductory features of XML. What XML is and what it is not? What does it do? Put different related technologies.
XML Instructor: Charles Moen CSCI/CINF XML  Extensible Markup Language  A set of rules that allow you to create your own markup language  Designed.
Lecture 16 Introduction to XML Boriana Koleva Room: C54
An Introduction to XML Sandeep Bhattaram
McGraw-Hill/Irwin © 2004 by The McGraw-Hill Companies, Inc. All rights reserved. Understanding How XML Works Ellen Pearlman Eileen Mullin Programming the.
Semistructured Data Extensible Markup Language Document Type Definitions Zaki Malik November 04, 2008.
Sheet 1XML Technology in E-Commerce 2001Lecture 2 XML Technology in E-Commerce Lecture 2 Logical and Physical Structure, Validity, DTD, XML Schema.
SDPL 2004Notes 2: Document Instances and Grammars1 2 Document Instances and Grammars Fundamentals of hierarchical document structures, or Computer Scientist’s.
Processing of structured documents Spring 2002, Part 1 Helena Ahonen-Myka.
1 Tutorial 14 Validating Documents with Schemas Exploring the XML Schema Vocabulary.
Introduction to Parsing
SNU OOPSLA Lab. Logical structure © copyright 2001 SNU OOPSLA Lab.
COMP9321 Web Application Engineering Semester 2, 2015 Dr. Amin Beheshti Service Oriented Computing Group, CSE, UNSW Australia Week 4 1COMP9321, 15s2, Week.
QUALITY CONTROL WITH SCHEMAS CSC1310 Fall BASIS CONCEPTS SchemaSchema is a pass-or-fail test for document Schema is a minimum set of requirements.
When we create.rtf document apart from saving the actual info the tool saves additional info like start of a paragraph, bold, size of the font.. Etc. This.
What is XML? eXtensible Markup Language eXtensible Markup Language A subset of SGML (Standard Generalized Markup Language) A subset of SGML (Standard Generalized.
Introduction to DTD A Document Type Definition (DTD) defines the legal building blocks of an XML document. It defines the document structure with a list.
XML CSC1310 Fall HTML (TIM BERNERS-LEE) HyperText Markup Language  HTML (HyperText Markup Language): December  Markup  Markup is a symbol.
SDPL 20062: Document Instances and Grammars1 2 Document Instances and Grammars Fundamentals of hierarchical document structures, or Computer Scientist’s.
Programming Languages and Design Lecture 2 Syntax Specifications of Programming Languages Instructor: Li Ma Department of Computer Science Texas Southern.
XML CORE CSC1310 Fall XML DOCUMENT XML document XML document is a convenient way for parsers to archive data. In other words, it is a way to describe.
C Copyright © 2011, Oracle and/or its affiliates. All rights reserved. Introduction to XML Standards.
Jackson, Web Technologies: A Computer Science Perspective, © 2007 Prentice-Hall, Inc. All rights reserved Chapter 7 Representing Web Data:
CITA 330 Section 2 DTD. Defining XML Dialects “Well-formedness” is the minimal requirement for an XML document; all XML parsers can check it Any useful.
Extensible Markup Language (XML) Pat Morin COMP 2405.
Unit 4 Representing Web Data: XML
Java XML IS
The XML Language.
Allyson Falkner Spokane County ISD
Presentation transcript:

SDPNotes 2: Document Grammars and Instances 2 Document Grammars and Instances A look at the foundations of hierarchical document structures 2.1 Language-Theoretic Basis (a quick review) –regular expressions –extended context-free grammars (as document schemas) and –parse trees (as document instances) 2.2 Review of XML Basics –Practical realisation of the general model: XML document instances and DTDs

SDPNotes 2: Document Grammars and Instances 2.1 Regular Expressions n Formalism to describe regular languages: –relatively simple sets of strings over a finite alphabet of »characters, events, document elements, … n Many uses: –text searching patterns (e.g., emacs, ed, awk, grep, Perl) –as a part of grammatical formalisms (for programming languages, XML, in XML DTDs and schemas, …) n Relevance for document structures: what kind of structural content is allowed within various document components

SDPNotes 2: Document Grammars and Instances Regular Expressions: Syntax A regular expression over an alphabet  is either A regular expression over an alphabet  is either –  an empty set, –  lambda (sometimes epsilon  ), –a any alphabet symbol a  –(R | S) choice; sometimes (R  S), –(R S) concatenation, or – R *Kleene closure or iteration, where R and S are regular expressions N.B: different syntaxes exist but the idea is same

SDPNotes 2: Document Grammars and Instances Regular Expressions: Grouping n Conventions to simplify operator expressions –outermost parentheses may be eliminated: »(E) = E –binary operations are associative: »(A | (B | C)) = ((A | B) | C) = (A | B | C) »(A (B C)) = ((A B) C) = (A B C) –operations have priorities: »iteration first, concatenation next, choice last »for example, (A | (B (C D)*)) = A | B (C D)*

SDPNotes 2: Document Grammars and Instances Regular Expressions: Semantics Regular expression E denotes a language (set of strings) L(E), defined inductively as follows : Regular expression E denotes a language (set of strings) L(E), defined inductively as follows : –L(  ) = {} (empty set) –L(  (singleton set of empty string = ‘‘) –L(a) = {a}, ( a , singleton set of word "a"  –L((R | S)) = L(R)  L(S) = {w | w  L(R) or w  L(S)} –L((R S)) = L(R) L(S) = {xy | x  L(R) and y  L(S)} –L(R * ) = L(R) * = {x 1 …x n | x k  L(R), k = 1, …, n; n  0}

SDPNotes 2: Document Grammars and Instances Regular Expressions: Examples n L(A | B (C D)*) = ? = L(A)  L( B (C D)*) = {A}  L(B) L((C D)*) = {A}  {B} L(C D)* = {A}  {B} {  CD, CDCD, CDCDCD, …} = {A, B, BCD, BCDCD, BCDCDCD, …}

SDPNotes 2: Document Grammars and Instances Regular Expressions: Examples n Simplified top-level structure of a document: –  = {title, auth, date, sect} –title followed by an optional list of authors, fby an optional date, fby one or more sections: title auth* (date |  ) sect sect* Commonly used abbreviations: Commonly used abbreviations: –E? = (E |  ); E + = E E*  the above more compactly: title auth* date? sect+

SDPNotes 2: Document Grammars and Instances Context-Free Grammars (CFGs) n Used widely to syntax specification (programming languages, XML, …) and to parser/compiler generation (e.g. YACC/GNU Bison) CFG G formally a quadruple (V,  P, S) CFG G formally a quadruple (V,  P, S) –V is the alphabet of the grammar G –  V  i s a set of terminal symbols; »N = V  is a set of nonterminal symbols –P set of productions –S  V the start symbol

SDPNotes 2: Document Grammars and Instances Productions and Derivations Productions: A  where A  N,  V* Productions: A  where A  N,  V* –E.g: A  aBa (production 1) Let   V*. String   derives  directly,  if Let   V*. String   derives  directly,  if –      for some  V*  and  P –E.g. A  => AaBa (assuming production 1 above) n NB: CFGs are often given simply by listing the productions (P); The start symbol (S) is then conventionally the left-hand-side of the first production

SDPNotes 2: Document Grammars and Instances Language Generated by a CFG  derives , if there’s a sequence of (0 or more) direct derivations that transforms  to   derives , if there’s a sequence of (0 or more) direct derivations that transforms  to  The language generated by a CFG G: The language generated by a CFG G: –L(G) = {w  * | S =>* w } NB: L(G) is a set of strings; NB: L(G) is a set of strings; –To model document structures, we consider syntax trees

SDPNotes 2: Document Grammars and Instances Syntax Trees n Also called parse trees or derivation trees n Ordered trees –consist of nodes that may have child nodes which are ordered left-to-right nodes labelled by symbols of V : nodes labelled by symbols of V : –internal nodes by nonterminals, root by start symbol –leaves by terminal symbols (or empty string ) A node with label A can have children labelled by X 1, …, X k only if A  X 1, …, X k  P A node with label A can have children labelled by X 1, …, X k only if A  X 1, …, X k  P

SDPNotes 2: Document Grammars and Instances Syntax Trees: Example CFG for simplified arithmetic expressions: CFG for simplified arithmetic expressions: V = {E, +, *, I};  = {+, *, I}; N = {E}; S = E ( I stands for an arbitrary integer) P = { E  E+E, E  E*E, E  I, E  (E) } n Syntax tree for 2*(3+4)?

SDPNotes 2: Document Grammars and Instances Syntax Trees: Example 2 * ( ) E EE E  E*E I E  I II EE E  E+E E E  ( E )

SDPNotes 2: Document Grammars and Instances CFGs for Document Structures n Nonterminals represent document elements –E.g. model for items ( Ref ) of a bibliography list: Ref  AuthorList Title PublData AuthorList  Author AuthorList AuthorList  –E.g. model for items ( Ref ) of a bibliography list: Ref  AuthorList Title PublData AuthorList  Author AuthorList AuthorList  n Notice: –right-hand-side of a production is a fixed string of grammar symbols –Repetition simulated using recursion »e.g. AuthorList above

SDPNotes 2: Document Grammars and Instances Example: List of Three Authors AuthorList Author Author Author AuthorList AuthorList AuthorList AhoHopcroftUllman Ref TitlePublData...

SDPNotes 2: Document Grammars and Instances Problems "Auxiliary nonterminals" (like AuthorList) obscure the model "Auxiliary nonterminals" (like AuthorList) obscure the model –the last Author several levels apart from its intuitive parent element Ref –awkward to access and to count Authors of a reference –avoided by extended context-free grammars

SDPNotes 2: Document Grammars and Instances Extended CFGs (ECFGs) like CFGs, but right-hand-sides of productions are regular expressions over V like CFGs, but right-hand-sides of productions are regular expressions over V –E.g: Ref  Author* Title PublData Let  V*. String   derives  directly,   if Let  V*. String   derives  directly,   if –      for some    V*  and  P such that  L(  ) –E.g. Ref => Author Author Author Title PublData

SDPNotes 2: Document Grammars and Instances Language Generated by an ECFG L(G) defined similarly to CFGs: L(G) defined similarly to CFGs: –  derives , if    n =  (for n  0) –L(G) = {w  * | S =>* w } n Theorem: Extended and ordinary CFGs allow to generate the same languages. Syntax trees of ECFGs and CFGs differ! (Next)Syntax trees of ECFGs and CFGs differ! (Next)

SDPNotes 2: Document Grammars and Instances Syntax Trees of an ECFG n Similar to parse trees of an ordinary CFG, except that.. –node with label A can have children labelled X 1, …, X k when A  E  P such that X 1 …X k  L(E)  an internal node may have arbitralily many children (e.g., Authors below a Ref node)

SDPNotes 2: Document Grammars and Instances Example: Three Authors of a Ref Ref AuthorAuthorAuthorTitle... PublData AhoHopcroftUllman The Design and Analysis... Ref  Author* Title PublData  P, Author Author Author Title PublData  L(Author* Title PublData)

SDPNotes 2: Document Grammars and Instances Terminal Symbols in Practise n (Extended) CFGs: –Leaves of parse trees are labelled by single terminal symbols (  ) n Too granular for practise; instead terminal symbols which stand for all values of a type –XML DTDs: #PCDATA for variable length string content –Proposed XML schema formalisms: »string, byte, integer, boolean, date,... –Explicit string constants rare in document grammars

SDPNotes 2: Document Grammars and Instances 2.2 XML, eXtensible Markup Language n W3C Recommendation 10-Feb-1998 –not an official standard, but a stable industry standard –Second Edition, 6-Oct-2000 »a revision, not a new version of XML n a simplified subset of SGML, Standard Generalized Markup Language, ISO 8879:1987 –what is said below about valid XML documents applies to SGML documents, too

SDPNotes 2: Document Grammars and Instances What is XML? n Extensible Markup Language is not a markup language! –does not fix semantics or a tag set (like, e.g., HTML does) n A way to use markup to represent information n A metalanguage –supports definition of specific markup languages –E.g. XHTML a reformulation of HTML using XML

SDPNotes 2: Document Grammars and Instances Next: Essential Features of XML n An overview of the essentials of XML –many details skipped »some to be discussed in exercises or with other topics when the need arises –learn to consult the original sources (specifications, documentation etc) for details

SDPNotes 2: Document Grammars and Instances XML Encoding of Structure n XML document essentially a parenthesized linear encoding of a parse tree (see next) –corresponds to a pre-order walk –start of inner node (or element) A denoted by a start tag, end denoted by end tag –start of inner node (or element) A denoted by a start tag, end denoted by end tag –leaves are strings (or empty elements) + certain extensions (especially attributes)

SDPNotes 2: Document Grammars and Instances XML Encoding of Structure: Example <S> S W W world! Hello E </S><W><W></W></W><E></E>Helloworld!

SDPNotes 2: Document Grammars and Instances An XML Processor (Parser) n Reads XML documents n Passes data to an application n XML Recommendation –tells how to read, what to pass –check the XML Rec for details; quite readable!

SDPNotes 2: Document Grammars and Instances XML: Logical Document Structure n Elements –correspond to internal nodes of the parse tree –unique root element  document a single parse tree –indicated by matching (case-sensitive!) tags … –indicated by matching (case-sensitive!) tags … –can contain text and/or subelements –can be empty: OR (e.g.) –can be empty: OR (e.g.)

SDPNotes 2: Document Grammars and Instances Logical document structure (2) n Attributes –name-value pairs attached to elements –‘‘metadata’’, usually not treated as content –in start-tag after the element type name … … n Also: – –

SDPNotes 2: Document Grammars and Instances CDATA Sections n “CDATA Sections” to include XML markup characters as textual content <![CDATA[ Here we can easily include markup characters and, for example, code fragments: if (Count 0) if (Count 0) ]]>

SDPNotes 2: Document Grammars and Instances Two levels of correctness n Well-formed documents – roughly: follows the syntax of XML, markup correct (elements properly nested, tag names match, attributes of an element have unique names,...) –violation is a fatal error n Valid documents –(in addition to being well-formed) obey their document type definition

SDPNotes 2: Document Grammars and Instances Document Type Declaration n Provides a grammar (document type definition, DTD) for a class of documents Syntax [ ]> Syntax [ ]> n DTD is the union of the external and internal subset; internal subset has higher precedence –can override entity and attribute declarations (see next)

SDPNotes 2: Document Grammars and Instances Markup Declarations n DTD consists of markup declarations –element type declarations »similar to productions of ECFGs –attribute-list declarations »for declared element types –entity declarations (see later) –notation declarations »to pass information about external (binary) objects to the application

SDPNotes 2: Document Grammars and Instances Element type declarations The general form is where E is a content model The general form is where E is a content model  regular expression of element names n Content model operators: E | F : alternationE, F: concatenation E? : optionalE* : zero or more E+ : one or more(E) : grouping

SDPNotes 2: Document Grammars and Instances Attribute-List Declarations n Can declare attributes for elements: –Name, data type and possible default value Example: Example: n Semantics mainly up to the application –processor checks that ID attributes are unique and that targets of IDREF attributes exist

SDPNotes 2: Document Grammars and Instances Mixed, Empty and Arbitrary Content Mixed content: Mixed content: –may contain text ( #PCDATA ) and elements Empty content: Empty content: Arbitrary content: (= ) Arbitrary content: (= )

SDPNotes 2: Document Grammars and Instances Entities (1) n Multiple uses: –character entities: »< < and < all expand to ‘ < ‘ »other predefined entities: & > &apos; &quote; expand to &, >, ' and " –general entities are shorthand notations: –general entities are shorthand notations:

SDPNotes 2: Document Grammars and Instances Entities (2) n physical storage units comprising a document –parsed entities –parsed entities –document entity is the starting point of processing –entities and elements must nest properly: <!DOCTYPE doc [ <!ENTITY chap1 ( … as above …) > ] <doc>&chap1;</doc> …</sec> …</sec>

SDPNotes 2: Document Grammars and Instances Unparsed Entities n External (binary) files Declarations: Declarations: Usage: Usage: –application receives information about the notation

SDPNotes 2: Document Grammars and Instances Parameter entities Way to parameterize and modularize DTDs %table-dtd; Way to parameterize and modularize DTDs %table-dtd; (The latter, parameter entities as a part of a markup declaration, is allowed only in the external, and not in the internal DTD subset)

SDPNotes 2: Document Grammars and Instances Speculations about XML Parsing n Parsing involves two things: 1. Checking the syntactic correctness of the input 2. Building a parse tree for the input (a'la DOM), or otherwise passing the document content to the application (e.g. a'la SAX) n Task 2 is simple, thanks to the simplicity of XML markup (see next) n Slightly more difficult (?) to implement are –pulling the entities together –checking the well-formedness –checking the validity wrt the DTD (or a Schema)

SDPNotes 2: Document Grammars and Instances Building an XML Parse Tree <S> E S </S><W><W></W></W><E></E>Helloworld! W Hello W world!