SDPNotes 2: Document Grammars and Instances 2 Document Grammars and Instances A look at the foundations of hierarchical document structures 2.1 Language-Theoretic Basis (a quick review) –regular expressions –extended context-free grammars (as document schemas) and –parse trees (as document instances) 2.2 Review of XML Basics –Practical realisation of the general model: XML document instances and DTDs
SDPNotes 2: Document Grammars and Instances 2.1 Regular Expressions n Formalism to describe regular languages: –relatively simple sets of strings over a finite alphabet of »characters, events, document elements, … n Many uses: –text searching patterns (e.g., emacs, ed, awk, grep, Perl) –as a part of grammatical formalisms (for programming languages, XML, in XML DTDs and schemas, …) n Relevance for document structures: what kind of structural content is allowed within various document components
SDPNotes 2: Document Grammars and Instances Regular Expressions: Syntax A regular expression over an alphabet is either A regular expression over an alphabet is either – an empty set, – lambda (sometimes epsilon ), –a any alphabet symbol a –(R | S) choice; sometimes (R S), –(R S) concatenation, or – R *Kleene closure or iteration, where R and S are regular expressions N.B: different syntaxes exist but the idea is same
SDPNotes 2: Document Grammars and Instances Regular Expressions: Grouping n Conventions to simplify operator expressions –outermost parentheses may be eliminated: »(E) = E –binary operations are associative: »(A | (B | C)) = ((A | B) | C) = (A | B | C) »(A (B C)) = ((A B) C) = (A B C) –operations have priorities: »iteration first, concatenation next, choice last »for example, (A | (B (C D)*)) = A | B (C D)*
SDPNotes 2: Document Grammars and Instances Regular Expressions: Semantics Regular expression E denotes a language (set of strings) L(E), defined inductively as follows : Regular expression E denotes a language (set of strings) L(E), defined inductively as follows : –L( ) = {} (empty set) –L( (singleton set of empty string = ‘‘) –L(a) = {a}, ( a , singleton set of word "a" –L((R | S)) = L(R) L(S) = {w | w L(R) or w L(S)} –L((R S)) = L(R) L(S) = {xy | x L(R) and y L(S)} –L(R * ) = L(R) * = {x 1 …x n | x k L(R), k = 1, …, n; n 0}
SDPNotes 2: Document Grammars and Instances Regular Expressions: Examples n L(A | B (C D)*) = ? = L(A) L( B (C D)*) = {A} L(B) L((C D)*) = {A} {B} L(C D)* = {A} {B} { CD, CDCD, CDCDCD, …} = {A, B, BCD, BCDCD, BCDCDCD, …}
SDPNotes 2: Document Grammars and Instances Regular Expressions: Examples n Simplified top-level structure of a document: – = {title, auth, date, sect} –title followed by an optional list of authors, fby an optional date, fby one or more sections: title auth* (date | ) sect sect* Commonly used abbreviations: Commonly used abbreviations: –E? = (E | ); E + = E E* the above more compactly: title auth* date? sect+
SDPNotes 2: Document Grammars and Instances Context-Free Grammars (CFGs) n Used widely to syntax specification (programming languages, XML, …) and to parser/compiler generation (e.g. YACC/GNU Bison) CFG G formally a quadruple (V, P, S) CFG G formally a quadruple (V, P, S) –V is the alphabet of the grammar G – V i s a set of terminal symbols; »N = V is a set of nonterminal symbols –P set of productions –S V the start symbol
SDPNotes 2: Document Grammars and Instances Productions and Derivations Productions: A where A N, V* Productions: A where A N, V* –E.g: A aBa (production 1) Let V*. String derives directly, if Let V*. String derives directly, if – for some V* and P –E.g. A => AaBa (assuming production 1 above) n NB: CFGs are often given simply by listing the productions (P); The start symbol (S) is then conventionally the left-hand-side of the first production
SDPNotes 2: Document Grammars and Instances Language Generated by a CFG derives , if there’s a sequence of (0 or more) direct derivations that transforms to derives , if there’s a sequence of (0 or more) direct derivations that transforms to The language generated by a CFG G: The language generated by a CFG G: –L(G) = {w * | S =>* w } NB: L(G) is a set of strings; NB: L(G) is a set of strings; –To model document structures, we consider syntax trees
SDPNotes 2: Document Grammars and Instances Syntax Trees n Also called parse trees or derivation trees n Ordered trees –consist of nodes that may have child nodes which are ordered left-to-right nodes labelled by symbols of V : nodes labelled by symbols of V : –internal nodes by nonterminals, root by start symbol –leaves by terminal symbols (or empty string ) A node with label A can have children labelled by X 1, …, X k only if A X 1, …, X k P A node with label A can have children labelled by X 1, …, X k only if A X 1, …, X k P
SDPNotes 2: Document Grammars and Instances Syntax Trees: Example CFG for simplified arithmetic expressions: CFG for simplified arithmetic expressions: V = {E, +, *, I}; = {+, *, I}; N = {E}; S = E ( I stands for an arbitrary integer) P = { E E+E, E E*E, E I, E (E) } n Syntax tree for 2*(3+4)?
SDPNotes 2: Document Grammars and Instances Syntax Trees: Example 2 * ( ) E EE E E*E I E I II EE E E+E E E ( E )
SDPNotes 2: Document Grammars and Instances CFGs for Document Structures n Nonterminals represent document elements –E.g. model for items ( Ref ) of a bibliography list: Ref AuthorList Title PublData AuthorList Author AuthorList AuthorList –E.g. model for items ( Ref ) of a bibliography list: Ref AuthorList Title PublData AuthorList Author AuthorList AuthorList n Notice: –right-hand-side of a production is a fixed string of grammar symbols –Repetition simulated using recursion »e.g. AuthorList above
SDPNotes 2: Document Grammars and Instances Example: List of Three Authors AuthorList Author Author Author AuthorList AuthorList AuthorList AhoHopcroftUllman Ref TitlePublData...
SDPNotes 2: Document Grammars and Instances Problems "Auxiliary nonterminals" (like AuthorList) obscure the model "Auxiliary nonterminals" (like AuthorList) obscure the model –the last Author several levels apart from its intuitive parent element Ref –awkward to access and to count Authors of a reference –avoided by extended context-free grammars
SDPNotes 2: Document Grammars and Instances Extended CFGs (ECFGs) like CFGs, but right-hand-sides of productions are regular expressions over V like CFGs, but right-hand-sides of productions are regular expressions over V –E.g: Ref Author* Title PublData Let V*. String derives directly, if Let V*. String derives directly, if – for some V* and P such that L( ) –E.g. Ref => Author Author Author Title PublData
SDPNotes 2: Document Grammars and Instances Language Generated by an ECFG L(G) defined similarly to CFGs: L(G) defined similarly to CFGs: – derives , if n = (for n 0) –L(G) = {w * | S =>* w } n Theorem: Extended and ordinary CFGs allow to generate the same languages. Syntax trees of ECFGs and CFGs differ! (Next)Syntax trees of ECFGs and CFGs differ! (Next)
SDPNotes 2: Document Grammars and Instances Syntax Trees of an ECFG n Similar to parse trees of an ordinary CFG, except that.. –node with label A can have children labelled X 1, …, X k when A E P such that X 1 …X k L(E) an internal node may have arbitralily many children (e.g., Authors below a Ref node)
SDPNotes 2: Document Grammars and Instances Example: Three Authors of a Ref Ref AuthorAuthorAuthorTitle... PublData AhoHopcroftUllman The Design and Analysis... Ref Author* Title PublData P, Author Author Author Title PublData L(Author* Title PublData)
SDPNotes 2: Document Grammars and Instances Terminal Symbols in Practise n (Extended) CFGs: –Leaves of parse trees are labelled by single terminal symbols ( ) n Too granular for practise; instead terminal symbols which stand for all values of a type –XML DTDs: #PCDATA for variable length string content –Proposed XML schema formalisms: »string, byte, integer, boolean, date,... –Explicit string constants rare in document grammars
SDPNotes 2: Document Grammars and Instances 2.2 XML, eXtensible Markup Language n W3C Recommendation 10-Feb-1998 –not an official standard, but a stable industry standard –Second Edition, 6-Oct-2000 »a revision, not a new version of XML n a simplified subset of SGML, Standard Generalized Markup Language, ISO 8879:1987 –what is said below about valid XML documents applies to SGML documents, too
SDPNotes 2: Document Grammars and Instances What is XML? n Extensible Markup Language is not a markup language! –does not fix semantics or a tag set (like, e.g., HTML does) n A way to use markup to represent information n A metalanguage –supports definition of specific markup languages –E.g. XHTML a reformulation of HTML using XML
SDPNotes 2: Document Grammars and Instances Next: Essential Features of XML n An overview of the essentials of XML –many details skipped »some to be discussed in exercises or with other topics when the need arises –learn to consult the original sources (specifications, documentation etc) for details
SDPNotes 2: Document Grammars and Instances XML Encoding of Structure n XML document essentially a parenthesized linear encoding of a parse tree (see next) –corresponds to a pre-order walk –start of inner node (or element) A denoted by a start tag, end denoted by end tag –start of inner node (or element) A denoted by a start tag, end denoted by end tag –leaves are strings (or empty elements) + certain extensions (especially attributes)
SDPNotes 2: Document Grammars and Instances XML Encoding of Structure: Example <S> S W W world! Hello E </S><W><W></W></W><E></E>Helloworld!
SDPNotes 2: Document Grammars and Instances An XML Processor (Parser) n Reads XML documents n Passes data to an application n XML Recommendation –tells how to read, what to pass –check the XML Rec for details; quite readable!
SDPNotes 2: Document Grammars and Instances XML: Logical Document Structure n Elements –correspond to internal nodes of the parse tree –unique root element document a single parse tree –indicated by matching (case-sensitive!) tags … –indicated by matching (case-sensitive!) tags … –can contain text and/or subelements –can be empty: OR (e.g.) –can be empty: OR (e.g.)
SDPNotes 2: Document Grammars and Instances Logical document structure (2) n Attributes –name-value pairs attached to elements –‘‘metadata’’, usually not treated as content –in start-tag after the element type name … … n Also: – –
SDPNotes 2: Document Grammars and Instances CDATA Sections n “CDATA Sections” to include XML markup characters as textual content <![CDATA[ Here we can easily include markup characters and, for example, code fragments: if (Count 0) if (Count 0) ]]>
SDPNotes 2: Document Grammars and Instances Two levels of correctness n Well-formed documents – roughly: follows the syntax of XML, markup correct (elements properly nested, tag names match, attributes of an element have unique names,...) –violation is a fatal error n Valid documents –(in addition to being well-formed) obey their document type definition
SDPNotes 2: Document Grammars and Instances Document Type Declaration n Provides a grammar (document type definition, DTD) for a class of documents Syntax [ ]> Syntax [ ]> n DTD is the union of the external and internal subset; internal subset has higher precedence –can override entity and attribute declarations (see next)
SDPNotes 2: Document Grammars and Instances Markup Declarations n DTD consists of markup declarations –element type declarations »similar to productions of ECFGs –attribute-list declarations »for declared element types –entity declarations (see later) –notation declarations »to pass information about external (binary) objects to the application
SDPNotes 2: Document Grammars and Instances Element type declarations The general form is where E is a content model The general form is where E is a content model regular expression of element names n Content model operators: E | F : alternationE, F: concatenation E? : optionalE* : zero or more E+ : one or more(E) : grouping
SDPNotes 2: Document Grammars and Instances Attribute-List Declarations n Can declare attributes for elements: –Name, data type and possible default value Example: Example: n Semantics mainly up to the application –processor checks that ID attributes are unique and that targets of IDREF attributes exist
SDPNotes 2: Document Grammars and Instances Mixed, Empty and Arbitrary Content Mixed content: Mixed content: –may contain text ( #PCDATA ) and elements Empty content: Empty content: Arbitrary content: (= ) Arbitrary content: (= )
SDPNotes 2: Document Grammars and Instances Entities (1) n Multiple uses: –character entities: »< < and < all expand to ‘ < ‘ »other predefined entities: & > ' "e; expand to &, >, ' and " –general entities are shorthand notations: –general entities are shorthand notations:
SDPNotes 2: Document Grammars and Instances Entities (2) n physical storage units comprising a document –parsed entities –parsed entities –document entity is the starting point of processing –entities and elements must nest properly: <!DOCTYPE doc [ <!ENTITY chap1 ( … as above …) > ] <doc>&chap1;</doc> …</sec> …</sec>
SDPNotes 2: Document Grammars and Instances Unparsed Entities n External (binary) files Declarations: Declarations: Usage: Usage: –application receives information about the notation
SDPNotes 2: Document Grammars and Instances Parameter entities Way to parameterize and modularize DTDs %table-dtd; Way to parameterize and modularize DTDs %table-dtd; (The latter, parameter entities as a part of a markup declaration, is allowed only in the external, and not in the internal DTD subset)
SDPNotes 2: Document Grammars and Instances Speculations about XML Parsing n Parsing involves two things: 1. Checking the syntactic correctness of the input 2. Building a parse tree for the input (a'la DOM), or otherwise passing the document content to the application (e.g. a'la SAX) n Task 2 is simple, thanks to the simplicity of XML markup (see next) n Slightly more difficult (?) to implement are –pulling the entities together –checking the well-formedness –checking the validity wrt the DTD (or a Schema)
SDPNotes 2: Document Grammars and Instances Building an XML Parse Tree <S> E S </S><W><W></W></W><E></E>Helloworld! W Hello W world!