Presentation is loading. Please wait.

Presentation is loading. Please wait.

SDPNotes 2: Document Grammars and Instances 2 Document Grammars and Instances A look at the foundations of hierarchical document structures 2.1 Language-Theoretic.

Similar presentations


Presentation on theme: "SDPNotes 2: Document Grammars and Instances 2 Document Grammars and Instances A look at the foundations of hierarchical document structures 2.1 Language-Theoretic."— Presentation transcript:

1 SDPNotes 2: Document Grammars and Instances 2 Document Grammars and Instances A look at the foundations of hierarchical document structures 2.1 Language-Theoretic Basis (a quick review) –regular expressions –extended context-free grammars (as document schemas) and –parse trees (as document instances) 2.2 Review of XML Basics –Practical realisation of the general model: XML document instances and DTDs

2 SDPNotes 2: Document Grammars and Instances 2.1 Regular Expressions n Formalism to describe regular languages: –relatively simple sets of strings over a finite alphabet of »characters, events, document elements, … n Many uses: –text searching patterns (e.g., emacs, ed, awk, grep, Perl) –as a part of grammatical formalisms (for programming languages, XML, in XML DTDs and schemas, …) n Relevance for document structures: what kind of structural content is allowed within various document components

3 SDPNotes 2: Document Grammars and Instances Regular Expressions: Syntax A regular expression over an alphabet  is either A regular expression over an alphabet  is either –  an empty set, –  lambda (sometimes epsilon  ), –a any alphabet symbol a  –(R | S) choice; sometimes (R  S), –(R S) concatenation, or – R *Kleene closure or iteration, where R and S are regular expressions N.B: different syntaxes exist but the idea is same

4 SDPNotes 2: Document Grammars and Instances Regular Expressions: Grouping n Conventions to simplify operator expressions –outermost parentheses may be eliminated: »(E) = E –binary operations are associative: »(A | (B | C)) = ((A | B) | C) = (A | B | C) »(A (B C)) = ((A B) C) = (A B C) –operations have priorities: »iteration first, concatenation next, choice last »for example, (A | (B (C D)*)) = A | B (C D)*

5 SDPNotes 2: Document Grammars and Instances Regular Expressions: Semantics Regular expression E denotes a language (set of strings) L(E), defined inductively as follows : Regular expression E denotes a language (set of strings) L(E), defined inductively as follows : –L(  ) = {} (empty set) –L(  (singleton set of empty string = ‘‘) –L(a) = {a}, ( a , singleton set of word "a"  –L((R | S)) = L(R)  L(S) = {w | w  L(R) or w  L(S)} –L((R S)) = L(R) L(S) = {xy | x  L(R) and y  L(S)} –L(R * ) = L(R) * = {x 1 …x n | x k  L(R), k = 1, …, n; n  0}

6 SDPNotes 2: Document Grammars and Instances Regular Expressions: Examples n L(A | B (C D)*) = ? = L(A)  L( B (C D)*) = {A}  L(B) L((C D)*) = {A}  {B} L(C D)* = {A}  {B} {  CD, CDCD, CDCDCD, …} = {A, B, BCD, BCDCD, BCDCDCD, …}

7 SDPNotes 2: Document Grammars and Instances Regular Expressions: Examples n Simplified top-level structure of a document: –  = {title, auth, date, sect} –title followed by an optional list of authors, fby an optional date, fby one or more sections: title auth* (date |  ) sect sect* Commonly used abbreviations: Commonly used abbreviations: –E? = (E |  ); E + = E E*  the above more compactly: title auth* date? sect+

8 SDPNotes 2: Document Grammars and Instances Context-Free Grammars (CFGs) n Used widely to syntax specification (programming languages, XML, …) and to parser/compiler generation (e.g. YACC/GNU Bison) CFG G formally a quadruple (V,  P, S) CFG G formally a quadruple (V,  P, S) –V is the alphabet of the grammar G –  V  i s a set of terminal symbols; »N = V  is a set of nonterminal symbols –P set of productions –S  V the start symbol

9 SDPNotes 2: Document Grammars and Instances Productions and Derivations Productions: A  where A  N,  V* Productions: A  where A  N,  V* –E.g: A  aBa (production 1) Let   V*. String   derives  directly,  if Let   V*. String   derives  directly,  if –      for some  V*  and  P –E.g. A  => AaBa (assuming production 1 above) n NB: CFGs are often given simply by listing the productions (P); The start symbol (S) is then conventionally the left-hand-side of the first production

10 SDPNotes 2: Document Grammars and Instances Language Generated by a CFG  derives , if there’s a sequence of (0 or more) direct derivations that transforms  to   derives , if there’s a sequence of (0 or more) direct derivations that transforms  to  The language generated by a CFG G: The language generated by a CFG G: –L(G) = {w  * | S =>* w } NB: L(G) is a set of strings; NB: L(G) is a set of strings; –To model document structures, we consider syntax trees

11 SDPNotes 2: Document Grammars and Instances Syntax Trees n Also called parse trees or derivation trees n Ordered trees –consist of nodes that may have child nodes which are ordered left-to-right nodes labelled by symbols of V : nodes labelled by symbols of V : –internal nodes by nonterminals, root by start symbol –leaves by terminal symbols (or empty string ) A node with label A can have children labelled by X 1, …, X k only if A  X 1, …, X k  P A node with label A can have children labelled by X 1, …, X k only if A  X 1, …, X k  P

12 SDPNotes 2: Document Grammars and Instances Syntax Trees: Example CFG for simplified arithmetic expressions: CFG for simplified arithmetic expressions: V = {E, +, *, I};  = {+, *, I}; N = {E}; S = E ( I stands for an arbitrary integer) P = { E  E+E, E  E*E, E  I, E  (E) } n Syntax tree for 2*(3+4)?

13 SDPNotes 2: Document Grammars and Instances Syntax Trees: Example 2 * ( 3 + 4 ) E EE E  E*E I E  I II EE E  E+E E E  ( E )

14 SDPNotes 2: Document Grammars and Instances CFGs for Document Structures n Nonterminals represent document elements –E.g. model for items ( Ref ) of a bibliography list: Ref  AuthorList Title PublData AuthorList  Author AuthorList AuthorList  –E.g. model for items ( Ref ) of a bibliography list: Ref  AuthorList Title PublData AuthorList  Author AuthorList AuthorList  n Notice: –right-hand-side of a production is a fixed string of grammar symbols –Repetition simulated using recursion »e.g. AuthorList above

15 SDPNotes 2: Document Grammars and Instances Example: List of Three Authors AuthorList Author Author Author AuthorList AuthorList AuthorList AhoHopcroftUllman Ref TitlePublData...

16 SDPNotes 2: Document Grammars and Instances Problems "Auxiliary nonterminals" (like AuthorList) obscure the model "Auxiliary nonterminals" (like AuthorList) obscure the model –the last Author several levels apart from its intuitive parent element Ref –awkward to access and to count Authors of a reference –avoided by extended context-free grammars

17 SDPNotes 2: Document Grammars and Instances Extended CFGs (ECFGs) like CFGs, but right-hand-sides of productions are regular expressions over V like CFGs, but right-hand-sides of productions are regular expressions over V –E.g: Ref  Author* Title PublData Let  V*. String   derives  directly,   if Let  V*. String   derives  directly,   if –      for some    V*  and  P such that  L(  ) –E.g. Ref => Author Author Author Title PublData

18 SDPNotes 2: Document Grammars and Instances Language Generated by an ECFG L(G) defined similarly to CFGs: L(G) defined similarly to CFGs: –  derives , if    n =  (for n  0) –L(G) = {w  * | S =>* w } n Theorem: Extended and ordinary CFGs allow to generate the same languages. Syntax trees of ECFGs and CFGs differ! (Next)Syntax trees of ECFGs and CFGs differ! (Next)

19 SDPNotes 2: Document Grammars and Instances Syntax Trees of an ECFG n Similar to parse trees of an ordinary CFG, except that.. –node with label A can have children labelled X 1, …, X k when A  E  P such that X 1 …X k  L(E)  an internal node may have arbitralily many children (e.g., Authors below a Ref node)

20 SDPNotes 2: Document Grammars and Instances Example: Three Authors of a Ref Ref AuthorAuthorAuthorTitle... PublData AhoHopcroftUllman The Design and Analysis... Ref  Author* Title PublData  P, Author Author Author Title PublData  L(Author* Title PublData)

21 SDPNotes 2: Document Grammars and Instances Terminal Symbols in Practise n (Extended) CFGs: –Leaves of parse trees are labelled by single terminal symbols (  ) n Too granular for practise; instead terminal symbols which stand for all values of a type –XML DTDs: #PCDATA for variable length string content –Proposed XML schema formalisms: »string, byte, integer, boolean, date,... –Explicit string constants rare in document grammars

22 SDPNotes 2: Document Grammars and Instances 2.2 XML, eXtensible Markup Language n W3C Recommendation 10-Feb-1998 –not an official standard, but a stable industry standard –Second Edition, 6-Oct-2000 »a revision, not a new version of XML n a simplified subset of SGML, Standard Generalized Markup Language, ISO 8879:1987 –what is said below about valid XML documents applies to SGML documents, too

23 SDPNotes 2: Document Grammars and Instances What is XML? n Extensible Markup Language is not a markup language! –does not fix semantics or a tag set (like, e.g., HTML does) n A way to use markup to represent information n A metalanguage –supports definition of specific markup languages –E.g. XHTML a reformulation of HTML using XML

24 SDPNotes 2: Document Grammars and Instances Next: Essential Features of XML n An overview of the essentials of XML –many details skipped »some to be discussed in exercises or with other topics when the need arises –learn to consult the original sources (specifications, documentation etc) for details

25 SDPNotes 2: Document Grammars and Instances XML Encoding of Structure n XML document essentially a parenthesized linear encoding of a parse tree (see next) –corresponds to a pre-order walk –start of inner node (or element) A denoted by a start tag, end denoted by end tag –start of inner node (or element) A denoted by a start tag, end denoted by end tag –leaves are strings (or empty elements) + certain extensions (especially attributes)

26 SDPNotes 2: Document Grammars and Instances XML Encoding of Structure: Example <S> S W W world! Hello E </S><W><W></W></W><E></E>Helloworld!

27 SDPNotes 2: Document Grammars and Instances An XML Processor (Parser) n Reads XML documents n Passes data to an application n XML Recommendation –tells how to read, what to pass –check the XML Rec for details; quite readable!

28 SDPNotes 2: Document Grammars and Instances XML: Logical Document Structure n Elements –correspond to internal nodes of the parse tree –unique root element  document a single parse tree –indicated by matching (case-sensitive!) tags … –indicated by matching (case-sensitive!) tags … –can contain text and/or subelements –can be empty: OR (e.g.) –can be empty: OR (e.g.)

29 SDPNotes 2: Document Grammars and Instances Logical document structure (2) n Attributes –name-value pairs attached to elements –‘‘metadata’’, usually not treated as content –in start-tag after the element type name … … n Also: – –

30 SDPNotes 2: Document Grammars and Instances CDATA Sections n “CDATA Sections” to include XML markup characters as textual content <![CDATA[ Here we can easily include markup characters and, for example, code fragments: if (Count 0) if (Count 0) ]]>

31 SDPNotes 2: Document Grammars and Instances Two levels of correctness n Well-formed documents – roughly: follows the syntax of XML, markup correct (elements properly nested, tag names match, attributes of an element have unique names,...) –violation is a fatal error n Valid documents –(in addition to being well-formed) obey their document type definition

32 SDPNotes 2: Document Grammars and Instances Document Type Declaration n Provides a grammar (document type definition, DTD) for a class of documents Syntax [ ]> Syntax [ ]> n DTD is the union of the external and internal subset; internal subset has higher precedence –can override entity and attribute declarations (see next)

33 SDPNotes 2: Document Grammars and Instances Markup Declarations n DTD consists of markup declarations –element type declarations »similar to productions of ECFGs –attribute-list declarations »for declared element types –entity declarations (see later) –notation declarations »to pass information about external (binary) objects to the application

34 SDPNotes 2: Document Grammars and Instances Element type declarations The general form is where E is a content model The general form is where E is a content model  regular expression of element names n Content model operators: E | F : alternationE, F: concatenation E? : optionalE* : zero or more E+ : one or more(E) : grouping

35 SDPNotes 2: Document Grammars and Instances Attribute-List Declarations n Can declare attributes for elements: –Name, data type and possible default value Example: Example: n Semantics mainly up to the application –processor checks that ID attributes are unique and that targets of IDREF attributes exist

36 SDPNotes 2: Document Grammars and Instances Mixed, Empty and Arbitrary Content Mixed content: Mixed content: –may contain text ( #PCDATA ) and elements Empty content: Empty content: Arbitrary content: (= ) Arbitrary content: (= )

37 SDPNotes 2: Document Grammars and Instances Entities (1) n Multiple uses: –character entities: »< < and < all expand to ‘ < ‘ »other predefined entities: & > &apos; &quote; expand to &, >, ' and " –general entities are shorthand notations: –general entities are shorthand notations:

38 SDPNotes 2: Document Grammars and Instances Entities (2) n physical storage units comprising a document –parsed entities –parsed entities –document entity is the starting point of processing –entities and elements must nest properly: <!DOCTYPE doc [ <!ENTITY chap1 ( … as above …) > ] <doc>&chap1;</doc> …</sec> …</sec>

39 SDPNotes 2: Document Grammars and Instances Unparsed Entities n External (binary) files Declarations: Declarations: Usage: Usage: –application receives information about the notation

40 SDPNotes 2: Document Grammars and Instances Parameter entities Way to parameterize and modularize DTDs %table-dtd; Way to parameterize and modularize DTDs %table-dtd; (The latter, parameter entities as a part of a markup declaration, is allowed only in the external, and not in the internal DTD subset)

41 SDPNotes 2: Document Grammars and Instances Speculations about XML Parsing n Parsing involves two things: 1. Checking the syntactic correctness of the input 2. Building a parse tree for the input (a'la DOM), or otherwise passing the document content to the application (e.g. a'la SAX) n Task 2 is simple, thanks to the simplicity of XML markup (see next) n Slightly more difficult (?) to implement are –pulling the entities together –checking the well-formedness –checking the validity wrt the DTD (or a Schema)

42 SDPNotes 2: Document Grammars and Instances Building an XML Parse Tree <S> E S </S><W><W></W></W><E></E>Helloworld! W Hello W world!


Download ppt "SDPNotes 2: Document Grammars and Instances 2 Document Grammars and Instances A look at the foundations of hierarchical document structures 2.1 Language-Theoretic."

Similar presentations


Ads by Google