Presentation is loading. Please wait.

Presentation is loading. Please wait.

Processing of structured documents Spring 2002, Part 1 Helena Ahonen-Myka.

Similar presentations


Presentation on theme: "Processing of structured documents Spring 2002, Part 1 Helena Ahonen-Myka."— Presentation transcript:

1 Processing of structured documents Spring 2002, Part 1 Helena Ahonen-Myka

2 2 Course organization z581290-5 laudatur course, 3 cu zlectures (in Finnish) y22.1.-21.2. Tue 12-14, Thu 10-12 ynot obligatory zexercise sessions y29.1.-27.2. ycourse assistants: Olli Lahti and Miro Lehtonen (new group Wed 12-14 A318) ynot obligatory

3 3 Requirements zExam (Wed 6.3. at 16-20): 45 points zProject: 15 points zExercises: 5 extra points zMaximum of points: 60

4 4 Outline (preliminary) z1. Descriptions of structure ycontext-free grammars ynamespaces, information sets y(XML DTD,) XML Schema z2. Programming interfaces ySAX, DOM ySOAP z3. Traversing documents yXPath

5 5 Outline... z4. Querying structured documents yXML Query z5. XML Linking z6. XML databases z7. Metadata: RDF z8. Compressing XML data z9....

6 6 Prerequisites zYou should know the basics of XML yDTD, elements, attributes, syntax yXSLT (basics), formatting zsome programming experience is needed

7 7 Group project zGroup of 4-5 students ygroups are formed in the exercise sessions in the 2nd week zTask: construct a toy B2B e-commerce application ya travel agency which sells packages containing hotel nights and concerts ya hotel (or several) ya concert ticket office

8 8 Group project zTask continues ya customer can reserve packages using a web page ya reservation causes a query to the hotels and the ticket offices for the availability of rooms and tickets yfor all the communication and for the storage of all the documents you should use XML

9 9 Group project zTry to get some simple implementation work ymay depend on the support we can offer zyou don´t have to consider all the real life problems, like consistency of reservations zconcentrate on playing with XML zstate of the work is presented in the last exercise sessions (also students who don’t normally attend exercises)

10 10 Requirements for project zMore instructions follow later... zreturn a report by 22.3. (as an URL) zThe report should include y(short) requirements analysis ydescriptions of the structure (DTD, Schema) yother designs, architecture,... zSome kind of a working prototype ynot necessarily the whole system

11 11 1. Structure descriptions zRegular expressions, context-free grammars -> What is XML? z(XML Document type definitions) znamespaces, information sets zXML Schema

12 12 Regular expressions zA way to describe set of strings over an alphabet (of chars, events, elements…) zmany uses: ytext searching (e.g. emacs, grep, perl) yin grammatical formalisms (e.g. XML DTDs) zrelevant for document structures: what kind of structural content is allowed for different document components

13 13 Regular expressions zA regular expression over alphabet  is either y  (an empty set) y  (epsilon; sometimes lambda ) ya, where a   yR | S (choice; sometimes R  S) yR S (catenation) or yR* (Kleene closure) zwhere R and S are regular expressions

14 14 Regular expressions zRegular expression E denotes a language (a set of strings) L(E): yL(  ) =  (empty set) yL(  ) = {  } (singleton set of empty string) yL(a) = {a} (singleton set of a   ) yL(R|S) = L(R)  L(S) = {w | w  L(R) or w  L(S)} yL(RS) = L(R)L(S) = {xy | x  L(R) and y  L(S)} yL(R*) = L(R)* = {x 1 …x n | x k  L(R), k=1,…,n; n  0}

15 15 Example ztop-level structure of a document: y  = {title, author, date, sect} ytitle followed by an optional list of authors, followed by an optional date, followed by one or more sections: ytitle auth* (date |  ) sect sect* zcommon abbreviations: yE? = (E |  ); E + = E E* y-> title auth* date? sect +

16 16 Context-free grammars zUsed widely for syntax specification (programming languages) zG = (V, , P, S) yV: the alphabet of the grammar G; V =   N y  : the set of terminal symbols; N = V-  : the set of nonterminal symbols yP: set of productions yS  N: the start symbol

17 17 Productions and derivations zProductions: A -> , where A  N,   V* ye.g. A -> aBa (1) zLet ,   V*. String  derives  directly,  => , if y  =  A ,  =  for some ,   V*, and A ->  is a production of the grammar ye.g. AA => AaBa (assuming prod. 1 above)

18 18 Language generated by a context-free grammar z  derives ,  =>* , if there is a sequence of 0 or more direct derivations that transforms  to  zThe language generated by a CFG G: yL(G) = {w   * | S =>* w} zL(G) is a set of strings: to model structural elements, we consider parse trees

19 19 Parse trees of a CFG zAka syntax trees or derivation trees znodes labelled by symbols of V (or by  ): yinternal nodes by nonterminals, root by start symbol yleaves using terminal symbols (or  ) zparent with label A can have children labeled by X 1,…,X k only if A -> X 1 …X k is a production

20 20 CFGs for document structures zNonterminals represent document structures ye.g. Ref -> AuthorList Title PublData AuthorList -> Author AuthorList AuthorList ->  zproblem: yobscures the relation of elements (the last Author several hierarchical levels away from Ref) -> solution: extended CFGs

21 21 Extended CFGs (ECFGs) zLike CFGs, but right-hand-sides of productions are regular expressions over V, e.g. Ref -> Author* Title PublData zLet ,   V*. String  derives  directly,  => , if y  =  A ,  =  for some ,   V*, and A -> E is a production such that   L(E) ye.g. Ref => Author Author Author Title PublData

22 22 Language generated by an ECFG zDefined similarly to CFGs zTheorem: Languages generated by extended and ordinary CGFs are the same

23 23 Parse trees of an ECFG zSimilar to parse trees of an ordinary CFG, except that… zparent with label A can have children labeled by X 1,…,X k when A -> E is a production such that X 1 …X k  L(E) z-> an internal node may have arbitrarily many children (e.g. Authors below a Ref node)

24 24 What is XML? zmetalanguage that can be used to define markup languages ygives syntax for defining extended context free grammars yXML documents that adhere to an ECFG are strings in that language ydocument types (grammars)- document instances (strings in the language)

25 25 XML encoding of structure zXML document essentially a parenthesized linear encoding of a parse tree ycorresponds to a preorder walk ystart of inner node (element) A denoted by a start tag, end denoted by end tag yleaves are strings (or empty elements) z+ certain extensions (especially attributes)

26 26 Terminal symbols in practice zLeaves of parse trees are labeled by single characters (symbols of  ) ztoo granular in practice: instead terminal symbols which stand for all values of a type ye.g. #PCDATA in XML for variable length content of data characters yricher data types in XML schema formalisms

27 27 An example DTD <!DOCTYPE invoice [ <!ELEMENT invoice (orderDate, shipDate, billingAddress voice*, fax?)> ]>

28 28 19990121 19990125 Ashok Malhotra 123 IBM Ave. Hawthorne NY 10532-0000 555-1234 555-4321 And a document:

29 29 XML processing model zA processor (parser) yreads XML documents ypasses data to an application zXML Specification tells how to read, what to pass

30 30 Well-formed XML documents zdocuments that adhere to the formal requirements (syntax) of the XML specification zif a document is not well-formed, it is not an XML document (and the XML tools do not have to process it)

31 31 Valid documents za document is a valid XML-document, if it is well-formed and adheres to the structure defined in the DTD given zXML-processor can be validating or non- validating zsometimes validity is important, sometimes not


Download ppt "Processing of structured documents Spring 2002, Part 1 Helena Ahonen-Myka."

Similar presentations


Ads by Google