Processing of structured documents Spring 2002, Part 1 Helena Ahonen-Myka.

Slides:



Advertisements
Similar presentations
XML-XSL Introduction SHIJU RAJAN SHIJU RAJAN Outline Brief Overview Brief Overview What is XML? What is XML? Well Formed XML Well Formed XML Tag Name.
Advertisements

Processing of structured documents Spring 2003, Part 1 Helena Ahonen-Myka.
Translator Architecture Code Generator ParserTokenizer string of characters (source code) string of tokens abstract program string of integers (object.
176 Formal Languages and Applications: We know that Pascal programming language is defined in terms of a CFG. All the other programming languages are context-free.
CS 898N – Advanced World Wide Web Technologies Lecture 21: XML Chin-Chih Chang
Context-Free Grammars Lecture 7
January 14, 2015CS21 Lecture 51 CS21 Decidability and Tractability Lecture 5 January 14, 2015.
Winter 2002Arthur Keller – CS 18018–1 Schedule Today: Mar. 12 (T) u Semistructured Data, XML, XQuery. u Read Sections Assignment 8 due. Mar. 14.
XML for Information Management – Day 3: Formal and Natural Languages in XML Airi Salminen XML for Information Management University of Erlangen-Nuremberg.
Chapter 3: Formal Translation Models
XML(EXtensible Markup Language). XML XML stands for EXtensible Markup Language. XML is a markup language much like HTML. XML was designed to describe.
XML – Extensible Markup Language Sivakumar Kuttuva & Janusz Zalewski.
1 Introduction to Parsing Lecture 5. 2 Outline Regular languages revisited Parser overview Context-free grammars (CFG’s) Derivations.
VICTORIA UNIVERSITY OF WELLINGTON Te Whare Wananga o te Upoko o te Ika a Maui SWEN 432 Advanced Database Design and Implementation Document Type Definition.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 XML Taken from Chapter 7.
XML Anisha K J Jerrin Thomas. Outline  Introduction  Structure of an XML Page  Well-formed & Valid XML Documents  DTD – Elements, Attributes, Entities.
Why XML ? Problems with HTML HTML design - HTML is intended for presentation of information as Web pages. - HTML contains a fixed set of markup tags. This.
1 XML at a neighborhood university near you Innovation 2005 September 16, 2005 Kwok-Bun Yue University of Houston-Clear Lake.
XP 1 CREATING AN XML DOCUMENT. XP 2 INTRODUCING XML XML stands for Extensible Markup Language. A markup language specifies the structure and content of.
XML 1 Enterprise Applications CE00465-M XML. 2 Enterprise Applications CE00465-M XML Overview Extensible Mark-up Language (XML) is a meta-language that.
XML for E-commerce Helena Ahonen-Myka University of Helsinki.
XP 1 DECLARING A DTD A DTD can be used to: –Ensure all required elements are present in the document –Prevent undefined elements from being used –Enforce.
What is XML?  XML stands for EXtensible Markup Language  XML is a markup language much like HTML  XML was designed to carry data, not to display data.
Chapter 27 The World Wide Web and XML. Copyright © 2004 Pearson Addison-Wesley. All rights reserved.27-2 Topics in this Chapter The Web and the Internet.
Processing of structured documents Spring 2002, Part 2 Helena Ahonen-Myka.
Tutorial 1: XML Creating an XML Document. 2 Introducing XML XML stands for Extensible Markup Language. A markup language specifies the structure and content.
1 Chapter 3 Describing Syntax and Semantics. 3.1 Introduction Providing a concise yet understandable description of a programming language is difficult.
Winter 2006Keller, Ullman, Cushing18–1 Plan 1.Information integration: important new application that motivates what follows. 2.Semistructured data: a.
XML A web enabled data description language 4/22/2001 By Mark Lawson & Edward Ryan L’Herault.
Of 33 lecture 3: xml and xml schema. of 33 XML, RDF, RDF Schema overview XML – simple introduction and XML Schema RDF – basics, language RDF Schema –
A sentence (S) is composed of a noun phrase (NP) and a verb phrase (VP). A noun phrase may be composed of a determiner (D/DET) and a noun (N). A noun phrase.
CIS 451: XML DTDs Dr. Ralph D. Westfall February, 2009.
Context Free Grammars CIS 361. Introduction Finite Automata accept all regular languages and only regular languages Many simple languages are non regular:
Grammars CPSC 5135.
CONTEXT FREE GRAMMAR presented by Mahender reddy.
Introduction to XML This presentation covers introductory features of XML. What XML is and what it is not? What does it do? Put different related technologies.
CMSC 330: Organization of Programming Languages Context-Free Grammars.
Chapter 27 The World Wide Web and XML. Copyright © 2004 Pearson Addison-Wesley. All rights reserved.27-2 Topics in this Chapter The Web and the Internet.
Jeff Ullman: Introduction to XML 1 XML Semistructured Data Extensible Markup Language Document Type Definitions.
McGraw-Hill/Irwin © 2004 by The McGraw-Hill Companies, Inc. All rights reserved. Understanding How XML Works Ellen Pearlman Eileen Mullin Programming the.
SDPNotes 2: Document Grammars and Instances 2 Document Grammars and Instances A look at the foundations of hierarchical document structures 2.1 Language-Theoretic.
The eXtensible Markup Language (XML). Presentation Outline Part 1: The basics of creating an XML document Part 2: Developing constraints for a well formed.
SDPL 2004Notes 2: Document Instances and Grammars1 2 Document Instances and Grammars Fundamentals of hierarchical document structures, or Computer Scientist’s.
Introduction to Parsing
Jennifer Widom XML Data Introduction, Well-formed XML.
Parsing XML Grammars, PDAs, Lexical Analysis, Recursive Descent.
CS 157B: Database Management Systems II February 11 Class Meeting Department of Computer Science San Jose State University Spring 2013 Instructor: Ron.
Internet & World Wide Web How to Program, 5/e. © by Pearson Education, Inc. All Rights Reserved.2.
1 A well-parenthesized string is a string with the same number of (‘s as )’s which has the property that every prefix of the string has at least as many.
COMP9321 Web Application Engineering Semester 2, 2015 Dr. Amin Beheshti Service Oriented Computing Group, CSE, UNSW Australia Week 4 1COMP9321, 15s2, Week.
Grammars CS 130: Theory of Computation HMU textbook, Chap 5.
SDPL 20062: Document Instances and Grammars1 2 Document Instances and Grammars Fundamentals of hierarchical document structures, or Computer Scientist’s.
Introduction Finite Automata accept all regular languages and only regular languages Even very simple languages are non regular (  = {a,b}): - {a n b.
Martin Kruliš by Martin Kruliš (v1.1)1.
CSC312 Automata Theory Lecture # 26 Chapter # 12 by Cohen Context Free Grammars.
Programming Languages and Design Lecture 2 Syntax Specifications of Programming Languages Instructor: Li Ma Department of Computer Science Texas Southern.
Overview of Previous Lesson(s) Over View 3 Model of a Compiler Front End.
1 A well-parenthesized string is a string with the same number of (‘s as )’s which has the property that every prefix of the string has at least as many.
CSE 311 Foundations of Computing I Lecture 19 Recursive Definitions: Context-Free Grammars and Languages Autumn 2012 CSE
XML Extensible Markup Language
CSE 311 Foundations of Computing I Lecture 18 Recursive Definitions: Context-Free Grammars and Languages Autumn 2011 CSE 3111.
XML 1.Introduction to XML 2.Document Type Definition (DTD) 3.XML Parser 4.Example: CGI Gateway to XML Middleware.
PZ03CX Programming Language design and Implementation -4th Edition Copyright©Prentice Hall, PZ03CX - Language semantics Programming Language Design.
Extensible Markup Language (XML) Pat Morin COMP 2405.
Formal Language Theory
XML Data Introduction, Well-formed XML.
New Perspectives on XML
Compilers Principles, Techniques, & Tools Taught by Jing Zhang
CSE591: Data Mining by H. Liu
Formal Languages Context free languages provide a convenient notation for recursive description of languages. The original goal of formalizing the structure.
Presentation transcript:

Processing of structured documents Spring 2002, Part 1 Helena Ahonen-Myka

2 Course organization z laudatur course, 3 cu zlectures (in Finnish) y Tue 12-14, Thu ynot obligatory zexercise sessions y ycourse assistants: Olli Lahti and Miro Lehtonen (new group Wed A318) ynot obligatory

3 Requirements zExam (Wed 6.3. at 16-20): 45 points zProject: 15 points zExercises: 5 extra points zMaximum of points: 60

4 Outline (preliminary) z1. Descriptions of structure ycontext-free grammars ynamespaces, information sets y(XML DTD,) XML Schema z2. Programming interfaces ySAX, DOM ySOAP z3. Traversing documents yXPath

5 Outline... z4. Querying structured documents yXML Query z5. XML Linking z6. XML databases z7. Metadata: RDF z8. Compressing XML data z9....

6 Prerequisites zYou should know the basics of XML yDTD, elements, attributes, syntax yXSLT (basics), formatting zsome programming experience is needed

7 Group project zGroup of 4-5 students ygroups are formed in the exercise sessions in the 2nd week zTask: construct a toy B2B e-commerce application ya travel agency which sells packages containing hotel nights and concerts ya hotel (or several) ya concert ticket office

8 Group project zTask continues ya customer can reserve packages using a web page ya reservation causes a query to the hotels and the ticket offices for the availability of rooms and tickets yfor all the communication and for the storage of all the documents you should use XML

9 Group project zTry to get some simple implementation work ymay depend on the support we can offer zyou don´t have to consider all the real life problems, like consistency of reservations zconcentrate on playing with XML zstate of the work is presented in the last exercise sessions (also students who don’t normally attend exercises)

10 Requirements for project zMore instructions follow later... zreturn a report by (as an URL) zThe report should include y(short) requirements analysis ydescriptions of the structure (DTD, Schema) yother designs, architecture,... zSome kind of a working prototype ynot necessarily the whole system

11 1. Structure descriptions zRegular expressions, context-free grammars -> What is XML? z(XML Document type definitions) znamespaces, information sets zXML Schema

12 Regular expressions zA way to describe set of strings over an alphabet (of chars, events, elements…) zmany uses: ytext searching (e.g. emacs, grep, perl) yin grammatical formalisms (e.g. XML DTDs) zrelevant for document structures: what kind of structural content is allowed for different document components

13 Regular expressions zA regular expression over alphabet  is either y  (an empty set) y  (epsilon; sometimes lambda ) ya, where a   yR | S (choice; sometimes R  S) yR S (catenation) or yR* (Kleene closure) zwhere R and S are regular expressions

14 Regular expressions zRegular expression E denotes a language (a set of strings) L(E): yL(  ) =  (empty set) yL(  ) = {  } (singleton set of empty string) yL(a) = {a} (singleton set of a   ) yL(R|S) = L(R)  L(S) = {w | w  L(R) or w  L(S)} yL(RS) = L(R)L(S) = {xy | x  L(R) and y  L(S)} yL(R*) = L(R)* = {x 1 …x n | x k  L(R), k=1,…,n; n  0}

15 Example ztop-level structure of a document: y  = {title, author, date, sect} ytitle followed by an optional list of authors, followed by an optional date, followed by one or more sections: ytitle auth* (date |  ) sect sect* zcommon abbreviations: yE? = (E |  ); E + = E E* y-> title auth* date? sect +

16 Context-free grammars zUsed widely for syntax specification (programming languages) zG = (V, , P, S) yV: the alphabet of the grammar G; V =   N y  : the set of terminal symbols; N = V-  : the set of nonterminal symbols yP: set of productions yS  N: the start symbol

17 Productions and derivations zProductions: A -> , where A  N,   V* ye.g. A -> aBa (1) zLet ,   V*. String  derives  directly,  => , if y  =  A ,  =  for some ,   V*, and A ->  is a production of the grammar ye.g. AA => AaBa (assuming prod. 1 above)

18 Language generated by a context-free grammar z  derives ,  =>* , if there is a sequence of 0 or more direct derivations that transforms  to  zThe language generated by a CFG G: yL(G) = {w   * | S =>* w} zL(G) is a set of strings: to model structural elements, we consider parse trees

19 Parse trees of a CFG zAka syntax trees or derivation trees znodes labelled by symbols of V (or by  ): yinternal nodes by nonterminals, root by start symbol yleaves using terminal symbols (or  ) zparent with label A can have children labeled by X 1,…,X k only if A -> X 1 …X k is a production

20 CFGs for document structures zNonterminals represent document structures ye.g. Ref -> AuthorList Title PublData AuthorList -> Author AuthorList AuthorList ->  zproblem: yobscures the relation of elements (the last Author several hierarchical levels away from Ref) -> solution: extended CFGs

21 Extended CFGs (ECFGs) zLike CFGs, but right-hand-sides of productions are regular expressions over V, e.g. Ref -> Author* Title PublData zLet ,   V*. String  derives  directly,  => , if y  =  A ,  =  for some ,   V*, and A -> E is a production such that   L(E) ye.g. Ref => Author Author Author Title PublData

22 Language generated by an ECFG zDefined similarly to CFGs zTheorem: Languages generated by extended and ordinary CGFs are the same

23 Parse trees of an ECFG zSimilar to parse trees of an ordinary CFG, except that… zparent with label A can have children labeled by X 1,…,X k when A -> E is a production such that X 1 …X k  L(E) z-> an internal node may have arbitrarily many children (e.g. Authors below a Ref node)

24 What is XML? zmetalanguage that can be used to define markup languages ygives syntax for defining extended context free grammars yXML documents that adhere to an ECFG are strings in that language ydocument types (grammars)- document instances (strings in the language)

25 XML encoding of structure zXML document essentially a parenthesized linear encoding of a parse tree ycorresponds to a preorder walk ystart of inner node (element) A denoted by a start tag, end denoted by end tag yleaves are strings (or empty elements) z+ certain extensions (especially attributes)

26 Terminal symbols in practice zLeaves of parse trees are labeled by single characters (symbols of  ) ztoo granular in practice: instead terminal symbols which stand for all values of a type ye.g. #PCDATA in XML for variable length content of data characters yricher data types in XML schema formalisms

27 An example DTD <!DOCTYPE invoice [ <!ELEMENT invoice (orderDate, shipDate, billingAddress voice*, fax?)> ]>

Ashok Malhotra 123 IBM Ave. Hawthorne NY And a document:

29 XML processing model zA processor (parser) yreads XML documents ypasses data to an application zXML Specification tells how to read, what to pass

30 Well-formed XML documents zdocuments that adhere to the formal requirements (syntax) of the XML specification zif a document is not well-formed, it is not an XML document (and the XML tools do not have to process it)

31 Valid documents za document is a valid XML-document, if it is well-formed and adheres to the structure defined in the DTD given zXML-processor can be validating or non- validating zsometimes validity is important, sometimes not