Presentation is loading. Please wait.

Presentation is loading. Please wait.

Database Management Systems, R. Ramakrishnan1 Introduction to Semistructured Data and XML Chapter 27.

Similar presentations


Presentation on theme: "Database Management Systems, R. Ramakrishnan1 Introduction to Semistructured Data and XML Chapter 27."— Presentation transcript:

1 Database Management Systems, R. Ramakrishnan1 Introduction to Semistructured Data and XML Chapter 27

2 Database Management Systems, R. Ramakrishnan2 How the Web is Today  HTML documents often generated by applications consumed by humans only easy access: across platforms, across organizations  No application interoperability: HTML not understood by applications Database technology: client-server

3 Database Management Systems, R. Ramakrishnan3 New Universal Data Exchange Format: XML A recommendation from the W3C  XML = data  XML generated by applications  XML consumed by applications  Easy access: across platforms, organizations

4 Database Management Systems, R. Ramakrishnan4 Paradigm Shift on the Web  From documents (HTML) to data (XML)  From information retrieval to data management  For databases, also a paradigm shift: from relational model to semistructured data from data processing to data/query translation from storage to transport

5 Database Management Systems, R. Ramakrishnan5 Semistructure data 1. Information integration: important new application that motivates what follows. 2. Semistructured data: a new data model designed to cope with problems of information integration. 3. XML ( Extensible Markup Language ) : a new Web standard that is essentially semistructured data. 4. XQUERY: an emerging standard query language for XML data.

6 Database Management Systems, R. Ramakrishnan6 Information Integration Problem: related data exists in many places. They talk about the same things, but differ in model, schema, conventions ( e.g., terminology). Example: In the real world, every bar has its own database.  Some may have relations like beer-price; others have an Microsoft Word file from which the menu is printed.  Some keep phones of manufacturers but not addresses.  Some distinguish beers and ales; others do not.

7 Database Management Systems, R. Ramakrishnan7 The Semistructured Data Model &o1 &o12&o24&o29 &o43 &96 &243 &206 &25 “Serge” “Abiteboul” 1997 “Victor” “Vianu” 122133 paper book paper references author title year http author title publisher author title page firstname lastname firstnamelastnamefirst last Bib Object Exchange Model (OEM) complex object atomic object

8 Database Management Systems, R. Ramakrishnan8 Characteristics of Semistructured Data  Missing or additional attributes  Multiple attributes  Different types in different objects  Heterogeneous collections Self-describing, irregular data, no a priori structure

9 Database Management Systems, R. Ramakrishnan9 Comparison with Relational Data { row: { name: “John”, phone: 3634 }, row: { name: “Sue”, phone: 6343 }, row: { name: “Dick”, phone: 6363 } } row name phone “John”3634“Sue”“Dick”63436363

10 Database Management Systems, R. Ramakrishnan10 XML (Extensible Markup Language)  A W3C standard to complement HTML  Origins: Structured text SGML Large-scale electronic publishing Data exchange on the web  Motivation: HTML describes presentation XML describes content

11 Database Management Systems, R. Ramakrishnan11 From HTML to XML HTML describes the presentation

12 Database Management Systems, R. Ramakrishnan12 HTML Bibliography Foundations of Databases Abiteboul, Hull, Vianu Addison Wesley, 1995 Data on the Web Abiteboul, Buneman, Suciu Morgan Kaufmann, 1999

13 Database Management Systems, R. Ramakrishnan13 XML Foundations… Abiteboul Hull Vianu Addison Wesley 1995 … XML describes the content

14 Database Management Systems, R. Ramakrishnan14 Why are we DB’ers interested?  It’s data. That’s us.  Database issues: How are we going to model XML? (graphs). How are we going to query XML? (XQuery) How are we going to store XML (in a relational database? object-oriented? native?) How are we going to process XML efficiently? (many interesting research questions!)

15 Database Management Systems, R. Ramakrishnan15 XML Terminology  Tags: book, title, author, … start tag:, end tag:  Elements: …, … elements can be nested empty element: (Can be abbrv. )  XML document: Has a single root element  Well-formed XML document: Has matching tags  Valid XML document: conforms to a schema

16 Database Management Systems, R. Ramakrishnan16 Well-Formed XML 1. Declaration =. Normal declaration is “Standalone” means that there is no DTD specified. 2. Root tag surrounds the entire balance of the document.  is balanced by, as in HTML. 3. Any balanced structure of tags OK. Option of tags that don’t require balance, like in HTML.

17 Database Management Systems, R. Ramakrishnan17 XML: An Example Richard Feynman The Character of Physical Law 1980 R.K. Narayan Waiting for the Mahatma 1981 R.K. Narayan The English Teacher 1980

18 Database Management Systems, R. Ramakrishnan18 XML – Elements …  Xml is case and space sensitive  Element opening and closing tag names must be identical  Opening tags: “ ”  Closing tags: “ ”  Empty Elements have no data and no closing tag: They begin with a “ ” closing tag attribute attribute valuedata open tag element name

19 Database Management Systems, R. Ramakrishnan19 XML – Attributes …  Attributes provide additional information for element tags.  There can be zero or more attributes in every element; each one has the form: attribute_name =“ attribute_value” - There is no space between the name and the “=” - Attribute values must be surrounded by “ or ‘ characters  Multiple attributes are separated by white space (one or more spaces or tabs). closing tag attribute attribute valuedata open tag element name

20 Database Management Systems, R. Ramakrishnan20 Elements The segment of an XML document between an opening and a corresponding closing tag is called an element. Malcolm Atchison (215) 898 4321 mp@dcs.gla.ac.sc element not an element element, a sub-element of

21 Database Management Systems, R. Ramakrishnan21 XML – Data and Comments …  Xml data is any information between an opening and closing tag  Xml data must not contain the ‘ ’ characters  Comments: closing tag attribute attribute valuedata open tag element name

22 Database Management Systems, R. Ramakrishnan22 XML text XML has only one “basic” type -- text. It is bounded by tags, e.g. The Big Sleep 1935 --- 1935 is still text XML text is called PCDATA (for parsed character data). It uses a 16-bit encoding.

23 Database Management Systems, R. Ramakrishnan23 XML – Nesting & Hierarchy  Xml tags can be nested in a tree hierarchy  Xml documents can have only one root tag  Between an opening and closing tag you can insert: 1. Data 2. More Elements 3. A combination of data and elements Some Text More

24 Database Management Systems, R. Ramakrishnan24 Representing relational DBs: Two ways projects: title budget managedBy employees: name ssn age

25 Database Management Systems, R. Ramakrishnan25 Project and Employee relations in XML Pattern recognition 10000 Joe Joe 344556 34 Sandra 2234 35 Auto guided vehicle 70000 Sandra : Projects and employees are intermixed

26 Database Management Systems, R. Ramakrishnan26 Pattern recognition 10000 Joe Auto guided vehicles 70000 Sandra : Project and Employee relations in XML (cont’d) Joe 344556 34 Sandra 2234 35 : Employees follows projects

27 Database Management Systems, R. Ramakrishnan27 More XML: Oids and References Jane Mary John oids and references in XML are just syntax

28 Database Management Systems, R. Ramakrishnan28 XML Data Model (Graph)

29 Database Management Systems, R. Ramakrishnan29 Document Type Descriptors  Sort of like a schema but not really.  Inherited from SGML DTD standard  BNF grammar establishing constraints on element structure and content  Definitions of entities

30 Database Management Systems, R. Ramakrishnan30 DTD – An Example --------------------------------------------------------------------------------

31 Database Management Systems, R. Ramakrishnan31 DTD - !ELEMENT  !ELEMENT declares an element name, and what children elements it should have  Content types: Other elements #PCDATA (parsed character data) EMPTY (no content) ANY (no checking inside this structure) A regular expression NameChildren

32 Database Management Systems, R. Ramakrishnan32 DTD - !ELEMENT (Contd.)  A regular expression has the following structure: exp 1, exp 2, exp 3, …, exp k : A list of regular expressions exp*: An optional expression with zero or more occurrences exp+: An optional expression with one or more occurrences exp 1 | exp 2 | … | exp k : A disjunction of expressions

33 Database Management Systems, R. Ramakrishnan33 DTD - !ATTLIST <!ATTLIST Orange location CDATA #REQUIRED color ‘orange’>  !ATTLIST defines a list of attributes for an element  Attributes can be of different types, can be required or not required, and they can have default values. ElementAttributeTypeFlag

34 Database Management Systems, R. Ramakrishnan34 DTD – Well-Formed and Valid -------------------------------------------------------------------------------- Well-Formed and Valid Not Well-Formed Well-Formed but Invalid Home

35 Database Management Systems, R. Ramakrishnan35 Example: An Address Book MacNiel, John Dr. John MacNiel 1234 Huron Street Rome, OH 98765 (321) 786 2543 jm@abc.com Exactly one name At most one greeting As many address lines as needed (in order) Mixed telephones and faxes As many as needed

36 Database Management Systems, R. Ramakrishnan36 Specifying the structure  name to specify a name element  greet? to specify an optional (0 or 1) greet elements  name,greet? to specify a name followed by an optional greet

37 Database Management Systems, R. Ramakrishnan37 Specifying the structure (cont)  addr* to specify 0 or more address lines  tel | fax a tel or a fax element  (tel | fax)* 0 or more repeats of tel or fax  email* 0 or more email elements

38 Database Management Systems, R. Ramakrishnan38 A DTD for the address book <!DOCTYPE addressbook [ <!ELEMENT person (name, greet?, address*, (fax | tel)*, email*)> ]>

39 Database Management Systems, R. Ramakrishnan39 DTD for the example relational DB <!DOCTYPE db [... ]>

40 Database Management Systems, R. Ramakrishnan40 Summary of XML regular expressions  Each element name is a tag.  Its components are the tags that appear nested within, in the order specified.  AThe tag A occurs  e1,e2The expression e1 followed by e2  e*0 or more occurrences of e  e?Optional -- 0 or 1 occurrences  e+1 or more occurrences  e1 | e2either e1 or e2  (e)grouping

41 Database Management Systems, R. Ramakrishnan41 XML Namespaces  Namespaces are a simple and straightforward way to distinguish names used in XML documents, no matter where they come from  The only reason namespaces exist, is to give elements and attributes programmer-friendly names that will be unique across the whole Internet

42 Database Management Systems, R. Ramakrishnan42 Example <h:html xmlns:xdc="http://www.xml.com/books" xmlns:h="http://www.w3.org/HTML/1998/html4"> Book Review XML: A Primer Author Price Pages Date Simon St. Laurent 31.98 352 1998/01

43 Database Management Systems, R. Ramakrishnan43 XML Namespaces  The prefixes are linked to the full names using the attributes on the top element whose names begin xmlns :  The prefixes are just shorthand placeholders for the full names  Those full names are URIs, i.e. Web addresses

44 Database Management Systems, R. Ramakrishnan44 Extensibility in XML  Anyone can invent new tags and attach a meaning to those tags  But if every user creates its own XML definition for describing his data, it is not possible to achieve interoperability  For example, one may prefer to use the tag name “POR”, while another prefers using the tag name “PurchaseOrderReq”  In other words, a tagged document is not very useful without some kind of agreement on the tags among inter-operating applications

45 Database Management Systems, R. Ramakrishnan45 Many Efforts for Standardized Tags…  HL7 for healthcare  RosettaNet for supply chain integration in Information Technology and Electronic Components domain  MPEG7 in multimedia applications  ebXML for eBusiness  Common Business Library (CBL) for electronic catalogs, purchase orders, etc.  …

46 Database Management Systems, R. Ramakrishnan46 XML Parsers  A parser takes an XML document and makes its structure and content available to an application through an API  There are two main Application Programming Interfaces (APIs) for writing parsers: Document Object Model (DOM) and Simple API for XML (SAX)  Today, many parsers are both DOM and SAX compliant

47 Database Management Systems, R. Ramakrishnan47 XML DOM Parser Application Code Initialize Parser In memory DOM: Perform Processing XML Parser XML Document Begin parsing Parsing complete A parser validates and makes the data contained in an XML document available to the application

48 Database Management Systems, R. Ramakrishnan48 XSLT Processor XML Document XSL Style Sheet 2 Output from Style Sheet 1 Output from Style Sheet 2 XSL Style Sheet 1 Parser XSLT Processor Converts an XML document to another form An XSL style sheet is a set of transformation instructions for converting a source XML document to a target document

49 Database Management Systems, R. Ramakrishnan49 table.xsl bar.xsl art.xsl

50 Database Management Systems, R. Ramakrishnan50 XML Querying Path Expressions :  Bib.paper  Bib.book.publisher  Bib.paper.author.lastname Given an OEM instance, the value of a path expression p is a set of objects

51 Database Management Systems, R. Ramakrishnan51 Path Expressions Examples: DB = &o1 &o12&o24&o29 &o43 &o70&o71 &96 &243 &206 &25 “Serge” “Abiteboul” 1997 “Victor” “Vianu” 122133 paper book paper references author title year http author title publisher author title page firstname lastname firstnamelastnamefirst last Bib &o44&o45&o46 &o47&o48 &o49 &o50 &o51 &o52 Bib.paper={&o12,&o29} Bib.book.publisher={&o51} Bib.paper.author.lastname={&o71,&206} Bib.paper={&o12,&o29} Bib.book.publisher={&o51} Bib.paper.author.lastname={&o71,&206}

52 Database Management Systems, R. Ramakrishnan52 XQuery Emerging standard for querying XML documents. Basic form: FOR WHERE RETURN ;  Sets of elements described by paths, consisting of: 1.URL, if necessary. 2.Element names forming a path in the semistructured data graph, e.g., //BAR/NAME = “start at any BAR node and go to a NAME child.” 3.Ending condition of the form [ ]

53 Database Management Systems, R. Ramakrishnan53 XQuery Overview:  FOR-LET-WHERE-ORDERBY-RETURN = FLWOR FOR/LET Clauses WHERE Clause ORDERBY/RETURN Clause List of tuples Instance of Xquery data model

54 Database Management Systems, R. Ramakrishnan54 XQuery  FOR $x in expr -- binds $x to each value in the list expr  LET $x = expr -- binds $x to the entire list expr Useful for common subexpressions and for aggregations

55 Database Management Systems, R. Ramakrishnan55 FOR v.s. LET FOR $x IN document("bib.xml") /bib/book RETURN $x FOR $x IN document("bib.xml") /bib/book RETURN $x Returns:... LET $x IN document("bib.xml") /bib/book RETURN $x LET $x IN document("bib.xml") /bib/book RETURN $x Returns:...

56 Database Management Systems, R. Ramakrishnan56 XQuery Find all book titles published after 1995: FOR $x IN document("bib.xml") /bib/book WHERE $x/year > 1995 RETURN $x/title FOR $x IN document("bib.xml") /bib/book WHERE $x/year > 1995 RETURN $x/title Result: abc def ghi

57 Database Management Systems, R. Ramakrishnan57 XQuery For each author of a book by Morgan Kaufmann, list all books s/he published: FOR $a IN distinct( document("bib.xml") /bib/book[publisher=“Morgan Kaufmann”]/author) RETURN $a, FOR $t IN /bib/book[author=$a]/title RETURN $t FOR $a IN distinct( document("bib.xml") /bib/book[publisher=“Morgan Kaufmann”]/author) RETURN $a, FOR $t IN /bib/book[author=$a]/title RETURN $t distinct = a function that eliminates duplicates

58 Database Management Systems, R. Ramakrishnan58 XQuery Result: Jones abc def Smith ghi

59 Database Management Systems, R. Ramakrishnan59 XQuery count = a (aggregate) function that returns the number of elms FOR $p IN distinct(document("bib.xml")//publisher) LET $b := document("bib.xml")/book[publisher = $p] WHERE count($b) > 100 RETURN $p FOR $p IN distinct(document("bib.xml")//publisher) LET $b := document("bib.xml")/book[publisher = $p] WHERE count($b) > 100 RETURN $p

60 Database Management Systems, R. Ramakrishnan60 XQuery Find books whose price is larger than average: LET $a=avg( document("bib.xml") /bib/book/price) FOR $b in document("bib.xml") /bib/book WHERE $b/price > $a RETURN $b LET $a=avg( document("bib.xml") /bib/book/price) FOR $b in document("bib.xml") /bib/book WHERE $b/price > $a RETURN $b

61 Database Management Systems, R. Ramakrishnan61 Examples for XQuery queries  FOR $x IN doc(www.company.com/info.xml) //employee [employeeSalary gt 70000]/employeeName RETURN $x/firstName, $x/lastName  FOR $x IN doc(www.company.com/info.xml)/company/employee WHERE $x/employeeSalary gt 70000 RETURN $x/employeeName/firstName, $x/employeeName/lastName  FOR $x IN doc(www.company.com/info.xml)/company /project [projectNumber = 5]/projectWorker, $y IN doc(www.company.com/info.xml)/company/employee WHERE $x/hours gt 20.0 AND $y.ssn = $x.ssn RETURN $x/EmployeeName/firstName, $y/employeeName/lastName, $x/hours

62 Database Management Systems, R. Ramakrishnan62 transcript.xml … … cont’d … …

63 Database Management Systems, R. Ramakrishnan63 transcript.xml (cont’d)

64 Database Management Systems, R. Ramakrishnan64 XQuery  Example: FOR $t IN document(“http://xyz.edu/transcript.xml”)/Transcript WHERE $t/CrsTaken/@CrsCode = “MAT123” RETURN $t/Student  Result:

65 Database Management Systems, R. Ramakrishnan65 XQuery (cont’d)  Previous query doesn’t produce a well-formed XML document; the following does: { FOR $t IN document(“transcript.xml”)/Transcript WHERE $t/CrsTaken/@CrsCode = “MAT123” RETURN $t/Student } Transcript Student StudentList  FOR binds $t to Transcript elements one by one, filters using WHERE, then places Student-children as children of StudentList using RETURN FLWR inside XML

66 Database Management Systems, R. Ramakrishnan66 Document Restructuring with XQuery Transcript  Reconstruct lists of students taking each class using the Transcript records : FOR $c IN distinct(document(“transcript.xml”)/CrsTaken) RETURN { FOR $t IN document(“transcript.xml”)/Transcript WHERE $t/CrsTaken/@CrsCode = $c/CrsCode AND $t/CrsTaken/@Semester = $c/@Semester RETURN $t/Student ORDERBY ($t/Student/@StudId) } ORDERBY ($c/@CrsCode) FLWOR inside RETURN

67 Database Management Systems, R. Ramakrishnan67 Document Restructuring (cont’d)  Output elements have the form :  Problem : the above element will be output twice – once when $c is bound to and once when it is bound to Note : grades are different – distinct( ) won’t eliminate transcript records that refer to same class! John Doe’s Bart Simpson’s

68 Database Management Systems, R. Ramakrishnan68 Document Restructuring (cont’d)  Solution : instead of FOR $c IN distinct(document(“transcript.xml”)/CrsTaken) use FOR $c IN document(“classes.xml”)//Class where classes.xml lists course offerings (course code/semester) explicitly (no need to extract them from transcript records). Then $c is bound to each class exactly once, so each class roster will be output exactly once Document on next slide

69 Database Management Systems, R. Ramakrishnan69 http://xyz.edu/classes.xml SE Adrian Jones Circuits David Jones Databases Mary Doe TP John Smyth Algebra Ann White

70 Database Management Systems, R. Ramakrishnan70 XQuery Semantics  So far the discussion was informal  XQuery semantics defines what the expected result of a query is  Defined analogously to the semantics of SQL

71 Database Management Systems, R. Ramakrishnan71 XQuery Semantics (cont’d)  Step 1 : Produce a list of bindings for variables The FOR clause binds each variable to an ordered list of nodes specified by an XQuery expression. The expression can be: An XQuery query A function that returns a list of nodes End result of a FOR clause: Ordered list of tuples of document nodes Each tuple is a binding for the variables in the FOR clause

72 Database Management Systems, R. Ramakrishnan72 XQuery Semantics (cont’d) Example (bindings): Let FOR declare $A and $B Bind $A to document nodes {v,w}; $B to {x,y,z} Then FOR clause produces the following list of bindings for $A and $B: $A/v, $B/x $A/v, $B/y $A/v, $B/z $A/w, $B/x $A/w, $B/y $A/w, $B/z

73 Database Management Systems, R. Ramakrishnan73 XQuery Semantics (cont’d)  Step 2 : filter the bindings via the WHERE clause Use each tuple-binding to substitute its components for variables; retain those bindings that make WHERE true Example: WHERE $A/CrsTaken/@CrsCode = $B/Class/@CrsCode Binding: $A/w, where w = $B/x, where x = Then w/CrsTaken/@CrsCode = x/Class/@CrsCode, so the WHERE condition is satisfied & binding retained

74 Database Management Systems, R. Ramakrishnan74 XQuery Semantics (cont’d)  Step 3 : Construct result For each retained tuple of bindings, instantiate the RETURN clause This creates a fragment of the output document Do this for each retained tuple of bindings in sequence


Download ppt "Database Management Systems, R. Ramakrishnan1 Introduction to Semistructured Data and XML Chapter 27."

Similar presentations


Ads by Google