Presentation is loading. Please wait.

Presentation is loading. Please wait.

2006.11.28- SLIDE 1IS 257 – Fall 2006 New Generation Database Systems: XML Databases University of California, Berkeley School of Information IS 257: Database.

Similar presentations


Presentation on theme: "2006.11.28- SLIDE 1IS 257 – Fall 2006 New Generation Database Systems: XML Databases University of California, Berkeley School of Information IS 257: Database."— Presentation transcript:

1 2006.11.28- SLIDE 1IS 257 – Fall 2006 New Generation Database Systems: XML Databases University of California, Berkeley School of Information IS 257: Database Management

2 2006.11.28- SLIDE 2IS 257 – Fall 2006 Lecture Outline XML and RDBMS Native XML Databases

3 2006.11.28- SLIDE 3IS 257 – Fall 2006 Lecture Outline XML and DBMS Native XML Databases

4 2006.11.28- SLIDE 4IS 257 – Fall 2006 Standards: XML/SQL As part of SQL3 an extension providing a mapping from XML to DBMS is being created called XML/SQL The (draft) standard is very complex, but the ideas are actually pretty simple Suppose we have a table called EMPLOYEE that has columns EMPNO, FIRSTNAME, LASTNAME, BIRTHDATE, SALARY

5 2006.11.28- SLIDE 5IS 257 – Fall 2006 Standards: XML/SQL That table can be mapped to: 000020 John Smith 1955-08-21 52300.00 … etc. …

6 2006.11.28- SLIDE 6IS 257 – Fall 2006 Standards: XML/SQL In addition the standard says that XMLSchemas must be generated for each table, and also allows relations to be managed by nesting records from tables in the XML. Variants of this are incorporated into the latest versions of ORACLE But what if you want to deal with more complex XML schemas (beyond “flat” structures)?

7 2006.11.28- SLIDE 7IS 257 – Fall 2006 XML to Relational Database Mapping Bhavin Kansara The following slides are adapted from: Slide from Bhavin Kansara

8 2006.11.28- SLIDE 8IS 257 – Fall 2006 Introduction XML/relational mapping means data transformation between XML and relational data models XML documents can be transformed to relational data models or vice versa. Mapping method is the way the mapping is done Slide from Bhavin Kansara

9 2006.11.28- SLIDE 9IS 257 – Fall 2006 XML XML: Extensible Markup Language Documents have tags giving extra information about sections of the document –E.g. XML – Introduction XML has emerged as the standard for representing and exchanging data on the World Wide Web. The increasing amount of XML documents requires the need to store and query XML documents efficiently. Slide from Bhavin Kansara

10 2006.11.28- SLIDE 10IS 257 – Fall 2006 XML vs. HTML HTML tags describe how to render things on the screen, while XML tags describe what thing are. HTML tags are designed for the interaction between humans and computers, while XML tags are designed for the interactions between two computers. Unlike HTML, XML tags tell you what the data means, rather than how to display it abc xyz def Title of page abc xyz def Slide from Bhavin Kansara

11 2006.11.28- SLIDE 11IS 257 – Fall 2006 XML Technologies Schema Languages DTDs XML Schemas Query Languages XPath XQuery XSLT Programming APIs DOM SAX { for $b in doc("http://bstore1.example.com/bib.xml")/bib/book where $b/publisher = "Addison-Wesley" and $b/@year > 1991 return { $b/title } } Belgian Waffles $5.95 two of our famous Belgian Waffles 650 Slide from Bhavin Kansara

12 2006.11.28- SLIDE 12IS 257 – Fall 2006 DTD ( Document Type Definition ) DTD stands for Document Type Definition The purpose of a Document Type Definition is to define the legal building blocks of an XML document. It formally defines relationship between the various elements that form the documents. DTD allows computers to check that each component of document occurs in a valid place within the document. Slide from Bhavin Kansara

13 2006.11.28- SLIDE 13IS 257 – Fall 2006 DTD ( Document Type Definition ) Slide from Bhavin Kansara

14 2006.11.28- SLIDE 14IS 257 – Fall 2006 XML vs. Relational Database CUSTOMER NameAge ABC30 XYZ40 ABC 30 XYZ 40 Slide from Bhavin Kansara

15 2006.11.28- SLIDE 15IS 257 – Fall 2006 XML vs. Relational Database Slide from Bhavin Kansara

16 2006.11.28- SLIDE 16IS 257 – Fall 2006 XML vs. Relational Database Slide from Bhavin Kansara

17 2006.11.28- SLIDE 17IS 257 – Fall 2006 XML vs. Relational Database Slide from Bhavin Kansara

18 2006.11.28- SLIDE 18IS 257 – Fall 2006 When XML representation is not beneficial When downstream processing of the data is relational When the highest possible performance is required When any normalized data components have value outside the XML representation or the data need not be retained in XML form to have value When the data is naturally tabular Slide from Bhavin Kansara

19 2006.11.28- SLIDE 19IS 257 – Fall 2006 When XML representation is beneficial When schema is volatile When data is inherently hierarchical in nature When data represents business objects in which the component parts do not make sense when removed from the context of that business object When applications have sparse attributes When low-volume data is highly structured Slide from Bhavin Kansara

20 2006.11.28- SLIDE 20IS 257 – Fall 2006 XML-to-Relational mapping Schema mapping Database schema is generated from an XML schema or DTD for the storage of XML documents. Data mapping Shreds an input XML document into relational tuples and inserts them into the relational database whose schema is generated in the schema mapping phase Slide from Bhavin Kansara

21 2006.11.28- SLIDE 21IS 257 – Fall 2006 Schema Mapping Slide from Bhavin Kansara

22 2006.11.28- SLIDE 22IS 257 – Fall 2006 Simplifying DTD Slide from Bhavin Kansara

23 2006.11.28- SLIDE 23IS 257 – Fall 2006 DTD graph Slide from Bhavin Kansara

24 2006.11.28- SLIDE 24IS 257 – Fall 2006 Inlined DTD graph Given a DTD graph, a node is inlinable if and only if it has exactly one incoming edge and that edge is a normal edge. Slide from Bhavin Kansara

25 2006.11.28- SLIDE 25IS 257 – Fall 2006 Inlined DTD graph Slide from Bhavin Kansara

26 2006.11.28- SLIDE 26IS 257 – Fall 2006 Generated Database Schema Slide from Bhavin Kansara

27 2006.11.28- SLIDE 27IS 257 – Fall 2006 Data Mapping XML file is used to insert data into generated database schema Parser is used to fetch data from XML file. Slide from Bhavin Kansara

28 2006.11.28- SLIDE 28IS 257 – Fall 2006 Summary Simplify DTD Create DTD graph from simplified DTD Create inlined DTD graph from DTD graph Use inlined DTD graph to generate database schema Insert values from XML file into generated tables Slide from Bhavin Kansara

29 2006.11.28- SLIDE 29IS 257 – Fall 2006 Issues So, we can convert the XML to a relational database, but can we then export as an XML document? –This is equally challenging But MOSTLY involves just re-joining the tables How do you store and put back the wrapping tags for sets of subelements? Since the decomposition of the DTD was approximate, the output MAY not be identical to the input

30 2006.11.28- SLIDE 30IS 257 – Fall 2006 Lecture Outline XML and RDBMS Native XML Databases

31 2006.11.28- SLIDE 31IS 257 – Fall 2006 Native XML Database (NXD) Native XML databases have an XML-based internal model –That is, their fundamental unit of storage is XML However, different native XML databases differ in What they consider the fundamental unit of storage –Document vs element or segment And how that information or its subelements are accessed, indexed and queried –E.g., SQL vs. Xquery or a special query language

32 2006.11.28- SLIDE 32IS 257 – Fall 2006 Database Systems supporting XQuery The following database systems offer XQuery support: –Native XML Databases: Berkeley DB XML eXist MarkLogic Software AG Tamino Raining Data TigerLogic Documentum xDb (X-Hive/DB) –Relational Databases (also support SQL): IBM DB2 Microsoft SQL Server Oracle

33 2006.11.28- SLIDE 33IS 257 – Fall 2006 Anatomy of a Native XML database The next set of slides (available on the class web site) come from George Feinberg of SleepyCat Software –SleepyCat is now part of Oracle

34 2006.11.28- SLIDE 34IS 257 – Fall 2006 Further comments on NXD Native XML databases are most often used for storing “document-centric” XML document –I.e. the unit of retrieval would typically be the entire document and not a particular node or subelement This supports query languages like Xquery –Able to ask for “all documents where the third chapter contains a page that has boldfaced word” –Very difficult to do that kind of query in SQL

35 2006.11.28- SLIDE 35IS 257 – Fall 2006 XML-Based IR - Cheshire II I thought I would take a little time to talk about how the Cheshire system (that I have been working for nearly 20 years) uses XML, since it has some similarities (and many differences) to XML database systems Cheshire II (and Cheshire 3) are document- centric and involve parsing the XML for the purposes of indexing (and sometimes for retrieval of partial documents)

36 2006.11.28- SLIDE 36IS 257 – Fall 2006 Cheshire II SGML/XML Support Underlying native format for all data is SGML or XML The DTD defines the file format for each file Full SGML/XML parsing SGML/XML Format Configuration Files define the database USMARC DTD and MARC to SGML conversion (and back again) Access to full-text via special SGML/XML tags

37 2006.11.28- SLIDE 37IS 257 – Fall 2006 SGML/XML Support Example XML record for a DL document ELIB-v1.0 756 June 12, 1996 June 1996 Cumulative Watershed Effects: Applicability of Available Methodologies to the Sierra Nevada University of California report USDA Forest Service Neil H. Berg Ken B. Roby Bruce J. McGurk SNEP Vol 3 40 /elib/data/docs/0700/756/HYPEROCR/hyperocr.html /elib/data/docs/0700/756/OCR-ASCII-NOZONE

38 2006.11.28- SLIDE 38IS 257 – Fall 2006 00722 n a m 2 2 00229 4 5 0 001001400000005001700014008004100031010001400072035002000086035001700106100001900123 2450105001422500011002472600032002583000033002905040050003236500036003737000022004097000022004 31950003200453998000700485 CUBGGLAD1282B 19940414143202.0 830810 1983 nyu eng u 82019962 (CU)ocm08866667 (CU)GLAD1282 Burch, John G. Information systems : theory and practice / John G. Burch, Jr., Felix R. Strater, Gary Grudnitski 3rd ed New York : J. Wiley, 1983 xvi, 632 p. : ill. ; 24 cm Includes bibliographical references and index Management information systems.... SGML Support Example SGML/MARC Record

39 2006.11.28- SLIDE 39IS 257 – Fall 2006 SGML Support Mini-TREC document… FT931-3566 _AN-DCPCCAA3FT 930316 FT 16 MAR 93 / Italy's Corruption Scandal: Magistrates hold key to unlocking Tangentopoli - They will set the investigation agenda By ROBERT GRAHAM OVER the weekend the Italian media felt obliged to comment on a non-event. No new arrests had taken place in any of the country's ever more numerous corruption scandals which centre on the illicit funding of political parties... …

40 2006.11.28- SLIDE 40IS 257 – Fall 2006 … Companies:- Ente Nazionale Idrocarburi. Ente Nazionale per L'Energia Electtrica. Ente Partecipazioni E Finanziamento Industria Manifatturiera. IRI Istituto per La Ricostruzione Industriale. Countries:- ITZ Italy, EC. Industries:- P9222 Legal Counsel and Prosecution. P91 Executive, Legislative and General Government. P13 Oil and Gas Extraction. P9631 Regulation, Administration of Utilities. P6719 Holding Companies, NEC. Types:- …

41 2006.11.28- SLIDE 41IS 257 – Fall 2006 … CMMT Comment & Analysis. GOVT Legal issues. The Financial Times London Page 4

42 2006.11.28- SLIDE 42IS 257 – Fall 2006 SGML/XML Support Configuration files for the Server are also SGML/XML: –They include tags describing all of the data files and indexes for the database. –They also include instructions on how data is to be extracted for indexing and how Z39.50 attributes map to the indexes for a given database.

43 2006.11.28- SLIDE 43IS 257 – Fall 2006 Cheshire Configuration Files /projects/is240/GroupX/indexes /projects/is240/GroupX trec /projects/is240/ft /projects/is240/ft.CONT /projects/is240/TREC.FT.DTD ft.assoc cheshire_index/TESTDATA.history …

44 2006.11.28- SLIDE 44IS 257 – Fall 2006 cheshire_index/trec.docno.index docno 12 1 12 2 12 6 DOCNO …

45 2006.11.28- SLIDE 45IS 257 – Fall 2006 cheshire_index/trec.topic.index topic 29 3 6 29 102 3 6 … cheshire_index/topicstoplist HEADLINE DATELINE BYLINE TEXT

46 2006.11.28- SLIDE 46IS 257 – Fall 2006 Cluster Definitions classcluster FLD950 ^a /usr3/cheshire2/data2/clasclusstoplist FLD245 ^[ab] FLD440 ^a FLD490 ^a FLD830 ^a FLD740 ^a titles FLD6.. ^[abcdxyz] subjects 5 subjsum

47 2006.11.28- SLIDE 47IS 257 – Fall 2006 Component Definitions TESTDATA/COMPONENT_DB1 NONE mainenty titles Fld300 TESTDATA/comp1index1.author …

48 2006.11.28- SLIDE 48IS 257 – Fall 2006 Result Formatting (Display) KEEP_ENTITIES DOCNO 28 #DOCID# 5

49 2006.11.28- SLIDE 49IS 257 – Fall 2006 Indexing Any SGML/XML tagged field or attribute can be indexed: –B-Tree and Hash access via Berkeley DB (Sleepycat) –Stemming, keyword, exact keys and “special keys” –Mapping from any Z39.50 Attribute combination to a specific index –Underlying postings information includes term frequency for probabilistic searching. –SGML may include address of full-text for indexing New indexes can be easily added, or old ones deleted

50 2006.11.28- SLIDE 50IS 257 – Fall 2006 Database Storage All data stored as SGML/XML flat text files plus optional linked full-text files File format is defined though SGML/XML DTD (also flat text file) “Associator” files provide indexed direct access to each record in SGML/XML files. –Contain offset and record length for each “record” –Associators can be built to index any conformant document in a directory sub-tree

51 2006.11.28- SLIDE 51IS 257 – Fall 2006 Database Storage Associator File Page Data File SGML /XML File History File DTD File Cluster File Postings File Index File Index File Remote RDBMS Config File Index File Associator File Prox data File

52 2006.11.28- SLIDE 52IS 257 – Fall 2006 Client/Server Architecture Server Supports: –Database storage –Indexing –Z39.50 access to local data –Boolean and Probabilistic Searching –Relevance Feedback –External SQL database support Client Supports: –Programmable (Tcl/Tk – Python soon) Graphical User Interface –Z39.50 access to remote servers –SGML & MARC formatting Combined Client/Server CGI scripting via WebCheshire

53 2006.11.28- SLIDE 53IS 257 – Fall 2006 Z39.50 Overview UI Map Query Internet Map Results Map Query Map Results Map Query Map Results Search Engine


Download ppt "2006.11.28- SLIDE 1IS 257 – Fall 2006 New Generation Database Systems: XML Databases University of California, Berkeley School of Information IS 257: Database."

Similar presentations


Ads by Google