2010.11.16- SLIDE 1IS 257 – Fall 2010 New Generation Database Systems: IR Systems and the Grid/Cloud University of California, Berkeley School of Information.

2010.11.16- SLIDE 1IS 257 – Fall 2010 New Generation Database Systems: IR Systems and the Grid/Cloud University of California, Berkeley School of Information IS 257: Database Management

2010.11.16- SLIDE 2IS 257 – Fall 2010 Lecture Outline XML and DBMS –Cheshire II as XML Database The Grid and DBMS –The Grid –Data Grids –Grid-based DBMS

2010.11.16- SLIDE 3IS 257 – Fall 2010 Lecture Outline XML and DBMS –Cheshire II as XML Database The Grid and DBMS –The Grid –Data Grids –Grid-based DBMS

2010.11.16- SLIDE 4IS 257 – Fall 2010 Standards: XML/SQL That table can be mapped to: 000020 John Smith 1955-08-21 52300.00 … etc. …

2010.11.16- SLIDE 5IS 257 – Fall 2010 XML to Relational Database Mapping Bhavin Kansara The following slides are adapted from: Slide from Bhavin Kansara

2010.11.16- SLIDE 6IS 257 – Fall 2010 Introduction XML/relational mapping means data transformation between XML and relational data models XML documents can be transformed to relational data models or vice versa. Mapping method is the way the mapping is done Slide from Bhavin Kansara

2010.11.16- SLIDE 7IS 257 – Fall 2010 DTD graph Slide from Bhavin Kansara

2010.11.16- SLIDE 8IS 257 – Fall 2010 Inlined DTD graph Given a DTD graph, a node is inlinable if and only if it has exactly one incoming edge and that edge is a normal edge. Slide from Bhavin Kansara

2010.11.16- SLIDE 9IS 257 – Fall 2010 Inlined DTD graph Slide from Bhavin Kansara

2010.11.16- SLIDE 10IS 257 – Fall 2010 Generated Database Schema Slide from Bhavin Kansara

2010.11.16- SLIDE 11IS 257 – Fall 2010 Data Mapping XML file is used to insert data into generated database schema Parser is used to fetch data from XML file. Slide from Bhavin Kansara

2010.11.16- SLIDE 12IS 257 – Fall 2010 Summary Simplify DTD Create DTD graph from simplified DTD Create inlined DTD graph from DTD graph Use inlined DTD graph to generate database schema Insert values from XML file into generated tables Slide from Bhavin Kansara

2010.11.16- SLIDE 13IS 257 – Fall 2010 Issues So, we can convert the XML to a relational database, but can we then export as an XML document? –This is equally challenging But MOSTLY involves just re-joining the tables How do you store and put back the wrapping tags for sets of subelements? Since the decomposition of the DTD was approximate, the output MAY not be identical to the input

2010.11.16- SLIDE 14IS 257 – Fall 2010 Anatomy of a Native XML database The next set of slides (available on the class web site) come from George Feinberg of SleepyCat Software –SleepyCat is now part of Oracle

2010.11.16- SLIDE 15IS 257 – Fall 2010 Further comments on NXD Native XML databases are most often used for storing “document-centric” XML document –I.e. the unit of retrieval would typically be the entire document and not a particular node or subelement This supports query languages like Xquery –Able to ask for “all documents where the third chapter contains a page that has boldfaced word” –Very difficult to do that kind of query in SQL

2010.11.16- SLIDE 16IS 257 – Fall 2010 XML-Based IR - Cheshire II I thought I would take a little time to talk about how the Cheshire system (that I have been working for nearly 20 years) uses XML, since it has some similarities (and many differences) to XML database systems Cheshire II (and Cheshire 3) are document-centric and involve parsing the XML for the purposes of indexing (and sometimes for retrieval of partial documents)

2010.11.16- SLIDE 17IS 257 – Fall 2010 Cheshire II SGML/XML Support Underlying native format for all data is SGML or XML The DTD defines the file format for each file Full SGML/XML parsing SGML/XML Format Configuration Files define the database USMARC DTD and MARC to SGML conversion (and back again) Access to full-text via special SGML/XML tags

2010.11.16- SLIDE 18IS 257 – Fall 2010 SGML/XML Support Example XML record for a DL document ELIB-v1.0 756 June 12, 1996 June 1996 Cumulative Watershed Effects: Applicability of Available Methodologies to the Sierra Nevada University of California report USDA Forest Service Neil H. Berg Ken B. Roby Bruce J. McGurk SNEP Vol 3 40 /elib/data/docs/0700/756/HYPEROCR/hyperocr.html /elib/data/docs/0700/756/OCR-ASCII-NOZONE

2010.11.16- SLIDE 19IS 257 – Fall 2010 00722 n a m 2 2 00229 4 5 0 001001400000005001700014008004100031010001400072035002000086035001700106100001900123 2450105001422500011002472600032002583000033002905040050003236500036003737000022004097000022004 31950003200453998000700485 CUBGGLAD1282B 19940414143202.0 830810 1983 nyu eng u 82019962 (CU)ocm08866667 (CU)GLAD1282 Burch, John G. Information systems : theory and practice / John G. Burch, Jr., Felix R. Strater, Gary Grudnitski 3rd ed New York : J. Wiley, 1983 xvi, 632 p. : ill. ; 24 cm Includes bibliographical references and index Management information systems.... SGML Support Example SGML/MARC Record

2010.11.16- SLIDE 20IS 257 – Fall 2010 SGML Support Mini-TREC document… FT931-3566 _AN-DCPCCAA3FT 930316 FT 16 MAR 93 / Italy's Corruption Scandal: Magistrates hold key to unlocking Tangentopoli - They will set the investigation agenda By ROBERT GRAHAM OVER the weekend the Italian media felt obliged to comment on a non-event. No new arrests had taken place in any of the country's ever more numerous corruption scandals which centre on the illicit funding of political parties... …

2010.11.16- SLIDE 21IS 257 – Fall 2010 … Companies:- Ente Nazionale Idrocarburi. Ente Nazionale per L'Energia Electtrica. Ente Partecipazioni E Finanziamento Industria Manifatturiera. IRI Istituto per La Ricostruzione Industriale. Countries:- ITZ Italy, EC. Industries:- P9222 Legal Counsel and Prosecution. P91 Executive, Legislative and General Government. P13 Oil and Gas Extraction. P9631 Regulation, Administration of Utilities. P6719 Holding Companies, NEC. Types:- …

2010.11.16- SLIDE 22IS 257 – Fall 2010 … CMMT Comment & Analysis. GOVT Legal issues. The Financial Times London Page 4

2010.11.16- SLIDE 23IS 257 – Fall 2010 SGML/XML Support Configuration files for the Server are also SGML/XML: –They include tags describing all of the data files and indexes for the database. –They also include instructions on how data is to be extracted for indexing and how Z39.50 attributes map to the indexes for a given database.

2010.11.16- SLIDE 24IS 257 – Fall 2010 Cheshire Configuration Files /projects/is240/GroupX/indexes /projects/is240/GroupX trec /projects/is240/ft /projects/is240/ft.CONT /projects/is240/TREC.FT.DTD ft.assoc cheshire_index/TESTDATA.history …

2010.11.16- SLIDE 25IS 257 – Fall 2010 cheshire_index/trec.docno.index docno 12 1 12 2 12 6 DOCNO …

2010.11.16- SLIDE 26IS 257 – Fall 2010 cheshire_index/trec.topic.index topic 29 3 6 29 102 3 6 … cheshire_index/topicstoplist HEADLINE DATELINE BYLINE TEXT

2010.11.16- SLIDE 27IS 257 – Fall 2010 Cluster Definitions classcluster FLD950 â /usr3/cheshire2/data2/clasclusstoplist FLD245 ^[ab] FLD440 â FLD490 â FLD830 â FLD740 â titles FLD6.. ^[abcdxyz] subjects 5 subjsum

2010.11.16- SLIDE 28IS 257 – Fall 2010 Component Definitions TESTDATA/COMPONENT_DB1 NONE mainenty titles Fld300 TESTDATA/comp1index1.author …

2010.11.16- SLIDE 29IS 257 – Fall 2010 Result Formatting (Display) KEEP_ENTITIES DOCNO 28 #DOCID# 5

2010.11.16- SLIDE 30IS 257 – Fall 2010 Indexing Any SGML/XML tagged field or attribute can be indexed: –B-Tree and Hash access via Berkeley DB (Sleepycat) –Stemming, keyword, exact keys and “special keys” –Mapping from any Z39.50 Attribute combination to a specific index –Underlying postings information includes term frequency for probabilistic searching. –SGML may include address of full-text for indexing New indexes can be easily added, or old ones deleted

2010.11.16- SLIDE 31IS 257 – Fall 2010 Database Storage All data stored as SGML/XML flat text files or in a BerkeleyDB database File format is defined though SGML/XML DTD (also flat text file) or XML Schema “Associator” files provide indexed direct access to each record in SGML/XML files. –Contain offset and record length for each “record” –Associators can be built to index any conformant document in a directory sub-tree

2010.11.16- SLIDE 32IS 257 – Fall 2010 Database Storage Associator File Page Data File SGML /XML File History File DTD File Cluster File Postings File Index File Index File Remote RDBMS Config File Index File Associator File Prox data File

2010.11.16- SLIDE 33IS 257 – Fall 2010 Client/Server Architecture Server Supports: –Database storage –Indexing –Z39.50 access to local data –Boolean and Probabilistic Searching –Relevance Feedback –External SQL database support Client Supports: –Programmable (Tcl/Tk – Python in C3) Graphical User Interface –Z39.50 access to remote servers –SGML & MARC formatting Combined Client/Server CGI scripting via WebCheshire

2010.11.16- SLIDE 34IS 257 – Fall 2010 Z39.50 Overview UI Map Query Internet Map Results Map Query Map Results Map Query Map Results Search Engine

2010.11.16- SLIDE 35IS 257 – Fall 2010 Lecture Outline XML and DBMS The Grid and DBMS –The Grid –Data Grids –Grid-based DBMS

2010.11.16- SLIDE 36IS 257 – Fall 2010 Grid-based Digital Libraries So what’s this Grid thing anyhow? Data Grids and Distributed Storage Grid-Based IR Grid-Based Digital Libraries Grid vs “Cloud” This lecture borrows heavily from presentations by Ian Foster (Argonne National Laboratory & University of Chicago), Reagan Moore and others from San Diego Supercomputer Center

2010.11.16- SLIDE 37IS 257 – Fall 2010 The Grid: On-Demand Access to Electricity Time Quality, economies of scale Source: Ian Foster

2010.11.16- SLIDE 38IS 257 – Fall 2010 By Analogy, A Computing Grid Decouples production and consumption –Enable on-demand access –Achieve economies of scale –Enhance consumer flexibility –Enable new devices On a variety of scales –Department –Campus –Enterprise –Internet Source: Ian Foster

2010.11.16- SLIDE 39IS 257 – Fall 2010 What is the Grid? “The short answer is that, whereas the Web is a service for sharing information over the Internet, the Grid is a service for sharing computer power and data storage capacity over the Internet. The Grid goes well beyond simple communication between computers, and aims ultimately to turn the global network of computers into one vast computational resource.” Source: The Global Grid Forum

2010.11.16- SLIDE 40IS 257 – Fall 2010 Not Exactly a New Idea … “The time-sharing computer system can unite a group of investigators …. one can conceive of such a facility as an … intellectual public utility.” –Fernando Corbato and Robert Fano, 1966 “We will perhaps see the spread of ‘computer utilities’, which, like present electric and telephone utilities, will service individual homes and offices across the country.” Len Kleinrock, 1967 Source: Ian Foster

2010.11.16- SLIDE 41IS 257 – Fall 2010 But, Things are Different Now Networks are far faster (and cheaper) –Faster than computer backplanes “Computing” is very different than pre-Net –Our “computers” have already disintegrated –E-commerce increases size of demand peaks –Entirely new applications & social structures We’ve learned a few things about software Source: Ian Foster

2010.11.16- SLIDE 42IS 257 – Fall 2010 Computing isn’t Really Like Electricity I import electricity but must export data “Computing” is not interchangeable but highly heterogeneous: data, sensors, services, … This complicates things; but also means that the sum can be greater than the parts –Real opportunity: Construct new capabilities dynamically from distributed services Raises three fundamental questions –Can I really achieve economies of scale? –Can I achieve QoS across distributed services? –Can I identify apps that exploit synergies? Source: Ian Foster

2010.11.16- SLIDE 43IS 257 – Fall 2010 Why the Grid? (1) Revolution in Science Pre-Internet –Theorize &/or experiment, alone or in small teams; publish paper Post-Internet –Construct and mine large databases of observational or simulation data –Develop simulations & analyses –Access specialized devices remotely –Exchange information within distributed multidisciplinary teams Source: Ian Foster

2010.11.16- SLIDE 44IS 257 – Fall 2010 Why the Grid? (2) Revolution in Business Pre-Internet –Central data processing facility Post-Internet –Enterprise computing is highly distributed, heterogeneous, inter-enterprise (B2B) –Business processes increasingly computing- & data-rich –Outsourcing becomes feasible => service providers of various sorts Source: Ian Foster

2010.11.16- SLIDE 45IS 257 – Fall 2010 The Information Grid Imagine a web of data Machine Readable –Search, Aggregate, Transform, Report On, Mine Data – using more computers, and less humans Scalable –Machines are cheap – can buy 50 machines with 100Gb or memory and 100 TB disk for under $100K, and dropping –Network is now faster than disk Flexible –Move data around without breaking the apps Source: S. Banerjee, O. Alonso, M. Drake - ORACLE

2010.11.16- SLIDE 46IS 257 – Fall 2010 Tier0/1 facility Tier2 facility 10 Gbps link 2.5 Gbps link 622 Mbps link Other link Tier3 facility The Foundations are Being Laid Cambridge Newcastle Edinburgh Oxford Glasgow Manchester Cardiff Soton London Belfast DL RAL Hinxton

2010.11.16- SLIDE 47IS 257 – Fall 2010 Data Grid Problem “Enable a geographically distributed community [of thousands] to pool their resources in order to perform sophisticated, computationally intensive analyses on Petabytes of data” Note that this problem: –Is common to many areas of science –Overlaps strongly with other Grid problems

2010.11.16- SLIDE 48IS 257 – Fall 2010 Data Grids for High Energy Physics Tier2 Centre ~1 TIPS Online System Offline Processor Farm ~20 TIPS CERN Computer Centre FermiLab ~4 TIPS France Regional Centre Italy Regional Centre Germany Regional Centre Institute Institute ~0.25TIPS Physicist workstations ~100 MBytes/sec ~622 Mbits/sec ~1 MBytes/sec There is a “bunch crossing” every 25 nsecs. There are 100 “triggers” per second Each triggered event is ~1 MByte in size Physicists work on analysis “channels”. Each institute will have ~10 physicists working on one or more channels; data for these channels should be cached by the institute server Physics data cache ~PBytes/sec ~622 Mbits/sec or Air Freight (deprecated) Tier2 Centre ~1 TIPS Caltech ~1 TIPS ~622 Mbits/sec Tier 0 Tier 1 Tier 2 Tier 4 1 TIPS is approximately 25,000 SpecInt95 equivalents Image courtesy Harvey Newman, Caltech

2010.11.16- SLIDE 49IS 257 – Fall 2010 Grids and Open Standards Increased functionality, standardization Time Custom solutions Open Grid Services Arch GGF: OGSI, … (+ OASIS, W3C) Multiple implementations, including Globus Toolkit Web services Globus Toolkit Defacto standards GGF: GridFTP, GSI X.509, LDAP, FTP, … App-specific Services

2010.11.16- SLIDE 50IS 257 – Fall 2010 The Grid as Enabler of 21st Century Science Entirely new approaches to enquiry based on –Deep analysis of huge quantities of data –Interdisciplinary collaboration –Large-scale simulation –Smart instrumentation Enabled by an infrastructure that enables access to, and integration of, resources & services without regard for location

2010.11.16- SLIDE 51IS 257 – Fall 2010 Not only Science… The Database world is moving to the Grid for large-scale applications Oracle 10g is specifically designed to exploit clustered/grid computing using RACs (Real Application Clusters) An example from the Information/Publishing world… –Presentation from Oracle about Thomson Legal’s use of Oracle 10g and RACs

2010.11.16- SLIDE 1IS 257 – Fall 2010 New Generation Database Systems: IR Systems and the Grid/Cloud University of California, Berkeley School of Information.

Similar presentations

Presentation on theme: "2010.11.16- SLIDE 1IS 257 – Fall 2010 New Generation Database Systems: IR Systems and the Grid/Cloud University of California, Berkeley School of Information."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

2010.11.16- SLIDE 1IS 257 – Fall 2010 New Generation Database Systems: IR Systems and the Grid/Cloud University of California, Berkeley School of Information.

Similar presentations

Presentation on theme: "2010.11.16- SLIDE 1IS 257 – Fall 2010 New Generation Database Systems: IR Systems and the Grid/Cloud University of California, Berkeley School of Information."— Presentation transcript:

Similar presentations

About project

Feedback