Putting Semi-structured Data to Practice Alon Levy Seattle, Washingon University of Washington.

Slides:



Advertisements
Similar presentations
Ontology-Based Computing Kenneth Baclawski Northeastern University and Jarg.
Advertisements

Chapter 10: Designing Databases
XML e X tensible M arkup L anguage (XML) By: Albert Beng Kiat Tan Ayzer Mungan Edwin Hendriadi.
XML: Extensible Markup Language
By Daniela Floresu Donald Kossmann
Data Modeling and Database Design Chapter 1: Database Systems: Architecture and Components.
1 Abdeslame ALILAOUAR, Florence SEDES Fuzzy Querying of XML Documents The minimum spanning tree IRIT - CNRS IRIT : IRIT : Research Institute for Computer.
Information Retrieval in Practice
Managing XML and Semistructured Data Lecture 8: Query Languages - XML-QL Prof. Dan Suciu Spring 2001.
Web-site Management System Strudel Presented by: LAKHLIFI Houda Instructor: Dr. Haddouti.
1 XEM: Managing the Evolution of XML Documents Author: Hong Su, Diane Kramer. Li Chen, Kajal Claypool and Elke A. Rundensteiner Presented by: Li Shuhong.
Visual Web Information Extraction With Lixto Robert Baumgartner Sergio Flesca Georg Gottlob.
From Semistructured Data to XML: Migrating The Lore Data Model and Query Language Roy Goldman, Jason McHugh, Jennifer Widom Stanford University
Query Languages Aswin Yedlapalli. XML Query data model Document is viewed as a labeled tree with nodes Successors of node may be : - an ordered sequence.
Web Site Management Based on Declarative Specifications Alon Levy University of Washington Joint work with: Strudel: Dana Florescu (INRIA), Mary Fernandez,
1 COS 425: Database and Information Management Systems XML and information exchange.
1 Statistics XML: –Altavista: 800,000 pages returned. –Amazon.com: 242 books. In comparison: –God: 12,000 books, 7 Million pages –Bible: 32,000 books,
Tutorial 1 Developing a Basic Web Page
Winter 2002Arthur Keller – CS 18018–1 Schedule Today: Mar. 12 (T) u Semistructured Data, XML, XQuery. u Read Sections Assignment 8 due. Mar. 14.
1 New Ways of Querying the Web by Eliahu Brodsky and Alina Blizhovsky.
4/15/2002Bo Du 1 - Bo Du, April 15, XML - QL A Query Language for XML.
CH 11 Multimedia IR: Models and Languages
Overview of Search Engines
4/20/2017.
Copyright © 2004 Pearson Education, Inc. Chapter 1 Introduction.
1 Web Database Processing. Web Database Applications Static Report Publishing a report is prepared from a database application and exported to HTML DB.
Web Data Management COSC Introduction  The ‘ world wide web’  a vast, widely distributed collection of semi-structured multimedia documents 
XML: Extensible Markup Language FST-UMAC Gong Zhiguo.
IS432: Semi-Structured Data Dr. Azeddine Chikh. 1. Semi Structured Data Object Exchange Model.
RDF (Resource Description Framework) Why?. XML XML is a metalanguage that allows users to define markup XML separates content and structure from formatting.
16-1 The World Wide Web The Web An infrastructure of distributed information combined with software that uses networks as a vehicle to exchange that information.
Lecture 21 XML querying. 2 XSL (eXtensible Stylesheet Language) In HTML, default styling is built into browsers as tag set for HTML is predefined and.
XML-QL A Query Language for XML Charuta Nakhe
Dan SuciuTools for XML Data Exchange Dan Suciu AT&T Labs Joint work with Mary Fernandez.
XML and XPath. Web Services: XML+XPath2 EXtensible Markup Language (XML) a W3C standard to complement HTML A markup language much like HTML origins: structured.
Database System Concepts and Architecture
XML과 Database 홍기형 성신여자대학교 성신여자대학교 홍기형.
Architecture for a Database System
Web Caching By Neeraj Agrawal. Caching Caching is widely used for improving performance in many context( e.g processor caches in hardware, buffer pool.
Navigational Plans For Data Integration Marc Friedman Alon Levy Todd Millistein Presented By Avinash Ponnala Avinash Ponnala.
Winter 2006Keller, Ullman, Cushing18–1 Plan 1.Information integration: important new application that motivates what follows. 2.Semistructured data: a.
XML A web enabled data description language 4/22/2001 By Mark Lawson & Edward Ryan L’Herault.
JSTL, XML and XSLT An introduction to JSP Standard Tag Library and XML/XSLT transformation for Web layout.
1 CS 430 Database Theory Winter 2005 Lecture 17: Objects, XML, and DBMSs.
1 Lessons from the TSIMMIS Project Yannis Papakonstantinou Department of Computer Science & Engineering University of California, San Diego.
Lecture2: Database Environment Prepared by L. Nouf Almujally & Aisha AlArfaj 1 Ref. Chapter2 College of Computer and Information Sciences - Information.
1 What Is XML? eXtensible Markup Language for data –Standard for publishing and interchange –“Cleaner” SGML for the Internet Applications: –Data exchange.
Web software. Two types of web software Browser software – used to search for and view websites. Web development software – used to create webpages/websites.
Management of XML and Semistructured Data Lecture 5: Query Languages Wednesday, 4/1/2001.
Lecture 6: XML Query Languages Thursday, January 18, 2001.
ICS (072)Database Systems: An Introduction & Review 1 ICS 424 Advanced Database Systems Dr. Muhammad Shafique.
2007. Software Engineering Laboratory, School of Computer Science S E Web-Harvest Web-Harvest: Open Source Web Data Extraction tool 이재정 Software Engineering.
Web-site Building Methodologies Current Research.
XML and Its Applications Ben Y. Zhao, CS294-7 Spring 1999.
XML and Database.
XML e X tensible M arkup L anguage (XML) By: Albert Beng Kiat Tan Ayzer Mungan Edwin Hendriadi.
XML CSC1310 Fall HTML (TIM BERNERS-LEE) HyperText Markup Language  HTML (HyperText Markup Language): December  Markup  Markup is a symbol.
XML Technology. Emerging Importance of XML –HTML-tagging is display oriented. –XML-based content tagging has important uses: data mining role-oriented.
Semi-structured Data In many applications, data does not have a rigidly and predefined schema: –e.g., structured files, scientific data, XML. Managing.
A Portrait of the Semantic Web in Action Jeff Heflin and James Hendler IEEE Intelligent Systems December 6, 2010 Hyewon Lim.
The Object-Oriented Database System Manifesto Malcolm Atkinson, François Bancilhon, David deWitt, Klaus Dittrich, David Maier, Stanley Zdonik DOOD'89,
SEMI-STRUCTURED DATA (XML) 1. SEMI-STRUCTURED DATA ER, Relational, ODL data models are all based on schema Structure of data is rigid and known is advance.
XML Databases Presented By: Pardeep MT15042 Anurag Goel MT15006.
Geographic Information Systems GIS Data Databases.
Information Retrieval in Practice
Chapter 1: Introduction
eXtensible Markup Language (XML)
Alin Deutsch, University of Pennsylvania Mary Mernandez, AT&T Labs
CSE591: Data Mining by H. Liu
Semi-structured Data In many applications, data does not have a rigidly and predefined schema: e.g., structured files, scientific data, XML. Managing such.
Presentation transcript:

Putting Semi-structured Data to Practice Alon Levy Seattle, Washingon University of Washington

Semi-structured Data In many applications, data does not have a rigidly and predefined schema: –e.g., structured files, scientific data, XML. Managing such data requires rethinking the design of components of a DBMS: –data model, query language, optimizer, storage system. The emergence of XML data underscores the importance of semi-structured data.

Outline of the Talk Semi-formal definition and examples. Modeling semi-structured data Querying semi-structured data Challenges in practice: –Application: web-site management –The XML challenge –A DBMS challenge: query optimization Current research challenges

Main Characteristics Schema is not what it used to be: not given in advance (often implicit in the data) descriptive, not prescriptive, partial, rapidly evolving, may be large (compared to the size of the data) Types are not what they used to be: objects and attributes are not strongly typed objects in the same collection have different representations.

Example: XML Database Systems Date Addison-Wesley Foundation for Object/Relational Databases Date Darwen

Example: Data Integration Mediator: uniform access to multiple data sources RDBMSOODBMS Structured file Legacy system Each source represents data differently: different data models, different schemas user

Physical versus Logical Structure In some cases, data can be modeled in relational or object-oriented models, but extracting the tuples is hard –extracting data from HTML: [Ashish and Knoblock, 97], [Hammer et al., 97], [Kushmerick and Weld, 97]. Semi-structured data: when the data cannot be modeled naturally or usefully using a standard data model.

Managing Semi-structured Data How do we model it? (directed labeled graphs). How do we query it? (many proposals, all include regular path expressions). Optimize queries? (beginning to understand). Store the data? (looking for patterns) Integrity constraints, views, updates,…,

Outline of the Talk Semi-formal Definition and examples. Modeling semi-structured data Querying semi-structured data Challenges in practice: –Application: web-site management –The XML challenge –A DBMS challenge: query optimization Current research challenges

Modeling Semi-Structured Data b01 a1 a2 “DBMS” 1997 “Ullman” “Widom” “Jeff” “ author title year LastName FirstName url Labeled directed graphs: (from OEM [TSIMMIS]): Nodes are objects; labels on the arcs are attribute names.

Querying Semi-structured Data Important features: –ability to navigate the data (regular path expressions), –querying the attribute names (arc variables), –create new structures, –type coercion. Languages: Lorel (Stanford), UnQL (U. Penn), StruQL (AT&T, INRIA, UW).

The StruQL Query Language A StruQL query is a function from a set of input graphs to an output graph. A StruQL expression contains two parts: A query component, and A restructuring component. Formally: INPUT graph names WHERE conjunction of regular path expression atoms CREATE name the nodes in the output graph using Skolem functions LINK specify the links in the resulting graph. OUTPUT resulting-graph name.

Example Query: StruQL WHERE Articles(art), art -> l -> value, l in { "Title", "Abstract", "Date", "Text", "Image", "Topimage", "RelatedSite"}, art -> * -> art1, Article(art1) CREATE ArticlePage(art), ArticlePage(art1) LINK ArticlePage(art) -> l -> att, ArticlePage(art) -> “related article” -> ArticlePage(art1)

StruQL Details Regular path expressions are constructed by a grammar: R <- “a” |  | R1.R2 | R1|R2 | R1* | L | _ Atoms in the WHERE clause are of the form X -> R -> Y or C(X) The LINK clause includes atoms of the form: LINK f(X) --> “new link” --> g(X) or LINK f(X) --> L --> g(X) Queries can be nested, inheriting the WHERE clauses of their outer blocks.

Outline of the Talk Semi-formal Definition and examples. Modeling semi-structured data Querying semi-structured data Challenges in practice: –Application: web-site management –The XML challenge –A DBMS challenge: query optimization Current research challenges

Semi-Structured Data in Practice A significant application area: –Web-site management An unexpected test: –XML (Extended Markup Language) An important technical challenge: –Query optimization

Web-Site Management Problem: designers are concerned with managing content, structure, and graphical presentation at the same time. Consequently it is hard to: –restructure web sites –enforce integrity constraints –easily create multiple sites from the same data –efficiently update a site.

Declarative Specification of Web-sites Key idea: specify the structure of the Web- site declaratively: –A Web-site as a view over an integrated collection of data. Several systems have been built following this paradigm: –Strudel (AT&T, INRIA, U. of Washington) –Araneus (U. of Roma), YAT (INRIA), Autoweb(Milan), Tiramisu(UW)

Strudel Architecture

Strudel Key ideas: –Introduce intermediate abstract representation of the web site: Declaratively define the structure of the web site: pages, links between them, and their content. –Integrates content from multiple sources. Advantages: –Derives multiple sites from the same data. –Supports easy restructuring and modification. –Declarative representation is a platform for: Specifying and enforcing integrity constraints, Designing warehousing configuration to tradeoff site prematerialization and click-time computation.

Why Semi-structured Data? raw data is often semi-structured [e.g., DB&LP] convenient for data integration, web-sites are ultimately graphs, rapidly evolving schema of the web-site, schema of web-site does not enforce typing iterative nature of web-site construction.

Outline of the Talk Semi-formal Definition and examples. Modeling semi-structured data Querying semi-structured data Challenges in practice: –Application: web-site management –The XML challenge –A DBMS challenge: query optimization Current research challenges

The Test of XML XML (Extended Markup Language) is emerging as a standard for exchanging data on the Web. Enables separation of content (XML) and presentation (XSL). DTD’s (Document Type Descriptors) provide partial schemas for XML documents. Applications will need to manage XML data. Can the database community & semi-structured data be of any help?

Semi-structured Data vs. XML Attributes ---> tags objects ---> elements atomic values ---> CDATA (characters) Order? Assumed in XML. XML attributes (fixable) References in XML. Real problem: XML comes with no data model!

References and Attributes Database Systems Date Addison-Wesley Foundation for Object/Relational Databases Date Darwen

Semantics of Queries with Order select N from Bib.book X, X.reference Y, Y.reference Z, Y.author.lastname N, Z.year U where X.publisher = "Addison-Wesley" ordered-by U Semantics of the answer in unclear!

XML-QL where Addison-Wesley $t $a in " construct $a $t Proposal submitted to the W3C (workshop to be held on December 3-4th).

Outline of the Talk Semi-formal Definition and examples. Modeling semi-structured data Querying semi-structured data Challenges in practice: –Application: web-site management –The XML challenge –A DBMS challenge: query optimization Current research challenges

Query Optimization: Challenges Statistics: –What do they even mean when the data is so irregular? –Data comes from external sources. Evaluation of regular path expressions: –need to optimize queries with limited forms of recursion. Mismatch between logical and physical schemas: –graphs are the logical model, but their storage varies considerably.

Logical vs. Physical Mismatch Graphs can be stored by: –materializing only forward pointers on edges, –maintaining some backward pointers –indexing on collections We can model the storage by binding patterns: –{title bf }, {author bf, author fb } Other storage patterns can be modeled by GMAPs (Tsatalos et al., 96).

The Effect of Binding Patterns on the Search Space Need to search the space of annotated query plans: –every query execution plan is also annotated with the set of inputs it requires. If there are only few binding patterns available: –search space becomes smaller Multiple binding patterns per relation: –size of the space grows. Florescu et al.: pruning methods for searching this space.

Conclusions Semi-structured data is everywhere. XML imposes a sense of urgency. An opportunity for the DB community to impact the WWW. We know how to model and query such data. Challenges: optimization, storage, adding partial structure. How can we help users structure information?