Overview of XML Data Management Research at Cornell Jayavel Shanmugasundaram Cornell University.

Slides:



Advertisements
Similar presentations
Dimensional Modeling.
Advertisements

XML: Extensible Markup Language
CSE 6331 © Leonidas Fegaras XML and Relational Databases 1 XML and Relational Databases Leonidas Fegaras.
Relational Databases for Querying XML Documents: Limitations & Opportunities VLDB`99 Shanmugasundaram, J., Tufte, K., He, G., Zhang, C., DeWitt, D., Naughton,
Search Engines and Information Retrieval
Agenda from now on Done: SQL, views, transactions, conceptual modeling, E/R, relational algebra. Starting: XML To do: the database engine: –Storage –Query.
Information Retrieval and Databases: Synergies and Syntheses IDM Workshop Panel 15 Sep 2003 Jayavel Shanmugasundaram Cornell University.
XML Views El Hazoui Ilias Supervised by: Dr. Haddouti Advanced XML data management.
Shared Ontology for Knowledge Management Atanas Kiryakov, Borislav Popov, Ilian Kitchukov, and Krasimir Angelov Meher Shaikh.
Introduction XML: an emerging standard for exchanging data on the WWW. Relational database: most wildly used DBMS. Goal: how to map the relational data.
2005rel-xml-i1 Relational to XML Transformations  Background & Issues  Preliminaries  Execution strategies  The SilkRoute System.
Database Systems and XML David Wu CS 632 April 23, 2001.
Bridging Relational Technology and XML Jayavel Shanmugasundaram University of Wisconsin & IBM Almaden Research Center.
COMP630 Paper Presentation by Haomian(Eric) Wang.
CAREER: Towards Unifying Database Systems and Information Retrieval Systems NSF IDM Workshop 10 Oct 2004 Jayavel Shanmugasundaram Cornell University.
1 Advanced Topics XML and Databases. 2 XML u Overview u Structure of XML Data –XML Document Type Definition DTD –Namespaces –XML Schema u Query and Transformation.
Indexing XML Data Stored in a Relational Database VLDB`2004 Shankar Pal, Istvan Cseri, Gideon Schaller, Oliver Seeliger, Leo Giakoumakis, Vasili Vasili.
8/17/20151 Querying XML Database Using Relational Database System Rucha Patel MS CS (Spring 2008) Advanced Database Systems CSc 8712 Instructor : Dr. Yingshu.
NUITS: A Novel User Interface for Efficient Keyword Search over Databases The integration of DB and IR provides users with a wide range of high quality.
Search Engines and Information Retrieval Chapter 1.
XML Overview. Chapter 8 © 2011 Pearson Education 2 Extensible Markup Language (XML) A text-based markup language (like HTML) A text-based markup language.
Efficient Keyword Search over Virtual XML Views Feng Shao and Lin Guo and Chavdar Botev and Anand Bhaskar and Muthiah Chettiar and Fan Yang Cornell University.
DBXplorer: A System for Keyword- Based Search over Relational Databases Sanjay Agrawal Surajit Chaudhuri Gautam Das Presented by Bhushan Pachpande.
XML as a Boxwood Data Structure Feng Zhou, John MacCormick, Lidong Zhou, Nick Murphy, Chandu Thekkath 8/20/04.
Database Management 9. course. Execution of queries.
Chapter 27 The World Wide Web and XML. Copyright © 2004 Pearson Addison-Wesley. All rights reserved.27-2 Topics in this Chapter The Web and the Internet.
1 Searching XML Documents via XML Fragments D. Camel, Y. S. Maarek, M. Mandelbrod, Y. Mass and A. Soffer Presented by Hui Fang.
Querying Structured Text in an XML Database By Xuemei Luo.
1 CS 430 Database Theory Winter 2005 Lecture 17: Objects, XML, and DBMSs.
ISP 433/533 Week 11 XML Retrieval. Structured Information Traditional IR –Unit of information: terms and documents –No structure Need more granularity.
Database Design and Management CPTG /23/2015Chapter 12 of 38 Functions of a Database Store data Store data School: student records, class schedules,
Publishing Relational Data in XML David McWherter.
Gökay Burak AKKUŞ Ece AKSU XRANK XRANK: Ranked Keyword Search over XML Documents Ece AKSU Gökay Burak AKKUŞ.
BNCOD07Indexing & Searching XML Documents based on Content and Structure Synopses1 Indexing and Searching XML Documents based on Content and Structure.
Lecture A/18-849B/95-811A/19-729A Internet-Scale Sensor Systems: Design and Policy Lecture 24 – Part 2 XML Query Processing Phil Gibbons April.
Chapter 27 The World Wide Web and XML. Copyright © 2004 Pearson Addison-Wesley. All rights reserved.27-2 Topics in this Chapter The Web and the Internet.
Efficiently Publishing Relational Data as XML Documents Jayavel Shanmugasundaram et al. Proceedings -VLDB 2000, Cairo.
XML and Its Applications Ben Y. Zhao, CS294-7 Spring 1999.
XML and Database.
Sept. 27, 2002 ISDB’02 Transforming XPath Queries for Bottom-Up Query Processing Yoshiharu Ishikawa Takaaki Nagai Hiroyuki Kitagawa University of Tsukuba.
Data Integration Hanna Zhong Department of Computer Science University of Illinois, Urbana-Champaign 11/12/2009.
1 Information Retrieval LECTURE 1 : Introduction.
Information Retrieval Transfer Cycle Dania Bilal IS 530 Fall 2007.
Computing & Information Sciences Kansas State University Friday, 20 Oct 2006CIS 560: Database System Concepts Lecture 24 of 42 Friday, 20 October 2006.
Efficiently Publishing Relational Data as XML Documents IBM Almaden Research Center Eugene Shekita Rimon Barr Michael Carey Bruce Lindsay Hamid Pirahesh.
XML 1. Chapter 8 © 2013 Pearson Education, Inc. Publishing as Prentice Hall SAMPLE XML SCHEMA (XSD) 2 Schema is a record definition, analogous to the.
XRANK: RANKED KEYWORD SEARCH OVER XML DOCUMENTS Lin Guo Feng Shao Chavdar Botev Jayavel Shanmugasundaram Abhishek Chennaka, Alekhya Gade Advanced Database.
Databases and Information Retrieval: Rethinking the Great Divide SIGMOD Panel 14 Jun 2005 Jayavel Shanmugasundaram Cornell University.
Database Research for the Current Millennium ICDE Panel 1 Apr 2004 Jayavel Shanmugasundaram Cornell University.
Text Search over XML Documents Jayavel Shanmugasundaram Cornell University.
Structured-Value Ranking in Update- Intensive Relational Databases Jayavel Shanmugasundaram Cornell University (Joint work with: Lin Guo, Kevin Beyer,
XPERANTO: A Middleware for Publishing Object-Relational Data as XML Documents Michael Carey Daniela Florescu Zachary Ives Ying Lu Jayavel Shanmugasundaram.
Bridging Relational Technology and XML Jayavel Shanmugasundaram Cornell University (Joint work with Catalina Fan, John Funderburk, Jerry Kiernan, Eugene.
1 Keyword Search over XML. 2 Inexact Querying Until now, our queries have been complex patterns, represented by trees or graphs Such query languages are.
1 Keyword Search over XML. 2 Inexact Querying Until now, our queries have been complex patterns, represented by trees or graphs Such query languages are.
Bridging Relational Technology and XML Jayavel Shanmugasundaram University of Wisconsin & IBM Almaden Research Center.
XML Databases Presented By: Pardeep MT15042 Anurag Goel MT15006.
XML: Extensible Markup Language
Chapter 2: Database System Concepts and Architecture - Outline
Efficiently Publishing Relational Data as XML Documents
XRANK: Ranked Keyword Search over XML Documents
Distributed web based systems
OrientX: an Integrated, Schema-Based Native XML Database System
Semi-Structured data (XML Data MODEL)
Magnet & /facet Zheng Liang
Introduction to Information Retrieval
Search Engine Architecture
Information Retrieval and Web Design
Information Retrieval and Web Design
Introduction to XML IR XML Group.
Presentation transcript:

Overview of XML Data Management Research at Cornell Jayavel Shanmugasundaram Cornell University

Why XML? Internet data exchange –XML emerging as dominant standard for data interactions over the Internet (e.g., SOAP) –Consequently, web application developers deal with XML (e.g., WebSphere) Captures structured and unstructured data –Content management –Semi-structured data

Outline XML for data exchange XML for structured and unstructured data Conclusion

XML for Data Exchange Tires R Us Cars R Us Order Fulfillment Application Purchasing Application Internet eXtensible Markup Language (XML) Relational Database System

Key Challenges Publishing relational data as XML –For the foreseeable future, most business data will continue to be store in relational databases –Need to publish relational data as XML Storing XML using relational database systems –Need to manage XML documents being transferred across the wire (Purchase orders for auditing etc.) –Do we need to build a specialized XML database? Or can we leverage relational technology?

Outline XML for data exchange –Publishing relational data as XML –Querying XML using relational databases XML for structured and unstructured data Conclusion

Example Relational Data idcustnamecustnum 10Smith Construction7734 9Western Builders7725 order oiddesccost 10generator backhoe24000 oiddueamt 101/10/ /10/ itempayment

XML View for Users Smith Construction …

Allow Users to Query View Get all orders of customer ‘Smith…’ for $order in view(“orders”) where $order/customer/text() like ‘Smith%’ return $order

// First prepare all the SQL statements to be executed and create cursors for them Exec SQL Prepare CustStmt From “select cust.id, cust.name from Customer cust where cust.name = ‘Jack’“ Exec SQL Declare CustCursor Cursor For CustStmt Exec SQL Prepare AcctStmt From “select acct.id, acct.acctnum from Account acct where acct.custId = ?“ Exec SQL Declare AcctCursor Cursor For AcctStmt Exec SQL Prepare PorderStmt From “select porder.id, porder.acct, porder.date from PurchOrder porder where porder.custId = ?“ Exec SQL Declare PorderCursor Cursor For PorderStmt Exec SQL Prepare ItemStmt From “select item.id, item.desc from Item item where item.poId = ?“ Exec SQL Declare ItemCursor Cursor For ItemStmt Exec SQL Prepare PayStmt From “select pay.id, pay.desc from Payment pay where item.poId = ?“ Exec SQL Declare PayCursor Cursor For PayStmt // Now execute SQL statements in nested order of XML document result. Start with customer XMLresult = ““ Exec SQL Open CustCursor while (CustCursor has more rows) { Exec SQL Fetch CustCursor Into :custId, :custName XMLResult += “ “ + custName + “ “ // For each customer, issue sub-query to get account information and add to custAccts Exec SQL Open AcctCursor Using :custId while (AcctCursor has more rows) { Exec SQL Fetch AcctCursor Into :acctId, :acctNum XMLResult += “ “ + acctNum + “ “ } XMLResult += “ “ // For each customer, issue sub-query to get purchase order information and add to custPorders Exec SQL Open PorderCursor Using :custId while (PorderCursor has more rows) { Exec SQL Fetch PorderCursor Into :poId, :poAcct, :poDate XMLResult += “ “+poDate +“ “ // For each purchase order, issue a sub-query to get item information and add to porderItems Exec SQL Open ItemCursor Using :poId while (ItemCursor has more rows) { Exec SQL Fetch ItemCursor Into :itemId, :itemDesc XMLResult += “ “ + itemDesc + “ “ } XMLResult += “ “ // For each purchase order, issue a sub-query to get payment information and add to porderPays Exec SQL Open PayCursor Using :poId while (PayCursor has more rows) { Exec SQL Fetch PayCursor Into :payId, :payDesc XMLResult += “ “ + payDesc + “ “ } XMLResult += “ “ } // End of looping over all purchase orders associated with a customer XMLResult += “ “ Return XMLResult as one result row; reset XMLResult = ““ } // loop until all customers are tagged and output

Previous Work SQL extensions for publishing relational data as XML –[Shanmugasundaram et al., VLDB 2000] –Prototyped in DB2 –Input into SQL/X working group XML publishing using XQuery –[Shanmugasundaram et al., VLDB 2001] –Initially XPERANTO prototype –Now XTables product initiative

Updates Updating XML views of relational data –Extend XQuery with update semantics –Translate XQuery updates to SQL updates (when possible!); efficiently! for $order in view(“orders”) where $order/customer/text() = ‘Smith’ update $order/cost = $order/cost

Recursion ‘//’ queries are very popular –Navigational recursion XQuery functions allow structural recursion –Part hierarchies –Nested catalogs How can we evaluate them using a relational database system? –View composition –Fix-point recursion in SQL

Outline XML for data exchange –Publishing relational data as XML –Querying XML using relational databases XML for structured and unstructured data Conclusion

Native XML Documents

Querying Native XML Documents Native XML database systems –Specialized for XML document processing Extend relational (or object-oriented) database systems –Leverage > 30 years of research and development –Harness sophisticated functionality, tools

Previous Work [Shanmugasundaram et. al., VLDB’99] [Shanmugasundaram et al., SIGMOD Record’01] Relational Database System XML Translation Layer XML Schema Relational Schema Translation Information XML Documents Tuples XML Query SQL Query Relational Result XML Result

Query Workload Different XML shredding techniques can have a dramatic influence on performance –How can we choose appropriate shredding based on XML query and update workload? –“SMART for XML”

Typing XML Schemas have many sophisticated constraints –Min occurs, max occurs –Structural constraints How can we preserve these in relational database systems in presence of updates? –Relational constraints? –Materialized views?

Outline XML for data exchange XML for structured and unstructured data Conclusion

30000-foot view of Data Management Today Essentially two “camps” Structured camp: Relational database systems –Highly structured data –Precise and sophisticated queries over this data Unstructured camp: Information retrieval systems –Unstructured data –Keyword search queries returning ranked results

Traditional Data Management Landscape StructuredUnstructured Complex and Structured Ranked Keyword Search Data Queries (Relational) Database Systems Information Retrieval Systems 12 34

Primary Advantages of Ranked Keyword Search Simple –As witnessed by popularity of keyword search over the Internet Facilitates information discovery –Ranks results in order of importance –Users do not need to know the schema of the underlying data (if there is any) “John Ithaca”

Ranked Keyword Search over Structured and Unstructured Data Content management –Semi-structured data Scientific documents, Shakespeare’s plays, … –Mix of structured and unstructured data Database with date and time of accident (structured data) and accident description (unstructured data) Support flexible keyword search interface

Semi-structured Document XML and Information Retrieval: A SIGIR 2000 Workshop David Carmel, Yoelle Maarek, Aya Soffer XQL and Proximal Nodes Ricardo Baeza-Yates Gonzalo Navarro Searching on structured text is becoming more important with XML … The XQL language … … A Query Language … …

Key Challenges Generalize ranked keyword query semantics –Should work as usual for unstructured data –Should generalize to structured data too! –Allows users to query across both forms of data Generalized inverted lists –Indexing mix of structured and unstructured data More on this soon!

Traditional Data Management Landscape StructuredUnstructured Complex and Structured Ranked Keyword Search Data Queries (Relational) Database Systems Information Retrieval Systems 12 34

Complex Queries over Semi- structured Data: Motivation 1 Document repositories do not typically conform to a rigid schema –Scientific documents –Powerpoint presentation –Push to publish these in XML form Complex queries over such heterogeneous collections (in conjunction with structured data) –Find all documents on “XML” authored by Almaden employees

Complex Queries over Semi- structured Data: Motivation 2 Even “structured” data can have a widely varying structure –Example: electronic parts market place –2 million parts each having about attributes –A total of 5000 distinct attributes Structure changes very often (“schema chaos”) –New parts are added every day May be better to treat this data as “semi-structured” But … still need to ask structured queries –Find capacitors with capacitance between 10 and 20

Indexing and Query Processing Index both schema and data –/book[author/name = ‘Jane’] –Treat schema as a data value Benefits –Can capture arbitrarily heterogeneous schema –Easy schema evolution –Can implement it using a relational database system! (using regular B+-trees) Supports wildcard (‘//’) queries

Order [Tatarinov et. al., SIGMOD’02] Shakespeare’s plays marked up as XML –Acts ordered one after the other –Cannot view this as an “unordered set” XQuery queries support ordered predicates –Find acts after Hamlet said “to be or not to be” Again, treat order as a data value –Order encoding methods –Can be implemented in a relational database system!

Traditional Data Management Landscape StructuredUnstructured Complex and Structured Ranked Keyword Search Data Queries (Relational) Database Systems Information Retrieval Systems 12 34

Unifying IR and Database Systems: Motivation The Internet is opening the door to ad-hoc queries by end users –E.g., Used car marketplace –Find all “bright red ford mustangs” that cost less than 20% of the average price of cars in its class Characteristics of queries –Ranked keyword search (for ease of use) –Complex query operations (information synthesis) –Want to see ranked results!

Main Challenge Integrate ranking with structured query operations Developing ~XQuery framework –Build ranking into core of language –Both keyword search and structured operators Open question –Will we be able to extend relational databases for this purpose? Find “bright red ford mustangs” that cost less than 20% of the average price of cars in its class

Outline XML for data exchange XML for structured and unstructured data –Overview –Ranked keyword search over XML documents Conclusion

XRANK: Ranked Keyword Search over XML Documents Lin Guo Feng Shao Chavdar Botev Jayavel Shanmugasundaram

Traditional Data Management Landscape StructuredUnstructured Complex and Structured Ranked Keyword Search Data Queries (Relational) Database Systems Information Retrieval Systems 12 34

Semi-structured Document XML and Information Retrieval: A SIGIR 2000 Workshop David Carmel, Yoelle Maarek, Aya Soffer XQL and Proximal Nodes Ricardo Baeza-Yates Gonzalo Navarro Searching on structured text is becoming more important with XML … The XQL language … … A Query Language … …

Design Principles 1)Return most specific element containing the query keywords

Semi-structured Document XML and Information Retrieval: A SIGIR 2000 Workshop David Carmel, Yoelle Maarek, Aya Soffer XQL and Proximal Nodes Ricardo Baeza-Yates Gonzalo Navarro Searching on structured text is becoming more important with XML … The XQL language … … A Query Language … …

Design Principles 1)Return most specific element containing the query keywords 2)Ranking has to be done at the granularity of elements

Semi-structured Document XML and Information Retrieval: A SIGIR 2000 Workshop David Carmel, Yoelle Maarek, Aya Soffer XQL and Proximal Nodes Ricardo Baeza-Yates Gonzalo Navarro Searching on structured text is becoming more important with XML … The XQL language … … A Query Language … …

Design Principles 1)Return most specific element containing the query keywords 2)Ranking has to be done at the granularity of elements 3)Two-dimensional keyword proximity Height of result XML tree Width of result XML tree

System Architecture Standing Computation Hybrid Dewey Inverted List Query Evaluator Input XML Documents XML Elements with Standings Keyword query Ranked Results Data access

Naïve Approach One main difference between document and XML keyword search is result granularity –Treat each element as a document –Build regular inverted list index structures over elements Drawbacks –Space overhead (depth of document) –Ranking (two-dimensional proximity) –Spurious query results

Semi-structured Document XML and Information Retrieval: A SIGIR 2000 Workshop David Carmel, Yoelle Maarek, Aya Soffer XQL and Proximal Nodes Ricardo Baeza-Yates Gonzalo Navarro Searching on structured text is becoming more important with XML … The XQL language … … A Query Language … …

Main Problem with Naïve Approach Decouples representation of ancestors and descendants –Space overhead –Spurious query results Dewey encoding for ids –General knowledge classification (1850s) –LDAP, Ordered XML, …

Dewey Encoding 0.0date July …XML and …David Carmel … … …… XQL and …Ricardo …

Dewey Inverted List (DIL) XQL Dewey Id Standing Position List Sorted by Dewey Id ……… Ricardo Sorted by Dewey Id ……… … 91

Query Processing Can answer XML keyword search queries in single pass over DIL –Space savings –Time savings (smaller inverted lists) Ranking refinements –Ranked Dewey Inverted List (RDIL) –Hybrid Dewey Inverted List (HDIL)

Outline XML for data exchange XML for structured and unstructured data Conclusion

Two main uses of XML –Data exchange –Managing structured and unstructured data Each of these gives rise to exciting data management opportunities Pursuing these actively at Cornell