Part One XML and Databases Soumen Chakrabarti CSE, IIT Bombay.

Slides:



Advertisements
Similar presentations
1 Web Data Management Path Expressions. 2 In this lecture Path expressions Regular path expressions Evaluation techniques Resources: Data on the Web Abiteboul,
Advertisements

&o1 &o12&o24&o29 &o43 &o96 &o243 &o206 &o25 “Serge” “Abiteboul” 1997 “Victor” “Vianu” paper book paper references author title year http author.
Relational Databases for Querying XML Documents: Limitations & Opportunities VLDB`99 Shanmugasundaram, J., Tufte, K., He, G., Zhang, C., DeWitt, D., Naughton,
CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 311 Database Systems I The Semistructured Data Model.
Database Management Systems, R. Ramakrishnan1 Introduction to Semistructured Data and XML Chapter 27, Part D Based on slides by Dan Suciu University of.
Information Retrieval in Practice
Agenda from now on Done: SQL, views, transactions, conceptual modeling, E/R, relational algebra. Starting: XML To do: the database engine: –Storage –Query.
Web-site Management System Strudel Presented by: LAKHLIFI Houda Instructor: Dr. Haddouti.
1 Lecture 10 XML Wednesday, October 18, XML Outline XML (4.6, 4.7) –Syntax –Semistructured data –DTDs.
Indexing Semistructured Data J. McHugh, J. Widom, S. Abiteboul, Q. Luo, and A. Rajaraman Stanford University January 1998
1 COS 425: Database and Information Management Systems XML and information exchange.
1 Statistics XML: –Altavista: 800,000 pages returned. –Amazon.com: 242 books. In comparison: –God: 12,000 books, 7 Million pages –Bible: 32,000 books,
1 New Ways of Querying the Web by Eliahu Brodsky and Alina Blizhovsky.
Database Systems and XML David Wu CS 632 April 23, 2001.
4/15/2002Bo Du 1 - Bo Du, April 15, XML - QL A Query Language for XML.
Managing XML and Semistructured Data
Putting Semi-structured Data to Practice Alon Levy Seattle, Washingon University of Washington.
1 Lecture 08: XML and Semistructured Data. 2 Outline XML (Section 17) –XML syntax, semistructured data –Document Type Definitions (DTDs) XPath.
1 Lecture 08: XML and Semistructured Data. 2 Outline XML (Section 17) –XML syntax, semistructured data –Document Type Definitions (DTDs) XPath.
Chapter 5: Information Retrieval and Web Search
1 Advanced Topics XML and Databases. 2 XML u Overview u Structure of XML Data –XML Document Type Definition DTD –Namespaces –XML Schema u Query and Transformation.
Overview of Search Engines
XML: Extensible Markup Language FST-UMAC Gong Zhiguo.
IS432: Semi-Structured Data Dr. Azeddine Chikh. 1. Semi Structured Data Object Exchange Model.
Dept. Computer Science, Korea Univ. Intelligent Information System Lab. XML clustering methods Sohn Jong-Soo Intelligent Information.
Dan SuciuTools for XML Data Exchange Dan Suciu AT&T Labs Joint work with Mary Fernandez.
XML과 Database 홍기형 성신여자대학교 성신여자대학교 홍기형.
Chapter 27 The World Wide Web and XML. Copyright © 2004 Pearson Addison-Wesley. All rights reserved.27-2 Topics in this Chapter The Web and the Internet.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
1 Searching XML Documents via XML Fragments D. Camel, Y. S. Maarek, M. Mandelbrod, Y. Mass and A. Soffer Presented by Hui Fang.
1 Lessons from the TSIMMIS Project Yannis Papakonstantinou Department of Computer Science & Engineering University of California, San Diego.
From Semistructured Data to XML Dan Suciu AT&T Labs
Chapter 6: Information Retrieval and Web Search
Lecture 6: XML Query Languages Thursday, January 18, 2001.
Lecture 5: XML Tuesday, January 16, Outline XML, DTDs (Data on the Web, 3.1) Semistructured data in XML (3.2) Exporting Relational Data in XML (8.3.1)
[ Part III of The XML seminar ] Presenter: Xiaogeng Zhao A Introduction of XQL.
Chapter 27 The World Wide Web and XML. Copyright © 2004 Pearson Addison-Wesley. All rights reserved.27-2 Topics in this Chapter The Web and the Internet.
XML query. introduction An XML document can represent almost anything, and users of an XML query language expect it to perform useful queries on whatever.
1 Introduction to Semistructured Data and XML. 2 How the Web is Today  HTML documents often generated by applications consumed by humans only easy access:
More XML: semantics, DTDs, XPATH February 18, 2004.
Management of XML and Semistructured Data Lecture 10: Schemas Monday, April 30, 2001.
XML and Database.
Integrating Structured & Unstructured Data. Goals  Identify some applications that have crucial requirement for integration of unstructured and structured.
XML e X tensible M arkup L anguage (XML) By: Albert Beng Kiat Tan Ayzer Mungan Edwin Hendriadi.
Physical Database Design Purpose- translate the logical description of data into the technical specifications for storing and retrieving data Goal - create.
Computing & Information Sciences Kansas State University Friday, 20 Oct 2006CIS 560: Database System Concepts Lecture 24 of 42 Friday, 20 October 2006.
Semi-structured Data In many applications, data does not have a rigidly and predefined schema: –e.g., structured files, scientific data, XML. Managing.
SEMI-STRUCTURED DATA (XML) 1. SEMI-STRUCTURED DATA ER, Relational, ODL data models are all based on schema Structure of data is rigid and known is advance.
S EMISTRUCTURED D ATA AND XML D ISCUSSION Q UESTION Think about your personal Itunes library. Should it be maintained in a database system?
Information Retrieval in Practice
Lecture 14: Relational Algebra Projects XML?
Search Engine Architecture
XML path expressions CSE 350 Fall 2003.
Management of XML and Semistructured Data
Managing XML and Semistructured Data
eXtensible Markup Language (XML)
Semi-Structured data (XML Data MODEL)
Structure and Content Scoring for XML
Alin Deutsch, University of Pennsylvania Mary Mernandez, AT&T Labs
Lecture 9: XML Monday, October 17, 2005.
Structure and Content Scoring for XML
Lecture 8: XML Data Wednesday, October
CSE591: Data Mining by H. Liu
Semi-structured Data In many applications, data does not have a rigidly and predefined schema: e.g., structured files, scientific data, XML. Managing such.
Information Retrieval and Web Design
Introduction to Database Systems CSE 444 Lecture 10 XML
Lecture 15: Querying XML Friday, October 27, 2000.
Lecture 11: XML and Semistructured Data
Introduction to XML IR XML Group.
Presentation transcript:

Part One XML and Databases Soumen Chakrabarti CSE, IIT Bombay

Form and content The Web today –HTML generated by hand, wysisyg editors, ‘webified’ databases –HTML specifies rendering for human reading –Screen scraping required to consolidate data The Web in the future –Common interchange format (XML) –Concentrate on content, not form –Represent data class broader than relations

Role of databases Contribute –Data storage and indexing –Query processing and optimization –Views, transformations, integration Adopt –Search modalities –Content-based approximate search –Linguistic analysis

Features of semi-structured data No explicit schema, or volatile schema Schema size comparable to data size Structure changes without notice Heterogeneous, deeply nested, irregular Has nature of documents rather than tables

Semi-structured data model example &o1 &o12&o24&o29 &o43 &96 &243 &206 &25 “Serge” “Abiteboul” 1997 “Victor” “Vianu” paper book paper references author title year http author title publisher author title page firstname lastname firstnamelastnamefirst last Bib Object Exchange Model (OEM) complex object atomic object

Syntax { paper: { author: “Abiteboul”, author: { firstname: “Victor”, lastname: “Vianu”}, title: “Regular path queries …”, page: { first: 122, last: 133 } }

Some observations Missing or additional attributes Multiple attributes Different types in different objects Heterogeneous collections

Object ID’s and references Jane Mary John o555 o456 o123 children mother

Names and acronyms OEM (Object Exchange Model): a semi- structured data model from Stanford, 1995 Lore: a system for storing data adhering to the OEM Lorel: a query language for Lore XML (eXtensible Markup Language): a simplification of SGML and a generalization of HTML XML-QL: Query language for XML

Lorel query examples select Bib.paper.title from Bib.paper where Bib.paper.year >1995 select Bib.paper.title from Bib.paper where Bib.paper.year >1995 select X.title from Bib.paper X, Bib.(paper|book) Y where Y.author.lastname? = “Ullman” and Y.reference+ X select X.title from Bib.paper X, Bib.(paper|book) Y where Y.author.lastname? = “Ullman” and Y.reference+ X Alternative Transitive closure Navigating partially known structures

XML-QL query examples where Morgan Kaufmann $a in “ construct $a where Morgan Kaufmann $a in “ construct $a where $a in “ construct $a $l where $a in “ construct $a $l

XML storage in ternary relation &o1 &o3 &o2 &o4&o5 paper title author year &o6 “The Calculus”“…” “1986” Ref Val Too many joins Label name storage redundant

Storage optimization through mining paper author title year fn ln Paper1 Paper2 Inline common cases Tolerate a few nulls

Schema extraction Schema: a template for type/semantics specification Conformance –Does that data conform to a given schema ? Classification –If so, which objects belong to what classes/types? Applications –Storage and query optimization

Graph simulation Given two edge-labeled graphs G1 and G2, a simulation is a relation R between nodes such that if (x1, x2) is in R, and (x1, a, y1) is in G1, then there exists (x2, a, y2) in G2 (same label) such that (y1,y2) is in R x1x2 a R G1G2 y1 a R y2

Upper and lower bound schema Lower bound schema –Conformance: find simulation R from S to D –Classification: check if (c,x) in R –Used in storage optimization Upper bound schema (data guides) –Conformance: find simulation R from D to S –Classification: check if (x,c) in R –Used in path index generation and query optimization

Sample data &r &p8&p1&p2&p3&p4&p5&p6&p7 &c company employee worksfor manages managedby manages managedby

Lower bound schema Root &r Bosses &p1,&p4,&p6 Regulars &p2,&p3,&p5,&p7,&p8 Company &c company employee manages managedby worksfor employee

Storage using lower bound schema Root Company Employee string company person works-for c.e.o. address name managed-by name Employee Store rest in overflow graph Lower-bound schema

Upper bound schema (DataGuides) Root &r Employees &p1,&p1,&p3,P4 &p5,&p6,&p7,&p8 Bosses &p1,&p4,&p6 Regulars &p2,&p3,&p5,&p7,&p8 Company &c company employee manages managedby manages managedby worksfor

Query optimization issues Select x from A.B x where exists y in x.C: y=5 DDB CCC A 555 BBB CCC A 445 BBB CCC A 445 B B D D

What makes the problem difficult Selectivity estimation Index selection Access cost models Clustering choices

Part Two Information Retrieval and Databases Soumen Chakrabarti CSE, IIT Bombay

Information retrieval (IR) Search –‘Inverted’ index –Boolean match –Relevance ranking Classification –Learn topics from examples Clustering –Discover topics from a document collection Never done inside a relational database cat dog D5: 3, 37, 50 D7: 9, 20 D7: 7, 90, 400 D20: 22, 533

Current style of loose integration RDBMS provides hooks Declare some columns as textual with keyword index Inserts, updates, and deletes trigger external program, e.g., Verity search engine Search engine maintains separate indices Simple query rewriting to combine relational and text-match where-clauses

Reasons Space –BLOB vs. pure relational representation –Average English word is only 5 bytes Time –Most text engines are resigned to flexible (i.e., no) model for data consistency –Much faster read-only access than relational database lookups

New features desired Operations that are more complex than keyword search can benefit from tighter coupling with RDBMS Approximate search is essential (Anand Rajaraman, Amazon.com, SIGMOD 99) –Misspelling book title, author name common –Variant of OEM edge label (author/writer/poet) Similarity extends to structure as well (‘Travolta’ NEAR ‘Cage’ = ‘Face/Off’)

Case study: generalized ‘like’ SQL has limited string matching constructs –like ‘%x’, ‘x%’, ‘%x%’ –x must be exact match Need more lenient match –Applications: LDAP, IR String edit distance is not suitable –“Given query, order strings in database in increasing order of edit distance and pick top 5”

Sliding-window matching nascentpascal nasascscecenentpasscacal Given a query, scan to get a set of 3-grams Similarity of string in database to query = number of shared 3-grams rascal ras

Issues Minimally disruptive architecture Low storage overheads Fast query processing Good selectivity estimates Combining with other predicates for ranking Efficiently handling updates