Integrating Keyword Search into XML Query Processing Presentation By: Alex Kremer Ariel Rosenblatt XML Query Language (XML-QL) Extending XML-QL with Keyword.

Slides:

Advertisements

Similar presentations

Native XML Database or RDBMS. Data or Document orientation If you are primarily storing documents, then a Native XML Database may be the best option.

Advertisements

XML: Extensible Markup Language

XML DOCUMENTS AND DATABASES

IMPLEMENTATION OF INFORMATION RETRIEVAL SYSTEMS VIA RDBMS.

By Daniela Floresu Donald Kossmann

CSE 6331 © Leonidas Fegaras XML and Relational Databases 1 XML and Relational Databases Leonidas Fegaras.

TIMBER A Native XML Database Xiali He The Overview of the TIMBER System in University of Michigan.

Relational Databases for Querying XML Documents: Limitations & Opportunities VLDB`99 Shanmugasundaram, J., Tufte, K., He, G., Zhang, C., DeWitt, D., Naughton,

Information Retrieval in Practice

Selective Dissemination of Streaming XML By Hyun Jin Moon, Hetal Thakkar.

Xyleme A Dynamic Warehouse for XML Data of the Web.

NaLIX: A Generic Natural Language Search Environment for XML Data Presented by: Erik Mathisen 02/12/2008.

Storing and Querying XML Data in Databases Anupama Soli

Introduction to Structured Query Language (SQL)

1 COS 425: Database and Information Management Systems XML and information exchange.

Storing and Querying Ordered XML Using Relational Database System Swapna Dhayagude.

1 Statistics XML: –Altavista: 800,000 pages returned. –Amazon.com: 242 books. In comparison: –God: 12,000 books, 7 Million pages –Bible: 32,000 books,

1 New Ways of Querying the Web by Eliahu Brodsky and Alina Blizhovsky.

Database Systems and XML David Wu CS 632 April 23, 2001.

Storing and Querying Ordered XML Using a Relational Database System By Khang Nguyen Based on the paper of Igor Tatarinov and Statis Viglas.

4/15/2002Bo Du 1 - Bo Du, April 15, XML - QL A Query Language for XML.

XML –Query Languages, Extracting from Relational Databases ADVANCED DATABASES Khawaja Mohiuddin Assistant Professor Department of Computer Sciences Bahria.

CIS607, Fall 2005 Semantic Information Integration Article Name: Clio Grows Up: From Research Prototype to Industrial Tool Name: DH(Dong Hwi) kwak Date:

1 Advanced Topics XML and Databases. 2 XML u Overview u Structure of XML Data –XML Document Type Definition DTD –Namespaces –XML Schema u Query and Transformation.

Overview of Search Engines

Processing of structured documents Spring 2003, Part 8 Helena Ahonen-Myka.

Indexing XML Data Stored in a Relational Database VLDB`2004 Shankar Pal, Istvan Cseri, Gideon Schaller, Oliver Seeliger, Leo Giakoumakis, Vasili Vasili.

8/17/20151 Querying XML Database Using Relational Database System Rucha Patel MS CS (Spring 2008) Advanced Database Systems CSc 8712 Instructor : Dr. Yingshu.

Sheet 1XML Technology in E-Commerce 2001Lecture 6 XML Technology in E-Commerce Lecture 6 XPointer, XSLT.

JSP Standard Tag Library

XML-to-Relational Schema Mapping Algorithm ODTDMap Speaker: Artem Chebotko* Wayne State University Joint work with Mustafa Atay,

Lecture 7 of Advanced Databases XML Querying & Transformation Instructor: Mr.Ahmed Al Astal.

Lecture 6 of Advanced Databases XML Schema, Querying & Transformation Instructor: Mr.Ahmed Al Astal.

Object and component “wiring” standards This presentation reviews the features of software component wiring and the emerging world of XML-based standards.

XML By Dr.S.Sridhar, Ph.D.(JNUD), RACI(Paris, NICE), RMR(USA), RZFM(Germany) DIRECTOR ARUNAI ENGINEERING COLLEGE TIRUVANNAMALAI.

XML-QL A Query Language for XML Charuta Nakhe

Lecture 6 of Advanced Databases XML Querying & Transformation Instructor: Mr.Eyad Almassri.

XML과 Database 홍기형 성신여자대학교 성신여자대학교 홍기형.

1 CIS336 Website design, implementation and management (also Semester 2 of CIS219, CIS221 and IT226) Lecture 6 XSLT (Based on Møller and Schwartzbach,

XML as a Boxwood Data Structure Feng Zhou, John MacCormick, Lidong Zhou, Nick Murphy, Chandu Thekkath 8/20/04.

Company LOGO OODB and XML Database Management Systems – Fall 2012 Matthew Moccaro.

Information Retrieval and Knowledge Organisation Knut Hinkelmann.

RELATIONAL FAULT TOLERANT INTERFACE TO HETEROGENEOUS DISTRIBUTED DATABASES Prof. Osama Abulnaja Afraa Khalifah

1 Design Issues in XML Databases Ref: Designing XML Databases by Mark Graves.

Database Systems Part VII: XML Querying Software School of Hunan University

2007. Software Engineering Laboratory, School of Computer Science S E Web-Harvest Web-Harvest: Open Source Web Data Extraction tool 이재정 Software Engineering.

[ Part III of The XML seminar ] Presenter: Xiaogeng Zhao A Introduction of XQL.

1 Biometric Databases. 2 Overview Problems associated with Biometric databases Some practical solutions Some existing DBMS.

XML and Its Applications Ben Y. Zhao, CS294-7 Spring 1999.

The Semistructured-Data Model Programming Languages for XML Spring 2011 Instructor: Hassan Khosravi.

XML and Database.

XML Access Control Koukis Dimitris Padeleris Pashalis.

Sept. 27, 2002 ISDB’02 Transforming XPath Queries for Bottom-Up Query Processing Yoshiharu Ishikawa Takaaki Nagai Hiroyuki Kitagawa University of Tsukuba.

XML e X tensible M arkup L anguage (XML) By: Albert Beng Kiat Tan Ayzer Mungan Edwin Hendriadi.

Computing & Information Sciences Kansas State University Friday, 20 Oct 2006CIS 560: Database System Concepts Lecture 24 of 42 Friday, 20 October 2006.

Dec. 13, 2002 WISE2002 Processing XML View Queries Including User-defined Foreign Functions on Relational Databases Yoshiharu Ishikawa Jun Kawada Hiroyuki.

Session 1 Module 1: Introduction to Data Integrity

Martin Kruliš by Martin Kruliš (v1.1)1.

Semi-structured Data In many applications, data does not have a rigidly and predefined schema: –e.g., structured files, scientific data, XML. Managing.

SEMI-STRUCTURED DATA (XML) 1. SEMI-STRUCTURED DATA ER, Relational, ODL data models are all based on schema Structure of data is rigid and known is advance.

XML and Distributed Applications By Quddus Chong Presentation for CS551 – Fall 2001.

Information Retrieval in Practice

XML: Extensible Markup Language

Search Engine Architecture

XML QUESTIONS AND ANSWERS

XML in Web Technologies

Semi-Structured data (XML Data MODEL)

Alin Deutsch, University of Pennsylvania Mary Mernandez, AT&T Labs

CSE591: Data Mining by H. Liu

Semi-Structured data (XML)

Presentation transcript:

Integrating Keyword Search into XML Query Processing Presentation By: Alex Kremer Ariel Rosenblatt XML Query Language (XML-QL) Extending XML-QL with Keyword Search Extended XML-QL Implementation Using RDBMS

Bibliography (well-formed, but invalid) Bibliography Article elements are from different sources Same information, but using different XML Scheme / DTDs (Document Type Descriptors)

XML Queries XML is becoming the Data Storage and Exchange Format of choice in many applications Handling of XML data requires a rich and powerful Query Language Allow for querying the content and structure of an XML document Varying or unknown structures can make formulating queries very difficult

XML Queries: Why not SQL/OQL XML is not rigidly structured In XML the schema can exists with the data as tag names If DTD is not available, schema is build while the document is parsed Missing elements or multiple occurrences of the same element This flexibility is crucial for EDI (Electronic Document Interchange)

XML Query Requirements W3C Working Group Goals: Support different usage scenarios Define data model + query operators Define query language syntax Interoperate with other XML working groups

XML Query Requirements: Usage Scenarios Human-readable documents Manuals, Books, Articles Data-oriented documents XML representation of: Database data, Object data, … XML representation might be either: Physical or Virtual

XML Query Requirements: Usage Scenarios Contd. Mixed model documents: Hybrid of document oriented and data- oriented Catalogues, Patient health records, … Administrative data: Configuration files, User profiles, Administrative logs

XML Query Requirements: Usage Scenarios Contd. Filtering streams: On-line: filtering / extracting / transforming / routing, of XML data streams Logs of messages, Network packets, Stock market data, Newswire feeds Document Object Model (DOM) Perform queries on DOM structures to return sets of nodes that meet the specified criteria

XML Query Requirements: Usage Scenarios Contd. Multiple syntactic environments for queries embedded in: URL, XML, JSP or ASP pages, a string in a general-purpose programming language …

XML Query Requirements: Interoperability Results must be returned in a DOM compatible manner XPath (used in XPointer and XSLT) XPath expressibility and search facilities should be used in query syntax Usage of XML Schema (XSDL) and/or DTD

XML Query Languages: Proposals to W3C XQL (heavily based on XPath) XML-QL

It is declarative It is “ relational complete ” ; in particular it can express joins Simple enough to enable optimizations It can extract data from existing XML documents and construct new documents (transformations)

XML-QL: Syntax WHERE clause specifies how to filter data from the input XML dataset CONSTRUCT clause specifies how to assemble the query results in XML WHERE ( xml-pattern [ ELEMENT_AS $elem_var ] )* IN url, ( predicate )* CONSTRUCT xml-pattern | $variable

XML-QL: Example #1 Yields the following resultresult WHERE $N $T ELEMENT_AS $E IN “bibliography.xml”,“bibliography.xml” $N like *Florescu* CONSTRUCT $E

XML-QL Explained: The Data Model A Set of XML documents must be represented (XML Data Set) XML elements in a dataset can be partitioned according to their types Need to represent information in a loss-less manner (original data set must be recreatable from the representation)

XML-QL Explained: Data Model Representation ID00 ID01 article ID02 “ ”“1”“1”“ http: …” idlinkdate title ID03 source “ XML Query …”“ W3C ” ID04 article ID05 “3”“3”“ http: …” title ID06 author “ A Query …” “ Daniela Florescu ” ID07 author idlink name “ Alon L …” ID08 ID09 “4”“4”“ http: …” title ID10 author “ Integr …” “ Daniela Florescu ” ID12 author idlink name “ Donald K …” article ID11ID13 name ID14 article id “6”“6” … Florescu … } ” BibliographyBibliography:

XML-QL Explained: Data Model Representation Dataset D is represented as a graph G D : Nodes: Element e  node N e uniquely labeled ID e Data value v  leaf L v uniquely labeled v Edges: (N e, N e ’ ) labeled with the tag of e ’, if e ’ is directly nested within e ( … ) (N e, L v ) labeled with “”, if v is directly contained within e ( v ) (N e, L v ) labeled with attribute name a, if v is the value of atribute a of element e ( … )

XML-QL Explained: Query Processing An XML pattern can be also modeled by a graph Some labels in the graph are now variables The result of the evaluation of query q on the input D, is: Each mapping from the graph G q to the graph G D which preservers the constant labels This mapping induces a substitution of the variables in the query on the set of constant values

XML-QL Explained: A Query Graph for Example #1 WHERE $N $T ELEMENT_AS $E IN “bibliography.xml”,“bibliography.xml” $N like *Florescu* CONSTRUCT $E titleauthor $T “ *Florescu* ” name article

XML-QL Explained: Query Processing, Example #1 ID00 ID01 article ID02 “ ”“1”“1”“ http: …” idlinkdate title ID03 source “ XML Query …”“ W3C ” ID04 article ID05 “3”“3”“ http: …” title ID06 author “ A Query …” “ Daniela Florescu ” ID07 author idlink name “ Alon L …” ID08 ID09 “4”“4”“ http: …” title ID10 author “ Integr …” “ Daniela Florescu ” ID12 author idlink name “ Donald K …” article ID11ID13 name ID014 article id “6”“6” … Florescu … } ” BibliographyBibliography: titleauthor $T “ *Florescu* ” name article No “ name ” is an attribute Match! Add ID08 to Results $E = ID08 $T = “ Integrating Keyword Search …”

XML-QL: Advanced Queries Example #2 (More Florescu) We now look for articles where the author name can be also an attribute!, resultresult WHERE $N $T ELEMENT_AS $E IN “bibliography.xml”,“bibliography.xml” $N like *Florescu* CONSTRUCT $E union WHERE $T ELEMENT_AS $E IN “bibliography.xml”,“bibliography.xml” $N like *Florescu* CONSTRUCT $E Back

XML-QL: Disadvantages We need to know the XML structure in order to query We can still perform more efficient queries, where we get all the information available, but These queries can easily grow very complex as seen previously

XML-QL: Keyword Search Extension Addition of special predicate called contains to XML-QL Tests the existence of a given word within an XML element Works on partially known or not-known XML structure Allows querying several XML documents with different structure

Extended XML-QL: The contains Predicate The contains predicate has 4 arguments, ($E, word, depth, location): $E is an XML element variable Word – the word we are searching for Depth is an integer expression limiting the depth at which the word is found within the element Location is a boolean expression over the set of constants, {tag_name, attribute_name, content, attribute_value}

Extended XML-QL: Example #3 We can use the extended XML-QL to formulate a query which yields the same result as Example #2Example #2 WHERE ELEMENT_AS $A $T ELEMENT_AS $E IN “bibliography.xml”,“bibliography.xml” contains($A, “Florescu”, 3, content or attribute_value) CONSTRUCT $E Back

Extended XML-QL: Example #4 WHERE ELEMENT_AS $E IN “bibliography.xml”,“bibliography.xml” contains($E, “Florescu”, 3, any) CONSTRUCT $E We are able to query unstructured data (full text search) within a set of articles: Yielding the resultresult

Implementing the contains predicate The authors suggest an implementation of the XML-QL extension on top of a Commercial RDBMS: Oracle 8, IBM DB2, MS-SQL, …

Implementation Using RDBMS Reasons: Easy to implement an extended XML query processor Universally available RDBMS allow to mix XML data and other (relational data) Very good performance over large volumes of data

Relational Support for Full-text Indexing Use of extended Inverted Files to implement: The contains predicate Finding of relevant XML data sources (URLs) in a distributed environment We will use RDBMS to implement Inverted Files

Inverting Files For our needs the inverted file will contain tuples of the following format: Examples from bibliography.xml:bibliography.xml

Storing Inverted Files in RDBMS: Unique Internal elIDs Unique element IDs are modeled as records containing: Document locators (URLs) Element locators within the document Using absolute positions (start, end) Using unique identifiers specified by DTD (explicit id attribute)  Why not XPointer?

Storing Inverted Files in RDBMS: Unique elID Schemes After normalization the authors propose the following scheme: Elements(elID, docid, start_pos, end_pos, type, id_val) Documents(docid, URL) From this point elID can be used as an internal key used for faster processing

Storing Inverted Files in RDBMS Natural way – using scheme: contains(elID, word, depth, location) Huge! We partition it into word tables for each keyword in the dataset: (elID, depth, location) Virtually all IR (Information Retrieval) systems use partitioning by word Back

Storing Inverted Files in RDBMS: Further Partitioning We use further partitioning to optimize the query processing: The type (tag) of the element is usually known at predicate evaluation time by looking at the XML pattern of the queryquery We further partition the individual tables by the type of the element they are in: - (elID, depth, location) Table examples: Name-author, Florescu-name bibliography.xml Back

Implementation: Extended XML-QL Query Processing Two Ways: Replicating the whole XML data in an RDBMS XML-QL processing is entirely performed in an RDBMS Distributed XML Query Processing only index (contains) is stored in an RDBMS

Replicating the XML Data in an RDBMS The binary table approach: For each type (tag name or attribute name), a table is built with the following scheme: (parent, element, value) The parent element contains the element of type element is null if a has no sub- elements or if is an attribute name (in that case we are usually interested in the value) bibliography.xml

Replicating the XML Data in an RDBMS: XML-QL Queries Every XML-QL query can be translated into an equivalent SQL query The SQL query will process the binary tables of the replicated XML Data Back

XML-QL to SQL: Example #5 (from Example #1) WHERE $N $T ELEMENT_AS $E IN “bibliography.xml”,“bibliography.xml” $N like *Florescu* CONSTRUCT $E SELECT article.element FROM article, author, name, title WHERE article.element = author.parent AND author.element = name.parent AND article.element = title.parent AND /* title exists */ name.value like “Florescu”

Extended XML-QL to SQL: Keyword Search Processing the contains predicate involves usage of inverted file tables The word-type table has to be joined with the previous result The word-type table is the resulting table of the word by type partitioningword by type partitioning

Extended XML-QL to SQL: Example #6 SELECT title.value FROM article, author, name, title, Florescu-author, Integrating-title WHERE article.element = author.parent AND author.element = Florescu-author.elID AND article.element = title.parent AND title.element = Integrating-title.elID WHERE ELEMENT_AS $A $Ttext ELEMENT_AS $T ELEMENT_AS $E IN “bibliography.xml”,“bibliography.xml” contains($A, “Florescu”, 3, any) contains($T, “Integrating”, 3, any) CONSTRUCT $Ttext

Distributed XML Query Processing XML data can be indexed in RDBMS, but The XML data cannot be stored in the RDBMS Reasons: volume (entire www) or legal The mediator (query interface): Uses inverted files in RDBMS, but Accesses the data sources to compute the full query result (Expensive!) Load relevant documents/elements into RDBMS and process the query as described before (XML-QL to SQL)XML-QL to SQL

Distributed XML Query Processing: Elements Retrieval Use of Inverted Files for the retrieval of relevant documents/elements: Evaluate contains predicates to disqualify irrelevant elements Further reduce the dataset needed to process the remaining basic XML-QL query This is an optimization since retrieval of remote data is expensive Load the relevant documents/elements

Distributed XML Query Processing: Reducing Retrieval WHERE $N $T ELEMENT_AS $E IN “bibliography.xml”,“bibliography.xml” $T like *XML* CONSTRUCT $N Get the intersection of elIDs sets from: author-article name-article title-article XML-article

Conclusions XML-QL can be extended to support keyword search Use of RDBMS: Inverted Files can be stored an queried using an RDBMS XML data itself can be replicated and queried in the RDBMS Keyword search and overall XML query processing can be carried out very efficiently Data structure influence: The more structure is known, the faster a query will be executed Totally unstructured queries can be executed very fast The more structure is known, the higher is the quality of the query results