Querying Structured Text in an XML Database Shurug Al-Khalifa Cong Yu H. V. Jagadish (University of Michigan) Presented by Vedat Güray AFŞAR & Esra KIRBAŞ.

Slides:



Advertisements
Similar presentations
Processing XML Keyword Search by Constructing Effective Structured Queries Jianxin Li, Chengfei Liu, Rui Zhou and Bo Ning Swinburne University of Technology,
Advertisements

Chapter 5: Introduction to Information Retrieval
Efficient Keyword Search for Smallest LCAs in XML Database Yu Xu Department of Computer Science & Engineering University of California, San Diego Yannis.
XML and Enterprise Computing. What is XML? Stands for “Extensible Markup Language” –similar to SGML and HTML –document “tags” are used to define content.
Effective Keyword Based Selection of Relational Databases Bei Yu, Guoliang Li, Karen Sollins, Anthony K.H Tung.
Min LuTIMBER: A Native XML DB1 TIMBER: A Native XML Database Author: H.V. Jagadish, etc. Presenter: Min Lu Date: Apr 5, 2005.
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
1 Abdeslame ALILAOUAR, Florence SEDES Fuzzy Querying of XML Documents The minimum spanning tree IRIT - CNRS IRIT : IRIT : Research Institute for Computer.
TIMBER A Native XML Database Xiali He The Overview of the TIMBER System in University of Michigan.
Image Indexing and Retrieval using Moment Invariants Imran Ahmad School of Computer Science University of Windsor – Canada.
XSEarch XML Search Engine Jonathan MAMOU October 2002.
Information Retrieval in Practice
Search Engines and Information Retrieval
Paper by: A. Balmin, T. Eliaz, J. Hornibrook, L. Lim, G. M. Lohman, D. Simmen, M. Wang, C. Zhang Slides and Presentation By: Justin Weaver.
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) Classic Information Retrieval (IR)
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
Suggestion of Promising Result Types for XML Keyword Search Joint work with Jianxin Li, Chengfei Liu and Rui Zhou ( Swinburne University of Technology,
IR Models: Structural Models
TOSS: An Extension of TAX with Ontologies and Similarity Queries Edward Hung, Yu Deng, V.S. Subrahmanian Department of Computer Science University of Maryland,
1 Indexing and Querying XML Data for Regular Path Expressions A Paper by Quanzhong Li and Bongki Moon Presented by Amnon Shochot.
Flexible and Efficient XML Search with Complex Full-Text Predicates Sihem Amer-Yahia - AT&T Labs Research → Yahoo! Research Emiran Curtmola - University.
XSEarch: A Semantic Search Engine for XML Sara Cohen Jonathan Mamou Yaron Kanza Yehoshua Sagiv Presented at VLDB 2003, Germany.
CIS607, Fall 2005 Semantic Information Integration Article Name: Clio Grows Up: From Research Prototype to Industrial Tool Name: DH(Dong Hwi) kwak Date:
Query Operations: Automatic Global Analysis. Motivation Methods of local analysis extract information from local set of documents retrieved to expand.
Overview of Search Engines
Indexing XML Data Stored in a Relational Database VLDB`2004 Shankar Pal, Istvan Cseri, Gideon Schaller, Oliver Seeliger, Leo Giakoumakis, Vasili Vasili.
Chapter 4 Query Languages.... Introduction Cover different kinds of queries posed to text retrieval systems Keyword-based query languages  include simple.
Modern Information Retrieval Chap. 02: Modeling (Structured Text Models)
NUITS: A Novel User Interface for Efficient Keyword Search over Databases The integration of DB and IR provides users with a wide range of high quality.
Xpath Query Evaluation. Goal Evaluating an Xpath query against a given document – To find all matches We will also consider the use of types Complexity.
Selective and Authentic Third-Party distribution of XML Documents - Yashaswini Harsha Kumar - Netaji Mandava (Oct 16 th 2006)
Search Engines and Information Retrieval Chapter 1.
Efficient Keyword Search over Virtual XML Views Feng Shao and Lin Guo and Chavdar Botev and Anand Bhaskar and Muthiah Chettiar and Fan Yang Cornell University.
A TREE BASED ALGEBRA FRAMEWORK FOR XML DATA SYSTEMS
DBXplorer: A System for Keyword- Based Search over Relational Databases Sanjay Agrawal Surajit Chaudhuri Gautam Das Presented by Bhushan Pachpande.
1 Holistic Twig Joins: Optimal XML Pattern Matching ACM SIGMOD 2002.
1 Ranking Inexact Answers. 2 Ranking Issues When inexact querying is allowed, there may be MANY answers –different answers have a different level of incompleteness.
Michael Cafarella Alon HalevyNodira Khoussainova University of Washington Google, incUniversity of Washington Data Integration for Relational Web.
1 Efficient Search Ranking in Social Network ACM CIKM2007 Monique V. Vieira, Bruno M. Fonseca, Rodrigo Damazio, Paulo B. Golgher, Davi de Castro Reis,
Querying Structured Text in an XML Database By Xuemei Luo.
April 14, 2003Hang Cui, Ji-Rong Wen and Tat- Seng Chua 1 Hierarchical Indexing and Flexible Element Retrieval for Structured Document Hang Cui School of.
Approximate XML Joins Huang-Chun Yu Li Xu. Introduction XML is widely used to integrate data from different sources. Perform join operation for XML documents:
ISP 433/533 Week 11 XML Retrieval. Structured Information Traditional IR –Unit of information: terms and documents –No structure Need more granularity.
Database Systems Part VII: XML Querying Software School of Hunan University
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Relational Algebra.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Gökay Burak AKKUŞ Ece AKSU XRANK XRANK: Ranked Keyword Search over XML Documents Ece AKSU Gökay Burak AKKUŞ.
[ Part III of The XML seminar ] Presenter: Xiaogeng Zhao A Introduction of XQL.
Sept. 27, 2002 ISDB’02 Transforming XPath Queries for Bottom-Up Query Processing Yoshiharu Ishikawa Takaaki Nagai Hiroyuki Kitagawa University of Tsukuba.
Mining Dependency Relations for Query Expansion in Passage Retrieval Renxu Sun, Chai-Huat Ong, Tat-Seng Chua National University of Singapore SIGIR2006.
1 Adaptive Subjective Triggers for Opinionated Document Retrieval (WSDM 09’) Kazuhiro Seki, Kuniaki Uehara Date: 11/02/09 Speaker: Hsu, Yu-Wen Advisor:
Ranking of Database Query Results Nitesh Maan, Arujn Saraswat, Nishant Kapoor.
Date: 2013/4/1 Author: Jaime I. Lopez-Veyna, Victor J. Sosa-Sosa, Ivan Lopez-Arevalo Source: KEYS’12 Advisor: Jia-ling Koh Speaker: Chen-Yu Huang KESOSD.
Information Retrieval CSE 8337 Spring 2005 Modeling (Part II) Material for these slides obtained from: Modern Information Retrieval by Ricardo Baeza-Yates.
Holistic Twig Joins Optimal XML Pattern Matching Nicolas Bruno Columbia University Nick Koudas Divesh Srivastava AT&T Labs-Research SIGMOD 2002.
1 Holistic Twig Joins: Optimal XML Pattern Matching Nicolas Bruno, Nick Koudas, Divesh Srivastava ACM SIGMOD 2002 Presented by Jun-Ki Min.
XRANK: RANKED KEYWORD SEARCH OVER XML DOCUMENTS Lin Guo Feng Shao Chavdar Botev Jayavel Shanmugasundaram Abhishek Chennaka, Alekhya Gade Advanced Database.
XML Query languages--XPath. Objectives Understand XPath, and be able to use XPath expressions to find fragments of an XML document Understand tree patterns,
1 Ranking Inexact Answers. 2 Ranking Issues When inexact querying is allowed, there may be MANY answers –different answers have a different level of incompleteness.
Information Retrieval in Practice
Querying and Transforming XML Data
Probabilistic Data Management
OrientX: an Integrated, Schema-Based Native XML Database System
Toshiyuki Shimizu (Kyoto University)
Structure and Content Scoring for XML
MCN: A New Semantics Towards Effective XML Keyword Search
Structure and Content Scoring for XML
Information Retrieval and Web Design
Introduction to XML IR XML Group.
CoXML: A Cooperative XML Query Answering System
Presentation transcript:

Querying Structured Text in an XML Database Shurug Al-Khalifa Cong Yu H. V. Jagadish (University of Michigan) Presented by Vedat Güray AFŞAR & Esra KIRBAŞ SIGMOD 2003, June 9-12, 2003, San Diego, CA. 1

Information Retrieval (IR) StyleStandart Database Style Natural language likeEfficient A well-suited query evaluation tecnique XML Databases includes structured text. 2

Example: Find Document components in “articles.xml” that are about “Search Engine”, RELEVANCE TO “Internet” is desirable but not necessary. RELEVANCE SCORING Not exact match, but relevant match ! Relevance is indicated by scores. 3

In the paper; An algebra, called TIX, is presented. -Used for the integration of IR techniques with standart DB query techniques TermJoin algorithm is introduced. -Stack-based algorithm -Used in XML DBs -Scores composite elements in an efficient way Experimental results. 4

If we know the exact schema of the DB, we can specify queries exactly. But this is not the case generally. We may sometimes need to “not exact but relevant” answers. Traditional databases are not good at dealing with fuzziness. 5

Relevance ranking is central in IR but IR is “document-centric” cannot manage XML’s granular structure 6

TIX (Text In XML) algebra, is based on the idea of a “scored tree”. Each operation in this algebra manipulates collections of scored trees. By using this algebra, deciding the granularity of the elements to be returned to the user is possible. The “Pick” operator in TIX, it is possible to select appropriate elements based on some relevance criteria among the others in an XML document. 7

The core of XML processing is “join operation”. Studies in this area showed that “stack-based” join access algorithms are superior to others. TermJoin algorithm is based on this idea. 8

MOTIVATING EXAMPLES Query 1: simple IR-style query Find document components in articles.xml that are about ‘‘search engine’’. Relevance to ‘‘internet’’ and ‘‘information retrieval’’ is desirable but not necessary. Query 2: structured IR-style query Find document components in articles.xml that are part of an article written by an author with last name ‘‘Doe’’ and are about ‘‘search engine’’. Relevance to ‘‘internet’’ and ‘‘information retrieval’’ is desirable but not necessary.

TIX ALGEBRA XML data is modeled by using “ordered labeled trees”. nodes  elements edges  inclusion relationships To model IR-style processing,“scored trees” are introduced. Selection, Projection and, Join operations can be defined on scored trees. THRESHOLD and PICK operations  to manipulate IR- style processing 10

SCORED TREES 11

Definition 1: Scored Data Tree: A scored data tree is a rooted ordered tree, such that each node carries data in the form of a set of attribute- value pairs, including at least a tag (indicating the type of the node) and a real number valued score (indicating the score of the node). The score of the tree is the score of the root node. 12

Definition 2: Scored Pattern Tree: A scored pattern tree is a triple P = (T, F, S), where T = (V,E) is a node-labeled and edge- labeled tree such that; Each node in V has a distinct integer as its label. Each edge is labeled pc (for parent child relationship), ad (for ancestor descendant relationship), or ad* (for self-or-descendant relationship). F is a formula of boolean combination of predicates applicable to nodes. S is a set of scoring functions specifying how to calculate the scores of each node (and their ancestors) being applied with an IR-style predicate, and the scores of each IR-style join condition match. 13

14 F: ($1.tag = article) & ($2.tag = author) & ($3.tag = sname) & ($3.content = "Doe") S: $4.score = {ScoreFoo({"search engine"},{"internet", "information retrieval"}) } $1.score = $4.score Find document components in articles.xml that are part of an article written by an author with last name ‘‘Doe’’ and are about ‘‘search engine’’. Relevance to ‘‘internet’’ and ‘‘information retrieval’’ is desirable but not necessary.

15 Article Article or any other descendant Author SurName

Primary IR node: if relevance finding applied to a node Secondary IR node: 1- if a node has primary IR nodes in its subtree 2- if a scoring function is defined based on the scores of other IR nodes Query IR node: IR nodes in the scored pattern tree Data IR node: IR nodes in the scored data tree 16

17 Primary IR node Secondary IR node not IR nodes !

18 A collection of data trees (C) A collection of scored data trees A scored pattern tree ( p ) SCORED SELECTION

19

A collection of data trees (C) 20 A collection of scored data trees A scored pattern tree ( p ) SCORED PROJECTION A projection list ( PL )

19 Scored Projection with PL = {$1, $3, $4}

A collection of data trees (C) 22 A collection of scored data trees A scored pattern tree ( p ) SCORED JOIN Another collection of data trees (C2)

23 IR-style join query Find relevant document components in articles.xml as specified in Query above. For articles containing such components, find reviews from reviews.xml for articles with similar titles. F: ($2.tag = article) & ($3.tag = article−title) & ($4.tag = author) & ($5.tag = sname) & ($5.content = "Doe") & ($7.tag = review) & ($8.title = title) & ($1.tag = tix_prod_root) S: $6.score = ScoreFoo({"search engine"}, {"internet", "information retrieval"}) $2.score = $6.score $joinScore = ScoreSim($3.content, $8.content) $1.score = ScoreBar($joinScore, $6.score)

24 Result tree for the query

25 THRESHOLD OPERATOR A collection of data trees (C) A collection of scored data trees A scored pattern tree ( p ) Threshold condition (TC) TC may be a real or an integer value. TC may be defined on one or more query IR-nodes. A SELECTION operation with a threshold value.

26 PICK OPERATOR A collection of data trees (C) A collection of scored data trees A scored pattern tree ( p ) Pick Criterion (PC) Very much similar to PROJECTION operation Pick cirterion may be defined on one or more query IR-nodes. Eleminates the redundant nodes.

27 EXTENSION of XQuery LANGUAGE For $a := document(‘‘articles.xml’’)// article[/author/sname/text()=‘‘Doe’’]/ descendant-or-self::* Score $a using ScoreFoo($a, {‘‘search engine’’}, {‘‘internet’’, ‘‘information retrieval’’}) Pick $a using PickFoo($a)) Return { $a } Sortby(score) Threshold ≥ 4 stop after 5 An example Query:

28 EXTENSION of XQuery LANGUAGE define function ScoreFoo(element* $a, term* A, term* B) return void { For $x in $a For each term α in A i += count(α, $a/alltext()) × 0.8 For each term β in B i += count(β, $a/alltext()) × 0.6 = i ScoreFoo Function:

Access Methods - We have identified a number of requirements for efective information retrieval from XML databases yet. - Now, we are going to study the effective evaluation of operators with IR-style functionality within the context of a set-oriented, pipelined, database-style query evaluation engine. The primary issue is how to manipulate scores – how to generate them, how to propagate and consolidate them.

Score Generating Methods Scores are generated early in a query evaluation plan, upon initial data or index access. An index look-up for an individual indexed term would at the very least return identifiers of XML elements in which this term occurs. Most queries may specify more than one term for IR relevance scoring.

Score Generating Methods(cont’d) - There are significant opportunities to do better, by developing new access methods for operator combinations that are likely to occur frequently. We have identified two such access methods: TermJoin PhraseFinder

Score-Modifying Methods A U w1,w2 B = {(x, s) | x Є¸ A U B Λ s = f(w1, sA, w2, sB)} A and B are the non-scored versions of input sets A and B. s is a score that gets assigned to an output tree x. w1 and w2 are weights each corresponds to one of the two input sets of witness trees. The function f that calculates s can be a weighted addition of the two scores, sA and sB associated with the input trees that composed x.

Score-Utilizing Methods Yet another important set of access methods are the ones that utilize the scores. Two such examples are the ones for evaluating Threshold and Pick. As discussed before, those two new TIX operators differ from existing operators in that their evaluation requires information not just local to the current node but also query-wide. For Threshold, the information possibly required (e.g., global ranking) can be efficiently generated from the input itself by employing techniques. The Pick operator poses much of a challenge.

Experimental Evaluation Experiments were run on a 1.8 GHz PC compatible machine with 256 MB of RAM running WindowsXP. It comprises 18 million XML elements with a total size of 500 MB. After loading into our database, the total size grew to 5 GB.

Term Join Evaluation To evaluate the TermJoin algorithm, we compared it to two other techniques: Composite and Meet. We also implemented and evaluated a variant we called Enhanced TermJoin.

Composite we showed how to express TermJoin in terms of standard operators. This operator expression can be evaluated directly to obtain a baseline implementation that is a composite of standard operators, we refer to this as Comp1 in the tables. pushing structural joins further down in the evaluation plan gives good performance. Therefore, we also implemented this technique and referred to it as Comp2 in the tables.

Generalized Meet Meet, to find the lowest common ancestor for a given set of elements (with term occurrences). We are interested in all common ancestors, which are easily obtained by traversing up the ancestor chain from the lowest common ancestor. We are also interested in ancestors that contain only some of the query terms but not all (with an appropriately lower score, of course). These are produced as intermediate results in the algorithm, and can be output. With these changes, the algorithm can be adapted to compute a TermJoin. We call the resulting algorithm Generalized Meet. It recursively obtains the ancestors of the text node containing any of the terms and output them along with the term occurrences after grouping based on node id. The term occurrences are then used for calculating the score of the node.

Enhanced TermJoin This is a slightly modi.ed algorithm from the original TermJoin. It uses an index structure to get a parent of a given node. Along with the parent information, the number of children of this parent is returned. In the original algorithm, a data access to the database is performed and some navigation is needed to get the number of children. Therefore, we expect savings when using this index structure.

Scoring Functions We use two different scoring functions: The simple scoring function is a weighted sum of the occurrences of each term under a given ancestor. The Complex scoring function examines the term distribution among child nodes. It assigns higher scores to nodes where the distances between terms are smaller.

Performance of different techniques

28 CONCLUSIONS TIX Algebra  integrates IR-style query processing into traditional piplined query evaluation relevance scoring granular structure Pick operator  user specified redundancy elemination TermJoin & PhraseFinder algoritms Score generation by stack based approach improve the performance 2-4 orders of magnitude