Text Search over XML Documents Jayavel Shanmugasundaram Cornell University.

Slides:



Advertisements
Similar presentations
Processing XML Keyword Search by Constructing Effective Structured Queries Jianxin Li, Chengfei Liu, Rui Zhou and Bo Ning Swinburne University of Technology,
Advertisements

© Copyright 2012 STI INNSBRUCK Apache Lucene Ioan Toma based on slides from Aaron Bannert
Intelligent Information Retrieval 1 Vector Space Model for IR: Implementation Notes CSC 575 Intelligent Information Retrieval These notes are based, in.
Information Retrieval in Practice
Paper by: A. Balmin, T. Eliaz, J. Hornibrook, L. Lim, G. M. Lohman, D. Simmen, M. Wang, C. Zhang Slides and Presentation By: Justin Weaver.
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) Classic Information Retrieval (IR)
Information Retrieval and Databases: Synergies and Syntheses IDM Workshop Panel 15 Sep 2003 Jayavel Shanmugasundaram Cornell University.
1 Configurable Indexing and Ranking for XML Information Retrieval Shaorong Liu, Qinghua Zou and Wesley W. Chu UCLA Computer Science Department {sliu, zou,
1 Ranked Queries over sources with Boolean Query Interfaces without Ranking Support Vagelis Hristidis, Florida International University Yuheng Hu, Arizona.
COMP630 Paper Presentation by Haomian(Eric) Wang.
Information retrieval Finding relevant data using irrelevant keys Example: database of photographic images sorted by number, date. DBMS: Well structured.
CAREER: Towards Unifying Database Systems and Information Retrieval Systems NSF IDM Workshop 10 Oct 2004 Jayavel Shanmugasundaram Cornell University.
INEX 2003, Germany Searching in an XML Corpus Using Content and Structure INEX 2003, Germany Yiftah Ben-Aharon, Sara Cohen, Yael Grumbach, Yaron Kanza,
WMES3103 : INFORMATION RETRIEVAL INDEXING AND SEARCHING.
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
NUITS: A Novel User Interface for Efficient Keyword Search over Databases The integration of DB and IR provides users with a wide range of high quality.
2 September 2005VLDB Tutorial on XML Full-Text Search XML Full-Text Search: Challenges and Opportunities Jayavel Shanmugasundaram Cornell University Sihem.
Keyword Search in Relational Databases Jaehui Park Intelligent Database Systems Lab. Seoul National University
Efficient Keyword Search over Virtual XML Views Feng Shao and Lin Guo and Chavdar Botev and Anand Bhaskar and Muthiah Chettiar and Fan Yang Cornell University.
XML as a Boxwood Data Structure Feng Zhou, John MacCormick, Lidong Zhou, Nick Murphy, Chandu Thekkath 8/20/04.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Query Routing in Peer-to-Peer Web Search Engine Speaker: Pavel Serdyukov Supervisors: Gerhard Weikum Christian Zimmer Matthias Bender International Max.
CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.
1 Searching XML Documents via XML Fragments D. Camel, Y. S. Maarek, M. Mandelbrod, Y. Mass and A. Soffer Presented by Hui Fang.
Querying Structured Text in an XML Database By Xuemei Luo.
1 Information Retrieval Acknowledgements: Dr Mounia Lalmas (QMW) Dr Joemon Jose (Glasgow)
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
ISP 433/533 Week 11 XML Retrieval. Structured Information Traditional IR –Unit of information: terms and documents –No structure Need more granularity.
Chapter 6: Information Retrieval and Web Search
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
A fast algorithm for the generalized k- keyword proximity problem given keyword offsets Sung-Ryul Kim, Inbok Lee, Kunsoo Park Information Processing Letters,
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
1 The Role of Document Structure in Querying, Scoring and Evaluating XML Full-Text Search Sihem Amer-Yahia AT&T Labs Research - USA Database Department.
Gökay Burak AKKUŞ Ece AKSU XRANK XRANK: Ranked Keyword Search over XML Documents Ece AKSU Gökay Burak AKKUŞ.
Personalizing XML Text Search in Piment Sihem Amer-Yahia AT&T Labs Research - USA Irini Fundulaki Bell Labs - USA Prateek Jain IIT-Kanpur - India Laks.
Ranked Information Retrieval on XML Data Seminar “Informationsorganisation und -suche mit XML” Dr. Ralf Schenkel SS 2003 Saarland University 8. Juli 2003.
Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.
BNCOD07Indexing & Searching XML Documents based on Content and Structure Synopses1 Indexing and Searching XML Documents based on Content and Structure.
2 September 2005VLDB Tutorial on XML Full-Text Search XML Full-Text Search: Challenges and Opportunities Jayavel Shanmugasundaram Cornell University Sihem.
Ranking objects based on relationships Computing Top-K over Aggregation Sigmod 2006 Kaushik Chakrabarti et al.
Integrating Structured & Unstructured Data. Goals  Identify some applications that have crucial requirement for integration of unstructured and structured.
Building a Distributed Full-Text Index for the Web by Sergey Melnik, Sriram Raghavan, Beverly Yang and Hector Garcia-Molina from Stanford University Presented.
Sept. 27, 2002 ISDB’02 Transforming XPath Queries for Bottom-Up Query Processing Yoshiharu Ishikawa Takaaki Nagai Hiroyuki Kitagawa University of Tsukuba.
1 Information Retrieval LECTURE 1 : Introduction.
Information Retrieval
Date: 2012/08/21 Source: Zhong Zeng, Zhifeng Bao, Tok Wang Ling, Mong Li Lee (KEYS’12) Speaker: Er-Gang Liu Advisor: Dr. Jia-ling Koh 1.
Ranking of Database Query Results Nitesh Maan, Arujn Saraswat, Nishant Kapoor.
Date: 2013/4/1 Author: Jaime I. Lopez-Veyna, Victor J. Sosa-Sosa, Ivan Lopez-Arevalo Source: KEYS’12 Advisor: Jia-ling Koh Speaker: Chen-Yu Huang KESOSD.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
Toward Entity Retrieval over Structured and Text Data Mayssam Sayyadian, Azadeh Shakery, AnHai Doan, ChengXiang Zhai Department of Computer Science University.
1 Personalizing Search via Automated Analysis of Interests and Activities Jaime Teevan, MIT Susan T. Dumais, Microsoft Eric Horvitz, Microsoft SIGIR 2005.
Chapter 5 Ranking with Indexes. Indexes and Ranking n Indexes are designed to support search  Faster response time, supports updates n Text search engines.
XRANK: RANKED KEYWORD SEARCH OVER XML DOCUMENTS Lin Guo Feng Shao Chavdar Botev Jayavel Shanmugasundaram Abhishek Chennaka, Alekhya Gade Advanced Database.
Databases and Information Retrieval: Rethinking the Great Divide SIGMOD Panel 14 Jun 2005 Jayavel Shanmugasundaram Cornell University.
Overview of XML Data Management Research at Cornell Jayavel Shanmugasundaram Cornell University.
Structured-Value Ranking in Update- Intensive Relational Databases Jayavel Shanmugasundaram Cornell University (Joint work with: Lin Guo, Kevin Beyer,
1 Keyword Search over XML. 2 Inexact Querying Until now, our queries have been complex patterns, represented by trees or graphs Such query languages are.
1 Keyword Search over XML. 2 Inexact Querying Until now, our queries have been complex patterns, represented by trees or graphs Such query languages are.
Module 7 XML and Information Retrieval (XQuery FullText, Research) 26MKT-ECHER-67FEX-44B6P.
Information Retrieval in Practice
Search Engine Architecture
XRANK: Ranked Keyword Search over XML Documents
Information Retrieval and Web Search
Structure and Content Scoring for XML
Introduction to Information Retrieval
Structure and Content Scoring for XML
Information Retrieval and Web Design
Introduction to XML IR XML Group.
Presentation transcript:

Text Search over XML Documents Jayavel Shanmugasundaram Cornell University

The HTML World XML and Information Retrieval: A SIGIR 2000 Workshop The workshop was held on 28 July The editors of the workshop were David Carmel, Yoelle Maarek, and Aya Soffer XQL and Proximal Nodes The paper was authored by Ricardo Baeza-Yates and Gonzalo Navarro. The abstract of this paper is given below. We consider the recently proposed language … The paper references the following papers: … …

The XML World XML and Information Retrieval: A SIGIR 2000 Workshop David Carmel, Yoelle Maarek, Aya Soffer XQL and Proximal Nodes Ricardo Baeza-Yates Gonzalo Navarro We consider the recently proposed language … Searching on structured text is becoming more important with XML … The XQL language … … …

Key Aspect of XML Captures text and structure Applications –Digital libraries –Content management Many such XML repositories already available –IEEE INEX collection –Library of Congress documents –Shakespeare’s plays –SIGMOD, DBLP, …

Searching XML Repositories Confluence of Information Retrieval (text) and Database (structure) techniques A spectrum of possibilities “Pure” Keyword Search Full-Text + DB Queries Keyword Search in Context

Outline Pure Keyword Search Keyword Search in Context Full-Text + DB Queries Related Work and Conclusion

Keyword Search over HTML Query Keywords Ranked Results Hyperlinked HTML Documents

Keyword Search over XML [Guo, Shao, Botev, Shanmugasundaram, SIGMOD 2003] Query Keywords Ranked Results Mix of Hyperlinked XML and HTML Documents

Outline Pure Keyword Search –Design Principles –Indexing and Query Processing Keyword Search in Context Full-Text + DB Queries Conclusion

XML Document XML and Information Retrieval: A SIGIR 2000 Workshop David Carmel, Yoelle Maarek, Aya Soffer XQL and Proximal Nodes Ricardo Baeza-Yates Gonzalo Navarro We consider the recently proposed language … Searching on structured text is becoming more important with XML … The XQL language … … …

XML Document XML and Information Retrieval: A SIGIR 2000 Workshop David Carmel, Yoelle Maarek, Aya Soffer XQL and Proximal Nodes Ricardo Baeza-Yates Gonzalo Navarro We consider the recently proposed language … Searching on structured text is becoming more important with XML … The XQL language … … …

Design Principles 1)Return most specific element containing the query keywords

XML Document XML and Information Retrieval: A SIGIR 2000 Workshop David Carmel, Yoelle Maarek, Aya Soffer XQL and Proximal Nodes Ricardo Baeza-Yates Gonzalo Navarro We consider the recently proposed language … Searching on structured text is becoming more important with XML … The XQL language … … …

Design Principles 1)Return most specific element containing the query keywords 2)Ranking has to be done at the granularity of elements

XML Document XML and Information Retrieval: A SIGIR 2000 Workshop David Carmel, Yoelle Maarek, Aya Soffer XQL and Proximal Nodes Ricardo Baeza-Yates Gonzalo Navarro We consider the recently proposed language … Searching on structured text is becoming more important with XML … The XQL language … … …

Design Principles 1)Return most specific element containing the query keywords 2)Ranking has to be done at the granularity of elements 3)Generalize HTML keyword search

Outline Pure Keyword Search –Design Principles –Indexing and Query Processing Keyword Search in Context Full-Text + DB Queries Conclusion

System Architecture ElemRank Computation Hybrid Dewey Inverted List Query Evaluator XML/HTML Documents XML Elements with ElemRanks Keyword query Ranked Results Data access Compute top-k query results as per definition of ranking

Na ï ve Method Naïve inverted lists: Ricardo 1 ; 5 ; 6 ; 8 XQL 1 ; 5 ; 6 ; 7 Problems: 1. Space Overhead 2. Spurious Results Main issue: Decouples representation of ancestors and descendants date 28 July …XML and …David Carmel … … …… XQL and … Ricardo …

Dewey IDs [1850s] 0.0date July …XML and …David Carmel … … …… XQL and …Ricardo …

Dewey Inverted List (DIL) XQL Dewey Id ElemRank Position List Sorted by Dewey Id ……… Ricardo Sorted by Dewey Id ……… Store IDs of elements that directly contain keyword - Avoids space overhead 91

DIL: Query Processing Merge query keyword inverted lists in Dewey ID Order –Entries with common prefixes are processed together Compute Longest Common Prefix of Dewey IDs during the merge –Longest common prefix ensures most specific results –Also suppresses spurious results Keep top-k results seen so far in output heap –Output contents of output heap after scanning inverted lists Algorithm works in a single scan over inverted lists

Ranked Dewey Inverted List (RDIL) XQL Inverted List … Sorted by ElemRank B+-tree On Dewey Id Ricardo Inverted List … Sorted by ElemRank B+-tree On Dewey Id

RDIL: Query Processing Ricardo Inverted List B+-tree on Dewey Id XQL P: threshold = ElemRank(P)+Max-ElemRank Rank(9.0.4) Output Heap Temp Heap PP R threshold = ElemRank(P)+ElemRank(R) B+-tree on Dewey Id

Motivation for DIL/RDIL Hybrid Correlation of query keywords: probability that the query keywords occur in same element –High correlation: RDIL likely to outperform DIL by stopping early –Low correlation: DIL likely to outperform RDIL because RDIL has to scan most (or entire) inverted list Dilemma –DIL and RDIL are likely to outperform each other –But require inverted lists to be sorted in different orders Challenges –Get benefits of DIL and RDIL without doubling space? –How can keyword correlation be determined?

Hybrid Dewey Inverted List (HDIL) XQL Full Inverted List … Sorted by Dewey id B+-tree On Dewey Id Short List Sorted by ElemRank RDIL is better only when it scans little of inverted list –Short list sorted by ElemRank - saves space! Can reuse full inverted list as leaf of B+-tree –Saves space!

DBLP: High Correlation Keywords

DBLP: Low Correlation Keywords

Outline Pure Keyword Search Keyword Search in Context Full-Text + DB Queries Related Work and Conclusion

Shakespeare's Plays (<3%) INEX IEEE SIGMOD Record... Shakespeare's Plays Find relevant elements in Shakespeare’s plays about ‘the process of speech’ 9 of top 10 results for one repository were not in the top 10 results of other repository –XIRQL’s [Fuhr & Grobjohann, SIGIR 2001] TF-IDF scoring

Explaining the Results TF-IDF scoring for a keyword k: –TF (Term Frequency): # occurences of k in element Usually normalized by some factor –IDF (Inverse Document Frequency): (# elements)/(# elements that contain k) Score = sum of TF*IDF for all query keywords Main reason for skewed results –Language of engineers very different from language of Shakespeare! –‘process’ common in INEX, ‘speech’ uncommon

Shakespeare's Plays (<3%) INEX IEEE SIGMOD Record... Need a way to efficiently compute IDF (or other corpus scoring statistic) “on-the-fly”

Context-Sensitive Ranking [Botev & Shanmugasundaram, WebDB 2005] Use Dewey inverted lists + context B+-trees Two pass algorithm –First pass: collect statistics –Second pass: compute results (entries cached from first pass)

Outline Pure Keyword Search Keyword Search in Context Full-Text + DB Queries Related Work and Conclusion

Motivation Many new applications require sophisticated DB queries + “complex” full-text search –Example: Library of Congress documents in XML Current XML query languages are mostly “database” languages –Examples: XQuery, XPath Provide very rudimentary text/IR support –fn:contains(e, keywords) No support for complex IR queries –Distance predicates, stemming, scoring, …

Example Queries From XQuery Full-Text Use Cases Document –Find the titles of the books whose body contains the phrases “Usability” and “Web site” in that order, in the same paragraph, using stemming if necessary to match the tokens –Find the titles of the books published after 1999 whose body contains “Usability” and “testing” within a window of 3 words, and return them in score order

XQuery Full-Text [W3C Working Draft] Quark Full-Text Language (Cornell) TeXQuery (Cornell, AT&T) IBM, Microsoft, Oracle proposals XQuery Full-Text (Second Draft)

Outline Pure Keyword Search Keyword Search in Context Full-Text + DB Queries –XQuery Full-Text Overview –Quark Implementation Related Work and Conclusion

XQuery Primer //book[./price < 25]/title //book/title for $b in //book[./author = ‘Dawkins’] order by $b/price return $b Find the titles of books: Find the titles of books with price < 25: Find books written by Dawkins, in order of price:

Syntax Overview [Amer-Yahia, Botev, Shanmugasundaram, WWW 2004] Two new XQuery constructs 1)FTContainsExpr Expresses “Boolean” full-text search predicates Seamlessly composes with other XQuery expressions 2)FTScoreClause Extension to FOR expression Can score FTContainsExpr and other expressions

FTContainsExpr ContextExpr ftcontains FTSelection –ContextExpr (any XQuery expression) is context spec –FTSelection is search spec –Returns true iff at least one node in ContextExpr satisfies the FTSelection Examples –//book ftcontains ‘Usability’ && ‘testing’ distance 5 –//book[./content ftcontains ‘Usability’ with stems]/title –//book ftcontains /article[author=‘Dawkins’]/title

FTScore Clause FOR $v [SCORE $s]? IN Expr ORDER BY … RETURN Example FOR $b SCORE $s in /pub/book[. ftcontains “Usability” && “testing”] ORDER BY $s RETURN $b

FTScore Clause FOR $v [SCORE $s]? IN Expr ORDER BY … RETURN Example FOR $b SCORE $s in /pub/book[. ftcontains “Usability” && “testing” and./price < 10.00] ORDER BY $s RETURN $b

Outline Pure Keyword Search Keyword Search in Context Full-Text + DB Queries –XQuery Full-Text Overview –Quark Implementation Related Work and Conclusion

Quark An open-source C++ implementation of XQuery Full-Text – –Compiles on Linux and Windows Key features –Mix of structured and full-text predicates –Score all of XQuery! –Full-text search over views

Quark Architecture File System Storage Structure Index Inverted List Index Document Loader Query Processing + Scoring XML Documents XQuery + XQFT Ranked Results

Mix of Structure and Full-Text Queries Structure IndexInverted List Index /pub/book[. ftcontains “Usability” && “testing” and./price < 10.00] /pub/book [./price < 10.00] Dewey IDs Results

Scoring XQuery FOR $b SCORE $s in /pub/book[. ftcontains “Usability” && “testing” and./price < 10.00] ORDER BY $s RETURN $b

Scoring XQuery Extending XQuery data model (internal) –Original: Sequence of items –New: Sequence of scored items Scoring predicates –Full-text: IR style probabilistic scoring –Structured: Scoring functions E.g., a > 1000 (score = 1 when a = infinity) Scoring XQuery expressions –Probabilistic combination of scores [Fuhr and Roelekke] E.g., Exists is “or” of all input scores

Full-Text Search Over Views … … … Data Source 1Data Source 2 … … … Integrated View

Outline Pure Keyword Search Keyword Search in Context Full-Text + DB Queries Related Work and Conclusion

Related Work Semi-structured ranked keyword search –XIRQL [Fuhr and Grobjohann] –XXL [Theobald and Weikum, 2001] –Commercial search engines [Luk et al.] –INEX initiative Keyword search over databases –BANKS [Bhalotia et al.] –DBXplorer [Agrawal et al.] –DISCOVER [Hristidis et al.] –LORE [Goldman et al.]

10000 Foot View of Data Management Structured Unstructured Complex and Structured Ranked Search Data Queries Database Systems Information Retrieval Systems