1 IDAR 2007 Emiran Curtmola A Platform for Efficient Full-Text SEARCH on the Web.

Slides:



Advertisements
Similar presentations
XML: Extensible Markup Language
Advertisements

Chapter 5: Introduction to Information Retrieval
13/04/20151 SPARK: Top- k Keyword Query in Relational Database Wei Wang University of New South Wales Australia.
Optimizing Join Enumeration in Transformation-based Query Optimizers ANIL SHANBHAG, S. SUDARSHAN IIT BOMBAY VLDB 2014
Efficient IR-Style Keyword Search over Relational Databases Vagelis Hristidis University of California, San Diego Luis Gravano Columbia University Yannis.
Efficient Keyword Search for Smallest LCAs in XML Database Yu Xu Department of Computer Science & Engineering University of California, San Diego Yannis.
Effective Keyword Based Selection of Relational Databases Bei Yu, Guoliang Li, Karen Sollins, Anthony K.H Tung.
The Volcano/Cascades Query Optimization Framework
Data Intensive Techniques to Boost the Real-time Performance of Global Agricultural Data Infrastructures SEMAGROW U SING A POWDER T RIPLE S TORE FOR BOOSTING.
Reasoning and Identifying Relevant Matches for XML Keyword Search Yi Chen Ziyang Liu, Yi Chen Arizona State University.
Information Retrieval in Practice
Paper by: A. Balmin, T. Eliaz, J. Hornibrook, L. Lim, G. M. Lohman, D. Simmen, M. Wang, C. Zhang Slides and Presentation By: Justin Weaver.
Distributed Search over the Hidden Web Hierarchical Database Sampling and Selection Panagiotis G. Ipeirotis Luis Gravano Computer Science Department Columbia.
Xyleme A Dynamic Warehouse for XML Data of the Web.
Information Retrieval and Databases: Synergies and Syntheses IDM Workshop Panel 15 Sep 2003 Jayavel Shanmugasundaram Cornell University.
Flexible and Efficient XML Search with Complex Full-Text Predicates Sihem Amer-Yahia - AT&T Labs Research → Yahoo! Research Emiran Curtmola - University.
CAREER: Towards Unifying Database Systems and Information Retrieval Systems NSF IDM Workshop 10 Oct 2004 Jayavel Shanmugasundaram Cornell University.
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
XML –Query Languages, Extracting from Relational Databases ADVANCED DATABASES Khawaja Mohiuddin Assistant Professor Department of Computer Sciences Bahria.
Minimal Probing: Supporting Expensive Predicates for Top-k Queries Kevin C. Chang Seung-won Hwang Univ. of Illinois at Urbana-Champaign.
Query Processing Presented by Aung S. Win.
CSCI 5417 Information Retrieval Systems Jim Martin Lecture 6 9/8/2011.
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
Tag-based Social Interest Discovery
Welcome to CPSC 534B: Web Data Integration & Management Laks V.S. Lakshmanan Rm. CICSR Main Mall.
NUITS: A Novel User Interface for Efficient Keyword Search over Databases The integration of DB and IR provides users with a wide range of high quality.
1 Distributed Monitoring of Peer-to-Peer Systems By Serge Abiteboul, Bogdan Marinoiu Docflow meeting, Bordeaux.
Context Tailoring the DBMS –To support particular applications Beyond alphanumerical data Beyond retrieve + process –To support particular hardware New.
MPI Informatik 1/17 Oberseminar AG5 Result merging in a Peer-to-Peer Web Search Engine Supervisors: Speaker : Sergey Chernov Prof. Gerhard Weikum Christian.
A TREE BASED ALGEBRA FRAMEWORK FOR XML DATA SYSTEMS
Ohio State University Department of Computer Science and Engineering Automatic Data Virtualization - Supporting XML based abstractions on HDF5 Datasets.
Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.
XML as a Boxwood Data Structure Feng Zhou, John MacCormick, Lidong Zhou, Nick Murphy, Chandu Thekkath 8/20/04.
Database Management 9. course. Execution of queries.
Master Thesis Defense Jan Fiedler 04/17/98
Emiran UC San Diego Alin UC San Diego K.K. at&t Divesh at&t.
« Pruning Policies for Two-Tiered Inverted Index with Correctness Guarantee » Proceedings of the 30th annual international ACM SIGIR, Amsterdam 2007) A.
Querying Structured Text in an XML Database By Xuemei Luo.
SE: CHAPTER 7 Writing The Program
Keyword Searching and Browsing in Databases using BANKS Seoyoung Ahn Mar 3, 2005 The University of Texas at Arlington.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
2007. Software Engineering Laboratory, School of Computer Science S E Web-Harvest Web-Harvest: Open Source Web Data Extraction tool 이재정 Software Engineering.
Q2Semantic: A Lightweight Keyword Interface to Semantic Search Haofen Wang 1, Kang Zhang 1, Qiaoling Liu 1, Thanh Tran 2, and Yong Yu 1 1 Apex Lab, Shanghai.
Personalizing XML Text Search in Piment Sihem Amer-Yahia AT&T Labs Research - USA Irini Fundulaki Bell Labs - USA Prateek Jain IIT-Kanpur - India Laks.
Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.
BNCOD07Indexing & Searching XML Documents based on Content and Structure Synopses1 Indexing and Searching XML Documents based on Content and Structure.
Ranking objects based on relationships Computing Top-K over Aggregation Sigmod 2006 Kaushik Chakrabarti et al.
Information Integration BIRN supports integration across complex data sources – Can process wide variety of structured & semi-structured sources (DBMS,
Introduction to the Semantic Web and Linked Data
1 Information Retrieval LECTURE 1 : Introduction.
Supporting Ranking and Clustering as Generalized Order-By and Group-By Chengkai Li (UIUC) joint work with Min Wang Lipyeow Lim Haixun Wang (IBM) Kevin.
Ranking of Database Query Results Nitesh Maan, Arujn Saraswat, Nishant Kapoor.
Date: 2013/4/1 Author: Jaime I. Lopez-Veyna, Victor J. Sosa-Sosa, Ivan Lopez-Arevalo Source: KEYS’12 Advisor: Jia-ling Koh Speaker: Chen-Yu Huang KESOSD.
DISTRIBUTED INFORMATION RETRIEVAL Lee Won Hee.
1 Holistic Twig Joins: Optimal XML Pattern Matching Nicolas Bruno, Nick Koudas, Divesh Srivastava ACM SIGMOD 2002 Presented by Jun-Ki Min.
A Unified Approach to Ranking in Probabilistic Databases Jian Li, Barna Saha, Amol Deshpande University of Maryland, College Park, USA VLDB
Welcome to CPSC 534B: Information Integration Laks V.S. Lakshmanan Rm. 315.
Chapter 13: Query Processing
1 Efficient Processing of Partially Specified Twig Queries Junfeng Zhou Renmin University of China.
Querying Structured Text in an XML Database Shurug Al-Khalifa Cong Yu H. V. Jagadish (University of Michigan) Presented by Vedat Güray AFŞAR & Esra KIRBAŞ.
Structured-Value Ranking in Update- Intensive Relational Databases Jayavel Shanmugasundaram Cornell University (Joint work with: Lin Guo, Kevin Beyer,
SEARCH ENGINES & WEB CRAWLER Akshay Ghadge Roll No: 107.
Probabilistic Data Management
Chapter 15 QUERY EXECUTION.
Structure and Content Scoring for XML
Panagiotis G. Ipeirotis Luis Gravano
Probabilistic Databases
Structure and Content Scoring for XML
Information Retrieval and Web Design
Introduction to XML IR XML Group.
Presentation transcript:

1 IDAR 2007 Emiran Curtmola A Platform for Efficient Full-Text SEARCH on the Web

2 Search Semi-structured Data (XML)  Growing amount of XML data available for processing and exchange  Need for text predicates that go beyond simple keyword search  Existing applications require to query both on structure and text of documents  Full-Text queries (FT)  query structure + text  complex, composable predicates on the words in the text  window, distance, order, times etc.

3 A Typical Scenario  E.g., web service discovery in P2P or Grid  Web services typically described using XML (e.g., WSDL standard)  Autonomous service providers use non-uniform descriptions, with variable structure and text comments  Query: “find web services providing info about on a possible tsunami in Asia (within 10 words)”

4 Existing Approaches: DB & IR DB community data centric (structure) languages efficient evaluation XPath 2.0, XQuery 1.0, XSLT 1.0 Information Retrieval (IR) community document centric (text) indices ranking methods Yahoo!, Google, XXL, JuruXML, Elixir etc. doc newspapers newspaper newspaper-name breaking news entertainment sailing clubs museums … sightseeing text text text text text text text text overview

5 Query Languages for Structure + Text  Challenge: a variety of competing proposals for querying XML on structure + text with [BAS-06]  variable expressive power  scoring methods  often fuzzy semantics  Front-runner language: XQuery Full-Text (XQFT)  Proposed by W3C task force  right now, going to last call until June 22, 2007  going as a W3C Recommendation as early as 2008!  Subsumes expressivity of most of the proposed FT languages  Reference implementation: GalaTex [Curtmola et al. XIME-P 2005]  Query in XQFT doc/newspapers/newspaper/breaking_news[.//* ftcontains “tsunami” and “Asia” window <=10 words] /overview

6  Prior to our project, no work on FT query optimization but efficient evaluation limited to  Conjunctive keyword search (no predicates)  Full-text predicates in isolation  Need for efficient evaluation of FT queries  universal formal techniques to optimize Need to Optimize FT Queries

7 Outline  Efficient evaluation of full-text queries  Query optimization  Impact of scoring methods on optimizations  Query distributed data  Summary and future work

8 A Novel Universal Optimization Framework  XQFT semantics in W3C proposal is given in functional language style  no apparent connection to (relational) database query languages  We provide an alternative (yet equivalent) semantics captured by  Formalization of XML full-text languages in terms of  keyword patterns  pattern matches  predicates evaluated through matches  XFT algebra  matches are treated as relational tuples

9 XFT Algebra  Example: query in XQFT.//* ftcontains “tsunami” and “Asia” window <=10 words all occurrences (matches) of “tsunami” all occurrences (matches) of “Asia” common ancestors of match pairs keep only ancestors of close matches

10 Benefits of the Optimization Framework [Amer-Yahia et al. SIGMOD 2006]  Enable leveraging the tried-and-true relational-style evaluation & optimization techniques, including  Join re-ordering  Pushing selection predicates into joins  Concise & clean formal semantics for all FT languages by translation to the XFT algebra  one-size-fits-all optimization for all FT languages  Efficient algorithms for operator evaluation through novel and successful marriage IR &DB  Measured speedup of at least two orders of magnitude over two reference XQFT engines

11 Outline  Efficient evaluation of full-text queries  Query optimization  Impact of scoring methods on optimizations  Query distributed data  Summary and future work

12  Until now, scoring well understood on text only  Challenge: score structure + text  Non-trivial  Many scoring proposals; sometimes hardcoded in the algorithm  Extend the universal optimization framework to accommodate for universal scoring Integrate with Universal Scoring

13  Documents carry “scores”  relevance of the query matching documents  XFT algebraic operators manipulate scores  Requirements  Generic functions, not a particular scoring function  no scoring method is better than the other  Avoid re-computing scores: score of a node can be derived solely from the scores of its descendants Requirements for Extending with Scores

14  Parameterized scoring scheme  scoreK( k,pos,n ) = score keyword k at position pos in node n  scoreM( p,m ) = score a match m with pattern p  aggregate scores from subpatterns of a pattern for the same node  scoreS( SM(n,p) ) = score a set of matches SM corresponding to node n and pattern p  aggregate scores from children to parent  The score of a node depends on scoring its set of matches  scoreK is used in scoring a match  scoreM is used in scoring a set of matches  scoreS Preliminary Results: Scoring Scheme

15 Example: Using the Scoring Scheme  Query: “tsunami” and “Asia” and “danger” “tsunami” =scoreK(tsunami, 2, node1)=10 “danger” =scoreK(danger, 40, node1)=2 “Asia” =scoreK(Asia, 5, node1)=15 match (2, 5) for pattern (“tsunami”, “Asia”) =scoreM(10, 15) match (2, 5, 40) for pattern (“tsunami”, “Asia”, “danger”) =scoreM(scoreM(10, 15), 2)

16 Impact of Scores on Optimizations  Challenge  Scoring breaks the expected relational “equivalent” query plans  scoring intermediate nodes might generate different score values

17 Pitfall: Scoring Breaks Equivalence  Query: “tsunami” and “Asia” and “danger”  Need  Consistent scoring: same scores for equivalent plans  Consistent ranking: same ranks for equivalent plans tsunami =10 Asia =15 danger =2 danger =2 tsunami =15 Asia =10 =scoreM(10, 15) =scoreM(2, 15) =scoreM(scoreM(10, 15), 2)=scoreM(scoreM(2, 15), 10) Different values if scoreM is the pairwise average function  There are functions that break the relational equivalence

18 Ongoing Work What are the properties of the scoring scheme such that the rewriting rule(s) holds? RW scoreK Properties? scoreM Properties? scoreS Properties? Equivalent rewriting rulesScoring scheme E.g., join reordering requires associative, commutative scoring functions E.g., top-K requires monotonicity

19 Ongoing Work RW? scoreK scoreM scoreS What rewriting rules hold under a particular scoring scheme? Equivalent rewriting rulesA particular scoring scheme What are the properties of the scoring scheme such that the rewriting rule(s) holds? RW scoreK Properties? scoreM Properties? scoreS Properties? Equivalent rewriting rulesScoring scheme  Catalog all existing scoring methods for structure and text w.r.t. their compatibility with rewriting optimizations  Can we capture them in our framework?  E.g., vector space model is consistent scoring for the relational- style rewritings

20  Smart, configurable optimizer Ongoing Work Is it consistent scoring / ranking? (are the rewritings sound?) Plug-in a particular scoring scheme at run time If yes, use the rewritingsIf not, identify and disable all non-sound rewritings

21 Outline  Efficient evaluation of full-text queries  Query optimization  Impact of scoring methods on optimizations  Distributed access methods  Summary and future work

22 Query on Distributed Data  Move from search individual sources to highly distributed sources  Challenges  Consumers and producers: many, dynamic  completely decentralized  Users unaware of data location  completely distributed data  Our goal: efficient distributed computation  data discovery, evaluation, ranking of FT queries

23 P2P Network with XML Sources Query1: (tsunami, Asia) Query2: (concerts, NYC) Local XML Local XML Local XML Local XML Local XML Local XML Local XML Local XML Local XML Local XML Network link Efficient and expressive querying of the global XML data? Each node can produce and store XML data answer queries over its local XML store initiate queries on actual content of documents

24 Proposed Architecture Local XML Local XML Local XML Local XML Local XML Local XML Local XML Local XML Local XML Local XML XFT Algebraic Engine Locally, post-processes at a node leverage the XFT engine Distributed access methods (index) to discover the relevant sources answer keyword/XPath part of the queries Consumer’s side Producers’ side Return the answers to the FT query

25 Proposal: Leverage Query Dissemination Trees  Route queries: move queries, not data  Peers self-organize in query dissemination trees  Every node contains summary of XML documents stored in its subtrees  Use the dissemination trees for query routing  Queries always posed at the root  If a node’s summary matches the query then forward query to children

26 Define the Design Space 1 tree per keyword 1 tree for all keywords less congestion more control overhead more congestion less control overhead … but the overall throughput depends on the slowest node. Challenge: relieve the traffic congestion

27 The Design Space To Explore  Optimal solution lies between the extremes  Proposal  Partition set of keywords into blocks  Build one tree per keyword block  connect all keywords from same block into one tree Partitioning the data space 1 tree per keyword 1 tree for all keywords Optimal solution Optimal solution?

28 Forces at Cross-purposes 1 tree per keyword 1 tree for all keywords Partitioning the data space less congestion more control overhead more congestion less control overhead Number trees Tradeoff: congestion vs. control traffic congestion control traffic Optimization problem: find the minimum number of trees relieve congestion (improve the overall throughput) to peak-to-average load within an approximation ε ( acceptable ε =20%)

29 Preliminary Results: Load Balancing  Requirement  a node that appears high in one tree will appear in lower levels in all the other trees  guarantee a node appears on different tree levels in each tree  Load balance is when the nodes have been in the top levels at most once  Our approach: circular permutation of the internal nodes among the different trees  peak load decreases drastically  peak-to-average processing load is within 15%

30 Future Directions  For conjunctive query routing  Query selectivity estimation  Scoring in distributed systems  E.g., IDF is inherently global  Need an analytical cost model to better understand parameters for XML access methods in the design space

31 Summary  A formalized approach to full-text queries for large-scale systems  Efficiency  Relational-style optimizations of XFT algebraic plans  Universal scoring  properties of scoring functions for scoring consistency  Distributed computation  Prototype (under construction)

32 Thank You!