Structured-Value Ranking in Update- Intensive Relational Databases Jayavel Shanmugasundaram Cornell University (Joint work with: Lin Guo, Kevin Beyer,

Slides:

Advertisements

Similar presentations

IMPLEMENTATION OF INFORMATION RETRIEVAL SYSTEMS VIA RDBMS.

Advertisements

Jianxin Li, Chengfei Liu, Rui Zhou Swinburne University of Technology, Australia Wei Wang University of New South Wales, Australia Top-k Keyword Search.

Efficient IR-Style Keyword Search over Relational Databases Vagelis Hristidis University of California, San Diego Luis Gravano Columbia University Yannis.

Effective Keyword Based Selection of Relational Databases Bei Yu, Guoliang Li, Karen Sollins, Anthony K.H Tung.

TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets Chun Chen 1, Feng Li 2, Beng Chin Ooi 2, and Sai Wu 2 1 Zhejiang University, 2 National.

Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.

Case Study: BibFinder BibFinder: A popular CS bibliographic mediator –Integrating 8 online sources: DBLP, ACM DL, ACM Guide, IEEE Xplore, ScienceDirect,

SPARK: Top-k Keyword Query in Relational Databases Yi Luo, Xuemin Lin, Wei Wang, Xiaofang Zhou Univ. of New South Wales, Univ. of Queensland SIGMOD 2007.

1 Oct 30, 2006 LogicSQL-based Enterprise Archive and Search System How to organize the information and make it accessible and useful ? Li-Yan Yuan.

Information Retrieval and Databases: Synergies and Syntheses IDM Workshop Panel 15 Sep 2003 Jayavel Shanmugasundaram Cornell University.

1 Ranked Queries over sources with Boolean Query Interfaces without Ranking Support Vagelis Hristidis, Florida International University Yuheng Hu, Arizona.

Flexible and Efficient XML Search with Complex Full-Text Predicates Sihem Amer-Yahia - AT&T Labs Research → Yahoo! Research Emiran Curtmola - University.

CAREER: Towards Unifying Database Systems and Information Retrieval Systems NSF IDM Workshop 10 Oct 2004 Jayavel Shanmugasundaram Cornell University.

Information Retrieval

EASE: An Effective 3-in-1 Keyword Search Method for Unstructured, Semi-structured and Structured Data Guoliang Li et al.

Chapter 5: Information Retrieval and Web Search

Overview of Search Engines

Indexing XML Data Stored in a Relational Database VLDB`2004 Shankar Pal, Istvan Cseri, Gideon Schaller, Oliver Seeliger, Leo Giakoumakis, Vasili Vasili.

Authors: Bhavana Bharat Dalvi, Meghana Kshirsagar, S. Sudarshan Presented By: Aruna Keyword Search on External Memory Data Graphs.

NUITS: A Novel User Interface for Efficient Keyword Search over Databases The integration of DB and IR provides users with a wide range of high quality.

Keyword Search in Relational Databases Jaehui Park Intelligent Database Systems Lab. Seoul National University

Efficient Keyword Search over Virtual XML Views Feng Shao and Lin Guo and Chavdar Botev and Anand Bhaskar and Muthiah Chettiar and Fan Yang Cornell University.

MPI Informatik 1/17 Oberseminar AG5 Result merging in a Peer-to-Peer Web Search Engine Supervisors: Speaker : Sergey Chernov Prof. Gerhard Weikum Christian.

DBXplorer: A System for Keyword- Based Search over Relational Databases Sanjay Agrawal Surajit Chaudhuri Gautam Das Presented by Bhushan Pachpande.

Probabilistic Ranking of Database Query Results Surajit Chaudhuri, Microsoft Research Gautam Das, Microsoft Research Vagelis Hristidis, Florida International.

Experimenting Lucene Index on HBase in an HPC Environment Xiaoming Gao Vaibhav Nachankar Judy Qiu.

Ranking Queries on Uncertain Data: A Probabilistic Threshold Approach Wenjie Zhang, Xuemin Lin The University of New South Wales & NICTA Ming Hua,

Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.

UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.

CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.

Full-Text Search in P2P Networks Christof Leng Databases and Distributed Systems Group TU Darmstadt.

« Performance of Compressed Inverted List Caching in Search Engines » Proceedings of the International World Wide Web Conference Commitee, Beijing 2008)

Querying Structured Text in an XML Database By Xuemei Luo.

EASE: An Effective 3-in-1 Keyword Search Method for Unstructured, Semi-structured and Structured Data Cuoliang Li, Beng Chin Ooi, Jianhua Feng, Jianyong.

Chapter 6: Information Retrieval and Web Search

Introduction to Digital Libraries hussein suleman uct cs honours 2003.

RecBench: Benchmarks for Evaluating Performance of Recommender System Architectures Justin Levandoski Michael D. Ekstrand Michael J. Ludwig Ahmed Eldawy.

Efficient Instant-Fuzzy Search with Proximity Ranking Authors: Inci Centidil, Jamshid Esmaelnezhad, Taewoo Kim, and Chen Li IDCE Conference 2014 Presented.

BNCOD07Indexing & Searching XML Documents based on Content and Structure Synopses1 Indexing and Searching XML Documents based on Content and Structure.

XML and Database.

Ranking objects based on relationships Computing Top-K over Aggregation Sigmod 2006 Kaushik Chakrabarti et al.

Language Model in Turkish IR Melih Kandemir F. Melih Özbekoğlu Can Şardan Ömer S. Uğurlu.

Taxonomy Caching: A Scalable Low- Cost Mechanism for Indexing Remote Contents in Peer-to-Peer Systems Kjetil Nørvåg Norwegian University of Science and.

Relational Operator Evaluation. Overview Application Programmer (e.g., business analyst, Data architect) Sophisticated Application Programmer (e.g.,

CSE 6392 – Data Exploration and Analysis in Relational Databases April 20, 2006.

Ranking of Database Query Results Nitesh Maan, Arujn Saraswat, Nishant Kapoor.

Date: 2013/4/1 Author: Jaime I. Lopez-Veyna, Victor J. Sosa-Sosa, Ivan Lopez-Arevalo Source: KEYS’12 Advisor: Jia-ling Koh Speaker: Chen-Yu Huang KESOSD.

DISTRIBUTED INFORMATION RETRIEVAL Lee Won Hee.

A Unified Approach to Ranking in Probabilistic Databases Jian Li, Barna Saha, Amol Deshpande University of Maryland, College Park, USA VLDB

Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.

One Platform for Mining Structured and Unstructured Data: Dream or Reality? VLDB Panel 13 Sep 2006 Jayavel Shanmugasundaram Yahoo! Research.

1 Overview of Query Evaluation Chapter Outline  Query Optimization Overview  Algorithm for Relational Operations.

XRANK: RANKED KEYWORD SEARCH OVER XML DOCUMENTS Lin Guo Feng Shao Chavdar Botev Jayavel Shanmugasundaram Abhishek Chennaka, Alekhya Gade Advanced Database.

Databases and Information Retrieval: Rethinking the Great Divide SIGMOD Panel 14 Jun 2005 Jayavel Shanmugasundaram Cornell University.

Database Research for the Current Millennium ICDE Panel 1 Apr 2004 Jayavel Shanmugasundaram Cornell University.

Text Search over XML Documents Jayavel Shanmugasundaram Cornell University.

Querying Structured Text in an XML Database Shurug Al-Khalifa Cong Yu H. V. Jagadish (University of Michigan) Presented by Vedat Güray AFŞAR & Esra KIRBAŞ.

Data Integrity & Indexes / Session 1/ 1 of 37 Session 1 Module 1: Introduction to Data Integrity Module 2: Introduction to Indexes.

Information Retrieval in Practice

Information Retrieval in Practice

Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance Hello everyone,

Efficient Multi-User Indexing for Secure Keyword Search

CS 540 Database Management Systems

Information Retrieval in Practice

Chapter 11: Indexing and Hashing

Structure and Content Scoring for XML

Overview of Query Evaluation

Structure and Content Scoring for XML

Information Retrieval and Web Design

Chapter 11: Indexing and Hashing

Presentation transcript:

Structured-Value Ranking in Update- Intensive Relational Databases Jayavel Shanmugasundaram Cornell University (Joint work with: Lin Guo, Kevin Beyer, Eugene Shekita)

Case Study: Internet Archive

Internet Archive Database Movies Name 10Amateur Film … they stand on the golden gate bridge and … Description Mid … …… 20American Thrift… golden gate bridge with statue of liberty … SELECT * FROM Movies M ORDER BY score(M.description, “golden gate”) FETCH TOP 10 RESULTS ONLY

Main Issue Traditional IR ranking methods would rank the two movies about the same Example: TF-IDF –“Golden Gate” appears exactly once in both descriptions –Length of the text fields are about the same –Hence: same normalized TF-IDF score Larger issue: Traditional IR scoring methods developed for stand-alone document collections

Internet Archive Database Movies Name 10Amateur Film … they stand on the golden gate bridge and … Description Mid … …… 20American Thrift… golden gate bridge with statue of liberty … Reviews Name 10bleblanc 2 Rating Mid 20 cooker4 10harry1 Rid alice5904 ………… Statistics Visits Downloads Mid Sid ………… Structured Value Ranking (SVR)

Structured Value Ranking Use structured data values associated with text columns to score results Main technical challenge –Structured data value (and hence scores) change frequently and possibly dramatically! Number of visits, downloads, award announcements “SlashDot effect” Bursts and rapidly changing popularity [Kleinberg] –Users still want to see results ordered by latest score values

Dealing with Score Updates Traditional top-k algorithms: order inverted lists by score –Top-k queries answered efficiently by scanning only top part of inverted list Not efficient if scores are updated –Need to reorder inverted lists Solution: –New family of inverted lists that are maintained in approximate score order –Correct for approximation during query processing

Summary of Proposed Techniques SQL-based technique for specifying SVR in a relational database New family of inverted lists that are robust to score updates, while still efficient for queries –Can specify update-query tradeoff –Combination of SVR and TF-IDF scores Can be implemented using existing relational technology such as B+-trees

Outline System Architecture Indexing and Query Processing Experimental Evaluation Related Work Conclusion

Internet Archive Database Movies Name 10Amateur Film … they stand on the golden gate bridge and … Description Mid … …… 20American Thrift… golden gate bridge with statue of liberty … Reviews Name 10bleblanc 2 Rating Mid 20 cooker4 10harry1 Rid alice5904 ………… Statistics Visits Downloads Mid Sid …………

System Architecture Relational Query Engine Relational Tables and Indices Text Management Component RDBMS Create Text IndexSQL/MMResults Relational Sub-query Text Query Engine Materialized Views for SVR Scores Novel Indices using B+-trees SQL Specification of SVR Scores Results & scores Keyword Query

Internet Archive Database Movies Name 10Amateur Film … they stand on the golden gate bridge and … Description Mid … …… 20American Thrift… golden gate bridge with statue of liberty … Reviews Name 10bleblanc 2 Rating Mid 20 cooker4 10harry1 Rid alice5904 ………… Statistics Visits Downloads Mid Sid …………

SQL-Based SVR Specification create function S1 (id: integer) returns float return SELECT Avg(R.rating) FROM Reviews R WHERE R.Mid = id create function S2 (id: integer) returns float return SELECT S.Visits FROM Statistics S WHERE S.Mid = id create function S3 (id: integer) returns float return SELECT S.Downloads FROM Statistics S WHERE S.Mid = id create function Agg (s1, s2, s3: float) returns float return (s1*100 + s2/2 + s3)

SQL-Based SVR Specification create function S1 (id: integer) returns float return SELECT Avg(R.rating) FROM Reviews R WHERE R.Mid = id create function S2 (id: integer) returns float return SELECT S.Visits FROM Statistics S WHERE S.Mid = id create function S3 (id: integer) returns float return SELECT S.Downloads FROM Statistics S WHERE S.Mid = id create function Agg (s1, s2, s3, s4: float) returns float return (s1*100 + s2/2 + s3 + s4/2) (s4 = TFIDF())

Efficiently Maintaining SVR Scores One of key challenges: SVR scores can change frequently Solution: use materialized views –Leverage relational technology –Benefit of SQL-based SVR specification create materialized view Score as SELECT Agg(S1(M.Mid), S2(M.Mid), S3(M.Mid)) FROM Movies M

System Architecture Relational Query Engine Relational Tables and Indices Text Management Component RDBMS Create Text IndexSQL/MMResults Relational Sub-query Text Query Engine Materialized Views for SVR Scores Novel Indices using B+-trees SQL Specification of SVR Scores Results & scores Keyword Query

Outline System Architecture Indexing and Query Processing Experimental Evaluation Related Work Conclusion

Index Operations Document score updates –Handle frequent updates to scores Top-k keyword queries –Conjunctive and disjunctive keyword queries –Include IR-style (TF-IDF) scores –Top-k query results Content updates, insertions and deletions –Update to document content –Document insertions and deletions

Naïve Approach 1: ID Method golden … gate … (ordered by Id) Inverted ListScore Table IdScore …... Score updates: efficient (just update score table) Top-k queries: inefficient (scan all of inverted list)

Naïve Approach 2: Score Method golden 156 gate (ordered by Score) Inverted List Top-k queries: efficient (top part of inverted list) Score updates: inefficient (reorganize many lists) … … Score … …

Dilemma Want inverted lists ordered by score –For top-k query performance –Like in Score Method But do not want to touch inverted lists for every score update –For score update performance –Like in ID Method How can we address this apparent dilemma?

Score-Threshold Method Extends Score Method in two key aspects 1)Allow inverted list scores to be out-of-date by up to a threshold –Avoids having to frequently update inverted list Better score update performance –Need to scan more of inverted list (by up to a threshold) to correct for out-of-date score Slightly reduced query performance 2)Use “short” inverted list for scores that exceed threshold –More efficient than updating large inverted list

Score-Threshold Method golden 156 gate (ordered by Score) … … … … Short list Score Table IdScore …… …... ListScore Table IdScoreInShortList Threshold = 10 Doc 12 new score: 95

Score-Threshold Method golden 156 gate (ordered by Score) … … … … Score Table IdScore …… …... ListScore Table IdScoreInShortList Threshold = 10 Doc 12 new score: false

Score-Threshold Method golden 156 gate (ordered by Score) … … … … Score Table IdScore …… …... ListScore Table IdScoreInShortList Threshold = 10 Doc 12 new score: false

Score-Threshold Method golden 156 gate (ordered by Score) … … … … Score Table IdScore …… …... ListScore Table IdScoreInShortList Threshold = 10 Doc 12 new score: false

Score-Threshold Method golden 156 gate (ordered by Score) … … … … Score Table IdScore …… …... ListScore Table IdScoreInShortList Threshold = 10 Doc 12 new score: true

Query-Update Tradeoff Choice of threshold function If threshold(score) = score –Every update results in update to inverted list –Similar to Score Method If threshold(score) = infinity –No inverted list update, but scan all of list –Similar to ID Method Can control query-update tradeoff using threshold function –threshold(score) = r * score, r >= 1 –r: threshold ratio

Score-Threshold Method: Critique Provides good update-query tradeoff But! Requires score to be stored in inverted list –Increases size of inverted list –Decreases query performance Can we avoid storing scores in inverted list and still get update-query tradeoff?

Chunk Method Main idea: divide document collection into “chunks” based on original document score –Lowest 5000 documents in first chunk –Next higher 3000 documents in second chunk –Next higher 4000 documents in third chunk –… Organize inverted list by chunk, but order documents by Id within a chunk –Ordered approximately by score (chunk) like Score Method –Avoids storing scores like in ID Method

Chunk Method golden 12 gate (ordered by Chunk) … … … … Short list Score Table IdScore …… …... ListScore Table IdScoreInShortList

Chunk Method: Details Setting chunk boundaries –highdoc(c) = highest score of document in chunk c –For two successive chunks c1 and c2: highdoc(c1)/highdoc(c2) = r r = chunk ratio Update document in short list only if document score exceeds 2 chunk boundaries –2 chunks handles boundary cases

Chunk-TermScore Method Support combination of SVR and TF-IDF Combines Chunk Method with Fancy-ID Method [Long and Suel] –In addition to long and short lists (ordered by chunk), have short fancy list (ordered by TF-IDF) –Combined merge of all three lists Details in ICDE paper

Summary of Alternatives ID Method –Efficient updates, slow queries Score Method –Efficient queries, slow updates Score-Threshold Method –Efficient updates, Intermediate queries Chunk Method –Efficient updates, Efficient queries Chunk-TermScore Method –Efficient updates, Efficient queries, TF-IDF + SVR

Outline System Architecture Indexing and Query Processing Experimental Evaluation Related Work Conclusion

Experimental Setup Two primary performance metrics –Time for a score update Only time to update inverted lists –Time for a top-k query Data sets –Real (Internet Archive): 60MB Thanks to Brewster Kahle and Jon Aizen –Synthetic: 805MB Compared all five alternatives + ID-TermScore (baseline for Chunk-TermScore)

Implementation Details Inverted lists implemented in BerkeleyDB –Long inverted lists as CLOBs Read in a page at a time during query processing –Short inverted lists as clustered B+ trees Since short inverted lists are updated Query algorithms implemented in C++

Inverted List Size ID Method: 145MB Score Method: 2768MB Score-Threshold Method: 847MB Chunk Method: 146MB ID-TermScore Method: 428MB Chunk-TermScore Method:430MB

Effect of Chunk Ratio Times in Milliseconds

Varying # Updates Times in Milliseconds

Varying k in Top-k

SVR + TF-IDF Times in Milliseconds

Summary of Alternatives ID Method –Efficient updates, slow queries Score Method –Efficient queries, slow updates Score-Threshold Method –Efficient updates, Intermediate queries Chunk Method –Efficient updates, Efficient queries Chunk-TermScore Method –Efficient updates, Efficient queries, TF-IDF + SVR

Outline System Architecture Indexing and Query Processing Experimental Evaluation Related Work Conclusion

Related Work SQL/MM –Integrating keyword search with databases Banks, DBXplorer, Discover –Search “across” tuples, but simple or traditional IR ranking Top-k inverted lists and query processing –Do not handle score updates Inverted list updates –Handle only content updates, not score updates –Proposed techniques can handle content updates too

Outline System Architecture Indexing and Query Processing Experimental Evaluation Related Work Conclusion

10000 foot view of Data Management Structured Unstructured Complex and Structured Ranked Keyword Search Data Queries Database Systems Information Retrieval Systems

10000 foot view of Data Management Structured Unstructured Complex and Structured Ranked Keyword Search Data Queries Database Systems Information Retrieval Systems Text search in databases Ranking based on structured values

10000 foot view of Data Management Structured Unstructured Complex and Structured Ranked Keyword Search Data Queries Database Systems Information Retrieval Systems

Towards Unifying DB and IR XRank: Keyword search over semi-structured XML documents –Extends keyword search to work over both structured and unstructured data –SIGMOD 2003 [Guo et al.] TeXQuery: Query language for structured and unstructured data, structured and keyword queries –Precursor to W3C XQuery Full-Text –WWW 2004 [Amer-Yahia et al.]

Questions?