Structured-Value Ranking in Update- Intensive Relational Databases Jayavel Shanmugasundaram Cornell University (Joint work with: Lin Guo, Kevin Beyer, Eugene Shekita)
Case Study: Internet Archive
Internet Archive Database Movies Name 10Amateur Film … they stand on the golden gate bridge and … Description Mid … …… 20American Thrift… golden gate bridge with statue of liberty … SELECT * FROM Movies M ORDER BY score(M.description, “golden gate”) FETCH TOP 10 RESULTS ONLY
Main Issue Traditional IR ranking methods would rank the two movies about the same Example: TF-IDF –“Golden Gate” appears exactly once in both descriptions –Length of the text fields are about the same –Hence: same normalized TF-IDF score Larger issue: Traditional IR scoring methods developed for stand-alone document collections
Internet Archive Database Movies Name 10Amateur Film … they stand on the golden gate bridge and … Description Mid … …… 20American Thrift… golden gate bridge with statue of liberty … Reviews Name 10bleblanc 2 Rating Mid 20 cooker4 10harry1 Rid alice5904 ………… Statistics Visits Downloads Mid Sid ………… Structured Value Ranking (SVR)
Structured Value Ranking Use structured data values associated with text columns to score results Main technical challenge –Structured data value (and hence scores) change frequently and possibly dramatically! Number of visits, downloads, award announcements “SlashDot effect” Bursts and rapidly changing popularity [Kleinberg] –Users still want to see results ordered by latest score values
Dealing with Score Updates Traditional top-k algorithms: order inverted lists by score –Top-k queries answered efficiently by scanning only top part of inverted list Not efficient if scores are updated –Need to reorder inverted lists Solution: –New family of inverted lists that are maintained in approximate score order –Correct for approximation during query processing
Summary of Proposed Techniques SQL-based technique for specifying SVR in a relational database New family of inverted lists that are robust to score updates, while still efficient for queries –Can specify update-query tradeoff –Combination of SVR and TF-IDF scores Can be implemented using existing relational technology such as B+-trees
Outline System Architecture Indexing and Query Processing Experimental Evaluation Related Work Conclusion
Internet Archive Database Movies Name 10Amateur Film … they stand on the golden gate bridge and … Description Mid … …… 20American Thrift… golden gate bridge with statue of liberty … Reviews Name 10bleblanc 2 Rating Mid 20 cooker4 10harry1 Rid alice5904 ………… Statistics Visits Downloads Mid Sid …………
System Architecture Relational Query Engine Relational Tables and Indices Text Management Component RDBMS Create Text IndexSQL/MMResults Relational Sub-query Text Query Engine Materialized Views for SVR Scores Novel Indices using B+-trees SQL Specification of SVR Scores Results & scores Keyword Query
Internet Archive Database Movies Name 10Amateur Film … they stand on the golden gate bridge and … Description Mid … …… 20American Thrift… golden gate bridge with statue of liberty … Reviews Name 10bleblanc 2 Rating Mid 20 cooker4 10harry1 Rid alice5904 ………… Statistics Visits Downloads Mid Sid …………
SQL-Based SVR Specification create function S1 (id: integer) returns float return SELECT Avg(R.rating) FROM Reviews R WHERE R.Mid = id create function S2 (id: integer) returns float return SELECT S.Visits FROM Statistics S WHERE S.Mid = id create function S3 (id: integer) returns float return SELECT S.Downloads FROM Statistics S WHERE S.Mid = id create function Agg (s1, s2, s3: float) returns float return (s1*100 + s2/2 + s3)
SQL-Based SVR Specification create function S1 (id: integer) returns float return SELECT Avg(R.rating) FROM Reviews R WHERE R.Mid = id create function S2 (id: integer) returns float return SELECT S.Visits FROM Statistics S WHERE S.Mid = id create function S3 (id: integer) returns float return SELECT S.Downloads FROM Statistics S WHERE S.Mid = id create function Agg (s1, s2, s3, s4: float) returns float return (s1*100 + s2/2 + s3 + s4/2) (s4 = TFIDF())
Efficiently Maintaining SVR Scores One of key challenges: SVR scores can change frequently Solution: use materialized views –Leverage relational technology –Benefit of SQL-based SVR specification create materialized view Score as SELECT Agg(S1(M.Mid), S2(M.Mid), S3(M.Mid)) FROM Movies M
System Architecture Relational Query Engine Relational Tables and Indices Text Management Component RDBMS Create Text IndexSQL/MMResults Relational Sub-query Text Query Engine Materialized Views for SVR Scores Novel Indices using B+-trees SQL Specification of SVR Scores Results & scores Keyword Query
Outline System Architecture Indexing and Query Processing Experimental Evaluation Related Work Conclusion
Index Operations Document score updates –Handle frequent updates to scores Top-k keyword queries –Conjunctive and disjunctive keyword queries –Include IR-style (TF-IDF) scores –Top-k query results Content updates, insertions and deletions –Update to document content –Document insertions and deletions
Naïve Approach 1: ID Method golden … gate … (ordered by Id) Inverted ListScore Table IdScore …... Score updates: efficient (just update score table) Top-k queries: inefficient (scan all of inverted list)
Naïve Approach 2: Score Method golden 156 gate (ordered by Score) Inverted List Top-k queries: efficient (top part of inverted list) Score updates: inefficient (reorganize many lists) … … Score … …
Dilemma Want inverted lists ordered by score –For top-k query performance –Like in Score Method But do not want to touch inverted lists for every score update –For score update performance –Like in ID Method How can we address this apparent dilemma?
Score-Threshold Method Extends Score Method in two key aspects 1)Allow inverted list scores to be out-of-date by up to a threshold –Avoids having to frequently update inverted list Better score update performance –Need to scan more of inverted list (by up to a threshold) to correct for out-of-date score Slightly reduced query performance 2)Use “short” inverted list for scores that exceed threshold –More efficient than updating large inverted list
Score-Threshold Method golden 156 gate (ordered by Score) … … … … Short list Score Table IdScore …… …... ListScore Table IdScoreInShortList Threshold = 10 Doc 12 new score: 95
Score-Threshold Method golden 156 gate (ordered by Score) … … … … Score Table IdScore …… …... ListScore Table IdScoreInShortList Threshold = 10 Doc 12 new score: false
Score-Threshold Method golden 156 gate (ordered by Score) … … … … Score Table IdScore …… …... ListScore Table IdScoreInShortList Threshold = 10 Doc 12 new score: false
Score-Threshold Method golden 156 gate (ordered by Score) … … … … Score Table IdScore …… …... ListScore Table IdScoreInShortList Threshold = 10 Doc 12 new score: false
Score-Threshold Method golden 156 gate (ordered by Score) … … … … Score Table IdScore …… …... ListScore Table IdScoreInShortList Threshold = 10 Doc 12 new score: true
Query-Update Tradeoff Choice of threshold function If threshold(score) = score –Every update results in update to inverted list –Similar to Score Method If threshold(score) = infinity –No inverted list update, but scan all of list –Similar to ID Method Can control query-update tradeoff using threshold function –threshold(score) = r * score, r >= 1 –r: threshold ratio
Score-Threshold Method: Critique Provides good update-query tradeoff But! Requires score to be stored in inverted list –Increases size of inverted list –Decreases query performance Can we avoid storing scores in inverted list and still get update-query tradeoff?
Chunk Method Main idea: divide document collection into “chunks” based on original document score –Lowest 5000 documents in first chunk –Next higher 3000 documents in second chunk –Next higher 4000 documents in third chunk –… Organize inverted list by chunk, but order documents by Id within a chunk –Ordered approximately by score (chunk) like Score Method –Avoids storing scores like in ID Method
Chunk Method golden 12 gate (ordered by Chunk) … … … … Short list Score Table IdScore …… …... ListScore Table IdScoreInShortList
Chunk Method: Details Setting chunk boundaries –highdoc(c) = highest score of document in chunk c –For two successive chunks c1 and c2: highdoc(c1)/highdoc(c2) = r r = chunk ratio Update document in short list only if document score exceeds 2 chunk boundaries –2 chunks handles boundary cases
Chunk-TermScore Method Support combination of SVR and TF-IDF Combines Chunk Method with Fancy-ID Method [Long and Suel] –In addition to long and short lists (ordered by chunk), have short fancy list (ordered by TF-IDF) –Combined merge of all three lists Details in ICDE paper
Summary of Alternatives ID Method –Efficient updates, slow queries Score Method –Efficient queries, slow updates Score-Threshold Method –Efficient updates, Intermediate queries Chunk Method –Efficient updates, Efficient queries Chunk-TermScore Method –Efficient updates, Efficient queries, TF-IDF + SVR
Outline System Architecture Indexing and Query Processing Experimental Evaluation Related Work Conclusion
Experimental Setup Two primary performance metrics –Time for a score update Only time to update inverted lists –Time for a top-k query Data sets –Real (Internet Archive): 60MB Thanks to Brewster Kahle and Jon Aizen –Synthetic: 805MB Compared all five alternatives + ID-TermScore (baseline for Chunk-TermScore)
Implementation Details Inverted lists implemented in BerkeleyDB –Long inverted lists as CLOBs Read in a page at a time during query processing –Short inverted lists as clustered B+ trees Since short inverted lists are updated Query algorithms implemented in C++
Inverted List Size ID Method: 145MB Score Method: 2768MB Score-Threshold Method: 847MB Chunk Method: 146MB ID-TermScore Method: 428MB Chunk-TermScore Method:430MB
Effect of Chunk Ratio Times in Milliseconds
Varying # Updates Times in Milliseconds
Varying k in Top-k
SVR + TF-IDF Times in Milliseconds
Summary of Alternatives ID Method –Efficient updates, slow queries Score Method –Efficient queries, slow updates Score-Threshold Method –Efficient updates, Intermediate queries Chunk Method –Efficient updates, Efficient queries Chunk-TermScore Method –Efficient updates, Efficient queries, TF-IDF + SVR
Outline System Architecture Indexing and Query Processing Experimental Evaluation Related Work Conclusion
Related Work SQL/MM –Integrating keyword search with databases Banks, DBXplorer, Discover –Search “across” tuples, but simple or traditional IR ranking Top-k inverted lists and query processing –Do not handle score updates Inverted list updates –Handle only content updates, not score updates –Proposed techniques can handle content updates too
Outline System Architecture Indexing and Query Processing Experimental Evaluation Related Work Conclusion
10000 foot view of Data Management Structured Unstructured Complex and Structured Ranked Keyword Search Data Queries Database Systems Information Retrieval Systems
10000 foot view of Data Management Structured Unstructured Complex and Structured Ranked Keyword Search Data Queries Database Systems Information Retrieval Systems Text search in databases Ranking based on structured values
10000 foot view of Data Management Structured Unstructured Complex and Structured Ranked Keyword Search Data Queries Database Systems Information Retrieval Systems
Towards Unifying DB and IR XRank: Keyword search over semi-structured XML documents –Extends keyword search to work over both structured and unstructured data –SIGMOD 2003 [Guo et al.] TeXQuery: Query language for structured and unstructured data, structured and keyword queries –Precursor to W3C XQuery Full-Text –WWW 2004 [Amer-Yahia et al.]
Questions?