Project Description 198:541
Query Processing Project 1. Exact query answering using standard indexes 2. Advanced query processing Multidimensional Text data Top-k query model
Implementation Details Choice of C++ or Java Data storage You do not have to implement disk storage… but you can. You can use a DBMS for storage but you have to implement your own indexes. You can simulate disk rid by a hash table to access full tuples For the purpose of this project main-memory implementation is fine, but it might be easier for you to have something more persistent Single or multiple tables You can have joins in the advanced query processing part of the project
Step 1: Finding Data You should find a dataset Multi-attributes (3-4 minimum) At least 1000 data points Domain Numeric values Some text fields if you want to look into IR techniques Find data on which you can ask meaningful queries (exact and advanced) Sources: Census data Weather statistics Bibliographical data Sales data (amazon)…
Step 2: Exact Query Processing Deciding on meaningful indexes for your application Bulk loading indexes (type is data and query dependent) B+ tree Hash tables Answering exact queries Single-attributes Multi-attributes (merging single attributes results)
Step 3: Advanced Query Processing Numeric Data Multidimensional Indexes Multidimensional range query processing Skyline Queries Find the best undominated tuples in the data set Related: maximize a function of the attributes values Top-k Query Processing, Nearest-Neighbor Queries Smart index accesses based on preferred results values Join optimization using specific join indexes
Step 3: Advanced Query Processing Numeric and Text Data IR techniques for text-only query Inverted lists Indexes Exact Queries Top-k queries (tf.idf scores) Text and value queries Exact queries: find articles written in 2004 with “XML Path Indexes” in their abstract Top-k Queries Exact matching on text, ranking on numeric value Exact matching on numeric values, ranking on text Ranking on both numeric values and text More research-oriented