Data Frame Augmentation of Free Form Queries for Constraint Based Document Filtering Andrew Zitzelberger
Problem
Constraint Based Queries
Queries Test Queries 1) Find me a Wii game. 2) Find me a Honda for under 15 thousand dollars. 3) Roller Coaster more than 150 feet high 4) mountains at least 15K feet 5) games under $25 6) mountains less than 4 km 7) ps games < $40 8) coasters longer than 1000 feet 9) car for under 5 grand newer than 1990 with less than 115K miles 10) more than 15K miles under 5 grand newer than 2004
Keywords + Semantics Semantic queries are computationally expensive Keyword queries are fast and simple o People are used to keyword queries Synergistic solution: o extract numerical constraints from the query o use keywords to quickly narrow the search space o use constraints as a filter
Data Frames Price internal representation: Double external representation: \$[1-9]\d{0,2}(,\d{3})*| right units: (K)?\s*(cents|dollars|[Gg]rand|...) canonicalization method: toUSDollars comparison methods: LessThan(p1: Price, p2: Price) returns (Boolean) external representation: (less than|<|under|...)\s*{p2}| end
Data Frame Library
Free Form Query Car under 6 grand newer than 1990 with less than 115K miles
Step 1: Condition Extraction Car under 6 grand newer than 1990 with less than 115K miles Extracted Conditions o (Price < 6000) o (Year > 1990) o (Distance < )
Step 2: Remove Condition Values Car under newer than with less than
Step 3: Remove Stopwords Car
Step 4: Perform Keyword Search
Step 5: Filter Document on Constraints Keep page if every constraint is satisfied by at least one extracted value
Experimental Setup 300 web documents o 100 car+trucks pages from o 100 video gaming pages from o 50 mountain pages from o 50 roller coaster pages from 10 queries o 8 with usable conditions 2 data sets o test-development o blind test
Results Summary Precision increase for 56% of queries o 75% for test-dev, 50% for blind-test Precision never worse than keyword query Most effective for short, focused documents
Discussion Issues: 1.inadequate narrowing or ranking of search space 2.noise caused by other numbers Distance <
Future Work Scalability o Indexing data frame extracted terms Precision vs Recall trade-offs Pay-as-you-go search construction
Related Work Question-Answering Systems Keyword search over databases and semantic stores
Questions?
Results (Test-Dev Set)
Results (Blind Test Set)