Predicate-based Indexing of Annotated Data Donald Kossmann ETH Zurich
Observations Data is annotated by apps and humans –Word: versions, comments, references, layout,... –humans: tags on del.icio.us Applications provide views on the data –Word: Version 1.3, text without comments, … –del.icio.us: Provides search on tags + data Search Engines see and index the raw data –E.g., treat all versions as one document Search Engine’s view != User’s view –Search Engine returns the wrong results –or hard-code the right search logic (del.icio.us)
Application (e.g., Word, Wiki, Outlook, …) File System (e.g., x.doc, y.xls, …) Views Desktop Search Engine (e.g., Spotlight, Google Desktop, …) User crawl & index read & update query Desktop Search
Example 1: Bulk Letters Dear, The meeting is at 12. CU, Donald Peter… Paul… Mary… Raw data x.doc, y.xls … Dear Peter, The meeting is at 12. CU, Donald … Dear Paul, The meeting is at 12. CU, Donald … View
Example1: Traditional Search Engines DocIdKeyword… x.docDear… x.docMeeting… y.xlsPeter… y.xlsPaul… y.xlsMary… ……… Inverted File Query: Paul, meeting Answer: - Correct Answer: x.doc Query: Paul, Mary Answer: y.xls Correct Answer: x.doc
Example 2: Versioning (OpenOffice) Mickey likes Minnie Donald likes Daisy Mickey likes MinnieDonald likes Daisy Raw Data Instance 1 (Version 1) Instance 2 (Version 2)
Example 2: Versioning (OpenOffice) DocIdKeyword… z.swxMickey… z.swxlikes… z.swxMinnie… z.swxDonald… z.swxDaisy… Inverted File Query: Mickey likes Daisy Answer: z.swx Correct Answer: - Query: Mickey likes Minnie Answer: z.swx Correct Answer: z.swx (V1)
Example 3: Personalization, Localization, Authorization Donald Daisy Mickey Minnie likes. Donald likes Daisy. Mickey likes Minnie. Donald Daisy Mickey Minnie likes.
Example 4: del.icio.us Query: „Joe, software, Yahoo“ –both A and B are relevant, but in different worlds –if context info available, choice is possible usertagURL JoebusinessA MarysoftwareB Tag Table Yahoo builds software. Joe is a programmer at Yahoo.
Example 5: Enterprise Search Web Applications –Application defined using „templates“ (e.g., JSP) –Data both in JSP pages and database –Content = JSP + Java + Database –Content depends on Context (roles, workflow) –Links = URL + function + parameters + context Enterprise Search –Search: Map Content to Link –Enterprise Search: Content and Link are complex Example: Search Engine for J2EE PetStore –(see demo at CIDR 2007)
Possible Solutions Extend Applications with Search Capabilities –Re-invents the wheel for each application –Not worth the effort for small apps –No support for cross-app search Extend Search Engines –Application-specific rules for „encoded“ data –„Possible Worlds“ Semantics of Data –Materialize view, normalize view –Index normalized view –Extended query processing –Challenge: Views become huge!
Application (e.g., Word, Wiki, Outlook, …) File System (e.g., x.doc, y.xls, …) Views Desktop Search Engine (e.g., Spotlight, Google Desktop, …) User crawl & index read & update query Views rules
Size of Views One rule: size of view grows linearly with size of document –E.g., for each version, one instance in view –Constant can be high! (e.g., many versions) Several rules: size of view grows exponentially with number of rules –E.g, #versions x #alternatives Prototype, experiments: Wiki, Office, … –About 30 rules; 5-6 applicable per document –View ~ 1000 Raw data
Solution Architecture
Rules and Patterns Analogy: Operators of relational algebra Patterns sufficient for Latex, MS Office, OpenOffice, TWiki, (Outlook)
Normalized View Donald Mickey likes Daisy Minnie. Donald Daisy Mickey Minnie likes. Raw Data: Rule: Normalized View:
Normalized View <version match=„//insert“ eq /> Mikey =5/1/2006">Mouse likes Minnie =5/16/2006">Mouse. Mikey Mouse likes Minnie Mouse. Raw Data: Rule: Normalized View: General Idea: Factor out common parts: „Mickey likes Minnie.“ Markup variable parts:,
Normalization Algorithm Step 1: Construct Tagging Table –Evaluate „match“ expression –Evaluate „key“ expression –Compute Operator from Pattern (e.g., > for version) Step 2: Tagging Table -> Normalized View –Embrace each match with tags RuleNodeIdKey ValueOp R119duck= R119mouse= R122duck= R122mouse=
Predicate-based Indexing DocIdKeywordCondition z.swxDonaldR1=duck z.swxMickeyR1=mouse z.swxlikestrue z.swxDaisyR1=duck z.swxMinnieR1=mouse Normalized View: Inverted File: Donald Mickey likes Daisy Minnie.
Query Processing Donald likes Minnie R1=duck ^ true ^ R1=mouse Donald likes Daisy R1=duck^true^R1=duck R1=duck false DocIdKeywordCondition z.swxDonaldR1=duck z.swxMickeyR1=mouse z.swxlikestrue z.swxDaisyR1=duck z.swxMinnieR1=mouse
Qualitative Assessment Expressiveness of rules / patterns –Good enough for „desktop data“ –Extensible for other data –Unclear how good for general applications (e.g., SAP) Normalized View –Size: O(n); with n size of raw data –Generation Time: depends on complexity of XQuery expressions in rules; typically O(n) Predicate-based Inverted File –Size: O(n) - same as traditional inverted files –Generation Time: O(n) –But, constants can be large Query Processing –Polynomial in #keywords in query (~ traditional) –High constants!
Experiments Data sets from my personal desktop – , TWiki, Latex, OpenOffice, MS Office, … Data-set dependent rules – different rule sets (here conversations) –Latex: include, footnote, exclude, … –TWiki: versioning, exclude, … Hand-cooked queries –Vary selectivity, degree that involves instances Measure size of data sets, indexes, precision & recall, query running times
Data Size (Twiki) TraditionalEnhanced Raw Data (MB)4.77 Normalized View (MB) Index (MB) Creation Time (secs)
Data Size ( ) TraditionalEnhanced Raw Data (MB)51.46 Normalized View (MB) Index (MB) Creation Time (secs)
Precision (Twiki) TraditionalEnhanced Query Query Query Query Recall is 1 in all cases. Twiki: example for „false positives“.
Recall ( ) TraditionalEnhanced Query Query Query Query Precision is 1 in all cases. example for „false negatives“.
Response Time in ms (Twiki) TraditionalEnhanced Query Query Query Query Enhanced one order of magnitude slower, but still within milliseconds.
Response Time in ms ( ) TraditionalEnhanced Query Query Query Query Enhanced orders of magnitude slower, but still within milliseconds.
Conclusion & Future Work See data with the eyes of users! –Give search engines the right glasses –Flexibility in search: reveal hidden data –Compressed indexes using predicates Future Work –Other apps: e.g., JSP, Tagging, Semantic Web –Consolidate different view definitions (security) –Search on streaming data