Predicate-based Indexing of Annotated Data Donald Kossmann ETH Zurich

Predicate-based Indexing of Annotated Data Donald Kossmann ETH Zurich http://www.dbis.ethz.ch

Observations Data is annotated by apps and humans –Word: versions, comments, references, layout,... –humans: tags on del.icio.us Applications provide views on the data –Word: Version 1.3, text without comments, … –del.icio.us: Provides search on tags + data Search Engines see and index the raw data –E.g., treat all versions as one document Search Engine’s view != User’s view –Search Engine returns the wrong results –or hard-code the right search logic (del.icio.us)

Application (e.g., Word, Wiki, Outlook, …) File System (e.g., x.doc, y.xls, …) Views Desktop Search Engine (e.g., Spotlight, Google Desktop, …) User crawl & index read & update query Desktop Search

Example 1: Bulk Letters Dear, The meeting is at 12. CU, Donald Peter… Paul… Mary… Raw data x.doc, y.xls … Dear Peter, The meeting is at 12. CU, Donald … Dear Paul, The meeting is at 12. CU, Donald … View

Example1: Traditional Search Engines DocIdKeyword… x.docDear… x.docMeeting… y.xlsPeter… y.xlsPaul… y.xlsMary… ……… Inverted File Query: Paul, meeting Answer: - Correct Answer: x.doc Query: Paul, Mary Answer: y.xls Correct Answer: x.doc

Example 2: Versioning (OpenOffice) Mickey likes Minnie Donald likes Daisy Mickey likes MinnieDonald likes Daisy Raw Data Instance 1 (Version 1) Instance 2 (Version 2)

Example 2: Versioning (OpenOffice) DocIdKeyword… z.swxMickey… z.swxlikes… z.swxMinnie… z.swxDonald… z.swxDaisy… Inverted File Query: Mickey likes Daisy Answer: z.swx Correct Answer: - Query: Mickey likes Minnie Answer: z.swx Correct Answer: z.swx (V1)

Example 3: Personalization, Localization, Authorization Donald Daisy Mickey Minnie likes. Donald likes Daisy. Mickey likes Minnie. Donald Daisy Mickey Minnie likes.

Example 4: del.icio.us Query: „Joe, software, Yahoo“ –both A and B are relevant, but in different worlds –if context info available, choice is possible usertagURL JoebusinessA MarysoftwareB Tag Table Yahoo builds software. Joe is a programmer at Yahoo. http://A.com http://B.com

Example 5: Enterprise Search Web Applications –Application defined using „templates“ (e.g., JSP) –Data both in JSP pages and database –Content = JSP + Java + Database –Content depends on Context (roles, workflow) –Links = URL + function + parameters + context Enterprise Search –Search: Map Content to Link –Enterprise Search: Content and Link are complex Example: Search Engine for J2EE PetStore –(see demo at CIDR 2007)

Possible Solutions Extend Applications with Search Capabilities –Re-invents the wheel for each application –Not worth the effort for small apps –No support for cross-app search Extend Search Engines –Application-specific rules for „encoded“ data –„Possible Worlds“ Semantics of Data –Materialize view, normalize view –Index normalized view –Extended query processing –Challenge: Views become huge!

Application (e.g., Word, Wiki, Outlook, …) File System (e.g., x.doc, y.xls, …) Views Desktop Search Engine (e.g., Spotlight, Google Desktop, …) User crawl & index read & update query Views rules

Size of Views One rule: size of view grows linearly with size of document –E.g., for each version, one instance in view –Constant can be high! (e.g., many versions) Several rules: size of view grows exponentially with number of rules –E.g, #versions x #alternatives Prototype, experiments: Wiki, Office, E-Mail… –About 30 rules; 5-6 applicable per document –View ~ 1000 Raw data

Solution Architecture

Rules and Patterns Analogy: Operators of relational algebra Patterns sufficient for Latex, MS Office, OpenOffice, TWiki, E-Mail (Outlook)

Normalized View Donald Mickey likes Daisy Minnie. Donald Daisy Mickey Minnie likes. Raw Data: Rule: Normalized View:

Normalized View <version match=„//insert“ key=„//inserted[@id eq $m/@id]/info/@date /> Mikey =5/1/2006">Mouse likes Minnie =5/16/2006">Mouse. Mikey Mouse likes Minnie Mouse. Raw Data: Rule: Normalized View: General Idea: Factor out common parts: „Mickey likes Minnie.“ Markup variable parts:,

Normalization Algorithm Step 1: Construct Tagging Table –Evaluate „match“ expression –Evaluate „key“ expression –Compute Operator from Pattern (e.g., > for version) Step 2: Tagging Table -> Normalized View –Embrace each match with tags RuleNodeIdKey ValueOp R119duck= R119mouse= R122duck= R122mouse=

Predicate-based Indexing DocIdKeywordCondition z.swxDonaldR1=duck z.swxMickeyR1=mouse z.swxlikestrue z.swxDaisyR1=duck z.swxMinnieR1=mouse Normalized View: Inverted File: Donald Mickey likes Daisy Minnie.

Query Processing Donald likes Minnie R1=duck ^ true ^ R1=mouse Donald likes Daisy R1=duck^true^R1=duck R1=duck false DocIdKeywordCondition z.swxDonaldR1=duck z.swxMickeyR1=mouse z.swxlikestrue z.swxDaisyR1=duck z.swxMinnieR1=mouse

Qualitative Assessment Expressiveness of rules / patterns –Good enough for „desktop data“ –Extensible for other data –Unclear how good for general applications (e.g., SAP) Normalized View –Size: O(n); with n size of raw data –Generation Time: depends on complexity of XQuery expressions in rules; typically O(n) Predicate-based Inverted File –Size: O(n) - same as traditional inverted files –Generation Time: O(n) –But, constants can be large Query Processing –Polynomial in #keywords in query (~ traditional) –High constants!

Experiments Data sets from my personal desktop –E-Mail, TWiki, Latex, OpenOffice, MS Office, … Data-set dependent rules –E-Mail: different rule sets (here conversations) –Latex: include, footnote, exclude, … –TWiki: versioning, exclude, … Hand-cooked queries –Vary selectivity, degree that involves instances Measure size of data sets, indexes, precision & recall, query running times

Data Size (Twiki) TraditionalEnhanced Raw Data (MB)4.77 Normalized View (MB) -4.53 Index (MB)0.561.07 Creation Time (secs) 9.009.62

Data Size (E-Mail) TraditionalEnhanced Raw Data (MB)51.46 Normalized View (MB) -51.77 Index (MB)2.8612.61 Creation Time (secs) 106.61132.62

Precision (Twiki) TraditionalEnhanced Query 10.9851 Query 20.0711 Query 30.3391 Query 40.8751 Recall is 1 in all cases. Twiki: example for „false positives“.

Recall (E-Mail) TraditionalEnhanced Query 10.3221 Query 20.8211 Query 30.4991 Query 40.51 Precision is 1 in all cases. E-Mail: example for „false negatives“.

Response Time in ms (Twiki) TraditionalEnhanced Query 10.2010.907 Query 20.2181.224 Query 30.0330.122 Query 40.0540.212 Enhanced one order of magnitude slower, but still within milliseconds.

Response Time in ms (E-Mail) TraditionalEnhanced Query 10.0030.864 Query 20.0046.091 Query 30.0201.845 Query 40.0270.055 Enhanced orders of magnitude slower, but still within milliseconds.

Conclusion & Future Work See data with the eyes of users! –Give search engines the right glasses –Flexibility in search: reveal hidden data –Compressed indexes using predicates Future Work –Other apps: e.g., JSP, Tagging, Semantic Web –Consolidate different view definitions (security) –Search on streaming data

Predicate-based Indexing of Annotated Data Donald Kossmann ETH Zurich

Similar presentations

Presentation on theme: "Predicate-based Indexing of Annotated Data Donald Kossmann ETH Zurich"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Predicate-based Indexing of Annotated Data Donald Kossmann ETH Zurich

Similar presentations

Presentation on theme: "Predicate-based Indexing of Annotated Data Donald Kossmann ETH Zurich"— Presentation transcript:

Similar presentations

About project

Feedback