Predicate-based Indexing of Annotated Data Donald Kossmann ETH Zurich

Slides:



Advertisements
Similar presentations
Fatma Y. ELDRESI Fatma Y. ELDRESI ( MPhil ) Systems Analysis / Programming Specialist, AGOCO Part time lecturer in University of Garyounis,
Advertisements

Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
1 Oct 30, 2006 LogicSQL-based Enterprise Archive and Search System How to organize the information and make it accessible and useful ? Li-Yan Yuan.
“ The Anatomy of a Large-Scale Hypertextual Web Search Engine ” Presented by Ahmed Khaled Al-Shantout ICS
Information Retrieval in Practice
Basic IR: Queries Query is statement of user’s information need. Index is designed to map queries to likely to be relevant documents. Query type, content,
Anatomy of a Large-Scale Hypertextual Web Search Engine (e.g. Google)
Gimme’ The Context: Context- driven Automatic Semantic Annotation with CPANKOW Philipp Cimiano et al.
WikiConversation Scotty Allen Phong Le. Goal Support joint document production asynchronously via localized comment capability In context of different.
CM143 - Web Week 2 Basic HTML. Links and Image Tags.
Google and Scalable Query Services
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
An Application of Graphs: Search Engines (most material adapted from slides by Peter Lee) Slides by Laurie Hiyakumoto.
Enterprise Search. Search Architecture Configuring Crawl Processes Advanced Crawl Administration Configuring Query Processes Implementing People Search.
Managing Large RDF Graphs (Infinite Graph) Vaibhav Khadilkar Department of Computer Science, The University of Texas at Dallas FEARLESS engineering.
1 Yolanda Gil Information Sciences InstituteJanuary 10, 2010 Requirements for caBIG Infrastructure to Support Semantic Workflows Yolanda.
16-1 The World Wide Web The Web An infrastructure of distributed information combined with software that uses networks as a vehicle to exchange that information.
Web Search Created by Ejaj Ahamed. What is web?  The World Wide Web began in 1989 at the CERN Particle Physics Lab in Switzerland. The Web did not gain.
RuleML-2007, Orlando, Florida1 Towards Knowledge Extraction from Weblogs and Rule-based Semantic Querying Xi Bai, Jigui Sun, Haiyan Che, Jin.
Chapter 7 Web Content Mining Xxxxxx. Introduction Web-content mining techniques are used to discover useful information from content on the web – textual.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Presented By: Sibin G. Peter Instructor: Dr. R.M.Verma.
Basics of Information Retrieval Lillian N. Cassel Some of these slides are taken or adapted from Source:
PAUL ALEXANDRU CHIRITA STEFANIA COSTACHE SIEGFRIED HANDSCHUH WOLFGANG NEJDL 1* L3S RESEARCH CENTER 2* NATIONAL UNIVERSITY OF IRELAND PROCEEDINGS OF THE.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Web Searching Basics Dr. Dania Bilal IS 530 Fall 2009.
WHAT IS A SEARCH ENGINE A search engine is not a physical engine, instead its an electronic code or a software programme that searches and indexes millions.
Querying Structured Text in an XML Database By Xuemei Luo.
McLean HIGHER COMPUTER NETWORKING Lesson 7 Search engines Description of search engine methods.
Chapter 6: Information Retrieval and Web Search
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
2007. Software Engineering Laboratory, School of Computer Science S E Web-Harvest Web-Harvest: Open Source Web Data Extraction tool 이재정 Software Engineering.
GUIDED BY DR. A. J. AGRAWAL Search Engine By Chetan R. Rathod.
WIRED Week 3 Syllabus Update (next week) Readings Overview - Quick Review of Last Week’s IR Models (if time) - Evaluating IR Systems - Understanding Queries.
Semantic Visualization What do we mean when we talk about visualization? - Understanding data - Showing the relationships between elements of data Overviews.
Efficient RDF Storage and Retrieval in Jena2 Written by: Kevin Wilkinson, Craig Sayers, Harumi Kuno, Dave Reynolds Presented by: Umer Fareed 파리드.
Search Tools and Search Engines Searching for Information and common found internet file types.
Search Engines By: Faruq Hasan.
Copyright © 2006 Pilothouse Consulting Inc. All rights reserved. Search Overview Search Features: WSS and Office Search Architecture Content Sources and.
Strategies for subject navigation of linked Web sites using RDF topic maps Carol Jean Godby Devon Smith OCLC Online Computer Library Center Knowledge Technologies.
CS562 Advanced Java and Internet Application Introduction to the Computer Warehouse Web Application. Java Server Pages (JSP) Technology. By Team Alpha.
CIDR 2007, Asilomar California1 Predicate-Based Indexing of Enterprise Web Applications Cristian Duda, David Graf, Donald Kossmann ETH Zurich.
The World Wide Web. What is the worldwide web? The content of the worldwide web is held on individual pages which are gathered together to form websites.
Web Design – Week 2 Introduction to website basics Website basics: How the Web Works Client / server architecture Packet switching URL components.
Search Engine and Optimization 1. Introduction to Web Search Engines 2.
Presented By: Carlton Northern and Jeffrey Shipman The Anatomy of a Large-Scale Hyper-Textural Web Search Engine By Lawrence Page and Sergey Brin (1998)
SEMANTIC WEB Presented by- Farhana Yasmin – MD.Raihanul Islam – Nohore Jannat –
Maitrayee Mukerji. INPUT MEMORY PROCESS OUTPUT DATA INFO.
Search can be Your Best Friend You just Need to Know How to Talk to it IW 306 Ágnes Molnár.
Information Retrieval in Practice
Presentation by: Rebecca Chambers WebDuck Designs
Search Engine Architecture
 Corpus Formation [CFT]  Web Pages Annotation [Web Annotator]  Web sites detection [NEACrawler]  Web pages collection [NEAC]  IE Remote.
SEARCH ENGINES & WEB CRAWLER Akshay Ghadge Roll No: 107.
Map Reduce.
Lecture 7. Web Search. Author: Aleksey Semyonov
Search Engine Architecture
Prepared by Rao Umar Anwar For Detail information Visit my blog:
Design and Maintenance of Web Applications in J2EE
MSIS 655 Advanced Business Applications Programming
What is a Search Engine EIT, Author Gay Robertson, 2017.
Multimedia Information Retrieval
Agenda What is SEO ? How Do Search Engines Work? Measuring SEO success ? On Page SEO – Basic Practices? Technical SEO - Source Code. Off Page SEO – Social.
Introduction to Information Retrieval
Chapter 5: Information Retrieval and Web Search
Documents, Text Editors, and Web Pages
CS246: Information Retrieval
The Search Engine Architecture
Information Retrieval and Web Design
Correct document structure Easy for authors and accessible to readers
Presentation transcript:

Predicate-based Indexing of Annotated Data Donald Kossmann ETH Zurich

Observations Data is annotated by apps and humans –Word: versions, comments, references, layout,... –humans: tags on del.icio.us Applications provide views on the data –Word: Version 1.3, text without comments, … –del.icio.us: Provides search on tags + data Search Engines see and index the raw data –E.g., treat all versions as one document Search Engine’s view != User’s view –Search Engine returns the wrong results –or hard-code the right search logic (del.icio.us)

Application (e.g., Word, Wiki, Outlook, …) File System (e.g., x.doc, y.xls, …) Views Desktop Search Engine (e.g., Spotlight, Google Desktop, …) User crawl & index read & update query Desktop Search

Example 1: Bulk Letters Dear, The meeting is at 12. CU, Donald Peter… Paul… Mary… Raw data x.doc, y.xls … Dear Peter, The meeting is at 12. CU, Donald … Dear Paul, The meeting is at 12. CU, Donald … View

Example1: Traditional Search Engines DocIdKeyword… x.docDear… x.docMeeting… y.xlsPeter… y.xlsPaul… y.xlsMary… ……… Inverted File Query: Paul, meeting Answer: - Correct Answer: x.doc Query: Paul, Mary Answer: y.xls Correct Answer: x.doc

Example 2: Versioning (OpenOffice) Mickey likes Minnie Donald likes Daisy Mickey likes MinnieDonald likes Daisy Raw Data Instance 1 (Version 1) Instance 2 (Version 2)

Example 2: Versioning (OpenOffice) DocIdKeyword… z.swxMickey… z.swxlikes… z.swxMinnie… z.swxDonald… z.swxDaisy… Inverted File Query: Mickey likes Daisy Answer: z.swx Correct Answer: - Query: Mickey likes Minnie Answer: z.swx Correct Answer: z.swx (V1)

Example 3: Personalization, Localization, Authorization Donald Daisy Mickey Minnie likes. Donald likes Daisy. Mickey likes Minnie. Donald Daisy Mickey Minnie likes.

Example 4: del.icio.us Query: „Joe, software, Yahoo“ –both A and B are relevant, but in different worlds –if context info available, choice is possible usertagURL JoebusinessA MarysoftwareB Tag Table Yahoo builds software. Joe is a programmer at Yahoo.

Example 5: Enterprise Search Web Applications –Application defined using „templates“ (e.g., JSP) –Data both in JSP pages and database –Content = JSP + Java + Database –Content depends on Context (roles, workflow) –Links = URL + function + parameters + context Enterprise Search –Search: Map Content to Link –Enterprise Search: Content and Link are complex Example: Search Engine for J2EE PetStore –(see demo at CIDR 2007)

Possible Solutions Extend Applications with Search Capabilities –Re-invents the wheel for each application –Not worth the effort for small apps –No support for cross-app search Extend Search Engines –Application-specific rules for „encoded“ data –„Possible Worlds“ Semantics of Data –Materialize view, normalize view –Index normalized view –Extended query processing –Challenge: Views become huge!

Application (e.g., Word, Wiki, Outlook, …) File System (e.g., x.doc, y.xls, …) Views Desktop Search Engine (e.g., Spotlight, Google Desktop, …) User crawl & index read & update query Views rules

Size of Views One rule: size of view grows linearly with size of document –E.g., for each version, one instance in view –Constant can be high! (e.g., many versions) Several rules: size of view grows exponentially with number of rules –E.g, #versions x #alternatives Prototype, experiments: Wiki, Office, … –About 30 rules; 5-6 applicable per document –View ~ 1000 Raw data

Solution Architecture

Rules and Patterns Analogy: Operators of relational algebra Patterns sufficient for Latex, MS Office, OpenOffice, TWiki, (Outlook)

Normalized View Donald Mickey likes Daisy Minnie. Donald Daisy Mickey Minnie likes. Raw Data: Rule: Normalized View:

Normalized View <version match=„//insert“ eq /> Mikey =5/1/2006">Mouse likes Minnie =5/16/2006">Mouse. Mikey Mouse likes Minnie Mouse. Raw Data: Rule: Normalized View: General Idea: Factor out common parts: „Mickey likes Minnie.“ Markup variable parts:,

Normalization Algorithm Step 1: Construct Tagging Table –Evaluate „match“ expression –Evaluate „key“ expression –Compute Operator from Pattern (e.g., > for version) Step 2: Tagging Table -> Normalized View –Embrace each match with tags RuleNodeIdKey ValueOp R119duck= R119mouse= R122duck= R122mouse=

Predicate-based Indexing DocIdKeywordCondition z.swxDonaldR1=duck z.swxMickeyR1=mouse z.swxlikestrue z.swxDaisyR1=duck z.swxMinnieR1=mouse Normalized View: Inverted File: Donald Mickey likes Daisy Minnie.

Query Processing Donald likes Minnie R1=duck ^ true ^ R1=mouse Donald likes Daisy R1=duck^true^R1=duck R1=duck false DocIdKeywordCondition z.swxDonaldR1=duck z.swxMickeyR1=mouse z.swxlikestrue z.swxDaisyR1=duck z.swxMinnieR1=mouse

Qualitative Assessment Expressiveness of rules / patterns –Good enough for „desktop data“ –Extensible for other data –Unclear how good for general applications (e.g., SAP) Normalized View –Size: O(n); with n size of raw data –Generation Time: depends on complexity of XQuery expressions in rules; typically O(n) Predicate-based Inverted File –Size: O(n) - same as traditional inverted files –Generation Time: O(n) –But, constants can be large Query Processing –Polynomial in #keywords in query (~ traditional) –High constants!

Experiments Data sets from my personal desktop – , TWiki, Latex, OpenOffice, MS Office, … Data-set dependent rules – different rule sets (here conversations) –Latex: include, footnote, exclude, … –TWiki: versioning, exclude, … Hand-cooked queries –Vary selectivity, degree that involves instances Measure size of data sets, indexes, precision & recall, query running times

Data Size (Twiki) TraditionalEnhanced Raw Data (MB)4.77 Normalized View (MB) Index (MB) Creation Time (secs)

Data Size ( ) TraditionalEnhanced Raw Data (MB)51.46 Normalized View (MB) Index (MB) Creation Time (secs)

Precision (Twiki) TraditionalEnhanced Query Query Query Query Recall is 1 in all cases. Twiki: example for „false positives“.

Recall ( ) TraditionalEnhanced Query Query Query Query Precision is 1 in all cases. example for „false negatives“.

Response Time in ms (Twiki) TraditionalEnhanced Query Query Query Query Enhanced one order of magnitude slower, but still within milliseconds.

Response Time in ms ( ) TraditionalEnhanced Query Query Query Query Enhanced orders of magnitude slower, but still within milliseconds.

Conclusion & Future Work See data with the eyes of users! –Give search engines the right glasses –Flexibility in search: reveal hidden data –Compressed indexes using predicates Future Work –Other apps: e.g., JSP, Tagging, Semantic Web –Consolidate different view definitions (security) –Search on streaming data