The CompleteSearch Engine: Interactive, Efficient, and Towards IR&DB Integration Holger Bast, Ingmar Weber Max-Planck-Institut für Informatik CIDR 2007)

Slides:



Advertisements
Similar presentations
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 7: Scoring and results assembly.
Advertisements

Chapter 5: Introduction to Information Retrieval
Introduction to Information Retrieval
Effective Keyword Based Selection of Relational Databases Bei Yu, Guoliang Li, Karen Sollins, Anthony K.H Tung.
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
SPARK: Top-k Keyword Query in Relational Databases Yi Luo, Xuemin Lin, Wei Wang, Xiaofang Zhou Univ. of New South Wales, Univ. of Queensland SIGMOD 2007.
Web- and Multimedia-based Information Systems. Assessment Presentation Programming Assignment.
Search Engines and Information Retrieval
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
NUITS: A Novel User Interface for Efficient Keyword Search over Databases The integration of DB and IR provides users with a wide range of high quality.
Keyword Search in Relational Databases Jaehui Park Intelligent Database Systems Lab. Seoul National University
Search Engines and Information Retrieval Chapter 1.
The Power of Prefix Search (with a nice open problem) Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany Talk at ADS 2007 in Bertinoro,
Multimedia Databases (MMDB)
The CompleteSearch Engine: Interactive, Efficient, and Towards IR&DB integration Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany joint.
DBease: Making Databases User-Friendly and Easily Accessible Guoliang Li, Ju Fan, Hao Wu, Jiannan Wang, Jianhua Feng Database Group, Department of Computer.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Presented By: Sibin G. Peter Instructor: Dr. R.M.Verma.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
« Performance of Compressed Inverted List Caching in Search Engines » Proceedings of the International World Wide Web Conference Commitee, Beijing 2008)
Indices Tomasz Bartoszewski. Inverted Index Search Construction Compression.
Querying Structured Text in an XML Database By Xuemei Luo.
Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.
Type Less, Find More: Fast Autocompletion Search with a Succinct Index Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany joint work with.
Search Engines WS 2009 / 2010 Prof. Dr. Hannah Bast Chair of Algorithms and Data Structures Department of Computer Science University of Freiburg Lecture.
Fast and Intelligent Search In Very Large Amounts of Data Hannah Bast Max-Planck-Institute for Informatics Saarbrücken Kick-off meeting for Cluster of.
The CompleteSearch Engine: Interactive, Efficient, and Towards IR&DB Integration Holger Bast, Ingmar Weber CIDR 2007) Conference on Innovative Data Systems.
 CIKM  Implementation of Smoothing techniques on the GPU  Re running experiments using the wt2g collection  The Future.
Efficient Search in Very Large Text Collections, Databases, and Ontologies Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany DFG Priority.
EASE: An Effective 3-in-1 Keyword Search Method for Unstructured, Semi-structured and Structured Data Cuoliang Li, Beng Chin Ooi, Jianhua Feng, Jianyong.
Search Result Interface Hongning Wang Abstraction of search engine architecture User Ranker Indexer Doc Analyzer Index results Crawler Doc Representation.
ESTER Efficient Search on Text, Entities, and Relations Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany joint work with Alexandru Chitea,
Keyword Searching and Browsing in Databases using BANKS Seoyoung Ahn Mar 3, 2005 The University of Texas at Arlington.
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
LATENT SEMANTIC INDEXING Hande Zırtıloğlu Levent Altunyurt.
Contextual Ranking of Keywords Using Click Data Utku Irmak, Vadim von Brzeski, Reiner Kraft Yahoo! Inc ICDE 09’ Datamining session Summarized.
Efficient Instant-Fuzzy Search with Proximity Ranking Authors: Inci Centidil, Jamshid Esmaelnezhad, Taewoo Kim, and Chen Li IDCE Conference 2014 Presented.
The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
ESTER Efficient Search on Text, Entities, and Relations Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany joint work with Alexandru Chitea,
Search Engines WS 2009 / 2010 Prof. Dr. Hannah Bast Chair of Algorithms and Data Structures Department of Computer Science University of Freiburg Lecture.
1 Information Retrieval LECTURE 1 : Introduction.
Chapter 5 Ranking with Indexes 1. 2 More Indexing Techniques n Indexing techniques:  Inverted files - best choice for most applications  Suffix trees.
Information Retrieval
Concept-based P2P Search How to find more relevant documents Ingmar Weber Max-Planck-Institute for Computer Science Joint work with Holger Bast Torino,
Search Engines WS 2009 / 2010 Prof. Dr. Hannah Bast Chair of Algorithms and Data Structures Department of Computer Science University of Freiburg Lecture.
Sudhanshu Khemka.  Treats each document as a vector with one component corresponding to each term in the dictionary  Weight of a component is calculated.
Survey Jaehui Park Copyright  2008 by CEBT Introduction  Members Jung-Yeon Yang, Jaehui Park, Sungchan Park, Jongheum Yeon  We are interested.
Learning in a Pairwise Term-Term Proximity Framework for Information Retrieval Ronan Cummins, Colm O’Riordan Digital Enterprise Research Institute SIGIR.
CS798: Information Retrieval Charlie Clarke Information retrieval is concerned with representing, searching, and manipulating.
03/02/20061 Evaluating Top-k Queries Over Web-Accessible Databases Amelie Marian Nicolas Bruno Luis Gravano Presented By: Archana and Muhammed.
Text Information Management ChengXiang Zhai, Tao Tao, Xuehua Shen, Hui Fang, Azadeh Shakery, Jing Jiang.
Toward Entity Retrieval over Structured and Text Data Mayssam Sayyadian, Azadeh Shakery, AnHai Doan, ChengXiang Zhai Department of Computer Science University.
Cool algorithms for a cool feature Holger Bast Max-Planck-Institut für Informatik (MPII) Saarbrücken, Germany joint work with Christian Mortensen and Ingmar.
XRANK: RANKED KEYWORD SEARCH OVER XML DOCUMENTS Lin Guo Feng Shao Chavdar Botev Jayavel Shanmugasundaram Abhishek Chennaka, Alekhya Gade Advanced Database.
CS315 Introduction to Information Retrieval Boolean Search 1.
Presented By: Carlton Northern and Jeffrey Shipman The Anatomy of a Large-Scale Hyper-Textural Web Search Engine By Lawrence Page and Sergey Brin (1998)
Federated text retrieval from uncooperative overlapped collections Milad Shokouhi, RMIT University, Melbourne, Australia Justin Zobel, RMIT University,
The CompleteSearch Engine: Interactive, Efficient, and Towards IR & DB integration Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany.
Efficient Top-k Querying over Social-Tagging Networks Ralf Schenkel, Tom Crecelius, Mouna Kacimi, Sebastian Michel, Thomas Neumann, Josiane Xavier Parreira,
Personalized Ontology for Web Search Personalization S. Sendhilkumar, T.V. Geetha Anna University, Chennai India 1st ACM Bangalore annual Compute conference,
Information Retrieval in Practice
Information Retrieval (in Practice)
COMP 430 Intro. to Database Systems
The core algorithmic problem Ordinary Inverted Index
Type Less, Find More: Fast Autocompletion Search with a Succinct Index
OUTLINE Basic ideas of traditional retrieval systems
Introduction to Information Retrieval
Information Retrieval and Web Design
Presentation transcript:

The CompleteSearch Engine: Interactive, Efficient, and Towards IR&DB Integration Holger Bast, Ingmar Weber Max-Planck-Institut für Informatik CIDR 2007) Conference on Innovative Data Systems Research (CIDR 2007) Summarized by Jaehui Park, IDS Lab., Seoul National University Presented by Jaehui Park, IDS Lab., Seoul National University

Copyright  2008 by CEBT Introduction  Interactive search engine: CompleteSearch Variety of complex features – Automatic query completion (IR perspective) – Semi-structured retrieval – Semantic search – DB-style joins and grouping (DB perspective) – Range search (Theorist’s perspective) – Combining IR-style with DB-style querying example query  ir db integration conference:sigmod author:  author[conference:sigmod] ir db integration  Context-sensitive prefix search and completion For a given collections of documents, with a unique id for each document and a unique id for each of the words used in the collection, a context-sensitive prefix search and completion query is a pair (D, W), where D is a set of document ids and W is a range of word ids. To process the query means to compute a ranked list of all pairs (d, w), where words w occurs in document d, d is from D and w is from W. Novel index data structure – HYB [SIGIR 2006][SPIRE 2006] Using no more space than a state-of-the-art compressed inverted index With 10 times faster query processing 2

Copyright  2008 by CEBT Model  Data is given as documents containing words documents have ids (D1, D2, …) words have ids (A, B, C, …)  Query given a sorted list of doc ids and a range of words ids 3 D98 E B A S D98 E B A S D78 K L S D78 K L S D53 J D E A D53 J D E A D2 B F A D2 B F A D4 K L K A B D4 K L K A B D9 E E R D9 E E R D27 K L D F D27 K L D F D92 P U D E M D92 P U D E M D43 D Q D43 D Q D32 I L S D H D32 I L S D H D1 A O E W H D1 A O E W H D88 P A E G Q D88 P A E G Q D3 Q D A D3 Q D A D17 B W U K A D17 B W U K A D74 J W Q D74 J W Q D13 A O E W H D13 A O E W H D13 D17 D88 … C D E F G H

Copyright  2008 by CEBT Model  Data is given as documents containing words documents have ids (D1, D2, …) words have ids (A, B, C, …)  Query given a sorted list of doc ids and a range of words ids  Answer all matching word-in-document pairs with scores 4 D13 E … D88 E … … D98 E B A S D98 E B A S D78 K L S D78 K L S D53 J D E A D53 J D E A D2 B F A D2 B F A D4 K L K A B D4 K L K A B D9 E E R D9 E E R D27 K L D F D27 K L D F D92 P U D E M D92 P U D E M D43 D Q D43 D Q D32 I L S D H D32 I L S D H D1 A O E W H D1 A O E W H D88 P A E G Q D88 P A E G Q D3 Q D A D3 Q D A D17 B W U K A D17 B W U K A D74 J W Q D74 J W Q D13 A O E W H D13 A O E W H D88 P A E G Q D88 P A E G Q D17 B W U K A D17 B W U K A D13 A O E W H D13 A O E W H D13 D17 D88 … C D E F G H D88 G

Copyright  2008 by CEBT HYB  Indexing data structure Space usage : HYB ~= INV (inverted index) – Empirical entropy The inherent space complexity of an index Processing time : HYB < INV – The number of operations needed – The latencies of access to data  Basic idea Precompute inverted lists for unions of words – The union of all lists for word range W W: arbitrary word range The basic unit of processing is a block – Block => a range of words Block consists of all pairs (w,d) Each block is sorted by document id – Effective gap encoding scheme Compressed multiset 5

Copyright  2008 by CEBT HYB (An example)  10 documents (ids: 3,5,6,7,8,9,11,12,13,15)  A block for the word range A-D HYB consists of a collection of such blocks Word-in-document pairs : (w,d)  Two operations on the block Intersection with a sorted list of document ids Intersection with a list of word ids For example, – Query : ontol sem search The sorted list of ids of documents matching ontol sem The sorted list of documents ids from the blocks containing all occurrences of the word search 6

Copyright  2008 by CEBT HYB  Block volumes The number of pairs (w,d) A small fraction of the total number of documents – c < 1 (c ~= 0.2)  Advantages Simple Can be compressed extremely well – It is proven both theoretically and empirically in [SIGIR 2006] Enables a processing of the prefix search and completion queries – By mere sequential access – Without sorting or other non-linear operations Rank by a precomputed score – For each word-in-document pair Okapi BM25 score + IDF 7

Copyright  2008 by CEBT CompleteSearch’s feature set  Context-sensitive autocompletion search Display completions of the last query word that – Would lead to good hits, as well as the best hits for any of these completion Compute all completions of the last query word  Google Suggest, Apple’s Spotlight, AlltheWeb Live Search We remark that a prototype of our engine already existed when Google Suggest and Apple Spotlight were launched. Algorithmically easier 8

Copyright  2008 by CEBT CompleteSearch’s feature set  Structured search in XML documents XML tags as special words.. (two dots) – Proximity operator Ex) tag: ..tag:subj..dbworld  Retrieve all messages mentioning dbworld in their subject line  Semantic search Tag documents – Ex) politician:tony_blair – The necessity of semantic annotation (proactive behavior) tagging Which politician had a private audience with a pope? – Query: audience pope politician: Compute ranked list of completions of politician: which occur in the context of audience pope 9

Copyright  2008 by CEBT CompleteSearch’s feature set  DB_style joins Add structured data via special words – : – conference:sigmod DB-style join functionality – Intersect the two list of completions == attribute-values pairs for the join attributes – Ex) query -> table:ABC attr_k: and table:XYZ attr_k: (select … where …) Something standard IR-style keyword search cannot handle – conference:sigir author: – conference:sigmod author: – The completion of the two queries Intersecting the two lists of authors  No document is a SIGIR paper and a SIGMOD paper at the same time – When the answer is spread over several pages Which German chancellors had an audience with the pope?  Combine information from the followings  One page about Angela Merkel (current German chancellor)  Another page about the current pope having met Angela Merkel  Intersect of following two queries  german chancellor politician:  audience pope politician: 10

Copyright  2008 by CEBT Lessons Learned  Locality of Access For efficiency – Access data as sequential as possible Faster than random access (100 times)  average / maximal query time: 0.11 secs / 0.86 secs – Process as little data as possible per query Extensive use of compression  raw data: 426 GB index size: 4.9 GB (within 1+ε of data entropy) – Hardware-aware implementation When it comes to algorithms highly optimized for sequential access to data, the choice of programming language is critical : C++  An interactive web-application AJAX  User Feedback The vast majority of users is not willing to read even the tiniest bit of documentation – Make user interface intuitive and simple 11

Copyright  2008 by CEBT Things to do  We describe CompleteSearch novel data structure and query algorithm for context-sensitive prefix search and completion – highly compressible and high locality of access – IR-style ranked retrieval – DB-style selects and joins – Natural blend of the two Subsecond query times for up to a terabyte on a single machine  Things to do We have not yet fully exploited the potential of top-k retrieval techniques How to deal with dynamics (many insertion or deletions) 12