Fast and Intelligent Search In Very Large Amounts of Data Hannah Bast Max-Planck-Institute for Informatics Saarbrücken Kick-off meeting for Cluster of.

Slides:



Advertisements
Similar presentations
Welcome to the seminar course
Advertisements

Reachability Querying: An Independent Permutation Labeling Approach (published in VLDB 2014) Presenter: WEI, Hao.
Effective Keyword Based Selection of Relational Databases Bei Yu, Guoliang Li, Karen Sollins, Anthony K.H Tung.
Great Theoretical Ideas in Computer Science for Some.
Management Science 461 Lecture 2b – Shortest Paths September 16, 2008.
Convexity of Point Set Sandip Das Indian Statistical Institute.
Train DEPOT PROBLEM USING PERMUTATION GRAPHS
The number of edge-disjoint transitive triples in a tournament.
Fast Two-Sided Error-Tolerant Search Hannah Bast, Marjan Celikik University of Freiburg, Germany KEYS 2010.
Computational problems, algorithms, runtime, hardness
Search Engines and Information Retrieval
1 Spanning Trees Lecture 20 CS2110 – Spring
© 2006 Pearson Addison-Wesley. All rights reserved14 A-1 Chapter 14 Graphs.
Branch and Bound Similar to backtracking in generating a search tree and looking for one or more solutions Different in that the “objective” is constrained.
CS Lecture 9 Storeing and Querying Large Web Graphs.
CS728 Lecture 16 Web indexes II. Last Time Indexes for answering text queries –given term produce all URLs containing –Compact representations for postings.
Chapter 11 Limitations of Algorithm Power Copyright © 2007 Pearson Addison-Wesley. All rights reserved.
1 Algorithms and Analysis CS 2308 Foundations of CS II.
Abstract Shortest distance query is a fundamental operation in large-scale networks. Many existing methods in the literature take a landmark embedding.
1 Trends in Mathematics: How could they Change Education? László Lovász Eötvös Loránd University Budapest.
Pancakes With A Problem! Great Theoretical Ideas In Computer Science Vince Conitzer COMPSCI 102 Fall 2007 Lecture 1 August 27, 2007 Duke University.
Cooking for Computer Scientists. I understand that making pancakes can be a dangerous activity and that, by doing so, I am taking a risk that I.
Dynamic Programming Introduction to Algorithms Dynamic Programming CSE 680 Prof. Roger Crawfis.
Improved results for a memory allocation problem Rob van Stee University of Karlsruhe Germany Leah Epstein University of Haifa Israel WADS 2007 WAOA 2007.
Search Engines and Information Retrieval Chapter 1.
CS523 INFORMATION RETRIEVAL COURSE INTRODUCTION YÜCEL SAYGIN SABANCI UNIVERSITY.
Yossi Azar Tel Aviv University Joint work with Ilan Cohen Serving in the Dark 1.
Pancakes With A Problem! Great Theoretical Ideas In Computer Science Anupam Gupta CS Fall 2O05 Lecture 1 Aug 29 th, 2OO5 Aug 29 th, 2OO5 Carnegie.
Lecture 2 Computational Complexity
The CompleteSearch Engine: Interactive, Efficient, and Towards IR&DB integration Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany joint.
The CompleteSearch Engine: Interactive, Efficient, and Towards IR&DB Integration Holger Bast, Ingmar Weber Max-Planck-Institut für Informatik CIDR 2007)
Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.
CSCE350 Algorithms and Data Structure Lecture 17 Jianjun Hu Department of Computer Science and Engineering University of South Carolina
« Pruning Policies for Two-Tiered Inverted Index with Correctness Guarantee » Proceedings of the 30th annual international ACM SIGIR, Amsterdam 2007) A.
Recursion, Complexity, and Sorting By Andrew Zeng.
RESOURCES, TRADE-OFFS, AND LIMITATIONS Group 5 8/27/2014.
Algorithms  Al-Khwarizmi, arab mathematician, 8 th century  Wrote a book: al-kitab… from which the word Algebra comes  Oldest algorithm: Euclidian algorithm.
Type Less, Find More: Fast Autocompletion Search with a Succinct Index Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany joint work with.
Search Engines WS 2009 / 2010 Prof. Dr. Hannah Bast Chair of Algorithms and Data Structures Department of Computer Science University of Freiburg Lecture.
The CompleteSearch Engine: Interactive, Efficient, and Towards IR&DB Integration Holger Bast, Ingmar Weber CIDR 2007) Conference on Innovative Data Systems.
© Ronaldo Menezes, Florida Tech Fundamentals of Algorithmic Problem Solving  Algorithms are not answers to problems  They are specific instructions for.
1 Lower Bounds Lower bound: an estimate on a minimum amount of work needed to solve a given problem Examples: b number of comparisons needed to find the.
Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.
Algorithms & Flowchart
Conjunctive Filter: Breaking the Entropy Barrier Daisuke Okanohara *1, *2 Yuichi Yoshida *1*3 *1 Preferred Infrastructure Inc. *2 Dept. of Computer Science,
Big Data Analytics Carlos Ordonez. Big Data Analytics research Input? BIG DATA (large data sets, large files, many documents, many tables, fast growing)
Graph Colouring L09: Oct 10. This Lecture Graph coloring is another important problem in graph theory. It also has many applications, including the famous.
Marina Drosou, Evaggelia Pitoura Computer Science Department
Ranking objects based on relationships Computing Top-K over Aggregation Sigmod 2006 Kaushik Chakrabarti et al.
Search Engines WS 2009 / 2010 Prof. Dr. Hannah Bast Chair of Algorithms and Data Structures Department of Computer Science University of Freiburg Lecture.
SNU OOPSLA Lab. 1 Great Ideas of CS with Java Part 1 WWW & Computer programming in the language Java Ch 1: The World Wide Web Ch 2: Watch out: Here comes.
Pancakes With A Problem! Great Theoretical Ideas In Computer Science Steven Rudich CS Spring 2005 Lecture 3 Jan 18, 2005 Carnegie Mellon University.
CPS Computational problems, algorithms, runtime, hardness (a ridiculously brief introduction to theoretical computer science) Vincent Conitzer.
Intro To Algorithms Searching and Sorting. Searching A common task for a computer is to find a block of data A common task for a computer is to find a.
CS 150: Analysis of Algorithms. Goals for this Unit Begin a focus on data structures and algorithms Understand the nature of the performance of algorithms.
Search Engines WS 2009 / 2010 Prof. Dr. Hannah Bast Chair of Algorithms and Data Structures Department of Computer Science University of Freiburg Lecture.
Onlinedeeneislam.blogspot.com1 Design and Analysis of Algorithms Slide # 1 Download From
Graph Indexing From managing and mining graph data.
Cool algorithms for a cool feature Holger Bast Max-Planck-Institut für Informatik (MPII) Saarbrücken, Germany joint work with Christian Mortensen and Ingmar.
Search Engines WS 2009 / 2010 Prof. Dr. Hannah Bast Chair of Algorithms and Data Structures Department of Computer Science University of Freiburg Lecture.
CS791 - Technologies of Google Spring A Web­based Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy.
Computational Challenges in BIG DATA 28/Apr/2012 China-Korea-Japan Workshop Takeaki Uno National Institute of Informatics & Graduated School for Advanced.
Certifying and Synthesizing Membership Equational Proofs Patrick Lincoln (SRI) joint work with Steven Eker (SRI), Jose Meseguer (Urbana) and Grigore Rosu.
An Efficient method to recommend research papers and highly influential authors. VIRAJITHA KARNATAPU.
Analysis and design of algorithm
Overview of Machine Learning
Disambiguation Algorithm for People Search on the Web
Chapter 11 Limitations of Algorithm Power
Compact routing schemes with improved stretch
CMPT 120 Lecture 26 – Unit 5 – Internet and Big Data
Presentation transcript:

Fast and Intelligent Search In Very Large Amounts of Data Hannah Bast Max-Planck-Institute for Informatics Saarbrücken Kick-off meeting for Cluster of Excellence Multimodal Computing and Interaction November 13 th, 2008

General theme of my group Searching for information Fancy and Fast, On Lots of Data Terabytes of data, hundreds of millions of documents Query times in a fraction of a second Beyond Google-style keyword search + always open for other real-world algorithmic problems currently: route planning in large transportation networks

Searching for Information Problems we have recently worked on –efficient prefix search –efficient faceted search –efficient error-tolerant search –efficient semantic search –efficient snippet generation –efficient index construction –efficient 3D shape retrieval Our system: the CompleteSearch engine –efficient –does all of the above (not the shapes though) There is a demo this afternoon at 2.30 pm joint work with the graphics people joint work with the database people planned joint work with the CL people planned: efficient music retrieval

Recent Output Installations –CompleteSearch DBLP (several million hits / month) – uses CompleteSearch (job search) –many more: mailing list archives, library search, … Publications –Conferences: SIGIR, VLDB, CIKM, CIDR, SPIRE, … –Journals: IR, TWEB, TOIS, VLDB Journal, … Awards –Jan’08: Meyer-Struckmann Award 15,000 € –Oct’08: Alcatel-Lucent Award 20,000 € –big press coverage (e.g, it was on the Heise newsticker)

Faceted Search Problem –Data: objects with ids and labels –Query: set of object ids –Answer: multi-set of labels of the respective objects –This talk: exactly one label per object year:2001 year:1997 year:2003 year:2001year:2008 Query: I = {1, 3, 4}  Answer: {year:2001, year:2003, year:2001}

Faceted Search Problem –Data: objects with ids and labels –Query: set of object ids –Answer: multi-set of labels of the respective objects –This talk: exactly one label per object a5a5 a4a4 a3a3 a2a2 a1a1 Query: I = {1, 3, 4}  Answer: {a 1, a 3, a 4 } Trivial if labels are in an array in main memory –but if data is on disk, we have block access to the data –each read gives us a whole block of B labels –we have to minimize the number of reads / IO operations typical: B=10,000

IO-efficient Faceted Search Precomputation: –given n elements a 1,…,a n –organize in array of size N ≥ n Query: –given I = {i 1,…, i m } с {1,…,n} –return elements a i 1,…, a i m using as few IOs as possible Extreme solutions: –space: n#IOs: min{n / B, |I|} (optimal space) –space: B ∙ (n choose B) #IOs: |I| / B (optimal #IOs) How much space is needed for which IO-efficiency? a1a1 a2a2 a3a3 a4a4 a5a5 a6a6 a7a7 a8a8 a4a4 a7a7 a5a5 a3a3 a1a1 a8a8 a2a2 a6a6 a3a3 a6a6 a4a4 a2a2 a7a7 a1a1 a8a8 a5a5 n = 8, N = 24 I = {1, 6, 8}, B = 4 get a 1, a 6, a 8 with 1 IO a1a1 a8a8 a2a2 a6a6 ???

A simple lower bound Theorem: –if we want < |I| IOs for every query I –we need ≥ n 2 / (4∙B) space Proof: 1.construct graph G with n vertices edge {i, j} iff a i and a j can be read in one IO  m ≤ 2B ∙ N 2.by assumption, every I = {i, j} can be read with 1 IO, hence edge {i, j} exists  m ≥ (n choose 2) ≈ n 2 / 2 The short queries alone make the problem hard n = 4, N = 8 B = 2 a1a1 a2a2 a3a3 a4a4 a1a1 a4a4 a2a2 a3a3 a1a1 a2a2 a4a4 a3a3

Restrict to large queries Theorem: –if we want < |I| IOs for all queries with |I| ≥ M –we need ≥ n 2 / (4∙B∙M) space Proof sketch: 1.construct graph G as before  m ≤ 2B ∙ N 2. Consider arbitrary I with |I| ≥ M  I not independent in G (otherwise |I| IOs necessary)  no independent set larger than M 3.Turan’s theorem implies m ≥ (n choose 2) / M n = 4, N = 8 B = 2 a1a1 a2a2 a3a3 a4a4 a1a1 a4a4 a2a2 a3a3 a1a1 a2a2 a4a4 a3a3 so there is hope for queries of size linear in n and we indeed have a space-efficient algorithm for that case (but no time to explain it here, sorry)

Turán numbers (extremal set theory) Definition: for n ≥ k ≥ r T(n, k, r) = the minimal number of r-subsets of {1,…n} such that every k-subset of {1,…,n} contains one of the r-subsets For r = 2: minimal number of edges in an n-vertex graph, where all independent sets have size < k Turan’s theorem: –lim n  ∞ T(n, k, r) / (n choose r) exists –exact value of limit unknown for k ≥ 2 Lower bound –T(n, k, r) ≥ (r / k) r-1 ∙ (n ch. r) Paul (Pál) Turán *1910 in Budapest †1976 in Budapest Erdös number 1 Very natural application in the context of faceted search!

Route Planning Route planning in road networks –from a single source to a single target (point-to-point) –weighted graph, edge costs = travel times

Transit Node Routing We invented transit node routing –100 times faster than previous best scheme –Oct’08 SaarLB Award € (together with Stefan Funke, now University of Greifswald) –integration with previous best scheme published in Science (joint work with P. Sanders and D. Schultes, Uni Karlsruhe) –big press coverage –we are currently trying to market the idea (via Algorithmic Solutions, a spin-off from MPII D1) There is a demo this afternoon at 2.00 pm

Google Transit I am Google in Zürich –as “visiting scientist” –great experience; I can highly recommend it –one of my projects there is Google TransitGoogle Transit –public transportation networks are completely different from road networks they can both be modeled as graphs and that’s about it with the similarity –the scale is an even bigger challenge there one node per arrival / departure event –will publish what I have done at the end of the year Thank you!

Vorberechnung der Transitknoten Von Distanzen zu Pfaden 24 min 20 min 23 min StartZiel

Overview How I work Information retrieval –overview of problems & results –our CompleteSearch engine –recent result: faceted search Route planning –ultrafast routing in road networks –public transportation Google

Recent Output Installations –CompleteSearch DBLP (several million hits / month) – uses CompleteSearch (job search) –many more: mailing list archives, library search, … Publications –Conferences: SIGIR, VLDB, CIKM, CIDR, SPIRE, … –Journals: IR, TWEB, TOIS, VLDB Journal, … Awards –Jan’08: Meyer-Struckmann Award 15,000 € –Oct’08: Alcatel-Lucent Award 20,000 € –Jul’09 : ,000 €

How I work I grew up in theoretical computer science –well-defined, standard problems –the goal are theorems –the more difficult / original, the better –often art for arts sake –good to learn the art of clear & precise thinking Then I moved to more applied problems –work starts with a real problem –finding the right abstraction is half of the challenge –think about it, but keep in mind the real problem –implement + experiment –build a system and use it / let it be used necessity is the mother of all inventions