Fast and Intelligent Search In Very Large Amounts of Data Hannah Bast Max-Planck-Institute for Informatics Saarbrücken Kick-off meeting for Cluster of Excellence Multimodal Computing and Interaction November 13 th, 2008
General theme of my group Searching for information Fancy and Fast, On Lots of Data Terabytes of data, hundreds of millions of documents Query times in a fraction of a second Beyond Google-style keyword search + always open for other real-world algorithmic problems currently: route planning in large transportation networks
Searching for Information Problems we have recently worked on –efficient prefix search –efficient faceted search –efficient error-tolerant search –efficient semantic search –efficient snippet generation –efficient index construction –efficient 3D shape retrieval Our system: the CompleteSearch engine –efficient –does all of the above (not the shapes though) There is a demo this afternoon at 2.30 pm joint work with the graphics people joint work with the database people planned joint work with the CL people planned: efficient music retrieval
Recent Output Installations –CompleteSearch DBLP (several million hits / month) – uses CompleteSearch (job search) –many more: mailing list archives, library search, … Publications –Conferences: SIGIR, VLDB, CIKM, CIDR, SPIRE, … –Journals: IR, TWEB, TOIS, VLDB Journal, … Awards –Jan’08: Meyer-Struckmann Award 15,000 € –Oct’08: Alcatel-Lucent Award 20,000 € –big press coverage (e.g, it was on the Heise newsticker)
Faceted Search Problem –Data: objects with ids and labels –Query: set of object ids –Answer: multi-set of labels of the respective objects –This talk: exactly one label per object year:2001 year:1997 year:2003 year:2001year:2008 Query: I = {1, 3, 4} Answer: {year:2001, year:2003, year:2001}
Faceted Search Problem –Data: objects with ids and labels –Query: set of object ids –Answer: multi-set of labels of the respective objects –This talk: exactly one label per object a5a5 a4a4 a3a3 a2a2 a1a1 Query: I = {1, 3, 4} Answer: {a 1, a 3, a 4 } Trivial if labels are in an array in main memory –but if data is on disk, we have block access to the data –each read gives us a whole block of B labels –we have to minimize the number of reads / IO operations typical: B=10,000
IO-efficient Faceted Search Precomputation: –given n elements a 1,…,a n –organize in array of size N ≥ n Query: –given I = {i 1,…, i m } с {1,…,n} –return elements a i 1,…, a i m using as few IOs as possible Extreme solutions: –space: n#IOs: min{n / B, |I|} (optimal space) –space: B ∙ (n choose B) #IOs: |I| / B (optimal #IOs) How much space is needed for which IO-efficiency? a1a1 a2a2 a3a3 a4a4 a5a5 a6a6 a7a7 a8a8 a4a4 a7a7 a5a5 a3a3 a1a1 a8a8 a2a2 a6a6 a3a3 a6a6 a4a4 a2a2 a7a7 a1a1 a8a8 a5a5 n = 8, N = 24 I = {1, 6, 8}, B = 4 get a 1, a 6, a 8 with 1 IO a1a1 a8a8 a2a2 a6a6 ???
A simple lower bound Theorem: –if we want < |I| IOs for every query I –we need ≥ n 2 / (4∙B) space Proof: 1.construct graph G with n vertices edge {i, j} iff a i and a j can be read in one IO m ≤ 2B ∙ N 2.by assumption, every I = {i, j} can be read with 1 IO, hence edge {i, j} exists m ≥ (n choose 2) ≈ n 2 / 2 The short queries alone make the problem hard n = 4, N = 8 B = 2 a1a1 a2a2 a3a3 a4a4 a1a1 a4a4 a2a2 a3a3 a1a1 a2a2 a4a4 a3a3
Restrict to large queries Theorem: –if we want < |I| IOs for all queries with |I| ≥ M –we need ≥ n 2 / (4∙B∙M) space Proof sketch: 1.construct graph G as before m ≤ 2B ∙ N 2. Consider arbitrary I with |I| ≥ M I not independent in G (otherwise |I| IOs necessary) no independent set larger than M 3.Turan’s theorem implies m ≥ (n choose 2) / M n = 4, N = 8 B = 2 a1a1 a2a2 a3a3 a4a4 a1a1 a4a4 a2a2 a3a3 a1a1 a2a2 a4a4 a3a3 so there is hope for queries of size linear in n and we indeed have a space-efficient algorithm for that case (but no time to explain it here, sorry)
Turán numbers (extremal set theory) Definition: for n ≥ k ≥ r T(n, k, r) = the minimal number of r-subsets of {1,…n} such that every k-subset of {1,…,n} contains one of the r-subsets For r = 2: minimal number of edges in an n-vertex graph, where all independent sets have size < k Turan’s theorem: –lim n ∞ T(n, k, r) / (n choose r) exists –exact value of limit unknown for k ≥ 2 Lower bound –T(n, k, r) ≥ (r / k) r-1 ∙ (n ch. r) Paul (Pál) Turán *1910 in Budapest †1976 in Budapest Erdös number 1 Very natural application in the context of faceted search!
Route Planning Route planning in road networks –from a single source to a single target (point-to-point) –weighted graph, edge costs = travel times
Transit Node Routing We invented transit node routing –100 times faster than previous best scheme –Oct’08 SaarLB Award € (together with Stefan Funke, now University of Greifswald) –integration with previous best scheme published in Science (joint work with P. Sanders and D. Schultes, Uni Karlsruhe) –big press coverage –we are currently trying to market the idea (via Algorithmic Solutions, a spin-off from MPII D1) There is a demo this afternoon at 2.00 pm
Google Transit I am Google in Zürich –as “visiting scientist” –great experience; I can highly recommend it –one of my projects there is Google TransitGoogle Transit –public transportation networks are completely different from road networks they can both be modeled as graphs and that’s about it with the similarity –the scale is an even bigger challenge there one node per arrival / departure event –will publish what I have done at the end of the year Thank you!
Vorberechnung der Transitknoten Von Distanzen zu Pfaden 24 min 20 min 23 min StartZiel
Overview How I work Information retrieval –overview of problems & results –our CompleteSearch engine –recent result: faceted search Route planning –ultrafast routing in road networks –public transportation Google
Recent Output Installations –CompleteSearch DBLP (several million hits / month) – uses CompleteSearch (job search) –many more: mailing list archives, library search, … Publications –Conferences: SIGIR, VLDB, CIKM, CIDR, SPIRE, … –Journals: IR, TWEB, TOIS, VLDB Journal, … Awards –Jan’08: Meyer-Struckmann Award 15,000 € –Oct’08: Alcatel-Lucent Award 20,000 € –Jul’09 : ,000 €
How I work I grew up in theoretical computer science –well-defined, standard problems –the goal are theorems –the more difficult / original, the better –often art for arts sake –good to learn the art of clear & precise thinking Then I moved to more applied problems –work starts with a real problem –finding the right abstraction is half of the challenge –think about it, but keep in mind the real problem –implement + experiment –build a system and use it / let it be used necessity is the mother of all inventions