Information Retrieval WS 2012 / 2013 Prof. Dr. Hannah Bast Chair of Algorithms and Data Structures Department of Computer Science University of Freiburg.

Slides:



Advertisements
Similar presentations
Information Retrieval in Practice
Advertisements

Sequential Minimal Optimization Advanced Machine Learning Course 2012 Fall Semester Tsinghua University.
Introduction to Computer Science 2 Lecture 7: Extended binary trees
Analysis of Algorithms
Counting the bits Analysis of Algorithms Will it run on a larger problem? When will it fail?
Theory of Computing Lecture 3 MAS 714 Hartmut Klauck.
MS 101: Algorithms Instructor Neelima Gupta
Lecture 12: Revision Lecture Dr John Levine Algorithms and Complexity March 27th 2006.
Fundamentals of Python: From First Programs Through Data Structures
Advanced Topics in Algorithms and Data Structures Page 1 Parallel merging through partitioning The partitioning strategy consists of: Breaking up the given.
Computational problems, algorithms, runtime, hardness
CSE332: Data Abstractions Lecture 12: Introduction to Sorting Tyler Robison Summer
Randomized Algorithms and Randomized Rounding Lecture 21: April 13 G n 2 leaves
DAST, Spring © L. Joskowicz 1 Data Structures – LECTURE 1 Introduction Motivation: algorithms and abstract data types Easy problems, hard problems.
Parallel Merging Advanced Algorithms & Data Structures Lecture Theme 15 Prof. Dr. Th. Ottmann Summer Semester 2006.
This material in not in your text (except as exercises) Sequence Comparisons –Problems in molecular biology involve finding the minimum number of edit.
Data Structures, Spring 2004 © L. Joskowicz 1 Data Structures – LECTURE 5 Linear-time sorting Can we do better than comparison sorting? Linear-time sorting.
DAST, Spring © L. Joskowicz 1 Data Structures – LECTURE 1 Introduction Motivation: algorithms and abstract data types Easy problems, hard problems.
Lower Bounds for Comparison-Based Sorting Algorithms (Ch. 8)
External Sorting Chapter 13.. Why Sort? A classic problem in computer science! Data requested in sorted order  e.g., find students in increasing gpa.
Sorting in Linear Time Lower bound for comparison-based sorting
Analysis CS 367 – Introduction to Data Structures.
©Silberschatz, Korth and Sudarshan13.1Database System Concepts Chapter 13: Query Processing Overview Measures of Query Cost Selection Operation Sorting.
Today  Table/List operations  Parallel Arrays  Efficiency and Big ‘O’  Searching.
Chapter 12 Recursion, Complexity, and Searching and Sorting
Complexity of algorithms Algorithms can be classified by the amount of time they need to complete compared to their input size. There is a wide variety:
Jessie Zhao Course page: 1.
David Luebke 1 10/13/2015 CS 332: Algorithms Linear-Time Sorting Algorithms.
CSC 41/513: Intro to Algorithms Linear-Time Sorting Algorithms.
Search Engines WS 2009 / 2010 Prof. Dr. Hannah Bast Chair of Algorithms and Data Structures Department of Computer Science University of Freiburg Lecture.
Advance Data Structure 1 College Of Mathematic & Computer Sciences 1 Computer Sciences Department م. م علي عبد الكريم حبيب.
COMP Recursion, Searching, and Selection Yi Hong June 12, 2015.
CS 61B Data Structures and Programming Methodology July 28, 2008 David Sun.
Fall 2015 Lecture 4: Sorting in linear time
Analysis of algorithms Analysis of algorithms is the branch of computer science that studies the performance of algorithms, especially their run time.
Asymptotic Notation (O, Ω, )
CSC 211 Data Structures Lecture 13
Mudasser Naseer 1 11/5/2015 CSC 201: Design and Analysis of Algorithms Lecture # 8 Some Examples of Recursion Linear-Time Sorting Algorithms.
Review 1 Arrays & Strings Array Array Elements Accessing array elements Declaring an array Initializing an array Two-dimensional Array Array of Structure.
Sorting: Implementation Fundamental Data Structures and Algorithms Klaus Sutner February 24, 2004.
1 Algorithms  Algorithms are simply a list of steps required to solve some particular problem  They are designed as abstractions of processes carried.
Search Engines WS 2009 / 2010 Prof. Dr. Hannah Bast Chair of Algorithms and Data Structures Department of Computer Science University of Freiburg Lecture.
Prof. Amr Goneid, AUC1 Analysis & Design of Algorithms (CSCE 321) Prof. Amr Goneid Department of Computer Science, AUC Part 1. Complexity Bounds.
2IS80 Fundamentals of Informatics Fall 2015 Lecture 5: Algorithms.
Chapter 9 Sorting. The efficiency of data handling can often be increased if the data are sorted according to some criteria of order. The first step is.
Algorithm Analysis (Big O)
Searching and Sorting Searching: Sequential, Binary Sorting: Selection, Insertion, Shell.
27-Jan-16 Analysis of Algorithms. 2 Time and space To analyze an algorithm means: developing a formula for predicting how fast an algorithm is, based.
E.G.M. PetrakisAlgorithm Analysis1  Algorithms that are equally correct can vary in their utilization of computational resources  time and memory  a.
Search Engines WS 2009 / 2010 Prof. Dr. Hannah Bast Chair of Algorithms and Data Structures Department of Computer Science University of Freiburg Lecture.
CS321 Data Structures Jan Lecture 2 Introduction.
1 Ch. 2: Getting Started. 2 About this lecture Study a few simple algorithms for sorting – Insertion Sort – Selection Sort (Exercise) – Merge Sort Show.
CES 592 Theory of Software Systems B. Ravikumar (Ravi) Office: 124 Darwin Hall.
Lecture 2 Algorithm Analysis Arne Kutzner Hanyang University / Seoul Korea.
Searching CSE 103 Lecture 20 Wednesday, October 16, 2002 prepared by Doug Hogan.
Program Performance 황승원 Fall 2010 CSE, POSTECH. Publishing Hwang’s Algorithm Hwang’s took only 0.1 sec for DATASET1 in her PC while Dijkstra’s took 0.2.
Search Engines WS 2009 / 2010 Prof. Dr. Hannah Bast Chair of Algorithms and Data Structures Department of Computer Science University of Freiburg Lecture.
CS6045: Advanced Algorithms Sorting Algorithms. Sorting So Far Insertion sort: –Easy to code –Fast on small inputs (less than ~50 elements) –Fast on nearly-sorted.
Sorting and Runtime Complexity CS255. Sorting Different ways to sort: –Bubble –Exchange –Insertion –Merge –Quick –more…
Algorithmic Foundations COMP108 COMP108 Algorithmic Foundations Algorithm efficiency Prudence Wong
Lecture 2 Algorithm Analysis
Week 9 - Monday CS 113.
May 17th – Comparison Sorts
COMP108 Algorithmic Foundations Algorithm efficiency
Sorting by Tammy Bailey
Chapter 12: Query Processing
13 Text Processing Hongfei Yan June 1, 2016.
Algorithm design and Analysis
Searching, Sorting, and Asymptotic Complexity
Presentation transcript:

Information Retrieval WS 2012 / 2013 Prof. Dr. Hannah Bast Chair of Algorithms and Data Structures Department of Computer Science University of Freiburg Lecture 3, Wednesday November 7 th, 2012 (List intersection: fancy algorithms + lower bounds)

Overview of this lecture Organizational –Your experiences with Exercise Sheet 2 (Ranking) List intersection –Key algorithm in every search engine –Some algorithm engineering tips –Fancier algorithms + runtime analysis –Matching lower bound –Exercise Sheet 3: Implement the asymptotically optimal algorithm Is it really faster than the simple linear-time algorithm? Some optional theoretical tasks, good for exam preparation 2

Experiences with ES2 (ranking) Summary / excerpts last checked November 7, 15:43 –Feasible in reasonable time for most –Manual relevance assessment is quite time-consuming –Wiki table for comparing results please, like last semester Ok, we will have one again for this exercise ! –Some of you are quite passionate about the coding Great, I hope you become addicted to it ! –Lecture pen too thick / too many colors... let's see today –How to write unit tests... I'll live code an example today –More coding, less theory please... ok, ok –Great recordings / streamed from iPhone to TV... yeah! 3

Your results for ES2 (ranking) Some interesting observations –Binary (k = 0, b = 0) was clearly the worst –In the tf.idf list, some linguists and other theory guys Many occurrences of either "relativity" or "theory" can boost the score –Albert Einstein was top for tf.idf, but not even in the top- 10 for BM25 with standard settings BM25 favors short articles that are mainly about rel. theo. To get AE at the top, a global document score (like the PageRank for web pages) would be more appropriate... later lecture 4

List Intersection 1/5 Let's first revisit the linear-time algorithm... –... and see what we can improve there –Java: ArrayList much worse than native int[] –C++: vector is as good as int[] with option –O3 –Minimize the number of branches within small-bodied loops –Note 1: hard to predict / understand some of the effects To understand what is really going on, look at the machine code (in C++) or byte code (Java)... see seminar Java vs. C++ on Wiki –Note 2: when comparing algorithms, always repeat each run at least 3 times, to make caching effects transparent 5

List Intersection 2/5 Algorithmic improvement 1 –Call the smaller list A, and the longer list B –Search the elements from A in B, using binary search k = #elements in A, n = #elements in B –This has time complexity Θ(k ∙ log n) 6

List Intersection 3/5 Algorithmic improvement 2 –Observation: when we have located element A[i] in B, that is, we have found j with B[j-1] < A[i] ≤ B[j] then for all elements A[i'] with i' > i, it suffices to look at elements B[j'] with j' ≥ j –Time complexity in the best case is Θ(log n) –Time complexity in the worst case is Θ(k ∙ log n) –Time complexity in the average case is Θ(k ∙ log n/k) 7

List Intersection 4/5 Algorithmic improvement 3 –Goal: when elements A[i] and A[i+1] are located at postions j 1 and j 2 in B, then, with d:= j 2 – j 1 ("gap") spend only time Θ(log d) to locate element A[i+1] –Idea: first do an exponential search, to get an upper bound on the range, then a binary search as before 8

List Intersection 5/5 Time complexity of this algorithm –Let d 1,..., d k be the gaps between the locations of the k elements of A in B d 1 = from beginning to first location –Note that Σ i d i ≤ n = the number of elements in B –Then the time complexity is O(Σ i log d i ) –Goal: find a formula that is independent of the d i –Idea: maximize Σ i log d i under the constraint Σ i d i ≤ n –This is called optimization with side constraints or Lagrangian optimization... next slide 9

Lagrangian Optimization By example of the task from the previous slide (there is an optional theoretical exercise, where you can practice the technique for yourself) 10

Skip Pointers A heuristic approach to increase performance –Idea: place pointers at "strategic" positions in the list, which allow to "skip" large parts during intersection –Question: where to place how many pointers? 11

Lower bounds 1/5 Let's first look at both union and intersection –For list union, the simple linear-time algorithm has a time complexity of Θ(k + n) This is optimal, because the result of the union contains k + n elements, and every algorithm has to output them –For list intersection, the exponential-binary-search algorithm has a time complexity of Θ(k ∙ log (n/k)) Exercise (optional): prove that k ∙ log (n/k) never worse than n + k (and usually much better, e.g. for k = const) But maybe we can do even better? (Note: the output size is at most k for intersection) 12

Lower bounds 2/5 Recall: lower bound for comparison-based sorting –There are n! possible outputs –The algorithm has to distinguish between all of them –Each comparison distinguishes between two cases –Hence we need at least log 2 n! comparisons –By Stirling’s formula (n/e) n ≤ n! ≤ n n –Hence log 2 n! ~ n ∙ log n –Hence every comparison-based sorting algorithm has a running time of Ω(n ∙ log n) –Note: not true for non-comparison based algorithms For example, 0-1 sequences can be sorted by counting 13

Lower bounds 3/5 Similar argument for intersecting two lists A and B –As before, k = #elements in A, n = #elements in B –How many different ways are there to locate the k elements from A within the n elements from B –Observation: each such way corresponds to a tuple (j 1,…, j k ) where 0 ≤ j 1 ≤ … ≤ j k ≤ n j i is simply the location of the i-th element of A in B location 0 means before the first element location i > 0 means after the i-th element How many such tuples are there? 14

Lower bounds 4/5 There is a similar quantity which is easy to count –The number of tuples (i 1,…, i k ) where 1 ≤ i 1 < … < i k ≤ m –This is just the number of size-k subsets of {1,…,m} –Their number is just m over k = m! / (k! ∙ (m – k)!) ≥ (m/k) k –Let us relate this to the number we are interested in: The number of tuples (i 1,…, i k ) where 0 ≤ i 1 ≤ … ≤ i k ≤ n 15

Lower bounds 5/5 Now it's easy to derive the lower bound –By the previous arguments, the number of ways to locate the k elements from A in the n element of B is m over k ≥ (m/k) k, where m = n + k –With the same argument as in the sorting lower bound, any comparison-based algorithm hence has to make log 2 (m over k) ≥ log 2 (m/k) k comparisons –This is k ∙ log 2 (1+n/k) and hence Ω(k ∙ log 2 (n/k)) 16

References In the Raghavan/Manning/Schütze textbook Section 2.3: Faster intersection with skip pointers Relevant Papers A simple algorithm for merging two linearly ordered sets F.K. Hwang and S. Lin SICOMP 1(1):31–39, 1980 A fast set intersection algorithm for sorted sequences R. Baeza-Yates CPM, LNCS 3109, 31–39, 2004 Relevant Wikipedia articles 17

18