Exact Nearest Neighbor Algorithms

Slides:



Advertisements
Similar presentations
Object Recognition Using Locality-Sensitive Hashing of Shape Contexts Andrea Frome, Jitendra Malik Presented by Ilias Apostolopoulos.
Advertisements

AI Pathfinding Representing the Search Space
Nearest Neighbor Search
Searching on Multi-Dimensional Data
Common approach 1. Define space: assign random ID (160-bit) to each node and key 2. Define a metric topology in this space,  that is, the space of keys.
Data Structures and Functional Programming Algorithms for Big Data Ramin Zabih Cornell University Fall 2012.
Robert Pless, CS 546: Computational Geometry Lecture #3 Last Time: Convex Hulls Today: Plane Sweep Algorithms, Segment Intersection, + (Element Uniqueness,
CS 206 Introduction to Computer Science II 04 / 27 / 2009 Instructor: Michael Eckmann.
SASH Spatial Approximation Sample Hierarchy
Computational Support for RRTs David Johnson. Basic Extend.
KD TREES CS16: Introduction to Data Structures & Algorithms Tuesday, April 7,
CS 206 Introduction to Computer Science II 12 / 05 / 2008 Instructor: Michael Eckmann.
Other time considerations Source: Simon Garrett Modifications by Evan Korth.
CS 206 Introduction to Computer Science II 12 / 03 / 2008 Instructor: Michael Eckmann.
1 Nearest Neighbor Learning Greg Grudic (Notes borrowed from Thomas G. Dietterich and Tom Mitchell) Intro AI.
KNN, LVQ, SOM. Instance Based Learning K-Nearest Neighbor Algorithm (LVQ) Learning Vector Quantization (SOM) Self Organizing Maps.
Efficient Distance Computation between Non-Convex Objects By Sean Quinlan Presented by Sean Augenstein and Nicolas Lee.
A Hybrid Self-Organizing Neural Gas Network James Graham and Janusz Starzyk School of EECS, Ohio University Stocker Center, Athens, OH USA IEEE World.
FLANN Fast Library for Approximate Nearest Neighbors
Data Structures for Computer Graphics Point Based Representations and Data Structures Lectured by Vlastimil Havran.
A Navigation Mesh for Dynamic Environments Wouter G. van Toll, Atlas F. Cook IV, Roland Geraerts CASA 2012.
Lab 3 How’d it go?.
VLDB '2006 Haibo Hu (Hong Kong Baptist University, Hong Kong) Dik Lun Lee (Hong Kong University of Science and Technology, Hong Kong) Victor.
Multi-dimensional Search Trees
Sorting. Pseudocode of Insertion Sort Insertion Sort To sort array A[0..n-1], sort A[0..n-2] recursively and then insert A[n-1] in its proper place among.
Lossless Compression CIS 465 Multimedia. Compression Compression: the process of coding that will effectively reduce the total number of bits needed to.
CS 206 Introduction to Computer Science II 04 / 22 / 2009 Instructor: Michael Eckmann.
METU Informatics Institute Min720 Pattern Classification with Bio-Medical Applications Part 6: Nearest and k-nearest Neighbor Classification.
Chapter 13 (Prototype Methods and Nearest-Neighbors )
1 30 November 2006 An Efficient Nearest Neighbor (NN) Algorithm for Peer-to-Peer (P2P) Settings Ahmed Sabbir Arif Graduate Student, York University.
Multi-dimensional Search Trees CS302 Data Structures Modified from Dr George Bebis.
Vector Quantization CAP5015 Fall 2005.
CZ5211 Topics in Computational Biology Lecture 4: Clustering Analysis for Microarray Data II Prof. Chen Yu Zong Tel:
Example Apply hierarchical clustering with d min to below data where c=3. Nearest neighbor clustering d min d max will form elongated clusters!
Adversarial Search 2 (Game Playing)
1 Learning Bias & Clustering Louis Oliphant CS based on slides by Burr H. Settles.
1 CSC 421: Algorithm Design Analysis Spring 2013 Transform & conquer  transform-and-conquer approach  presorting  balanced search trees, heaps  Horner's.
Autonomous Robots Robot Path Planning (2) © Manfred Huber 2008.
Lecture 3: Uninformed Search
CSC 421: Algorithm Design Analysis
Spatial Data Management
School of Computing Clemson University Fall, 2012

Clustering CSC 600: Data Mining Class 21.
Computational Geometry
Trees Chapter 15.
Nonparametric Density Estimation – k-nearest neighbor (kNN) 02/20/17
Multiway Search Trees Data may not fit into main memory
Data Science Algorithms: The Basic Methods
Fast nearest neighbor searches in high dimensions Sami Sieranoja
Distance Computation “Efficient Distance Computation Between Non-Convex Objects” Sean Quinlan Stanford, 1994 Presentation by Julie Letchner.
Schedule for next 2 weeks
Haim Kaplan and Uri Zwick
Chapter 6 Transform and Conquer.
K Nearest Neighbor Classification
Nearest-Neighbor Classifiers
8/04/2009 Many thanks to David Sun for some of the included slides!
Instance Based Learning
Topic 23 Red Black Trees "People in every direction No words exchanged No time to exchange And all the little ants are marching Red and Black.
Other time considerations
Searching CLRS, Sections 9.1 – 9.3.
CS5112: Algorithms and Data Structures for Applications
Nearest Neighbors CSC 576: Data Mining.
UNINFORMED SEARCH -BFS -DFS -DFIS - Bidirectional
Clustering.
The Rich/Knight Implementation
Heaps & Multi-way Search Trees
The Rich/Knight Implementation
Data Mining CSCI 307, Spring 2019 Lecture 23
Presentation transcript:

Exact Nearest Neighbor Algorithms

Sabermetrics One of the best players ever Who is the next Derek Jeter? .310 batting average 3,465 hits 260 home runs 1,311 RBIs 14x All-star 5x World Series winner Who is the next Derek Jeter? Derek Jeter Source: Wikipedia

Sabermetrics Classic example of nearest neighbor application Hits, Home runs, RBIs, etc. are dimensions in “Baseball-space” Every individual player has a unique point in this space Problem reduces to finding closest point in this space

POI Suggestions Simpler example, only 2d space Known set of interest points in a map - we want to suggest the closest Note: we could make this more complicated; we could add dimensions for ratings, category, newness, etc. How do we figure out what to suggest? Brute force: just compute distance to all known points and pick lowest O(n) in the number of points Feels wasteful… why look at the distance to the Eiffel Tower when we know you’re in NYC? Space partitioning

2d trees (4,3) 5 x x 4 x 3 x 2 x x 1 x 1 2 3 4 5 6 7 8

2d trees 5 x x (4,3) 4 x (1,4) (6,2) 3 x 2 x x 1 x 1 2 3 4 5 6 7 8

2d trees 5 x x (4,3) 4 x (1,4) (6,2) 3 x 2 x x (2,2) (3,5) 1 x 1 2 3 4 7 8

2d trees 5 x x (4,3) 4 x (1,4) (6,2) 3 x 2 x x (7,2) (8,5) (2,2) (3,5)

2d trees 5 x x (4,3) 4 x (1,4) (6,2) 3 x 2 x x (7,2) (8,5) (2,2) (3,5)

2d trees - Nearest Neighbor 5 x x (4,3) 4 x x (1,4) (6,2) 3 x 2 x x (7,2) (8,5) (2,2) (3,5) 1 x 1 2 3 4 5 6 7 8

2d trees - Nearest Neighbor 5 x x (4,3) 4 x x (1,4) (6,2) 3 x 2 x x (7,2) (8,5) (2,2) (3,5) 1 x 1 2 3 4 5 6 7 8

2d trees - Nearest Neighbor 5 x x (4,3) 4 x x (1,4) (6,2) 3 x 2 x x (7,2) (8,5) (2,2) (3,5) 1 x 1 2 3 4 5 6 7 8

2d trees - Nearest Neighbor 5 x x (4,3) √10 4 x x (1,4) (6,2) 3 x 2 x x (7,2) (8,5) (2,2) (3,5) 1 x 1 2 3 4 5 6 7 8

2d trees - Nearest Neighbor 5 x x (4,3) √10 4 x x (1,4) (6,2) 3 x 2 x x (7,2) (8,5) (2,2) (3,5) 1 x 1 2 3 4 5 6 7 8

2d trees - Nearest Neighbor 5 x x (4,3) √10 4 x x (1,4) (6,2) 3 x √5 2 x x (7,2) (8,5) (2,2) (3,5) 1 x 1 2 3 4 5 6 7 8

2d trees - Nearest Neighbor 5 x x (4,3) 3 4 x x (1,4) (6,2) 3 x √5 2 x x (7,2) (8,5) (2,2) (3,5) 1 x 1 2 3 4 5 6 7 8

2d trees - Nearest Neighbor 5 x x (4,3) 3 4 x x (1,4) (6,2) 3 x √5 2 x x (7,2) (8,5) (2,2) (3,5) 1 x 1 2 3 4 5 6 7 8

2d trees - Nearest Neighbor 5 x x (4,3) 4 x x (1,4) (6,2) 3 x √5 2 x x (7,2) (8,5) (2,2) (3,5) 1 x 1 2 3 4 5 6 7 8

2d trees - Nearest Neighbor 5 x x (4,3) 4 x x (1,4) (6,2) 3 x √5 2 x x (7,2) (2,2) (3,5) (8,5) √2 1 x 1 2 3 4 5 6 7 8

2d trees - Nearest Neighbor 5 x x (4,3) 4 x x (1,4) (6,2) 3 x 2 2 x x (7,2) (2,2) (3,5) (8,5) √2 1 x 1 2 3 4 5 6 7 8

2d trees - Nearest Neighbor 5 x x (4,3) 4 x x (1,4) (6,2) 3 x 2 2 x x (2,2) (3,5) (7,2) (8,5) √2 1 x 1 2 3 4 5 6 7 8

2d trees - Nearest Neighbor 5 x x (4,3) 4 x x (1,4) (6,2) 3 x 2 x x (2,2) (3,5) (7,2) (8,5) √2 1 x 1 2 3 4 5 6 7 8

2d trees - Nearest Neighbor 5 x x x (4,3) 4 x (1,4) (6,2) 3 x 2 x x (7,2) (8,5) (2,2) (3,5) 1 x 1 2 3 4 5 6 7 8

2d trees - Nearest Neighbor 5 x x x (4,3) 4 x (1,4) (6,2) 3 x 2 x x (7,2) (8,5) (2,2) (3,5) 1 x 1 2 3 4 5 6 7 8

2d trees - Nearest Neighbor 5 x x x (4,3) 4 x (1,4) (6,2) 3 x 2 x x (7,2) (8,5) (2,2) (3,5) 1 x 1 2 3 4 5 6 7 8

2d trees - Nearest Neighbor 5 x x x (4,3) √5 4 x (1,4) (6,2) 3 x 2 x x (7,2) (8,5) (2,2) (3,5) 1 x 1 2 3 4 5 6 7 8

2d trees - Nearest Neighbor 5 x x x (4,3) √5 4 x (1,4) (6,2) 3 x 2 x x (7,2) (8,5) (2,2) (3,5) 1 x 1 2 3 4 5 6 7 8

2d trees - Nearest Neighbor 5 x x x (4,3) √5 4 x (1,4) (6,2) √10 3 x 2 x x (7,2) (8,5) (2,2) (3,5) 1 x 1 2 3 4 5 6 7 8

2d trees - Nearest Neighbor 5 x x x (4,3) √5 4 x 3 (1,4) (6,2) √10 3 x 2 x x (7,2) (8,5) (2,2) (3,5) 1 x 1 2 3 4 5 6 7 8

2d trees - Nearest Neighbor 5 x x x (4,3) √5 4 x 3 (1,4) (6,2) √10 3 x 2 x x (8,5) (2,2) (3,5) (7,2) 1 x 1 2 3 4 5 6 7 8

2d trees - Nearest Neighbor 5 x x x (4,3) √5 4 x (1,4) (6,2) √10 3 x 2 x x (8,5) (2,2) (3,5) (7,2) 1 x 1 2 3 4 5 6 7 8

2d trees - Nearest Neighbor 5 x x x (4,3) √5 4 x (1,4) (6,2) √10 3 x 2 x x (2,2) (3,5) (7,2) (8,5) 3 1 x 1 2 3 4 5 6 7 8

2d trees - Nearest Neighbor 5 x x x (4,3) √5 4 x (1,4) (6,2) √10 3 x 2 x x (2,2) (3,5) (7,2) (8,5) 3 1 x 1 2 3 4 5 6 7 8

2d trees - Nearest Neighbor 5 x x x (4,3) √5 4 x √17 (1,4) (6,2) √10 3 x 2 x x (2,2) (3,5) (7,2) (8,5) 3 1 x 1 2 3 4 5 6 7 8

2d trees - Nearest Neighbor 5 x x x (4,3) √5 4 x √17 (1,4) (6,2) √10 3 x 2 x x (2,2) (3,5) (7,2) (8,5) 3 1 x 1 2 3 4 5 6 7 8

2d trees - Nearest Neighbor 5 x x x (4,3) √5 4 x √17 (1,4) (6,2) √10 3 x 2 x x 2 (2,2) (3,5) (7,2) (8,5) 3 1 x 1 2 3 4 5 6 7 8

2d trees - Nearest Neighbor 5 x x x (4,3) √5 4 x √17 (1,4) (6,2) √10 3 x 2 x x 2 (2,2) (3,5) (7,2) (8,5) 3 1 x 1 2 3 4 5 6 7 8

2d trees - Nearest Neighbor 5 x x x (4,3) √5 4 x √17 (1,4) (6,2) √10 3 x 2 x x 2 √18 (2,2) (3,5) (7,2) (8,5) 3 1 x 1 2 3 4 5 6 7 8

2d trees - Nearest Neighbor 5 x x x (4,3) √5 4 x √17 (1,4) (6,2) √10 3 x 2 x x 2 √18 (2,2) (3,5) (7,2) (8,5) 3 1 x 1 2 3 4 5 6 7 8

2d trees Construction complexity: O(n log n) Complication: how do you decide how to partition? Requires you be smart about picking pivots Adding/Removing element: O(log n) This is because we know it’s balanced from the median selection ...except adding/removing might make it unbalanced -- there are variants that handle this Nearest Neighbor Average case: O(log n) Worst case: O(n) Not great… but not worse than brute force Can also be used effectively for range finding

k-d trees 2d trees can be extended to k dimensions Image Source: Wikipedia

k-d trees Same algorithm for nearest neighbor! Curse of Dimensionality Remember sabermetrics! ...except there’s a catch Curse of Dimensionality The higher the dimensions, the “sparser” the data gets in the space Harder to rule out portions of the tree, so many searches end up being fancy brute forces In general, k-d trees are useful when N >> 2k

Sabermetrics (reprise) Finding single neighbor could be noise-prone Are we sure this year’s “Derek Jeter” will be next year’s too? What if there are lots of close points… are we sure that the relative distance matters? Could ask for set of most likely players Alternate question: will a player make it to the Hall of Fame? Still k-dimensional space, but we’re not comparing with individual point Am I in the “neighborhood” of Hall of Famers? Classic example of “classification problem”

k-Nearest Neighbors (kNN) New plan: find the k closest points Each can “vote” for a classification …or you can do some other kind of averaging Can we modify our k-d tree NN algorithm to do this? Track k closest points in max-heap (priority queue) Keep heap at size k Only need to consider k’th closest point for tree pruning

Voronoi Diagrams Useful visualization of nearest neighbors Image Source: Wikipedia Useful visualization of nearest neighbors Good when you have a known set of comparison points Wide ranging applications Epidemiology Cholera victims all near one water pump Aviation Nearest airport for flight diversion Networking Capacity derivation Robotics Points are obstacles, edges are safest paths

Voronoi Diagrams Also helpful for visualizing effects of different distance metrics Image Source: Wikipedia Euclidean distance Manhattan distance

Voronoi Diagrams Polygon construction algorithm is a little tricky, but conceptually you can think of expanding balls around the points Image Source: Wikipedia

k-means Clustering Goal: Group n data points into k groups based on nearest neighbor Algorithm: Pick k data points at random to be starting “centers,” call each center ci For each node n, calculate which of the k centers is the nearest neighbor and add it to set Si Compute the mean of all points in Si to generate a new ci Go back to (2) and repeat with the new centers, until the centers converge

k-means clustering Notice: the algorithm basically creates Voronoi diagrams for the centers! Image Source: Wikipedia

k-means clustering Does this always converge? Depends on distance function. Generally yes for Euclidean Converges quickly in practice, but worst case can take an exponential number of iterations Does it give the optimal clustering? NO! Well, at least not always. Image Source: Wikipedia

Other space partitioning data structures Leaf point k-d trees Only stores points in leaves, but leaves can store more than one point Split space at the middle of longest axis Effectively “buckets” points - can be used for approximate nearest neighbor Quadtrees Split space into quadrants (i.e. every tree node has four children) Quadrant can only contain at most q nodes If there are more than q, split that quadrant again into quadrants Applications Collision detection (video games) Image representation/processing (transforming/comparing/etc. nearby pixels) Sparse data storage (spreadsheets) Octrees are extension to 3d