K-Nearest Neighbors (kNN) Given a case base CB, a new problem P, and a similarity metric sim Obtain: the k cases in CB that are most similar to P according.

Slides:

Advertisements

Similar presentations

Chapter 13. Red-Black Trees

Advertisements

Nearest Neighbor Search

Nearest Neighbor Queries using R-trees

Comp 122, Spring 2004 Binary Search Trees. btrees - 2 Comp 122, Spring 2004 Binary Trees  Recursive definition 1.An empty tree is a binary tree 2.A node.

Chapter 4: Trees Part II - AVL Tree

S. Sudarshan Based partly on material from Fawzi Emad & Chau-Wen Tseng

Augmenting Data Structures Advanced Algorithms & Data Structures Lecture Theme 07 – Part I Prof. Dr. Th. Ottmann Summer Semester 2006.

I/O-Algorithms Lars Arge Fall 2014 September 25, 2014.

Multidimensional Indexing

Searching on Multi-Dimensional Data

Binary Trees, Binary Search Trees CMPS 2133 Spring 2008.

Forms of Retrieval Sequential Retrieval Two-Step Retrieval Retrieval with Indexed Cases.

Multidimensional Data. Many applications of databases are "geographic" = 2dimensional data. Others involve large numbers of dimensions. Example: data.

SASH Spatial Approximation Sample Hierarchy

A balanced life is a prefect life.

Multiple-key indexes Index on one attribute provides pointer to an index on the other. If V is a value of the first attribute, then the index we reach.

Spatial Queries Nearest Neighbor Queries.

B + -Trees (Part 1). Motivation AVL tree with N nodes is an excellent data structure for searching, indexing, etc. –The Big-Oh analysis shows most operations.

Efficient Case Retrieval Sources: –Chapter 7 – –

Chapter 3: Cluster Analysis  3.1 Basic Concepts of Clustering  3.2 Partitioning Methods  3.3 Hierarchical Methods The Principle Agglomerative.

Binary Trees Chapter 6.

Data Structures for Computer Graphics Point Based Representations and Data Structures Lectured by Vlastimil Havran.

Module 04: Algorithms Topic 07: Instance-Based Learning

CHAPTER 71 TREE. Binary Tree A binary tree T is a finite set of one or more nodes such that: (a) T is empty or (b) There is a specially designated node.

Database Management 8. course. Query types Equality query – Each field has to be equal to a constant Range query – Not all the fields have to be equal.

Review Binary Tree Binary Tree Representation Array Representation Link List Representation Operations on Binary Trees Traversing Binary Trees Pre-Order.

Efficient Case Retrieval Sources: –Chapter 7 – –

The Binary Heap. Binary Heap Looks similar to a binary search tree BUT all the values stored in the subtree rooted at a node are greater than or equal.

Priority Queues and Binary Heaps Chapter Trees Some animals are more equal than others A queue is a FIFO data structure the first element.

Chapter 6 Binary Trees. 6.1 Trees, Binary Trees, and Binary Search Trees Linked lists usually are more flexible than arrays, but it is difficult to use.

Binary Trees, Binary Search Trees RIZWAN REHMAN CENTRE FOR COMPUTER STUDIES DIBRUGARH UNIVERSITY.

Data Structures Week 8 Further Data Structures The story so far  Saw some fundamental operations as well as advanced operations on arrays, stacks, and.

Multi-dimensional Search Trees

12.1 Chapter 12: Indexing and Hashing Spring 2009 Sections , , Problems , 12.7, 12.8, 12.13, 12.15,

Sorting. Pseudocode of Insertion Sort Insertion Sort To sort array A[0..n-1], sort A[0..n-2] recursively and then insert A[n-1] in its proper place among.

CSC 211 Data Structures Lecture 13

2-3 Trees, Trees Red-Black Trees

B + -Trees. Motivation An AVL tree with N nodes is an excellent data structure for searching, indexing, etc. The Big-Oh analysis shows that most operations.

Outline Binary Trees Binary Search Tree Treaps. Binary Trees The empty set (null) is a binary tree A single node is a binary tree A node has a left child.

Lecture1 introductions and Tree Data Structures 11/12/20151.

Indexing and hashing Azita Keshmiri CS 157B. Basic concept An index for a file in a database system works the same way as the index in text book. For.

Priority Queues and Heaps. October 2004John Edgar2  A queue should implement at least the first two of these operations:  insert – insert item at the.

Multi-dimensional Search Trees CS302 Data Structures Modified from Dr George Bebis.

Spatial Indexing Techniques Introduction to Spatial Computing CSE 5ISC Some slides adapted from Spatial Databases: A Tour by Shashi Shekhar Prentice Hall.

The Present. Outline Index structures for in-memory Quad trees kd trees Index structures for databases kdB trees Grid files II. Index Structure.

BINARY TREES Objectives Define trees as data structures Define the terms associated with trees Discuss tree traversal algorithms Discuss a binary.

Course: Programming II - Abstract Data Types HeapsSlide Number 1 The ADT Heap So far we have seen the following sorting types : 1) Linked List sort by.

School of Computing Clemson University Fall, 2012

BCA-II Data Structure Using C

Multiway Search Trees Data may not fit into main memory

Azita Keshmiri CS 157B Ch 12 indexing and hashing

Lecture 22 Binary Search Trees Chapter 10 of textbook

CS Anya E. Vostinar Grinnell College

Lecture 25 Splay Tree Chapter 10 of textbook

CMPS 3130/6130 Computational Geometry Spring 2017

KD Tree A binary search tree where every node is a

Heap Chapter 9 Objectives Upon completion you will be able to:

original list {67, 33,49, 21, 25, 94} pass { } {67 94}

Chapter 6 Transform and Conquer.

CS202 - Fundamental Structures of Computer Science II

Balanced-Trees This presentation shows you the potential problem of unbalanced tree and show two way to fix it This lecture introduces heaps, which are.

(2,4) Trees /26/2018 3:48 PM (2,4) Trees (2,4) Trees

Segment Trees Basic data structure in computational geometry.

Balanced-Trees This presentation shows you the potential problem of unbalanced tree and show two way to fix it This lecture introduces heaps, which are.

Lecture 2- Query Processing (continued)

(2,4) Trees 2/15/2019 (2,4) Trees (2,4) Trees.

(2,4) Trees /24/2019 7:30 PM (2,4) Trees (2,4) Trees

Heaps Section 6.4, Pg. 309 (Section 9.1).

Presentation transcript:

K-Nearest Neighbors (kNN) Given a case base CB, a new problem P, and a similarity metric sim Obtain: the k cases in CB that are most similar to P according to sim Reminder: we used a priority list with the top k most similar cases obtained so far

Forms of Retrieval Sequential Retrieval Two-Step Retrieval Retrieval with Indexed Cases

Sources: –Bergman’s b`ook –Davenport & Prusack’s book on Advanced Data Structures –Samet’s book on Data Structures

Range Search Red light on? Yes Beeping? Yes … Transistor burned! Space of known problems

K-D Trees Idea: Partition of the case base in smaller fragments Representation of a k-dimension space in a binary tree Similar to a decision tree: comparison with nodes During retrieval:  Search for a leaf, but  Unlike decision trees backtracking may occur

Definition: K-D Trees Given:  K types: T 1, …, T k for the attributes A 1, …, A k  A case base CB containing cases in T 1  …  T k  A parameter b (size of bucket) A K-D tree T(CB) for a case base CB is a binary tree defined as follows:  If |CB| < b then T(CB) is a leaf node (a bucket)  Else T(CB) defines a tree such that:  The root is marked with an attribute A i and a value v in A i and  The 2 k-d trees T({c  CB: c.i-attribute < v}) and T({c  CB: c.i-attribute  v}) are the left and right subtrees of the root

Example (0,0) (0,100) (25,35) Omaha (5,45) Denver (35,40) Chicago (50,10) Mobile (90,5) Miami Atlanta (85,15) (80,65) Buffalo (60,75) Toronto (100,0) A1A1 <35  35 Denver Omaha A2A2 <40  40 A1A1 <85  85 Mobile Atlanta Miami A1A1 <60  60 Chicago Toronto Buffalo Notes: Supports Euclidean distance May require backtracking  Closest city to P(32,45)? Priority lists are used for computing kNN P(32,45)

Using Decision Trees as Index AiAi v1v1 v2v2 … vnvn Standard Decision Tree AiAi v1v1 v2v2 … vnvn Variant: InReCA Tree unknown Can be combined with numeric attributes AiAi v1v1 >v1v2>v1v2 … >v n unknown Notes: Supports Hamming distance May require backtracking  Operates in a similar fashion as kd-trees Priority lists are used for computing kNN

Variation: Point QuadTree Particularly suited for performing range search (i.e, similarity assessment) Adequate with fewer numerical and known-important attributes A node in a (point) quadtree contains: 4 Pointers: quad [‘NW’], quad [‘NE’], quad[‘SW’], and quad[‘SE’] point, of type DataPoint, which in turn contains: name (x,y) coordinates

Example (0,0) (0,100) (25,35) Omaha (5,45) Denver (35,40) Chicago (50,10) Mobile (90,5) Miami Atlanta (85,15) (80,65) Buffalo (60,75) Toronto (100,0) Insertion order: Chicago, Mobile, Toronto, Buffalo, Denver, Omaha, Atlanta and Miami

Insertion in Quadtree Chicago Denver Toronto OmahaMobile Buffalo Atlanta Miami

Insertion Procedure We define a new type: quadrant: ‘NW’, ‘NE’, ‘SW’, ‘SE’ function PT_compare(DataPoint dP, dR): quadrant //quadrant where dP belongs relative to dR if (dP.x < dR.x) then if (dP.y < dR.y) then return ‘SW’ else return ‘NW’ else if (dP.y < dR.y) then return ‘SE’ else return ‘NE’

Insertion Procedure (Cont.) procedure PT_insert(Pointer P, R) //inserts P in the tree rooted at R Pointer T //points to the current node being examined Pointer F // points to the parent of T Quadrant Q //auxiliary variable T  R F  null while not(T == null) && not(equalCoord(P.point,T.point)) do F  T Q  PT_compare(P.point, T.point) T  T.quad[Q] if (T == null) then F.quad[Q]  P

Search Typical query: “find all cities within 50 miles of Washington,DC” In the initial example: “find all cities within 8 data units from (83,13)” Solution: Discard NW, SW and NE of Chicago (that is, only examine SE) There is no need to search the NW and SW of Mobile

Search (II) A r Let R be the root of the quadtree, what regions need to be inspected if R is in the quadrant: 1:SE 2:SW, SE 8:NW 11:NW, NE, SE

Priority Queues Typical example: printing in a Unix/Linux environment. Printing jobs have different priorities. These priorities may override the FIFO policy of the queues (i.e., jobs with the highest priorities will get printed first). Operations supported in a priority queue: Insert a new element Extract/Delete of the element with the lowest priority In search trees, the priority is based on the distance Insertion, deletion can be done in O(Log N) and look-head in O(1)

Nearest-Neighbor Search Problem: Given a point quadtree T and a point P find the node in T that is the closest to P Idea: traverse the quadtree maintaining a priority list, candidates, based on the distance from P to the quadrants containing the candidate nodes (25,35) Omaha (5,45) Denver (35,40) Chicago (50,10) Mobile (90,5) Miami (85,15) Atlanta (80,65) Buffalo (60,75) Toronto P(95,15)

Distance from P to a Quadrant P P1 P2 P3 distance(P,SW) = f -1 (sim(P,(P.y,0)) (x,y) distance(P,NW) = f -1 (sim(P,(x,y)) distance(P,NE) = f -1 (sim(P,(P.x,0)) 4 P4 distance(P,SE) = 0 Let f -1 be the inverse of the distance- similarity compatible function

Idea of the Algorithm (25,35) Omaha (5,45) Denver (35,40) Chicago (50,10) Mobile (60,75) Toronto P = (95,15) Candidates = [Chicago (4225)] Buffer: null (  ) Candidates = [Mobile(0),Toronto (25), Omaha (60), Denver(4225)] Buffer: Chicago (4225)

List of Candidates (50,10) Mobile (90,5) Miami (85,15) Atlanta P(95,15) Examine the quadrant of the top of candidates (Mobile) and make it the new buffer: Buffer: Mobile (1625) distance(P,NE) = 0 distance(P,SE) = 5 Termination test: Buffer.distance < distance(candidates.top,P) if “yes” then return Buffer if “no” then continue In this particular example, is “no” since Mobile is closer to P than Chicago

Finally the Nearest Neighbor is Found Candidates = [Atlanta(0), Miami(5), Toronto (25), Omaha (60), Denver(4225)] Buffer: Atlanta(100) Candidates = [Miami(5), Toronto (25), Omaha (60), Denver(4225)] A new iteration: The algorithm terminates since the distance from Atlanta to P is less than the distance from Miami to P

Complexity Experiments show that random insertion of N nodes is roughly O(N log 4 N) Thus, insertion of a single node is O(log 4 N) But worst case (actual complexity) can be much worse Range search can be performed in O(2 N ½ )

Delete First idea: Find the node N that you want to delete Delete N and all of its descendants ND For each node N’ in ND, add N’ back into the tree Terrible idea; it is too inefficient!.

Idealized Deletion in Quadtrees If a point A is to be deleted find a point B such that the region between A and B is empty and replaced A with B A B “Hatched Region” Why? Because all the remaining points will be in the same quadrants relative to B as they are relative to A. For example, Omaha could replace Chicago as the root.

Problem with Idealized Situation First Problem: A lot of effort is required to find such a B. C A D E F In the following example which point (C, F, D or A) has a hatched region with A? Answer: none!. Second problem: No such a B may exit!

Problem with Defining a New Root Several points will have to be re-positioned Old root New root SW  NE NW  NE SW  NW SE  NE SW  SE

Deletion Process Delete P: 1. If P is a leaf then just delete it!. 2. If P has a single child C, then replace P with C 3. For all other cases: 3.1 Compute 4 candidate nodes, one for each quadrant under P 3.2 Select one of the candidate node, N according to certain criteria 3.3 Delete several nodes under P and collect them in a list, ADD. Also delete N. 3.4 Make N.point the new root: P.point  N.point 3.5 Re-insert all nodes in ADD

A Word of Warning About Deletion In databases frequently deletion is not done immediately because it is so time-consuming. Sometimes they don’t even do insertions immediately! Instead they keep a log with all deletions (and additions), and periodically (i.e., every night, weekend), the log is traversed to update the database. The technique is called Differential Databases. Deleting cases is part of the general problem of case base maintenance.

Properties of Retrieval with Indexed Cases Advantage: Disadvantages:  Efficient retrieval  Incremental: don’t need to rebuild index again every time a new case is entered   -error does not occur  Cost of construction is high  Only work for monotonic similarity relations