Forms of Retrieval Sequential Retrieval Two-Step Retrieval Retrieval with Indexed Cases
Sources: –Textbook, Chapter 7 –Davenport & Prusack’s book on Advanced Data Structures –Samet’s book on Data Structures
Range Search Red light on? Yes Beeping? Yes … Transistor burned! Space of known problems
k-d Trees Idea: Partition of the case base in smaller fragments Representation of a k-dimensional space in a binary tree Similar to a decision tree: comparison with nodes During retrieval: Search for a leaf, but Unlike decision trees backtracking may occur
Definition: k-d Trees Given: K types: T 1, …, T k for the attributes A 1, …, A k A case base CB containing cases in T 1 … T k A parameter b (size of bucket) A K-D tree T(CB) for a case base CB is a binary tree defined as follows: If |CB| < b then T(CB) is a leaf node (a bucket) Else T(CB) defines a tree such that: The root is marked with an attribute A i and a value v in A i and The 2 k-d trees T({c CB: c.i-attribute < v}) and T({c CB: c.i-attribute v}) are the left and right subtrees of the root
BWB-Check Ball-With in-Bounds check: Suppose that algorithm reaches a leave node M (with at most b cases) while searching for the most similar case to P Let c be a case in B such that dist(c,P) is the smallest Then c is a candidate NN for P For each boundary B of M, dist(P,B) > dist(c,P) then c is the NN But if for any boundary B of M, if dist(P,B) < dist(c,P) then the algorithm needs to backtrack and check if in the regions of B, there is a better candidate For computing distance, simply use: f -1 be the inverse of the distance-similarity compatible function: distance(P,C) = f -1 (sim(P,C))
BOB-Check Ball-Out of-Bounds check: Used during backtracking Checks if for the boundary B defined in the node: dist(P,B) < dist(c,P) Where c is our current candidate for best case (e.g., the closest case to P in the initial bucket) If the condition is true, The algorithm needs to check if in those boundary’s regions, there is a better candidate
Example (0,0) (0,100) (25,35) Omaha (5,45) Denver (35,40) Chicago (50,10) Mobile (90,5) Miami Atlanta (85,15) (80,65) Buffalo (60,75) Toronto (100,0) A1A1 <35 35 Denver Omaha A2A2 <40 40 A1A1 <85 85 Mobile Atlanta Miami A1A1 <60 60 Chicago Toronto Buffalo Notes: Priority lists are used for computing kNN P(32,45)
Using Decision Trees as Index AiAi v1v1 v2v2 … vnvn Standard Decision Tree AiAi v1v1 v2v2 … vnvn Variant: InReCA Tree unknown Can be combined with numeric attributes AiAi v1v1 >v1v2>v1v2 … >v n unknown Notes: Supports Hamming distance May require backtracking (using BOB-check) Operates in a similar fashion as k-d trees Priority lists are used for computing kNN
Properties of Retrieval with Indexed Cases Advantage: Disadvantages: Efficient retrieval Incremental: don’t need to rebuild index again every time a new case is entered -error does not occur Cost of construction is high Only work for monotonic similarity relations