Download presentation
Presentation is loading. Please wait.
Published byPoppy Jackson Modified over 5 years ago
1
An Adaptive Nearest Neighbor Classification Algorithm for Data Streams
Yan-Nei Law & Carlo Zaniolo University of California, Los Angeles PKDD, Porto, 2005
2
Outline Related Work ANNCAD Properties of ANNCAD Conclusion
3
Classifying Data Streams
Problem Statement: We seek an algorithm for classifying data streams with numerical attributes---will work for totally ordered domains too. Desiderata: Fast update speed for newly arriving records. Only require single pass of data. Incremental algorithms are needed. Coping with concept changes. Classical mining algorithms were not designed for data streams and need to replaced or modified. Fast update speed??? Incremental algorithm Numeric or ordered?
4
Classifying Data Streams: Related Work
Hoeffding trees: VFDT and CVFDT: build decision tree incrementally. Require a large amount of examples to obtain a fair performance classifier. Unsatisfied performance when training set is small. Ensemble: Combine base models by voting technique. Suitable for coping with concept drift. Fail to provide a simple model and understanding of the problem. What besides decision trees?
5
State of the Art: NearestNeighborhood Classifiers
Pros and cons: +: Strong intuitive appeal and simple to implement -: Fail to provide simple models/rules -: Expensive Computations ANN: Approximate Nearest Neighborhood with error guarantee 1+ε: Idea: pre-processing the data by devising a data structure (e.g. ring-cover tree) to speed up the searchings. Designed for stored data only. Time for update the pre-processing step depends on size of data set, which may be infinite. Time for updates—not clear
6
Our Algorithm: ANNCAD Adaptive NN Classification Algorithm for Data Streams
Model building: Pre-assign classes to obtain an approximate result and provide simple models/rules. Decompose the feature space to make classification decisions. Akin to wavelets. Classification: Find NN for classification adaptively. progressively expand the searching of nearby area of a test point (star). Wavelets? Why adaptive
7
Quantize Feature Space and Compute Multi-resolution Coefficients
( )/4 I: blue II: red Quantize Feature Space and record information into data arrays A set of 100 two-class training points Multi-resolution representation of a two-class data set.
8
Hierarchical structure of ANNCAD Classifier
Building a Classifier B=6.75; R=0.6 Blue B=2; R=4.25 M(ix) B=3; R=3.25 M(ix) Label each block with its majority class Label block only if |C1st|-|C2nd| > 80% Hierarchical structure of ANNCAD Classifier
9
Decision Algorithm on the ANNCAD Hierarchy
Compute the distance between the test point and the center of every nonempty neighboring block. Classified block Label class I Unclassified block, go to next level. Block with tag “M”, go back to prev. level. Classified block Label class II Delay last picture The combined classifier over multiple levels
10
Incremental Update New training point 8 10 9 2 1 8 10 9 2 1 6.75 2 3
8 10 9 2 1 8 10 9 2 1 6.75 2 3 0.5 6.75 2 3 0.25 3 3.0625 New training point
11
Concept Drift: Adaptation by Exponential Forgetting
Data Array , Factor 01: new old No effect if no concept changes Adapt quickly (exponentially) if concept changes No extra memory needed (sliding window required.) Sliding window required?
12
Grid Position and Resolution
Problem: Neighborhood decision strongly depends on grid position Solution: Build several classifiers by shifting grid position by 1/n. Then combine the results by voting. Thm. x: test point, nd classifiers, b(x): Blocks containing x, then: zb(x), yb(x): dist(x,y)<(1+1/n-1)*dist(x,z). In practice, only 2-3 classifiers can achieve a good result. Example: 4 different grids for building 4 classifiers.
13
Properties of ANNCAD Compact support: locality property allows fast update Dealing with noise: can set a threshold for classification decision Multi-resolution: to control the fineness of the result, or optimize the system resources. Low complexity (gd = total number of cells) Building classifier: O(min(N,gd)) Testing: O(log2(g)+2d). Updating: log2(g)+1.
14
Experiments Synthetic Data 3-d unit cube: Class distribution:
class 0 inside sphere with radius 0.5 class 1 outside 3000 training examples 1000 test examples Exact ANN: Expand the searching area by double the radius until reaching some training point. Classify the test point with the majority class. (a) different initial resolutions. (b) different # ensembles.
15
Experiments (Cont’) Real Data 1 -- Letter Recognition
Objective: identify a pixel displays as one of the 26 letter. 16 numerical attributes to describe its pixel displays. 15,000 training examples 5,000 test examples Add 5 % noise by randomly assign class. Grid size: 16 units #Classifiers: 2 Number of rescans
16
ANNCAD Vs VFDT (Very Fast Decision Tree)
Real Data 2 – Forest Cover Type Objective: predict forest cover type. 10 numerical attributes. 12,000 training examples 9,000 test examples Grid size: 32 unit #Classifiers: 2
17
Concept Shift: ANNCAD vs CVFDT
Real Data 3 – Adult Objective: determine a person with salary>50K Concept Shift Simulation: Group by races = 0.98 Grid Size: 64 #Classifier: 2 CVFDT Not understood CVFDT: concept adapting VFDT
18
Conclusion and Future Work
ANNCAD an incremental classification algorithm to find adaptive NN Suitable for mining data streams: fast update speed Exponential forgetting for concept shift/drift. Future Work: Detect concept shift/drift by changes in class label of blocks.
19
THANK YOU!
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.