1 Data Stream Management Systems Checkpoint CS240B Notes by Carlo Zaniolo UCLA CSD With slides from a KDD04 tutorial by Haixun Wang, Jian Pei & Philip Yu
2 Mining Data Streams: Challenges zOn-line response (NB), limited memory, most recent windows only zFast & Light algorithms needed: yMust minimize usage of memory and CPU yRequires only one (or a few) passes through data zConcept shift/drift: change mining set statistics yRender previously learned models inaccurate or invalid yRobustness and Adaptability: quickly recover/adjust after concept changes. zPopular machine learning algorithms no longer effective: yNeural nets: slow learner requires many passes ySupport Vector Machines (SVM): computationally expensive yApriori: many passes and expensive (association rule mine difficult for on data streams)
3 The Decision Tree Classifier zLearning (Training) : yInput: a data set of (a, b), where a is a vector, b a class label yOutput: a model (decision tree) zTesting: yInput: a test sample (x, ?) yOutput: a class label prediction for x
4 Decision Tree Classifiers zA divide-and-conquer approach ySimple algorithm, intuitive model zTypically a decision tree grows one level for each scan of data y Multiple scans are required y But if we can use small samples these problem disappears z But data structure is not ‘stable’ ySubtle changes of data can cause global changes in the data structure
5 Stable Trees Using Samples How many samples do we need to build a tree in constant time that is nearly identical to the tree a batch learner (C4.5, Sprint,...) Nearly identical? zCategorical attributes: y with high probability, the attribute we choose for split is the same attribute as would be chosen by a batch learner yidentical decision tree zContinuous attributes: ydiscretize them into categorical ones...Forget concept changes for now
6 Hoeffding Trees zHoeffding bound is applied to the information gain zError decreases when n (# of samples) increases zAt each node, we shall accumulate enough samples (n) before we make a split zScales better than traditional DT algorithms yIncremental: the nodes are are created incrementally as news samples stream in ySub-linear with sampling ySmall memory requirement zCons: yOnly consider top 2 attributes yTie breaking takes time yGrow a deep tree takes time yDiscrete attribute only
7 VFDT zVery Fast Decision Tree [Domingos, Hulten 2000] ySeveral Improvements: faster and less memory Concept Changes? A na ï ve approach: yPlace a sliding window on the stream yReapply C4.5 or VFDT whenever window moves yTime consuming!
8 CVFDT zConcept-adapting VFDT yHulten, Spencer, Domingos, 2001 zGoal yClassifying concept-drifting data streams zApproach yMake use of Hoeffding bound Incorporate “ windowing ” yMonitor changes of information gain for attributes. If change reaches threshold, generate alternate subtree with new “ best ” attribute, but keep on background. yReplace if new subtree becomes more accurate.
9 Classifiers for Data Streams zFast and Light Classifiers: yNaïve Bayesian: one pass to count occurrences x Sliding windows, tumbles and slides x Adaptive Nearest Neighbor Classification Algorithm-- ANNCAD Fast and Light Classifiers zEnsembles of Classifiers--decision trees or others yBagging Ensembles and yBoosting Ensembles
10 Basic Ideas zStream partitioned into sequential chunks zTrain a classifier from each chunk zAccuracy of voting ensembles is normally better than that of a single classfier. zMethod1. Bagging yWeighted voting: weights are assigned to classifiers based on their recent performance on the current test examples yOnly top K classifiers are used zMethod2. Boosting yMajority voting yClassifiers retired by age yBoosting used in training
11 Bagging Ensemble Method
12 Mining Streams with Concept Changes z Changes detected by drop in accuracy or by other methods yBuild new classifiers on new windows ySearch among old ones those that have now become accurate
13 Boosting Ensembles for Adaptive Mining of Data Streams Andrea Fang Chu, Carlo Zaniolo [PAKDD2004]
14 Mining Data Stream: Desiderata yFast learning (preferably in one pass of the data.) yLight requirements (low time complexity, low memory requirement) yAdaptation (model always reflects the time- changing concept)
15 Adaptive Boosting Ensembles Training stream is split into blocks (i.e., windows) Each individual classifier is learned from a block. A boosting ensemble of (7—19 members) is maintained over time Decisions are taken by simple majority As the N+1 classifier is build, boost the weight of the tuples misclassified by the first N Change detection is explored to achieve adaptation.
16 Fast and Light Experiments show that boosting ensembles of “weak learners” provide accurate prediction Weak Learners An aggressively pruned decision tree, e.g., shallow tree (this means fast!) Trained on a small set of examples (this mean light in memory requirements!)
17 Adaptation: Detect changes that cause significant drops in ensemble performance gradual changes: concept drift abrupt changes: concept schift
18 Adaptability zThe error rate is viewed as a random variable zWhen it drops significantly from the recent average the whole ensemble is dropped zAnd a new one is quickly re-learned zCost/performance of boosting ensembles is better than that of bagging ensembles [KDD04] zBUT ???
19 References zHaixun Wang, Wei Fan, Philip S. Yu, Jiawei Han. Mining Concept Drifting Data Streams using Ensemble Classifiers. In the ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD) zPedro Domingos, Geoff Hulten. Mining High Speed Data Streams. In the ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD) zGeoff Hulten, Laurie Spencer, Pedro Domingos. Mining Time-Changing Data Streams. In the ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD) zWei Fan, Yi-an Huang, Haixun Wang, Philip S Yu. Active Mining of Data Streams. In the SIAM International Conference on Data Mining (SIAM DM) z2004Fang Chu, Yizhou Wang, Carlo Zaniolo, An adaptive learning approach for noisy data streams, 4th IEEE International Conference on Data Mining (ICDM), 2004 zFang Chu, Carlo Zaniolo: Fast and Light Boosting for Adaptive Mining of Data Streams. PAKDD 2004: zYan-Nei Law, Carlo Zaniolo, An Adaptive Nearest Neighbor Classification Algorithm for Data Streams, 2005 ECML/PKDD Conference, Porto, Portugal, October 3-7, 2005.