Download presentation
Presentation is loading. Please wait.
Published byDanielle Wood Modified over 11 years ago
1
A General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions Jing Gao Wei Fan Jiawei Han Philip S. Yu University of Illinois at Urbana-Champaign IBM T. J. Watson Research Center
2
Introduction (1) Data Stream –Continuously arriving data flow –Applications: network traffic, credit card transaction flow, phone calling records, etc. 1 0 1 1 1 0 1 0 0 1 1
3
Introduction (2) Stream Classification –Construct a classification model based on past records –Use the model to predict labels for new data –Help decision making Fraud? Fraud Classification model Labeling
4
Framework ……… ? Classification Model Predict
5
Concept Drifts Changes in P(x,y) –P(x,y)=P(y|x)P(x) x-feature vector, y-class label –No Change, Feature Change, Conditional Change, Dual Change –Expected error is not a good indicator of concept drifts –Training on the most recent data could help reduce expected error Time Stamp 1 Time Stamp 11 Time Stamp 21
6
Issues in Stream Classification(1) Generative Model –P(y|x) follows some distribution Descriptive Model –Let data decides Stream Data –Distribution unknown and evolving
7
Issues in Stream Classification(2) Label Prediction –Classify x into one class Probability Estimation –x is assigned to all classes with different probabilities Stream Applications –Stochastic, prediction confidence information is needed
8
Mining Skewed Data Stream Skewed Distribution –Credit card frauds, network intrusions Existing Stream Classification Algorithms –Evaluated on balanced data Problems –Ignore minority examples –The cost of misclassifying minority examples is usually huge + - Classify every leaf node as negative
9
Stream Ensemble Approach (1) ……… ? Training set? Insufficient positive examples! Step 1 Sampling
10
Stream Ensemble Approach (2) Step 2 Ensemble C1C1 C2C2 CkCk …… 12k
11
Why this approach works? Incorporation of old positive examples –increase the training size, reduce variance –negative examples reflect current concepts, so the increase in boundary bias is small Ensemble –reduce variance caused by single model –disjoint sets of negative examplesthe classifiers will make uncorrelated errors Bagging & Boosting –running cost is much higher –cannot generate reliable probability estimates for skewed distributions
12
Analysis Error Reduction –Sampling –Ensemble Efficiency Analysis –Single model –Ensemble –Ensemble is more efficient
13
Experiments Measures –Mean Squared Error –ROC Curve –Recall-Precision Curve Baseline Methods –NS: No sampling +Single Model –SS: Sampling + Single Model –SE: Sampling + Ensemble
14
Experimental Results (1) Mean Squared Error on Synthetic Data Feature Change only P(x) changes Conditional Change only P(y|x) changes Dual Change both P(x) and P(y|x) changes
15
Experimental Results (2) Mean Squared Error on Real Data
16
Experimental Results (3) ROC CurveRecall-Precision Plot Plots on Synthetic Data
17
Experimental Results (4) ROC CurveRecall-Precision Plot Plots on Real Data
18
Experimental Results (5) Training Time
19
Conclusions General issues in stream classification –concept drifts –descriptive model –probability estimation Mining skewed data streams –sampling and ensemble techniques –accurate and efficient Wide applications –graph data –airforce data
20
Thanks! Any questions?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.