Download presentation
Presentation is loading. Please wait.
Published byDandre Mileham Modified over 10 years ago
1
1 Active Mining of Data Streams Wei Fan, Yi-an Huang, Haixun Wang and Philip S. Yu Proc. SIAM International Conference on Data Mining 2004 Speaker: Pei-Min Chou Date:2005/01/14
2
2 Introduction In most real-world problems, labelled data streams are rarely immediately available models are refreshed periodically we propose a new concept of demand-driven active data mining.
3
3 Method Step1:Detect potential changes of data streams --- ” Guess ” Step2:If guessed loss or error rate higher than tolerable maximum--- choose a small number of data records Step3:If statistically estimated loss higher than tolerable maximum--- Reconstruct the old model
4
4 Definition(1) D c :complete data set D:training set S:data stream dt:Decision tree constructed from D Tolerable Maximum: Exact values are completely defined by each application
5
5 Definition(2) n l :number of instance classified by leaf l N:size of data stream Statistic at leaf l Σp( l )=1
6
6 Example Name Banklocalprice MaryICEA500 JohnIAEB700 BillyICEA100 EllaICEB300 BobIDEC500 PaulIBEB700 TomICEA100 AmyIBEB700 Name Banklocalpriceclass MaryICEA500C2 JohnIAEB700C4 BillyICEA100C1 EllaICEB300C3 PaulIBEB700C6 TomICEA100C1 AmyIBEB700C6 D:training set Dc:complete set
7
7 Example---decision tree Bank is ICE Local is ABank is IBE Price is 100 Local is B C1: Billy Tom C2: Mary C3: Ella C4: John C6: Paul Amy yes no P D ( l )=2/7 1/7 2/7 C5 0 yes
8
8 Observable Statistics(1) p s ( l ):statistic at leaf l in S p D ( l ): statistic at leaf l in D Change of leaf statistic on data stream PS means that significant change occur
9
9 Example(2) Name Banklocalprice ErinICEA500 JoJoIAEB700 BossIBEC500 HebeICEA500 SamIBEC500 Bank is ICE Local is ABank is IBE Price is 100Local is B C1C2: Erin Hebe C4: Boss Sam C5: JoJo C6 yes no yes no P s ( l )=0 2/5 1/5 0 S: New data stream C3 0
10
10 Observable Statistics(2) L a :validation loss L e :sum of expected loss at every leaf LS:potential change in loss due to changes in the data stream Difference :LS take the loss function into account
11
11 Example(3) Name Banklocalprice HebeICEA- SamIBEC700 Bank is ICE Local is ABank is IBE Price is 100Local is B C1C2C4: Boss Sam C5: JoJo C6: yes no yes no S: New data stream C3 Major 0.7 L e(C2)=(1-0.7)*30%=9% ErinHebe 30%
12
12 Loss Estimation When two statistics above tolerable maximum occur Investigate true class labels of a selected number of example Assume loss of each example:{l 1. l 2. l 3…. l n } Average loss : Σl i/n Standard error: ( ) Investigation cost :not for free
13
13 Experiment(1) Changing statistics is good indicator of change
14
14 Experiment(2)
15
15 Experiment(3)
16
16 Experiment(4)
17
17 Experiment---Result Two statistics are very well correlated with the amount of change Statistically estimated loss range is very close to true value
18
18 Conclusion Estimates the error without knowing the true class labels statistical sampling method to estimate the range of true loss Model reconstruction whenever estimated loss is higher than tolerable maximum.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.