Download presentation
Presentation is loading. Please wait.
Published byMarlene Simon Modified over 9 years ago
1
Data Stream Mining and Incremental Discretization John Russo CS561 Final Project April 26, 2007
2
Overview Introduction Data Mining: A Brief Overview Histograms Challenges of Streaming Data to Data Mining Using Histograms for Incremental Discretization of Data Streams Fuzzy Histograms Future Work
3
Introduction Data mining Class of algorithms for knowledge discovery Patterns, trends, predictions Utilizes statistical methods, neural networks, genetic algorithms, decision trees, etc. Streaming data presents unique challenges to traditional data mining Non-persistence – one opportunity to mine Data rates Non-discrete Changing over time Huge volumes of data
4
Data Mining Types of Relationships Classes Predetermined groups Clusters Groups of related data Sequential Patterns Used to predict behavior Associations Rules are built from associations between data
5
Data Mining Algorithms K-means clustering Unsupervised learning algorithm Classified data set into pre-defined clusters Decision Trees Used to generate rules for classification Two common types: CART CHAID Nearest Neighbor Classify a record in a dataset based upon similar records in a historical dataset
6
Data Mining Algorithms (continued) Rule Induction Uses statistical significance to find interesting rules Data Visualization Uses graphics for mining
7
Histograms and Data Mining
8
Histograms and Supervised Learning – An Example
9
We have two classes: Mortgage approval = “yes” P(mortgage approval = "Yes") = 5/10 =.5 Mortgage approval = “no” P(mortgage approval = "Yes") = 5/10 =.5 Let’s calculate some of the conditional probabilities based upon training data: P(age<=30|mortgage approval = "Yes") = 2/5 =.4 P(age<=30|mortgage approval = "No") = 2/5 =.4 P(income="Low"| mortgage approval = "Yes") = 2/5 =.4 P(income="Low"| mortgage approval = "No") = 2/5 =.4 P(income = "Medium"|mortgage approval = "Yes") = 1/5 =.2 P(income = "Medium"|mortgage approval = "No") = 1/5 =.2 P(marital status = "Married"| mortgage approval = "Yes") = 3/5 = 0.6 P(marital status = "Married"| mortgage approval = "No") = 3/5 = 0.6 P(credit rating = "Good"|mortgage approval = "Yes") = 1/5 =.2 P(credit rating = "Good"|mortgage approval = "No") = 2/5 =.5
10
Histograms and Supervised Learning – An Example We will use Bayes’ rule and the naïve assumption that all attributes are independent: P(A 1 =a 1 ... A k =a k ) is irrelevant, since it is the same for every class Now, let’s predict the class for one observation: X=Age<=30, income="medium", marital status = "married", credit rating = "good"
11
Histograms and Supervised Learning – An Example P(X|mortgage approval = "Yes") =.4 *.2 *.6 *.2 = 0.0096 P(X|mortgage approval = "No") =.4 *.2 *.6 *.5 = 0.024 P(X|C=c)*P(C=c) : 0.0096 *.4 = 0.00384 0.024 *.4 = 0.0096 X belongs to “no” class. The probabilities are determined by frequency counts, the frequencies are tabulated in bins. Two common types of histograms Equal-width – the range of observed values is divided into k intervals Equal-frequency – the frequencies are equal in all bins Difficulty is determining number of bins or k Sturges’ rule Scott’s rule Determining k for a data stream is problematic
12
Challenges of Data Streaming to Data Mining Determining k for a histogram or machine learning Concept drifting Data from the past is no longer valid for the model today Several approaches Incremental learning – CVFDT Ensemble classifiers Ambiguous decision trees What about “ebb and flow” problem?
13
Incremental Discretization Way to create discrete intervals from a data stream Partition Incremental Discretization (PID) algorithm (Gama and Pinto) Two-level algorithm Creates intervals at level 1 Only one pass over the stream Aggregates level 1 intervals into level 2 intervals
14
Incremental Discretization Example
15
Sensor data reporting on air temperature, soil moisture and flow of water in a sprinkler. The data shown in the previous slide is training data Once trained, model can predict what we should set sprinkler to based upon conditions 4 class problem
16
Incremental Discretization Example We will walk through level 1 for the temperature attribute. Decide an estimated range -> 30 – 85 Pick number of intervals (11) Step is set to 5 2 vectors: breaks and counts Set a threshold for splitting an interval -> 33% of all observed values Begin to work through training set If a value falls below the lower bound of the range, add a new interval before the first interval If a value falls above the upper bound of the range, add a new interval after the last value If an interval reaches the threshold, split it evenly and divide the count between the old interval and the new
17
Incremental Discretization Example Breaks vector for our sample after training Counts vector for our sample after training 25303540455055606570758082.5859095 11000010102.53.5210
18
Second Layer The second layer is invoked whenever necessary. User intervention Changes in intervals of first layer Input Breaks and counters from layer 1 Type of histogram to be generated
19
Second Layer Objective is to create a smaller number of intervals based upon layer 1intervals For equal width histograms: Computes number of intervals based upon observed range in layer 1 Traverses the vector of breaks once and adds counters of consecutive intervals Equal frequency Computes exact number of data points in each interval Traverses counter and adds counts for consecutive interval Stops for each layer 2 interval when frequency is reached
20
Application of PID for Data Mining Add a data structure to both layer 1 and layer 2. Matrix: Columns: intervals Rows: classes Naïve Bayesian classification can be easily done
21
Example Matrix Temperature Attribute Class25303540455055606570758082.5859095 High0000000000001110 Med0000101000102100 Low0000000010021000 Off1100000000000000
22
Dealing with Concept Drift What happens when training is no longer valid (for example, winter?) Assume sensors are still on in winter but sprinklers are not
23
Dealing with Concept Drift Fuzzy Histograms Fuzzy histograms are used for visual content representation. A given attribute can be a member of more than 1 interval. Varying degrees of membership Degree of membership is determined by a membership function
24
Fuzzy Histograms with PID Use membership function to build layer 2 intervals based upon a determinant in layer 1 Sprinkler example Soil moisture is potentially a member of >1 interval One interval is a high value During winter, ensure that all values of moisture fall into highest end of range
25
References [1] Hand, David. Mannila, Heikki and Padhraic Smyth. Principles of Data Mining. Cambridge, MA: MIT Press, 2001. [2] Sturges, H.(1926) The choice of a class-interval. J. Amer. Statist. Assoc., 21, 65–66. [3] D.W. Scott. On optimal and data-based histograms, Biometrika 66(1979) 605-610. [4] David Freedman and Persi Diaconis (1981). "On the histogram as a density estimator: Persi DiaconisPersi Diaconis L2 theory." Probability Theory and Related Fields. 57(4): 453-476 [5] Jianping Zhang, Huan Liu and Paul P. Wang, Some current issues of streaming data mining, Information Sciences, Volume 176, Issue 14, Streaming Data Mining, 22 July 2006, Pages 1949-1951. [6] Hulten, G., Spencer, L., and Domingos, P. 2001. Mining time-changing data streams. In Proceedings of the Seventh ACM SIGKDD international Conference on Knowledge Discovery and Data Mining (San Francisco, California, August 26 - 29, 2001). KDD '01. ACM Press, New York, NY, 97-106. [7] Wang, H., Fan, W., Yu, P. S., and Han, J. 2003. Mining concept-drifting data streams using ensemble classifiers. In Proceedings of the Ninth ACM SIGKDD international Conference on Knowledge Discovery and Data Mining (Washington, D.C., August 24 - 27, 2003). KDD '03. ACM Press, New York, NY, 226-235. [8] Natwichai, J. and Li, X. (2004). Knowledge Maintenance on Data Streams with Concept Drifting. In: Zhang, J., He, J. and Fu, Y. 2004, (705-710), Shanghai, China. [9] Gama, J. and Pinto, C. 2006. Discretization from data streams: applications to histograms and data mining. In Proceedings of the 2006 ACM Symposium on Applied Computing (Dijon, France, April 23 - 27, 2006). SAC '06. ACM Press, New York, NY, 662-667. [10] Anastasios Doulamis and Nikolaos Doulamis.Fuzzy histograms for Efficient Visual Content Representation:Application to content-based image retrieval. In IEEE International Conference on Multimedia and Expo(ICME’01),page227.IEEE Press,2001. [11] Gaber, M.M., Zaslavsky, A. & Krishnaswamy, S. 2005, "Mining data streams: a review", SIGMOD Rec., vol. 34, no. 2, pp. 18-26.
26
Questions ?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.