Data Stream Mining and Incremental Discretization John Russo CS561 Final Project April 26, 2007.

Data Stream Mining and Incremental Discretization John Russo CS561 Final Project April 26, 2007

Overview  Introduction  Data Mining: A Brief Overview  Histograms  Challenges of Streaming Data to Data Mining  Using Histograms for Incremental Discretization of Data Streams  Fuzzy Histograms  Future Work

Introduction  Data mining  Class of algorithms for knowledge discovery  Patterns, trends, predictions  Utilizes statistical methods, neural networks, genetic algorithms, decision trees, etc.  Streaming data presents unique challenges to traditional data mining  Non-persistence – one opportunity to mine  Data rates  Non-discrete  Changing over time  Huge volumes of data

Data Mining Types of Relationships  Classes  Predetermined groups  Clusters  Groups of related data  Sequential Patterns  Used to predict behavior  Associations  Rules are built from associations between data

Data Mining Algorithms  K-means clustering  Unsupervised learning algorithm  Classified data set into pre-defined clusters  Decision Trees  Used to generate rules for classification  Two common types:  CART  CHAID  Nearest Neighbor  Classify a record in a dataset based upon similar records in a historical dataset

Data Mining Algorithms (continued)  Rule Induction  Uses statistical significance to find interesting rules  Data Visualization  Uses graphics for mining

Histograms and Data Mining

Histograms and Supervised Learning – An Example

 We have two classes:  Mortgage approval = “yes”  P(mortgage approval = "Yes") = 5/10 =.5  Mortgage approval = “no”  P(mortgage approval = "Yes") = 5/10 =.5  Let’s calculate some of the conditional probabilities based upon training data:  P(age<=30|mortgage approval = "Yes") = 2/5 =.4  P(age<=30|mortgage approval = "No") = 2/5 =.4  P(income="Low"| mortgage approval = "Yes") = 2/5 =.4  P(income="Low"| mortgage approval = "No") = 2/5 =.4  P(income = "Medium"|mortgage approval = "Yes") = 1/5 =.2  P(income = "Medium"|mortgage approval = "No") = 1/5 =.2  P(marital status = "Married"| mortgage approval = "Yes") = 3/5 = 0.6  P(marital status = "Married"| mortgage approval = "No") = 3/5 = 0.6  P(credit rating = "Good"|mortgage approval = "Yes") = 1/5 =.2  P(credit rating = "Good"|mortgage approval = "No") = 2/5 =.5

Histograms and Supervised Learning – An Example  We will use Bayes’ rule and the naïve assumption that all attributes are independent:  P(A 1 =a 1 ...  A k =a k ) is irrelevant, since it is the same for every class  Now, let’s predict the class for one observation:  X=Age<=30, income="medium", marital status = "married", credit rating = "good"

Histograms and Supervised Learning – An Example  P(X|mortgage approval = "Yes") =.4 *.2 *.6 *.2 = 0.0096  P(X|mortgage approval = "No") =.4 *.2 *.6 *.5 = 0.024  P(X|C=c)*P(C=c) : 0.0096 *.4 = 0.00384  0.024 *.4 = 0.0096  X belongs to “no” class.  The probabilities are determined by frequency counts, the frequencies are tabulated in bins.  Two common types of histograms  Equal-width – the range of observed values is divided into k intervals  Equal-frequency – the frequencies are equal in all bins  Difficulty is determining number of bins or k  Sturges’ rule  Scott’s rule  Determining k for a data stream is problematic

Challenges of Data Streaming to Data Mining  Determining k for a histogram or machine learning  Concept drifting  Data from the past is no longer valid for the model today  Several approaches  Incremental learning – CVFDT  Ensemble classifiers  Ambiguous decision trees  What about “ebb and flow” problem?

Incremental Discretization  Way to create discrete intervals from a data stream  Partition Incremental Discretization (PID) algorithm (Gama and Pinto)  Two-level algorithm  Creates intervals at level 1  Only one pass over the stream  Aggregates level 1 intervals into level 2 intervals

Incremental Discretization Example

 Sensor data reporting on air temperature, soil moisture and flow of water in a sprinkler.  The data shown in the previous slide is training data  Once trained, model can predict what we should set sprinkler to based upon conditions  4 class problem

Incremental Discretization Example  We will walk through level 1 for the temperature attribute.  Decide an estimated range -> 30 – 85  Pick number of intervals (11)  Step is set to 5  2 vectors: breaks and counts  Set a threshold for splitting an interval -> 33% of all observed values  Begin to work through training set  If a value falls below the lower bound of the range, add a new interval before the first interval  If a value falls above the upper bound of the range, add a new interval after the last value  If an interval reaches the threshold, split it evenly and divide the count between the old interval and the new

Incremental Discretization Example  Breaks vector for our sample after training  Counts vector for our sample after training 25303540455055606570758082.5859095 11000010102.53.5210

Second Layer  The second layer is invoked whenever necessary.  User intervention  Changes in intervals of first layer  Input  Breaks and counters from layer 1  Type of histogram to be generated

Second Layer  Objective is to create a smaller number of intervals based upon layer 1intervals  For equal width histograms:  Computes number of intervals based upon observed range in layer 1  Traverses the vector of breaks once and adds counters of consecutive intervals  Equal frequency  Computes exact number of data points in each interval  Traverses counter and adds counts for consecutive interval  Stops for each layer 2 interval when frequency is reached

Application of PID for Data Mining  Add a data structure to both layer 1 and layer 2.  Matrix:  Columns: intervals  Rows: classes  Naïve Bayesian classification can be easily done

Example Matrix Temperature Attribute Class25303540455055606570758082.5859095 High0000000000001110 Med0000101000102100 Low0000000010021000 Off1100000000000000

Dealing with Concept Drift  What happens when training is no longer valid (for example, winter?)  Assume sensors are still on in winter but sprinklers are not

Dealing with Concept Drift Fuzzy Histograms  Fuzzy histograms are used for visual content representation.  A given attribute can be a member of more than 1 interval.  Varying degrees of membership  Degree of membership is determined by a membership function

Fuzzy Histograms with PID  Use membership function to build layer 2 intervals based upon a determinant in layer 1  Sprinkler example  Soil moisture is potentially a member of >1 interval  One interval is a high value  During winter, ensure that all values of moisture fall into highest end of range

References  [1] Hand, David. Mannila, Heikki and Padhraic Smyth. Principles of Data Mining. Cambridge, MA: MIT Press, 2001.  [2] Sturges, H.(1926) The choice of a class-interval. J. Amer. Statist. Assoc., 21, 65–66.  [3] D.W. Scott. On optimal and data-based histograms, Biometrika 66(1979) 605-610.  [4] David Freedman and Persi Diaconis (1981). "On the histogram as a density estimator: Persi DiaconisPersi Diaconis  L2 theory." Probability Theory and Related Fields. 57(4): 453-476  [5] Jianping Zhang, Huan Liu and Paul P. Wang, Some current issues of streaming data mining, Information Sciences, Volume 176, Issue 14, Streaming Data Mining, 22 July 2006, Pages 1949-1951.  [6] Hulten, G., Spencer, L., and Domingos, P. 2001. Mining time-changing data streams. In Proceedings of the Seventh ACM SIGKDD international Conference on Knowledge Discovery and Data Mining (San Francisco, California, August 26 - 29, 2001). KDD '01. ACM Press, New York, NY, 97-106.  [7] Wang, H., Fan, W., Yu, P. S., and Han, J. 2003. Mining concept-drifting data streams using ensemble classifiers. In Proceedings of the Ninth ACM SIGKDD international Conference on Knowledge Discovery and Data Mining (Washington, D.C., August 24 - 27, 2003). KDD '03. ACM Press, New York, NY, 226-235.  [8] Natwichai, J. and Li, X. (2004). Knowledge Maintenance on Data Streams with Concept Drifting. In: Zhang, J., He, J. and Fu, Y. 2004, (705-710), Shanghai, China.  [9] Gama, J. and Pinto, C. 2006. Discretization from data streams: applications to histograms and data mining. In Proceedings of the 2006 ACM Symposium on Applied Computing (Dijon, France, April 23 - 27, 2006). SAC '06. ACM Press, New York, NY, 662-667.  [10] Anastasios Doulamis and Nikolaos Doulamis.Fuzzy histograms for Efficient Visual Content Representation:Application to content-based image retrieval. In IEEE International Conference on Multimedia and Expo(ICME’01),page227.IEEE Press,2001.  [11] Gaber, M.M., Zaslavsky, A. & Krishnaswamy, S. 2005, "Mining data streams: a review", SIGMOD Rec., vol. 34, no. 2, pp. 18-26.

Questions ?

Data Stream Mining and Incremental Discretization John Russo CS561 Final Project April 26, 2007.

Similar presentations

Presentation on theme: "Data Stream Mining and Incremental Discretization John Russo CS561 Final Project April 26, 2007."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Data Stream Mining and Incremental Discretization John Russo CS561 Final Project April 26, 2007.

Similar presentations

Presentation on theme: "Data Stream Mining and Incremental Discretization John Russo CS561 Final Project April 26, 2007."— Presentation transcript:

Similar presentations

About project

Feedback