Presentation is loading. Please wait.

Presentation is loading. Please wait.

Data Discretization Unification Ruoming Jin Yuri Breitbart Chibuike Muoh Kent State University, Kent, USA.

Similar presentations


Presentation on theme: "Data Discretization Unification Ruoming Jin Yuri Breitbart Chibuike Muoh Kent State University, Kent, USA."— Presentation transcript:

1 Data Discretization Unification Ruoming Jin Yuri Breitbart Chibuike Muoh Kent State University, Kent, USA

2 Outline Motivation Problem Statement Prior Art & Our Contributions Goodness Function Definition Unification Parameterized Discretization Discretization Algorithm & Its Performance

3 Motivation AgeSuccessFailureTotal 1810119 25527 411005105 ……. 5125010260 523605365 5324910259 ….….. Patients Table

4 Motivation Possible implication of the table: –If a person is between 18 and 25, the probability of procedure success is ‘much higher’ than if the person is between 45 and 55 –Is that a good rule or this one is better: If a person is between 18 and 30, the probability of procedure success is ‘much higher’ than if the person is between 46 and 61 What is the best interval?

5 Motivation Without data discretization some rules would be difficult to establish. Several existing data mining systems cannot handle continuous variables without discretization. Data discretization significantly improves the quality of the discovered knowledge. New methods of discretization needed for tables with rare events. Data discretization significantly improves the performance of data mining algorithms. Some studies reported ten fold increase in performance. However: Any discretization process generally leads to a loss of information. Minimizing such a possible loss is the mark of good discretization method.

6 Problem Statement IntervalsClass1Class2….Class JRow Sum S1S1 r 11 r 12 ….r 1J N 1 S2S2 r 21 r 22 ….r 2J N 2.................. …............. SISI r I1 r I2 …r IJ N I Column Sum M 1 M2M2 MJMJ N(Total) Given an input table:

7 Problem Statement IntervalsClass1Class2….Class JRow Sum S1S1 C 11 C 12 ….C 1J N 1 S2S2 C 21 C 22 ….C 2J N 2.................. …............. S I’ C I’1 C I’2 …C I’J N J Column Sum M 1 M2M2 MJMJ N(Total) Obtain an output table: Where S i = Union of consecutive k intervals. The quality of discretization is measured by cost(model) =cost(data/model) +penalty(model)

8 Prior Art Unsupervised Discretization – no class information is provided: –Equal-width –Equal-frequency Supervised Discretization – class information is provided with each attribute value : –MDLP –Pearson’s X 2 or Wilks’ G 2 statistic based methods Dougherty, Kohavi (1995) compare unsupervised and supervised methods of Holte (1993) and entropy based methods by Fayyad and Irani (1993) and conclude that supervised methods give less classification errors than the unsupervised ones and supervised methods based on entropy are better than other supervised methods.

9 Prior Art There are several recent (2003-2006) papers introduced new discretization algorithms: Yang and Webb; Kurgan and Cios (CAIM); Boulle (Khiops). CAIM attempts to minimize the number of discretization intervals and at the same time to minimize the information loss. Khiops uses Pearson’s X 2 statistic to select merging consecutive intervals that minimize the value of X 2. Yang and Webb studied discretization using naïve Bayesian classifiers. They report that their method generates a lower number of classification errors than the alternative discretization methods that appeared in literature

10 Our Results There is a strong connection between discretization methods based on statistic and on entropy. There is a parametric function so that any prior discretization method is derivable from this function by choosing at most two parameters. There is an optimal dynamic programming method that derived from our discretization approach that mostly outperforms any prior discretization method in experiments that we conducted.

11 Goodness Function Definition (Preliminaries) IntervalsClass1Class2….Class JRow Sum S1S1 C 11 C 12 ….C 1J N 1 S2S2 C 21 C 22 ….C 2J N 2.................. …............. S I’ C I’1 C I’2 …C I’J N J Column Sum M 1 M2M2 MJMJ N(Total)

12 Goodness Function Definition (Preliminaries) Binary encoding of a row requires H(S i ) binary characters. Binary encoding of a set of rows requires H( S 1, S 2, … S I’ ) binary characters Binary encoding of a table requires S L binary characters Entropy of the i-th row of a contingency table Total entropy of all intervals Entropy of a contingency table: : N*H( S 1,S 2, ….S I’ ) = S L

13 Goodness Function Definition Cost(Model, T)= Cost(T/Model)+Penalty(Model) (Mannila, et. al.) Cost(T/Model) is the complexity of table encoding in the given model. Penalty(Model) reflects a complexity of the resulting table. GF Model (T) = Cost(Model, T 0 ) – Cost(Model, T)

14 Goodness Function Definition Models To Be Considered MDLP (Information Theory) Statistical Model Selection Confidence Level of Rows Independence Gini Index

15 Goodness Function Examples Entropy: Statistical Akaike (AIC) : Statistical Bayesian (BIC):

16 Goodness Function Examples (Con’t) Confidence level based Goodness Functions: Pearson X 2 –statistic where Wilks’ G 2 -statistic Table’s degree of freedom is df=(I’-1)(J-1) Distribution functions for these statistics are It is known in statistic that asymptotically both Pearson’s X 2 and Wilks G 2 statistics have chi-square distribution with df degrees of freedom.

17 Unification The following relationship between G 2 and goodness functions for MDLP, AIC, and BIC holds: G 2 /2 = N*H(S 1 U ……US I’ ) – S L Thus, the goodness functions for MDLP, AIC and BIC can be rewritten as follows:

18 Unification Normal Deviate is the difference between the mean and the variable divided by standard deviation Consider t a random variable chi-square distributed. U(t) be a normal deviate so that the following equation holds Let u(t) be a normal deviate function. The following theorem holds (see Wallace 1959, 1960) For all t>df, all df>.37, and with w(t)=[t-df-df*log(t/df)] (1/2) 0<w(t)≤u(t)≤w(t)+.6*df -(1/2)

19 Unification From this theorem it follows that if df goes to infinity and w(t)>>0, u(t)/w(t) ~1. Finally, w 2 (t) ~ u 2 (t) ~t – df – df*log(t/df) and GF G 2 (T)=u 2 (G 2 )=G 2 - df*(1+ log(G 2 /df)) and similarly goodness function for GF X 2 (T) is asymptotically the same

20 Unification G 2 estimate –If G 2 >df, then G 2 <2N*logJ. It follows from the upper bound logJ on H(S 1 U ….U S J ) and lower bound 0 on entropy of a specific row of the contingency table. –Recall that u 2 (G 2 )= G 2 - df*(1+ log(G 2 /df)) –Thus, penalty of u 2 (G 2 ) is between O(df) and O(df*logN) If G 2 ~ c*df and c>1, then penalty is O(df) If G 2 ~ c*df and N/df ~N/(I’J)=c, then penalty is also O(df) If G 2 ~ c*df and N->inf and N/(I’J) ->inf, then penalty is O(df*logN)

21 Unification GF MDLP = G 2 –(depending on the ratio of N/I’) O(df) O(dflogN) GF AIC = G 2 - df GF BIC = G 2 - df*(logN)/2 GF G 2 = G 2 - df([either constant or logN, depending on the ratio between N and I and J] In general, GF = G 2 - df*f(N,I,J) To unify a Gini function as one of the cost functions we resort to the parametric approach to goodness of discretization

22 Gini Based Goodness Function Let S i be a row of a contingency table. Gini index on S i is defined as follows: Cost(Gini, T) = Gini Goodness Function:

23 Parametrized Discretization Parametrized Entropy Entropy of the row Gini of the row Parametrized Data Cost

24 Parametrized Discretization Parametrized Cost of T 0 Parametrized Goodness Function

25 Parameters for Known Goodness Functions

26 Parametrized Dynamic Programming Algorithm

27 Dynamic Programming Algorithm

28 Experiments

29

30

31 Classification errors for Glass dataset (Naïve Bayesian)

32 Iris+C4.5

33 Experiments: C4.5 Validation

34 Experiments: Naïve Bayesian Validation

35 Conclusions We considered several seemingly different approaches to discretization and demonstrated that they can be unified by considering a notion of generalized entropy. Each of the methods that were discussed in literature can be derived from generalized entropy by selecting at most two parameters. Dynamic Programming Algorithm for a given set of two parameters selects an optimal method of discretization (in terms of the discretization goodness function)

36 What Remains To be Done How to find analytically a relationship between the goodness function in terms of the model and the number of classification errors? What is the Algorithm for Selecting the Best Parameters for a Given Set of Data?


Download ppt "Data Discretization Unification Ruoming Jin Yuri Breitbart Chibuike Muoh Kent State University, Kent, USA."

Similar presentations


Ads by Google