Managing uncertainty and quality in the classification process SETN 2002 Managing uncertainty and quality in the classification process Maria Halkidi, Michalis Vazirgiannis Email: {mhalk, mvazirg}@aueb.gr Dept of Informatics, Athens University of Economics & Business WWW: http://www.db-net.aueb.gr
Maria Halkidi, Michalis Vazirgiannis, AUEB SETN 2002 Outline Introduction & Motivation Framework that manage uncertainty in classification Exploiting knowledge based on Information Measures Experimental Study Summarization & Further Work Maria Halkidi, Michalis Vazirgiannis, AUEB 02/01/2019
Maria Halkidi, Michalis Vazirgiannis, AUEB SETN 2002 Introduction Classification is one of the main tasks in the data mining procedure for assigning a data item to a predefined set of classes. The goal in classification process is to induce a model that can be used to classify future data items whose classification is unknown Classification is based on: A well-defined set of classes and a training set of pre-classified examples. Maria Halkidi, Michalis Vazirgiannis, AUEB 02/01/2019
Maria Halkidi, Michalis Vazirgiannis, AUEB SETN 2002 Motivation Classification results may hide "useful" knowledge for our data set that the majority of data mining methods ignore. They consider that The initial classes are not overlapping. The data values are treated equally in the classification process. Maria Halkidi, Michalis Vazirgiannis, AUEB 02/01/2019
Maria Halkidi, Michalis Vazirgiannis, AUEB SETN 2002 Our Contribution The contributions of the proposed framework are summarized as follows: Maintenance of classification belief all the way through the classification process. A value set can be assigned to more than one classes with a different belief. Decision support tools for decision related to: relative importance of classes in a data set (e.g., “young vs. old customers”), relative importance of classes across data sets, and the information content of different data sets. Quality assessment of a classification model. This procedure will be very useful for evaluating models and select the one that best fits the data under consideration. Maria Halkidi, Michalis Vazirgiannis, AUEB 02/01/2019
Steps of the classification approach. SETN 2002 Steps of the classification approach. Data Set Classes Definition, Membership Functions (CS) Definition of Initial classes Queries & Decision support Mapping to the fuzzy domain -Classification CS,. d.o.b.s lj tk Quality assessment CVS Αi Maria Halkidi, Michalis Vazirgiannis, AUEB 02/01/2019
Maria Halkidi, Michalis Vazirgiannis, AUEB SETN 2002 Classification Space The term Classification Space (CS) implies the specifications for mapping database values to the fuzzy domain. For each attribute Ai we define the corresponding classification set LAi ={ct | t is a classification tuple}. and ct = (li, [v1, v2], fi), where li is a lexical category, [v1, v2] is the corresponding value interval and fi the assigned transformation function. The value domains may be overlapping. Maria Halkidi, Michalis Vazirgiannis, AUEB 02/01/2019
Hypertrapezoidal Membership Functions SETN 2002 Hypertrapezoidal Membership Functions The effect of crispness factor on one- dimensional data sets σ =1, Non-Fuzzy λ1 λ2 λ3 σ =0.5, Trapezoidal λ1 λ2 λ3 σ =0, Triangular Maria Halkidi, Michalis Vazirgiannis, AUEB 02/01/2019
client_salary low Medium High Min 1500 2500 4000 Max 3000 5500 10000 Function decr triangle Increasing client_age young Old 18 30 50 40 60 80 increasing price very chea p Cheap moderate expensive 1 10 35 70 15 150 date_of_p beg. of month mid. of month end. of month 8 20 time_of_p morning Noon Afternoon evening 9 11 5 12 2 6 The Classification Space (attributes, lexical categories, value domains and mapping functions) for the sales schema. SETN 2002
Classification Value Space SENT 2002 Classification Value Space The result of the transformation of the data set values to the fuzzy domains using the CS is a 3D structure. Classification Categories Attributes Tuples Ai tk li Data Set S CVS (S) The front face of this structure stores the original data set while each of the other cells C[Ai, lj, tk] , where j, k >1, stores the d.o.b. μli(S.tk.Ai). Maria Halkidi, Michalis Vazirgiannis, AUEB 02/01/2019
Information Measures for Decision Support 02/01/2019 Maria Halkidi, Michalis Vazirgiannis, AUEB
Information Measures in CVS Category Energy metric. SETN 2002 Information Measures in CVS Category Energy metric. attribute category importance overall belief that the data set includes objects of the category li medium salary high low tk Which category is better supported by the dataset? What is the degree of belief that our dataset contains objects of li category ? Maria Halkidi, Michalis Vazirgiannis, AUEB 02/01/2019
Information Measures in CVS Attribute Energy metric SETN 2002 Information Measures in CVS Attribute Energy metric information content of the dataset regarding attribute Ai; Salary medium high low What is the ammount of information regarding the considered categories for the attribute Ai ? Maria Halkidi, Michalis Vazirgiannis, AUEB 02/01/2019
Classification Scheme Quality Assessment 02/01/2019 Maria Halkidi, Michalis Vazirgiannis, AUEB
Maria Halkidi, Michalis Vazirgiannis, AUEB SETN 2002 Quality Criteria Main Question : How successful is a classification model? How well the defined classes fit the data? Criteria for a successful classification model: High values of class/attribute energy Minimum entropy in the defined classes Maria Halkidi, Michalis Vazirgiannis, AUEB 02/01/2019
Quality Assessment Uncertainty of a class. SETN 2002 Quality Assessment Uncertainty of a class. It evaluates the uncertainty within a class. where N= the number of tuples in the dataset In case that the membership values of the data to the classes are equal i.e., μij=1/nc, Unc_Clcj obtains its higher value, i.e., log2(nc ), Maria Halkidi, Michalis Vazirgiannis, AUEB 02/01/2019
Quality Assessment Overall belief of a class SETN 2002 Quality Assessment Overall belief of a class The overall belief that a data set supports a class is given by the equation: where N= the number of tuples in the dataset Maria Halkidi, Michalis Vazirgiannis, AUEB 02/01/2019
Quality Assessment Information coefficient of a class SETN 2002 Quality Assessment Information coefficient of a class It is an index of the quality of the class under consideration. the significance of a class in the data set, i.e., the amount of information included in the specific class. is an indication of the class uncertainty. It evaluates the deviation of the class uncertainty from the case that all membership values to a class are equal (i.e., the case of no clustering tendency or improper definition of classes). Maria Halkidi, Michalis Vazirgiannis, AUEB 02/01/2019
Quality Assessment Information coefficient of a classification scheme SETN 2002 Quality Assessment Information coefficient of a classification scheme Info_Coef can be used as a measure for finding the model that fits a data set taking in account the uncertainty included in its values. Maria Halkidi, Michalis Vazirgiannis, AUEB 02/01/2019
Experimental Study
Relative importance of classes in a data set SETN 2002 Relative importance of classes in a data set R={Salary, Age} Salary Age Low 170 Medium 278 High 419,7 Young 25,66 Old 51,6 HMF : σ = 0.5 Maria Halkidi, Michalis Vazirgiannis, AUEB 02/01/2019
Relative importance of classes in a data set SETN 2002 Relative importance of classes in a data set (a) (b) Salary Age Elow 304.2736 Eyoung 398.75 Emedium 343.2594 Eold 540.96 Ehigh 348.0000 Esalary 995.5320 Eage 939.71 Ehigh > Elow data set supports with more confidence high salaries than low salaries (Eold> Eyoung). data set we are more confident to have old employees than young ones Maria Halkidi, Michalis Vazirgiannis, AUEB 02/01/2019
Selecting the optimal classification scheme. SETN 2002 Selecting the optimal classification scheme. (a) (b) a. A data set classified in four clusters, b. The graph of Info_Coef versus the number of clusters considering a synthetic two-dimensional data set Maria Halkidi, Michalis Vazirgiannis, AUEB 02/01/2019
Selecting the optimal classification scheme. SETN 2002 Selecting the optimal classification scheme. (a) (b) a. A data set classified in three classes, b. The graph of Info_Coef versus the number of clusters considering a two-dimensional data set “salary and age”. Maria Halkidi, Michalis Vazirgiannis, AUEB 02/01/2019
Maria Halkidi, Michalis Vazirgiannis, AUEB SETN 2002 Conclusion Maintenance of classification belief all the way through the classification process Information measures enabling decisions related to: i. relative importance of classes in a data set (i.e., “young vs. old customers”), ii. the information content of data sets. Quality assessment of classification models, so as to find how well a model fits the underlying data set. Maria Halkidi, Michalis Vazirgiannis, AUEB 02/01/2019
Maria Halkidi, Michalis Vazirgiannis, AUEB SETN 2002 Further Work Evaluation of the classification models through out the life cycle of a data set as insertions/updates and deletions occur. Application of the proposed framework to study whether a model based on a specific dataset A fits to a different dataset having a similar schema with A. Study of different mapping functions and their effect to the proposed classification scheme as regards uncertainty representation will be studied. Maria Halkidi, Michalis Vazirgiannis, AUEB 02/01/2019
THANK YOU FOR YOUR ATTENTION ! SETN 2002 THANK YOU FOR YOUR ATTENTION !