Presentation is loading. Please wait.

Presentation is loading. Please wait.

Managing uncertainty and quality in the classification process

Similar presentations


Presentation on theme: "Managing uncertainty and quality in the classification process"— Presentation transcript:

1 Managing uncertainty and quality in the classification process
SETN 2002 Managing uncertainty and quality in the classification process Maria Halkidi, Michalis Vazirgiannis {mhalk, Dept of Informatics, Athens University of Economics & Business WWW:

2 Maria Halkidi, Michalis Vazirgiannis, AUEB
SETN 2002 Outline Introduction & Motivation Framework that manage uncertainty in classification Exploiting knowledge based on Information Measures Experimental Study Summarization & Further Work Maria Halkidi, Michalis Vazirgiannis, AUEB 02/01/2019

3 Maria Halkidi, Michalis Vazirgiannis, AUEB
SETN 2002 Introduction Classification is one of the main tasks in the data mining procedure for assigning a data item to a predefined set of classes. The goal in classification process is to induce a model that can be used to classify future data items whose classification is unknown Classification is based on: A well-defined set of classes and a training set of pre-classified examples. Maria Halkidi, Michalis Vazirgiannis, AUEB 02/01/2019

4 Maria Halkidi, Michalis Vazirgiannis, AUEB
SETN 2002 Motivation Classification results may hide "useful" knowledge for our data set that the majority of data mining methods ignore. They consider that The initial classes are not overlapping. The data values are treated equally in the classification process. Maria Halkidi, Michalis Vazirgiannis, AUEB 02/01/2019

5 Maria Halkidi, Michalis Vazirgiannis, AUEB
SETN 2002 Our Contribution The contributions of the proposed framework are summarized as follows: Maintenance of classification belief all the way through the classification process. A value set can be assigned to more than one classes with a different belief. Decision support tools for decision related to: relative importance of classes in a data set (e.g., “young vs. old customers”), relative importance of classes across data sets, and the information content of different data sets. Quality assessment of a classification model. This procedure will be very useful for evaluating models and select the one that best fits the data under consideration. Maria Halkidi, Michalis Vazirgiannis, AUEB 02/01/2019

6 Steps of the classification approach.
SETN 2002 Steps of the classification approach. Data Set Classes Definition, Membership Functions (CS) Definition of Initial classes Queries & Decision support Mapping to the fuzzy domain -Classification CS,. d.o.b.s lj tk Quality assessment CVS Αi Maria Halkidi, Michalis Vazirgiannis, AUEB 02/01/2019

7 Maria Halkidi, Michalis Vazirgiannis, AUEB
SETN 2002 Classification Space The term Classification Space (CS) implies the specifications for mapping database values to the fuzzy domain. For each attribute Ai we define the corresponding classification set LAi ={ct | t is a classification tuple}. and ct = (li, [v1, v2], fi), where li is a lexical category, [v1, v2] is the corresponding value interval and fi the assigned transformation function. The value domains may be overlapping. Maria Halkidi, Michalis Vazirgiannis, AUEB 02/01/2019

8 Hypertrapezoidal Membership Functions
SETN 2002 Hypertrapezoidal Membership Functions The effect of crispness factor on one- dimensional data sets σ =1, Non-Fuzzy λ λ λ3 σ =0.5, Trapezoidal λ λ λ3 σ =0, Triangular Maria Halkidi, Michalis Vazirgiannis, AUEB 02/01/2019

9 client_salary low Medium High Min 1500 2500 4000 Max 3000 5500 10000 Function decr triangle Increasing client_age young Old 18 30 50 40 60 80 increasing price very chea p Cheap moderate expensive 1 10 35 70 15 150 date_of_p beg. of month mid. of month end. of month 8 20 time_of_p morning Noon Afternoon evening 9 11 5 12 2 6 The Classification Space (attributes, lexical categories, value domains and mapping functions) for the sales schema. SETN 2002

10 Classification Value Space
SENT 2002 Classification Value Space The result of the transformation of the data set values to the fuzzy domains using the CS is a 3D structure. Classification Categories Attributes Tuples Ai tk li Data Set S CVS (S) The front face of this structure stores the original data set while each of the other cells C[Ai, lj, tk] , where j, k >1, stores the d.o.b. μli(S.tk.Ai). Maria Halkidi, Michalis Vazirgiannis, AUEB 02/01/2019

11 Information Measures for Decision Support
02/01/2019 Maria Halkidi, Michalis Vazirgiannis, AUEB

12 Information Measures in CVS Category Energy metric.
SETN 2002 Information Measures in CVS Category Energy metric. attribute category importance overall belief that the data set includes objects of the category li medium salary high low tk Which category is better supported by the dataset? What is the degree of belief that our dataset contains objects of li category ? Maria Halkidi, Michalis Vazirgiannis, AUEB 02/01/2019

13 Information Measures in CVS Attribute Energy metric
SETN 2002 Information Measures in CVS Attribute Energy metric information content of the dataset regarding attribute Ai; Salary medium high low What is the ammount of information regarding the considered categories for the attribute Ai ? Maria Halkidi, Michalis Vazirgiannis, AUEB 02/01/2019

14 Classification Scheme Quality Assessment
02/01/2019 Maria Halkidi, Michalis Vazirgiannis, AUEB

15 Maria Halkidi, Michalis Vazirgiannis, AUEB
SETN 2002 Quality Criteria Main Question : How successful is a classification model? How well the defined classes fit the data? Criteria for a successful classification model: High values of class/attribute energy Minimum entropy in the defined classes Maria Halkidi, Michalis Vazirgiannis, AUEB 02/01/2019

16 Quality Assessment Uncertainty of a class.
SETN 2002 Quality Assessment Uncertainty of a class. It evaluates the uncertainty within a class. where N= the number of tuples in the dataset In case that the membership values of the data to the classes are equal i.e., μij=1/nc, Unc_Clcj obtains its higher value, i.e., log2(nc ), Maria Halkidi, Michalis Vazirgiannis, AUEB 02/01/2019

17 Quality Assessment Overall belief of a class
SETN 2002 Quality Assessment Overall belief of a class The overall belief that a data set supports a class is given by the equation: where N= the number of tuples in the dataset Maria Halkidi, Michalis Vazirgiannis, AUEB 02/01/2019

18 Quality Assessment Information coefficient of a class
SETN 2002 Quality Assessment Information coefficient of a class It is an index of the quality of the class under consideration. the significance of a class in the data set, i.e., the amount of information included in the specific class. is an indication of the class uncertainty. It evaluates the deviation of the class uncertainty from the case that all membership values to a class are equal (i.e., the case of no clustering tendency or improper definition of classes). Maria Halkidi, Michalis Vazirgiannis, AUEB 02/01/2019

19 Quality Assessment Information coefficient of a classification scheme
SETN 2002 Quality Assessment Information coefficient of a classification scheme Info_Coef can be used as a measure for finding the model that fits a data set taking in account the uncertainty included in its values. Maria Halkidi, Michalis Vazirgiannis, AUEB 02/01/2019

20 Experimental Study

21 Relative importance of classes in a data set
SETN 2002 Relative importance of classes in a data set R={Salary, Age} Salary Age Low 170 Medium 278 High 419,7 Young 25,66 Old 51,6 HMF : σ = 0.5 Maria Halkidi, Michalis Vazirgiannis, AUEB 02/01/2019

22 Relative importance of classes in a data set
SETN 2002 Relative importance of classes in a data set (a) (b) Salary Age Elow Eyoung 398.75 Emedium Eold 540.96 Ehigh Esalary Eage 939.71 Ehigh > Elow data set supports with more confidence high salaries than low salaries (Eold> Eyoung). data set we are more confident to have old employees than young ones Maria Halkidi, Michalis Vazirgiannis, AUEB 02/01/2019

23 Selecting the optimal classification scheme.
SETN 2002 Selecting the optimal classification scheme. (a) (b) a. A data set classified in four clusters, b. The graph of Info_Coef versus the number of clusters considering a synthetic two-dimensional data set Maria Halkidi, Michalis Vazirgiannis, AUEB 02/01/2019

24 Selecting the optimal classification scheme.
SETN 2002 Selecting the optimal classification scheme. (a) (b) a. A data set classified in three classes, b. The graph of Info_Coef versus the number of clusters considering a two-dimensional data set “salary and age”. Maria Halkidi, Michalis Vazirgiannis, AUEB 02/01/2019

25 Maria Halkidi, Michalis Vazirgiannis, AUEB
SETN 2002 Conclusion Maintenance of classification belief all the way through the classification process Information measures enabling decisions related to: i. relative importance of classes in a data set (i.e., “young vs. old customers”), ii. the information content of data sets. Quality assessment of classification models, so as to find how well a model fits the underlying data set. Maria Halkidi, Michalis Vazirgiannis, AUEB 02/01/2019

26 Maria Halkidi, Michalis Vazirgiannis, AUEB
SETN 2002 Further Work Evaluation of the classification models through out the life cycle of a data set as insertions/updates and deletions occur. Application of the proposed framework to study whether a model based on a specific dataset A fits to a different dataset having a similar schema with A. Study of different mapping functions and their effect to the proposed classification scheme as regards uncertainty representation will be studied. Maria Halkidi, Michalis Vazirgiannis, AUEB 02/01/2019

27 THANK YOU FOR YOUR ATTENTION !
SETN 2002 THANK YOU FOR YOUR ATTENTION !


Download ppt "Managing uncertainty and quality in the classification process"

Similar presentations


Ads by Google