Managing uncertainty and quality in the classification process

Slides:



Advertisements
Similar presentations
Answering Approximate Queries over Autonomous Web Databases Xiangfu Meng, Z. M. Ma, and Li Yan College of Information Science and Engineering, Northeastern.
Advertisements

The Software Infrastructure for Electronic Commerce Databases and Data Mining Lecture 4: An Introduction To Data Mining (II) Johannes Gehrke
ADBIS 2007 Discretization Numbers for Multiple-Instances Problem in Relational Database Rayner Alfred Dimitar Kazakov Artificial Intelligence Group, Computer.
ICDE 2014 LinkSCAN*: Overlapping Community Detection Using the Link-Space Transformation Sungsu Lim †, Seungwoo Ryu ‡, Sejeong Kwon§, Kyomin Jung ¶, and.
Recognizing Human Actions by Attributes CVPR2011 Jingen Liu, Benjamin Kuipers, Silvio Savarese Dept. of Electrical Engineering and Computer Science University.
Overcoming Limitations of Sampling for Agrregation Queries Surajit ChaudhuriMicrosoft Research Gautam DasMicrosoft Research Mayur DatarStanford University.
Cleaning Uncertain Data with Quality Guarantees Reynold Cheng, Jinchuan Chen, Xike Xie 2008 VLDB Presented by SHAO Yufeng.
Clustering: Introduction Adriano Joaquim de O Cruz ©2002 NCE/UFRJ
Bayesian Decision Theory
Assessment. Schedule graph may be of help for selecting the best solution Best solution corresponds to a plateau before a high jump Solutions with very.
GENERATING AUTOMATIC SEMANTIC ANNOTATIONS FOR RESEARCH DATASETS AYUSH SINGHAL AND JAIDEEP SRIVASTAVA CS DEPT., UNIVERSITY OF MINNESOTA, MN, USA.
Jierui Xie, Boleslaw Szymanski, Mohammed J. Zaki Department of Computer Science Rensselaer Polytechnic Institute Troy, NY 12180, USA {xiej2, szymansk,
Geog 458: Map Sources and Errors Uncertainty January 23, 2006.
Data Mining – Intro.
Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence From Data Mining To Knowledge.
1 Template-Based Classification Method for Chinese Character Recognition Presenter: Tienwei Tsai Department of Informaiton Management, Chihlee Institute.
Copyright R. Weber Machine Learning, Data Mining ISYS370 Dr. R. Weber.
3 Objects (Views Synonyms Sequences) 4 PL/SQL blocks 5 Procedures Triggers 6 Enhanced SQL programming 7 SQL &.NET applications 8 OEM DB structure 9 DB.
Data Mining Chapter 1 Introduction -- Basic Data Mining Tasks -- Related Concepts -- Data Mining Techniques.
Web Usage Mining for Semantic Web Personalization جینی شیره شعاعی زهرا.
Kernel Methods A B M Shawkat Ali 1 2 Data Mining ¤ DM or KDD (Knowledge Discovery in Databases) Extracting previously unknown, valid, and actionable.
3. Rough set extensions  In the rough set literature, several extensions have been developed that attempt to handle better the uncertainty present in.
Reverse Top-k Queries Akrivi Vlachou *, Christos Doulkeridis *, Yannis Kotidis #, Kjetil Nørvåg * *Norwegian University of Science and Technology (NTNU),
Data Preprocessing Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2010.
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
Data Mining – Intro. Course Overview Spatial Databases Temporal and Spatio-Temporal Databases Multimedia Databases Data Mining.
Graph-based Text Classification: Learn from Your Neighbors Ralitsa Angelova , Gerhard Weikum : Max Planck Institute for Informatics Stuhlsatzenhausweg.
1 FINDING FUZZY SETS FOR QUANTITATIVE ATTRIBUTES FOR MINING OF FUZZY ASSOCIATE RULES By H.N.A. Pham, T.W. Liao, and E. Triantaphyllou Department of Industrial.
K. Kolomvatsos 1, C. Anagnostopoulos 2, and S. Hadjiefthymiades 1 An Efficient Environmental Monitoring System adopting Data Fusion, Prediction & Fuzzy.
Data Preprocessing Compiled By: Umair Yaqub Lecturer Govt. Murray College Sialkot.
Computational Approaches for Biomarker Discovery SubbaLakshmiswetha Patchamatla.
1 An Efficient Classification Approach Based on Grid Code Transformation and Mask-Matching Method Presenter: Yo-Ping Huang.
Linguistic summaries on relational databases Miroslav Hudec University of Economics in Bratislava, Department of Applied Informatics FSTA, 2014.
Introduction to Data Mining by Yen-Hsien Lee Department of Information Management College of Management National Sun Yat-Sen University March 4, 2003.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Validity index for clusters of different sizes and densities Presenter: Jun-Yi Wu Authors: Krista Rizman.
Recent Trends in Fuzzy Clustering: From Data to Knowledge Shenyang, August 2009
Data Mining and Decision Support
1 Chi-square Test Dr. T. T. Kachwala. Using the Chi-Square Test 2 The following are the two Applications: 1. Chi square as a test of Independence 2.Chi.
Using category-Based Adherence to Cluster Market-Basket Data Author : Ching-Huang Yun, Kun-Ta Chuang, Ming-Syan Chen Graduate : Chien-Ming Hsiao.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
1 A latent information function to extend domain attributes to improve the accuracy of small-data-set forecasting Reporter : Zhao-Wei Luo Che-Jung Chang,Der-Chiang.
Data Mining – Introduction (contd…) Compiled By: Umair Yaqub Lecturer Govt. Murray College Sialkot.
QED : An Efficient Framework for Temporal Region Query Processing Yi-Hong Chu 朱怡虹 Network Database Laboratory Dept. of Electrical Engineering National.
Data Mining: Confluence of Multiple Disciplines Data Mining Database Systems Statistics Other Disciplines Algorithm Machine Learning Visualization.
Presented by Niwan Wattanakitrungroj
WIS/COLLNET’2016 Nancy, France
Data Mining – Intro.
Data Transformation: Normalization
Presented by Khawar Shakeel
A paper on Join Synopses for Approximate Query Answering
Quality Assessment in the framework of Map Generalization
Adrian Tuhtan CS157A Section1
Multidimensional Scaling and Correspondence Analysis
Fuzzy Support Vector Machines
Chapter 11: Indexing and Hashing
Data Analysis.
Nearest-Neighbor Classifiers
ECE539 final project Instructor: Yu Hen Hu Fall 2005
Prepared by: Mahmoud Rafeek Al-Farra
Representation of documents and queries
Discriminative Frequent Pattern Analysis for Effective Classification
COSC 4335: Other Classification Techniques
K. Kolomvatsos1, C. Anagnostopoulos2, and S. Hadjiefthymiades1
Data Transformations targeted at minimizing experimental variance
Nearest Neighbors CSC 576: Data Mining.
Chapter 22, Part
Chapter 11: Indexing and Hashing
An Edge-Centric Ensemble Scheme for Queries Assignment
Correspondence Analysis
CSE591: Data Mining by H. Liu
Presentation transcript:

Managing uncertainty and quality in the classification process SETN 2002 Managing uncertainty and quality in the classification process   Maria Halkidi, Michalis Vazirgiannis Email: {mhalk, mvazirg}@aueb.gr Dept of Informatics, Athens University of Economics & Business WWW: http://www.db-net.aueb.gr

Maria Halkidi, Michalis Vazirgiannis, AUEB SETN 2002 Outline Introduction & Motivation Framework that manage uncertainty in classification Exploiting knowledge based on Information Measures Experimental Study Summarization & Further Work Maria Halkidi, Michalis Vazirgiannis, AUEB 02/01/2019

Maria Halkidi, Michalis Vazirgiannis, AUEB SETN 2002 Introduction Classification is one of the main tasks in the data mining procedure for assigning a data item to a predefined set of classes. The goal in classification process is to induce a model that can be used to classify future data items whose classification is unknown Classification is based on: A well-defined set of classes and a training set of pre-classified examples. Maria Halkidi, Michalis Vazirgiannis, AUEB 02/01/2019

Maria Halkidi, Michalis Vazirgiannis, AUEB SETN 2002 Motivation Classification results may hide "useful" knowledge for our data set that the majority of data mining methods ignore. They consider that The initial classes are not overlapping. The data values are treated equally in the classification process. Maria Halkidi, Michalis Vazirgiannis, AUEB 02/01/2019

Maria Halkidi, Michalis Vazirgiannis, AUEB SETN 2002 Our Contribution The contributions of the proposed framework are summarized as follows: Maintenance of classification belief all the way through the classification process. A value set can be assigned to more than one classes with a different belief. Decision support tools for decision related to: relative importance of classes in a data set (e.g., “young vs. old customers”), relative importance of classes across data sets, and the information content of different data sets. Quality assessment of a classification model. This procedure will be very useful for evaluating models and select the one that best fits the data under consideration. Maria Halkidi, Michalis Vazirgiannis, AUEB 02/01/2019

Steps of the classification approach. SETN 2002 Steps of the classification approach. Data Set Classes Definition, Membership Functions (CS) Definition of Initial classes Queries & Decision support Mapping to the fuzzy domain -Classification CS,. d.o.b.s lj tk Quality assessment CVS Αi Maria Halkidi, Michalis Vazirgiannis, AUEB 02/01/2019

Maria Halkidi, Michalis Vazirgiannis, AUEB SETN 2002 Classification Space The term Classification Space (CS) implies the specifications for mapping database values to the fuzzy domain. For each attribute Ai we define the corresponding classification set LAi ={ct | t is a classification tuple}. and ct = (li, [v1, v2], fi), where li is a lexical category, [v1, v2] is the corresponding value interval and fi the assigned transformation function. The value domains may be overlapping. Maria Halkidi, Michalis Vazirgiannis, AUEB 02/01/2019

Hypertrapezoidal Membership Functions SETN 2002 Hypertrapezoidal Membership Functions The effect of crispness factor on one- dimensional data sets σ =1, Non-Fuzzy λ1 λ2 λ3 σ =0.5, Trapezoidal λ1 λ2 λ3 σ =0, Triangular Maria Halkidi, Michalis Vazirgiannis, AUEB 02/01/2019

client_salary low Medium High Min 1500 2500 4000 Max 3000 5500 10000 Function decr triangle Increasing client_age young Old 18 30 50 40 60 80 increasing price very chea p Cheap moderate expensive 1 10 35 70 15 150 date_of_p beg. of month mid. of month end. of month 8 20 time_of_p morning Noon Afternoon evening 9 11 5 12 2 6 The Classification Space (attributes, lexical categories, value domains and mapping functions) for the sales schema. SETN 2002

Classification Value Space SENT 2002 Classification Value Space The result of the transformation of the data set values to the fuzzy domains using the CS is a 3D structure. Classification Categories Attributes Tuples Ai tk li Data Set S CVS (S) The front face of this structure stores the original data set while each of the other cells C[Ai, lj, tk] , where j, k >1, stores the d.o.b. μli(S.tk.Ai). Maria Halkidi, Michalis Vazirgiannis, AUEB 02/01/2019

Information Measures for Decision Support 02/01/2019 Maria Halkidi, Michalis Vazirgiannis, AUEB

Information Measures in CVS Category Energy metric. SETN 2002 Information Measures in CVS Category Energy metric. attribute category importance overall belief that the data set includes objects of the category li medium salary high low tk Which category is better supported by the dataset? What is the degree of belief that our dataset contains objects of li category ? Maria Halkidi, Michalis Vazirgiannis, AUEB 02/01/2019

Information Measures in CVS Attribute Energy metric SETN 2002 Information Measures in CVS Attribute Energy metric information content of the dataset regarding attribute Ai; Salary medium high low What is the ammount of information regarding the considered categories for the attribute Ai ? Maria Halkidi, Michalis Vazirgiannis, AUEB 02/01/2019

Classification Scheme Quality Assessment 02/01/2019 Maria Halkidi, Michalis Vazirgiannis, AUEB

Maria Halkidi, Michalis Vazirgiannis, AUEB SETN 2002 Quality Criteria Main Question : How successful is a classification model? How well the defined classes fit the data? Criteria for a successful classification model: High values of class/attribute energy Minimum entropy in the defined classes Maria Halkidi, Michalis Vazirgiannis, AUEB 02/01/2019

Quality Assessment Uncertainty of a class. SETN 2002 Quality Assessment Uncertainty of a class. It evaluates the uncertainty within a class. where N= the number of tuples in the dataset In case that the membership values of the data to the classes are equal i.e., μij=1/nc, Unc_Clcj obtains its higher value, i.e., log2(nc ), Maria Halkidi, Michalis Vazirgiannis, AUEB 02/01/2019

Quality Assessment Overall belief of a class SETN 2002 Quality Assessment Overall belief of a class The overall belief that a data set supports a class is given by the equation: where N= the number of tuples in the dataset Maria Halkidi, Michalis Vazirgiannis, AUEB 02/01/2019

Quality Assessment Information coefficient of a class SETN 2002 Quality Assessment Information coefficient of a class It is an index of the quality of the class under consideration. the significance of a class in the data set, i.e., the amount of information included in the specific class. is an indication of the class uncertainty. It evaluates the deviation of the class uncertainty from the case that all membership values to a class are equal (i.e., the case of no clustering tendency or improper definition of classes). Maria Halkidi, Michalis Vazirgiannis, AUEB 02/01/2019

Quality Assessment Information coefficient of a classification scheme SETN 2002 Quality Assessment Information coefficient of a classification scheme Info_Coef can be used as a measure for finding the model that fits a data set taking in account the uncertainty included in its values. Maria Halkidi, Michalis Vazirgiannis, AUEB 02/01/2019

Experimental Study

Relative importance of classes in a data set SETN 2002 Relative importance of classes in a data set R={Salary, Age} Salary Age Low 170 Medium 278 High 419,7 Young 25,66 Old 51,6 HMF : σ = 0.5 Maria Halkidi, Michalis Vazirgiannis, AUEB 02/01/2019

Relative importance of classes in a data set SETN 2002 Relative importance of classes in a data set   (a) (b) Salary Age Elow 304.2736 Eyoung 398.75 Emedium 343.2594 Eold 540.96 Ehigh 348.0000 Esalary 995.5320 Eage 939.71 Ehigh > Elow data set supports with more confidence high salaries than low salaries (Eold> Eyoung). data set we are more confident to have old employees than young ones Maria Halkidi, Michalis Vazirgiannis, AUEB 02/01/2019

Selecting the optimal classification scheme. SETN 2002 Selecting the optimal classification scheme. (a) (b) a. A data set classified in four clusters, b. The graph of Info_Coef versus the number of clusters considering a synthetic two-dimensional data set Maria Halkidi, Michalis Vazirgiannis, AUEB 02/01/2019

Selecting the optimal classification scheme. SETN 2002 Selecting the optimal classification scheme. (a) (b) a. A data set classified in three classes, b. The graph of Info_Coef versus the number of clusters considering a two-dimensional data set “salary and age”. Maria Halkidi, Michalis Vazirgiannis, AUEB 02/01/2019

Maria Halkidi, Michalis Vazirgiannis, AUEB SETN 2002 Conclusion Maintenance of classification belief all the way through the classification process Information measures enabling decisions related to: i. relative importance of classes in a data set (i.e., “young vs. old customers”), ii. the information content of data sets. Quality assessment of classification models, so as to find how well a model fits the underlying data set. Maria Halkidi, Michalis Vazirgiannis, AUEB 02/01/2019

Maria Halkidi, Michalis Vazirgiannis, AUEB SETN 2002 Further Work Evaluation of the classification models through out the life cycle of a data set as insertions/updates and deletions occur. Application of the proposed framework to study whether a model based on a specific dataset A fits to a different dataset having a similar schema with A. Study of different mapping functions and their effect to the proposed classification scheme as regards uncertainty representation will be studied. Maria Halkidi, Michalis Vazirgiannis, AUEB 02/01/2019

THANK YOU FOR YOUR ATTENTION ! SETN 2002 THANK YOU FOR YOUR ATTENTION !