Dr. Abdul Aziz Associate Dean Faculty of Computer Sciences Riphah International University Islamabad, Pakistan Dr. Nazir A. Zafar.

Slides:



Advertisements
Similar presentations
COMP3740 CR32: Knowledge Management and Adaptive Systems
Advertisements

CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.
Decision Tree Approach in Data Mining
Application of Stacked Generalization to a Protein Localization Prediction Task Melissa K. Carroll, M.S. and Sung-Hyuk Cha, Ph.D. Pace University, School.
Decision Tree Learning 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.
2015/6/1Course Introduction1 Welcome! MSCIT 521: Knowledge Discovery and Data Mining Qiang Yang Hong Kong University of Science and Technology
Data Mining: A Closer Look Chapter Data Mining Strategies.
1 Lecture 5: Automatic cluster detection Lecture 6: Artificial neural networks Lecture 7: Evaluation of discovered knowledge Brief introduction to lectures.
© Prentice Hall1 DATA MINING TECHNIQUES Introductory and Advanced Topics Eamonn Keogh (some slides adapted from) Margaret Dunham Dr. M.H.Dunham, Data Mining,
Induction of Decision Trees
Data Mining.
Basic Data Mining Techniques
Data Mining with Decision Trees Lutz Hamel Dept. of Computer Science and Statistics University of Rhode Island.
Data Mining Adrian Tuhtan CS157A Section1.
Classification.
© Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist.
Data Mining – Intro.
CS157A Spring 05 Data Mining Professor Sin-Min Lee.
Data mining By Aung Oo.
Oracle Data Mining Ying Zhang. Agenda Data Mining Data Mining Algorithms Oracle DM Demo.
Data Mining: A Closer Look
Data Mining: A Closer Look Chapter Data Mining Strategies 2.
TURKISH STATISTICAL INSTITUTE INFORMATION TECHNOLOGIES DEPARTMENT (Muscat, Oman) DATA MINING.
GUHA method in Data Mining Esko Turunen Tampere University of Technology Tampere, Finland.
Enterprise systems infrastructure and architecture DT211 4
Data Mining By Andrie Suherman. Agenda Introduction Major Elements Steps/ Processes Tools used for data mining Advantages and Disadvantages.
Data Mining: Concepts & Techniques. Motivation: Necessity is the Mother of Invention Data explosion problem –Automated data collection tools and mature.
Data Mining Techniques
Data Mining. 2 Models Created by Data Mining Linear Equations Rules Clusters Graphs Tree Structures Recurrent Patterns.
Data Mining Dr. Chang Liu. What is Data Mining Data mining has been known by many different terms Data mining has been known by many different terms Knowledge.
Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence From Data Mining To Knowledge.
DATA MINING : CLASSIFICATION. Classification : Definition  Classification is a supervised learning.  Uses training sets which has correct answers (class.
CLassification TESTING Testing classifier accuracy
Copyright R. Weber Machine Learning, Data Mining ISYS370 Dr. R. Weber.
Data Mining Joyeeta Dutta-Moscato July 10, Wherever we have large amounts of data, we have the need for building systems capable of learning information.
Data Mining and Application Part 1: Data Mining Fundamentals Part 2: Tools for Knowledge Discovery Part 3: Advanced Data Mining Techniques Part 4: Intelligent.
COMP3503 Intro to Inductive Modeling
Data Mining CS157B Fall 04 Professor Lee By Yanhua Xue.
Introduction to Data Mining Group Members: Karim C. El-Khazen Pascal Suria Lin Gui Philsou Lee Xiaoting Niu.
INTRODUCTION TO DATA MINING MIS2502 Data Analytics.
Data Mining Chapter 1 Introduction -- Basic Data Mining Tasks -- Related Concepts -- Data Mining Techniques.
Data Mining: Classification & Predication Hosam Al-Samarraie, PhD. Centre for Instructional Technology & Multimedia Universiti Sains Malaysia.
Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining Chapter 2 Data Mining: A Closer Look Jason C. H. Chen, Ph.D. Professor of MIS School of Business Administration.
1 KDD-09, Paris France Quantification and Semi-Supervised Classification Methods for Handling Changes in Class Distribution Jack Chongjie Xue † Gary M.
Data Management and Database Technologies 1 DATA MINING Extracting Knowledge From Data Petr Olmer CERN
Data Mining – Intro. Course Overview Spatial Databases Temporal and Spatio-Temporal Databases Multimedia Databases Data Mining.
Stefan Mutter, Mark Hall, Eibe Frank University of Freiburg, Germany University of Waikato, New Zealand The 17th Australian Joint Conference on Artificial.
1 Improving quality of graduate students by data mining Asst. Prof. Kitsana Waiyamai, Ph.D. Dept. of Computer Engineering Faculty of Engineering, Kasetsart.
CS157B Fall 04 Introduction to Data Mining Chapter 22.3 Professor Lee Yu, Jianji (Joseph)
Advanced Database Course (ESED5204) Eng. Hanan Alyazji University of Palestine Software Engineering Department.
Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining Part I Data Mining Fundamentals Chapter 1 Data Mining: A First View Jason C. H. Chen, Ph.D. Professor.
Part I Data Mining Fundamentals. Data Mining: A First View Chapter 1.
Lecture 10 1Dr. Nawaz Khan, School of Computing Science BIS4435 Lecture : Data Mining Dr. Nawaz Khan School of Computing Science.
Data Mining BY JEMINI ISLAM. Data Mining Outline: What is data mining? Why use data mining? How does data mining work The process of data mining Tools.
Data Mining: Knowledge Discovery in Databases Peter van der Putten ALP Group, LIACS Pre-University College LAPP-Top Computer Science February 2005.
MIS2502: Data Analytics Advanced Analytics - Introduction.
An Introduction Student Name: Riaz Ahmad Program: MSIT( ) Subject: Data warehouse & Data Mining.
Data Mining and Decision Support
Data Mining By Farzana Forhad CS 157B. Agenda Decision Tree and ID3 Rough Set Theory Clustering.
WHAT IS DATA MINING?  The process of automatically extracting useful information from large amounts of data.  Uses traditional data analysis techniques.
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 28 Data Mining Concepts.
DATA MINING TECHNIQUES (DECISION TREES ) Presented by: Shweta Ghate MIT College OF Engineering.
By N.Gopinath AP/CSE.  A decision tree is a flowchart-like tree structure, where each internal node (nonleaf node) denotes a test on an attribute, each.
MULTI DISEASE CLASSIFICATION BASED ON EFFECTIVE ANALYTICAL TECHNIQUES Guide: Mr.R. Nandhi kesavan S.Aabitha Banu A.Karthika.
Data Mining Concept Submitted TO: Mrs. MONIKA SUBMITTED BY: SHALU 4717.
MIS2502: Data Analytics Advanced Analytics - Introduction
DATA MINING © Prentice Hall.
Rule Induction for Classification Using
Adrian Tuhtan CS157A Section1
CSCI N317 Computation for Scientific Applications Unit Weka
Presentation transcript:

Dr. Abdul Aziz Associate Dean Faculty of Computer Sciences Riphah International University Islamabad, Pakistan Dr. Nazir A. Zafar Department of Computer & Information Sciences Pakistan Institute of Engineering & Applied Sciences Nilore, Islamabad, Pakistan

2 Reduction in Over-Fitting for Classification without Compromising on Accuracy and Effectiveness

3 Machine Learning Machine learning covers following main types of learning: Classification learning: Learn to put instances into pre-defined classes based on other attributes Association learning: Learn relationships between the attributes Clustering: Discover classes of instances that belong together Regression: Learn to predict a numeric quantity instead of a class

5 Roots of Classification 1.Classification draws on the concepts of three major paradigms: Database technology Statistics Machines 2.Domain knowledge, i.e. the expertise of the end-user.

7 Knowledge Discovery in Databases 1.KDD process typically generates a model using past records with known target classes (outputs) and these models are used to predict outputs of future records (new cases). 2.Applications include fraud detection, marketing, investment analysis, insurance.

8 Marketing example The goal is to predict whether a customer will buy a product given gender, country and age. Freitas and Lavington (1998) Data Mining, CEC99.

9 no yes country? age? Germany England France <= 25> 25 This is the decision tree induced by the Marketing example data. The first branch is called the root of the tree.

10 Tree induction 1.The tree is built by selecting one attribute at a time - the one that ‘best’ separates the classes. 2.The set of examples is then partitioned according to value of selected attributes. 3.This is repeated at each branch node until segmentation is complete.

11 no yes country? age? GermanyEngland France <= 25> 25 (4Y 6N) (0Y 3N)(2Y 0N) (2Y 3N) (2Y 0N)(0Y 3N) Notice that in this simple example the leaf nodes contain records of one class only. The number of yes and no examples is conserved as you move up and down the tree.

12 Rule derivation no yes country? age? GermanyEngland France <= 25> 25 (4Y 6N) (0Y 3N)(2Y 0N) (2Y 3N) (2Y 0N)(0Y 3N) If (country = Germany) then (Buy? = No) What are the other rules? Rules can be extracted directly from induction trees.

13 Heart Disease Dataset

14 What is needed? 1.With databases of enormous size, the user needs help to analyse the data more effectively than just simply querying and reporting. 2.Semi-automatic methods to extract useful, unknown (higher-level) information in a concise format will help the user make more sense of their data.

15 The KDD roadmap 1.KDD may be divided into the following stages: 2.Note the iterative nature of the process.

16 Expertise required 1.Any organisation that undertakes a project in KDD will require much expert input to ensure that the results produced are of high quality, valid, interesting/useful/novel/surprising, and comprehensible by the human user. 2.“If patient is pregnant then gender is female” is very accurate, but is neither useful nor surprising.

17

18

19 S.No.Data SetSRSSDSER % 1Heart Disease(Cleveland) Credit-A Diabetes (PIMA) Liver disorder (BUPA) Breast cancer Wisconsin Hepatitis Ionosphere Boston housing Credit (German) Iris Sonar Over all average Relative error reduction SRS: Simple Random Sampling SDS: Systematic Distribution Sampling

20 Comparison Data Set SRSSIS AccuracyOver fittingAccuracyOver fitting Heart-C Credit-A Diabetes Liver Cancer Hepatitis Ionosphere Housing Credit-G Iris Sonar Average SRS: Simple Random Sampling SIS:Stratified Induction Sampling

21 Conclusion In this study, we have shown that the original data sets partitioned into training and test data sets by using stratified induction approach reduces over fitting significantly without compromising on accuracy factor.

22 Supporting Texts Data Warehousing, Data Mining and OLAP, Alex Berson & Stephen Smith, McGraw-Hill (1997), ISBN Predictive Data Mining, Sholom Weiss & Nitin Indurkhya, Morgan Kauffmann (1998), ISBN Data Mining, Ian Witten & Eibe Frank, Morgan Kaufmann (1999), ISBN

23 Useful urls 1.University of East Anglia School of Computing Sciences, UK 2.UCI ML repository, USA 3.KD Nuggets, USA

24 Questions and Answers Discussion

25 THANK YOU