Dept. of Computer Science University of Liverpool

Slides:



Advertisements
Similar presentations
COMP3740 CR32: Knowledge Management and Adaptive Systems
Advertisements

UNIT-2 Data Preprocessing LectureTopic ********************************************** Lecture-13Why preprocess the data? Lecture-14Data cleaning Lecture-15Data.
Machine Learning Instance Based Learning & Case Based Reasoning Exercise Solutions.
Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A. Hall.
CSCI 347 / CS 4206: Data Mining Module 02: Input Topic 03: Attribute Characteristics.
Machine Learning in Practice Lecture 3 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Introduction to Data Mining with XLMiner
© Prentice Hall1 DATA MINING TECHNIQUES Introductory and Advanced Topics Eamonn Keogh (some slides adapted from) Margaret Dunham Dr. M.H.Dunham, Data Mining,
Dr. M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2010 COMP207: Data Mining General Data Mining Issues COMP207:
Data Mining with Decision Trees Lutz Hamel Dept. of Computer Science and Statistics University of Rhode Island.
Lecture Slides Elementary Statistics Twelfth Edition
Major Tasks in Data Preprocessing(Ref Chap 3) By Prof. Muhammad Amir Alam.
M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2009 COMP527: Data Mining Classification: Evaluation February 23,
1 © Goharian & Grossman 2003 Introduction to Data Mining (CS 422) Fall 2010.
Data Mining – Algorithms: OneR Chapter 4, Section 4.1.
M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2009 COMP527: Data Mining Text Mining: Text-as-Data March 25, 2009.
Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence From Data Mining To Knowledge.
Ch2 Data Preprocessing part2 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009.
Encoding, Validation and Verification Chapter 1. Introduction This presentation covers the following: – Data encoding – Data validation – Data verification.
WEKA - Explorer (sumber: WEKA Explorer user Guide for Version 3-5-5)
Data Mining – Input: Concepts, instances, attributes Chapter 2.
ISV Innovation Presented by ISV Innovation Presented by Business Intelligence Fundamentals: Data Cleansing Ola Ekdahl IT Mentors 9/12/08.
M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2009 COMP527: Data Mining Classification: Bayes February 17, 2009.
Preprocessing for Data Mining Vikram Pudi IIIT Hyderabad.
M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2009 COMP527: Data Mining Text Mining: Challenges, Basics March.
M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2009 COMP527: Data Mining Association Rule Mining March 5, 2009.
 Classification 1. 2  Task: Given a set of pre-classified examples, build a model or classifier to classify new cases.  Supervised learning: classes.
Data Mining – Algorithms: Naïve Bayes Chapter 4, Section 4.2.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.
Data Preprocessing Compiled By: Umair Yaqub Lecturer Govt. Murray College Sialkot.
M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2009 COMP527: Data Mining ARM: Improvements March 10, 2009 Slide.
Data Mining and Decision Support
Machine Learning in Practice Lecture 9 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Data Mining What is to be done before we get to Data Mining?
Data Mining CH6 Implementation: Real machine learning schemes(2) Reporter: H.C. Tsai.
Data Mining Practical Machine Learning Tools and Techniques Chapter 6.5: Instance-based Learning Rodney Nielsen Many / most of these slides were adapted.
Machine Learning in Practice Lecture 9 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Machine Learning in Practice Lecture 4 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Pattern Recognition Lecture 20: Data Mining 2 Dr. Richard Spillman Pacific Lutheran University.
Data Science Credibility: Evaluating What’s Been Learned
Data Mining – Input: Concepts, instances, attributes
Course Outline 1. Pengantar Data Mining 2. Proses Data Mining
Data Mining – Intro.
Data Mining – Algorithms: Instance-Based Learning
Prepared by: Mahmoud Rafeek Al-Farra
Data Science Algorithms: The Basic Methods
Chapter 5 Probability 5.2 Random Variables 5.3 Binomial Distribution
Introduction to Data Mining, 2nd Edition by
Data Mining Practical Machine Learning Tools and Techniques
Classification Nearest Neighbor
Prepared by: Mahmoud Rafeek Al-Farra
Prepared by: Mahmoud Rafeek Al-Farra
Dept. of Computer Science University of Liverpool
Experiments in Machine Learning
Data Preprocessing Copyright, 1996 © Dale Carnegie & Associates, Inc.
iSRD Spam Review Detection with Imbalanced Data Distributions
Data Preprocessing Copyright, 1996 © Dale Carnegie & Associates, Inc.
Dept. of Computer Science University of Liverpool
CSCI N317 Computation for Scientific Applications Unit Weka
Dept. of Computer Science University of Liverpool
Machine Learning in Practice Lecture 17
Dept. of Computer Science University of Liverpool
Data Preprocessing Copyright, 1996 © Dale Carnegie & Associates, Inc.
Avoid Overfitting in Classification
Dept. of Computer Science University of Liverpool
Data Pre-processing Lecture Notes for Chapter 2
Implementation of Learning Systems
Data Preprocessing Copyright, 1996 © Dale Carnegie & Associates, Inc.
Data Preprocessing Copyright, 1996 © Dale Carnegie & Associates, Inc.
Data Mining CSCI 307, Spring 2019 Lecture 6
Presentation transcript:

Dept. of Computer Science University of Liverpool COMP527: Data Mining COMP527: Data Mining M. Sulaiman Khan (mskhan@liv.ac.uk)‏ Dept. of Computer Science University of Liverpool 2009 General Data Mining Issues February 03, 2009 Slide 1

COMP527: Data Mining COMP527: Data Mining Introduction to the Course Introduction to Data Mining Introduction to Text Mining General Data Mining Issues Data Warehousing Classification: Challenges, Basics Classification: Rules Classification: Trees Classification: Trees 2 Classification: Bayes Classification: Neural Networks Classification: SVM Classification: Evaluation Classification: Evaluation 2 Regression, Prediction Input Preprocessing Attribute Selection Association Rule Mining ARM: A Priori and Data Structures ARM: Improvements ARM: Advanced Techniques Clustering: Challenges, Basics Clustering: Improvements Clustering: Advanced Algorithms Hybrid Approaches Graph Mining, Web Mining Text Mining: Challenges, Basics Text Mining: Text-as-Data Text Mining: Text-as-Language Revision for Exam General Data Mining Issues February 03, 2009 Slide 2

Input to Data Mining Algorithms Data types Missing values Noisy values Today's Topics COMP527: Data Mining Machine Learning? Input to Data Mining Algorithms Data types Missing values Noisy values Inconsistent values Redundant values Number of values Over-fitting / Under-fitting Scalability Human Interaction Ethical Data Mining General Data Mining Issues February 03, 2009 Slide 3

What do we mean by 'learning' when applied to machines? Machine Learning COMP527: Data Mining What do we mean by 'learning' when applied to machines? Not just committing to memory (= storage)‏ Can't require consciousness Learn facts (data), or processes (algorithms)? “Things learn when they change their behaviour in a way that makes them perform better” (Witten)‏ Ties to future performance, not the act itself But things change behaviour for reasons other than 'learning' Can a machine have the Intent to perform better? General Data Mining Issues February 03, 2009 Slide 4

Input comes as instances. Eg, the individual emails. Inputs COMP527: Data Mining The aim of data mining is to learn a model for the data. This could be called a concept of the data, so our outcome will be a concept description. Eg, the task is classify emails as spam/not spam. Concept to learn is the concept of 'what is spam?' Input comes as instances. Eg, the individual emails. Instances have attributes. Eg sender, date, recipient, words in text General Data Mining Issues February 03, 2009 Slide 5

Inputs COMP527: Data Mining Use attributes to determine what about an instance means that it should be classified as a particular class. == Learning! Obvious input structure: Table of instances (rows) and attributes (columns)‏ General Data Mining Issues February 03, 2009 Slide 6

But what about non numeric data? WEKA's ARFF Format COMP527: Data Mining @relation Iris @attribute sepal_length numeric @attribute sepal_width numeric @attribute petal_length numeric @attribute petal_width numeric @data 5.1, 3.5, 1.4, 0.2 4.9, 3.0, 1.4, 0.2 4.7, 3.2, 1.3, 0.2 5.0, 3.6, 1.4, 0.2 ... But what about non numeric data? General Data Mining Issues February 03, 2009 Slide 7

Nominal: Prespecified, finite number of values Data Types COMP527: Data Mining Nominal: Prespecified, finite number of values eg: {cat, fish, dog, squirrel} Includes boolean {true, false} and all enumerations. Ordinal: Orderable, but no concept of distance eg: hot > warm > cool > cold Domain specific ordering, but no notion of how much hotter warm is compared to cool. General Data Mining Issues February 03, 2009 Slide 8

Interval: Ordered, fixed unit eg: 1990 < 1995 < 2000 < 2005 Data Types COMP527: Data Mining Interval: Ordered, fixed unit eg: 1990 < 1995 < 2000 < 2005 Difference between values makes sense (1995 is 5 years after 1990)‏ Sum does not make sense (1990 + 1995 = year 3985??)‏ Ratio: Ordered, fixed unit, relative to a zero point eg: 1m, 2m, 3m, 5m Difference makes sense (3m is 1m greater than 2m)‏ Sum makes sense (1m + 2m = 3m)‏ General Data Mining Issues February 03, 2008 Slide 9

@attribute name {option1, option2, ... optionN} Numeric: ARFF Data Types COMP527: Data Mining Nominal: @attribute name {option1, option2, ... optionN} Numeric: @attribute name numeric -- real values Other: @attribute name string -- text fields @attribute name date -- date fields (ISO-8601 format)‏ General Data Mining Issues February 03, 2008 Slide 10

Data Issues: Missing Values COMP527: Data Mining The following issues will come up over and over again, but different algorithms have different requirements. What happens if we don't know the value for a particular attribute in an instance? For example, the data was never stored, lost or not able to be represented. Maybe that data was important! ARFF records missing values with a ? in the table How should we process missing values? General Data Mining Issues February 03, 2009 Slide 11

Possible 'solutions' for dealing with missing values: COMP527: Data Mining Possible 'solutions' for dealing with missing values: Ignore the instance completely. (eg class missing in training data set) Not very useful solution if in test data to be classified! Fill in values by hand Could be very slow, and likely to be impossible Global 'missingValue' constant Possible for enumerations, but what about numeric data? Replace with attribute mean Replace with class's attribute mean Train new classifier to predict missing value! Just leave as missing and require algorithm to apply appropriate technique General Data Mining Issues February 03, 2009 Slide 12

By 'noisy data' we mean random errors scattered in the data. Noisy Values COMP527: Data Mining By 'noisy data' we mean random errors scattered in the data. For example, due to inaccurate recording, data corruption. Some noise will be very obvious: data has incorrect type (string in numeric attribute)‏ data does not match enumeration (maybe in yes/no field)‏ data is very dissimilar to all other entries (10 in an attr otherwise 0..1)‏ Some incorrect values won't be obvious at all. Eg typing 0.52 at data entry instead of 0.25. General Data Mining Issues February 03, 2008 Slide 13

Some possible solutions: Manual inspection and removal Noisy Values COMP527: Data Mining Some possible solutions: Manual inspection and removal Use clustering on the data to find instances or attributes that lie outside the main body (outliers) and remove them Use regression to determine function, then remove those that lie far from the predicted value Ignore all values that occur below a certain frequency threshold Apply smoothing function over known-to-be-noisy data If noise is removed, can apply missing value techniques on it. If it is not removed, it may adversely affect the accuracy of the model. General Data Mining Issues February 03, 2009 Slide 14

Some values may not be recorded in different ways. Inconsistent Values COMP527: Data Mining Some values may not be recorded in different ways. For example 'coke', 'coca cola', 'coca-cola', 'Coca Cola' etc etc In this case, the data should be normalised to a single form. Can be treated as a special case of noise. Some values may be recorded inaccurately on purpose! Email address: r.d.nospam.sanderson@... Spike in early census data for births on 11/11/1911. Had to put in some value, so defaulted to 1s everywhere. Ooops! (Possibly urban legend?)‏ General Data Mining Issues February 03, 2009 Slide 15

ProductId, ProductName, ProductPrice, SupplierId, SupplierAddress... Redundant Values COMP527: Data Mining Just because the base data includes an attribute doesn't make it worth giving to the data mining task. For example, denormalise a typical commercial database and you might have: ProductId, ProductName, ProductPrice, SupplierId, SupplierAddress... SupplierAddress is dependant on SupplierId (remember SQL normalisation rules?) so they will always appear together. A 100% confident, 100% support association rule is not very interesting! General Data Mining Issues February 03, 2009 Slide 16

Number of Attributes COMP527: Data Mining Is there any harm in putting in redundant values? Yes for association rule mining, and ... yes for other data mining tasks too. Can treat text as thousands of numeric attributes: term/frequency from our inverted indexes. But not all of those terms are useful for determining (for example) if an email is spam. 'the' does not contribute to spam detection. The number of attributes in the table will affect the time it takes the data mining process to run. It is often the case that we want to run it many times, so getting rid of unnecessary attributes is important. General Data Mining Issues February 03, 2009 Slide 17

Number of Attributes/Values COMP527: Data Mining Called 'dimensionality reduction'. We'll look at techniques for this later in the course, but some simplistic versions: Apply upper and lower thresholds of frequency Noise removal functions Remove redundant attributes Remove attributes below a threshold of contribution to classification (Eg if attribute is evenly distributed, adds no knowledge) General Data Mining Issues February 03, 2009 Slide 18

Over-Fitting / Under-Fitting COMP527: Data Mining Learning a concept must stop at the appropriate time. For example, could express the concept of 'Is Spam?' as a list of spam emails. Any email identical to those is spam. Accuracy: 0% on new data, 100% on training data. Ooops! This is called Over-Fitting. The concept has been tailored too closely to the training data. Story: US Military trained a neural network to distinguish tanks vs rocks. It would shoot the US tanks they trained it on very consistently and never shot any rocks ... or enemy tanks. [probably fiction, but amusing] General Data Mining Issues February 03, 2009 Slide 19

Over-Fitting / Under-Fitting COMP527: Data Mining Extreme case of over-fitting: Algorithm tries to learn a set of rules to determine class. Rule1: attr1=val1/1 and attr2=val2/1 and attr3=val3/1 = class1 Rule2: attr1=val1/2 and attr2=val2/2 and attr3=val3/2 = class2 Urgh. One rule for each instance is useless. Need to prevent the learning from becoming too specific to the training set, but also don't want it to be too broad. Complicated! General Data Mining Issues February 03, 2009 Slide 20

Over-Fitting / Under-Fitting COMP527: Data Mining Extreme case of under-fitting: Always pick the most frequent class, ignore the data completely. Eg: if one class makes up 99% of the data, then a 'classifier' that always picks this class will be correct 99% of the time! But probably the aim of the exercise is to determine the 1%, not the 99%... making it accurate 0% of the time when you need it. General Data Mining Issues February 03, 2009 Slide 21

Very important that data mining algorithms scale well. Scalability COMP527: Data Mining We may be able to reduce the number of attributes, but most of the time we're not interested in small 'toy' databases, but huge ones. When there are millions of instances, and thousands of attributes, that's a LOT of data to try to find a model for. Very important that data mining algorithms scale well. Can't keep all data in memory Might not be able to keep all results in memory either Might have access to distributed processing? Might be able to train on a sample of the data? General Data Mining Issues February 03, 2009 Slide 22

Problem Exists Between Keyboard And Chair. Human Interaction COMP527: Data Mining Problem Exists Between Keyboard And Chair. Data Mining experts are probably not experts in the domain of the data. Need to work together to find out what is needed, and formulate queries Need to work together to interpret and evaluate results Visualisation of results may be problematic Integrating into the normal workflow may be problematic How to apply the results appropriately may not be clear (eg Barbie + Chocolate?)‏ General Data Mining Issues February 03, 2009 Slide 23

Just because we can doesn't mean we should. Ethical Data Mining COMP527: Data Mining Just because we can doesn't mean we should. Should we include married status, gender, race, religion or other attributes about a person in a data mining experiment? Discrimination? But sometimes those attributes are appropriate and important ... medical diagnosis, for example. What about attributes that are dependent on 'sensitive' attributes? Neighbourhoods have different average incomes... discriminating against the poor by using location? Privacy issues? Data Mining across time? Government sponsored data mining? General Data Mining Issues February 03, 2009 Slide 24

Further Reading Witten, Chapters 1,2 Dunham Sections 1.3-1.5 COMP527: Data Mining Witten, Chapters 1,2 Dunham Sections 1.3-1.5 Han Sections 1.9, 11.4 General Data Mining Issues February 03, 2009 Slide 25