Dept. of Computer Science University of Liverpool

Slides:



Advertisements
Similar presentations
Artificial Intelligence 12. Two Layer ANNs
Advertisements

The Software Infrastructure for Electronic Commerce Databases and Data Mining Lecture 4: An Introduction To Data Mining (II) Johannes Gehrke
NEURAL NETWORKS Perceptron
Introduction Training Complexity, Pruning CART vs. ID3 vs. C4.5
Data Mining Classification: Alternative Techniques
Data Mining Classification: Alternative Techniques
1 Machine Learning: Lecture 10 Unsupervised Learning (Based on Chapter 9 of Nilsson, N., Introduction to Machine Learning, 1996)
Neural Networks. R & G Chapter Feed-Forward Neural Networks otherwise known as The Multi-layer Perceptron or The Back-Propagation Neural Network.
KNN, LVQ, SOM. Instance Based Learning K-Nearest Neighbor Algorithm (LVQ) Learning Vector Quantization (SOM) Self Organizing Maps.
© Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist.
CS Instance Based Learning1 Instance Based Learning.
Chapter 5 Data mining : A Closer Look.
CSCI 347 / CS 4206: Data Mining Module 04: Algorithms Topic 06: Regression.
M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2009 COMP527: Data Mining Classification: Evaluation February 23,
M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2009 COMP527: Data Mining Text Mining: Text-as-Data March 25, 2009.
Module 04: Algorithms Topic 07: Instance-Based Learning
Slides are based on Negnevitsky, Pearson Education, Lecture 12 Hybrid intelligent systems: Evolutionary neural networks and fuzzy evolutionary systems.
Chapter 9 Neural Network.
Introduction to machine learning and data mining 1 iCSC2014, Juan López González, University of Oviedo Introduction to machine learning Juan López González.
Artificial Intelligence Methods Neural Networks Lecture 4 Rakesh K. Bissoondeeal Rakesh K. Bissoondeeal.
1 Pattern Classification X. 2 Content General Method K Nearest Neighbors Decision Trees Nerual Networks.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2009 COMP527: Data Mining Classification: Bayes February 17, 2009.
Applying Neural Networks Michael J. Watts
M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2009 COMP527: Data Mining Association Rule Mining March 5, 2009.
1 Chapter 6. Classification and Prediction Overview Classification algorithms and methods Decision tree induction Bayesian classification Lazy learning.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.7: Instance-Based Learning Rodney Nielsen.
School of Engineering and Computer Science Victoria University of Wellington Copyright: Peter Andreae, VUW Image Recognition COMP # 18.
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
Decision Trees Binary output – easily extendible to multiple output classes. Takes a set of attributes for a given situation or object and outputs a yes/no.
DATA MINING WITH CLUSTERING AND CLASSIFICATION Spring 2007, SJSU Benjamin Lam.
M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2009 COMP527: Data Mining ARM: Improvements March 10, 2009 Slide.
Chapter 4: Algorithms CS 795. Inferring Rudimentary Rules 1R – Single rule – one level decision tree –Pick each attribute and form a single level tree.
1 Learning Bias & Clustering Louis Oliphant CS based on slides by Burr H. Settles.
Eick: kNN kNN: A Non-parametric Classification and Prediction Technique Goals of this set of transparencies: 1.Introduce kNN---a popular non-parameric.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Machine Learning: Ensemble Methods
General-Purpose Learning Machine
Applying Neural Networks
Ananya Das Christman CS311 Fall 2016
Data Science Algorithms: The Basic Methods
Data Mining – Algorithms: Instance-Based Learning
Prepared by: Mahmoud Rafeek Al-Farra
Instance Based Learning
Chapter 6 Classification and Prediction
Data Science Algorithms: The Basic Methods
Dept. of Computer Science University of Liverpool
K Nearest Neighbor Classification
Data Mining Practical Machine Learning Tools and Techniques
Dept. of Computer Science University of Liverpool
Instance Based Learning
Perceptron as one Type of Linear Discriminants
Classification Algorithms
Classification and Prediction
Dept. of Computer Science University of Liverpool
Artificial Intelligence Lecture No. 28
CSCI N317 Computation for Scientific Applications Unit Weka
Machine Learning in Practice Lecture 23
Dept. of Computer Science University of Liverpool
Ensemble learning.
Nearest Neighbors CSC 576: Data Mining.
Artificial Intelligence 12. Two Layer ANNs
Dept. of Computer Science University of Liverpool
Text Categorization Berlin Chen 2003 Reference:
©Jiawei Han and Micheline Kamber
Data Mining CSCI 307, Spring 2019 Lecture 21
Data Mining CSCI 307, Spring 2019 Lecture 23
Data Mining CSCI 307, Spring 2019 Lecture 11
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Data Mining CSCI 307, Spring 2019 Lecture 6
Presentation transcript:

Dept. of Computer Science University of Liverpool COMP527: Data Mining COMP527: Data Mining M. Sulaiman Khan (mskhan@liv.ac.uk)‏ Dept. of Computer Science University of Liverpool 2009 This is the full course notes, but not quite complete. You should come to the lectures anyway. Really. Introduction to the Course February 05, 2009 Slide 1

COMP527: Data Mining COMP527: Data Mining Introduction to the Course Introduction to Data Mining Introduction to Text Mining General Data Mining Issues Data Warehousing Classification: Challenges, Basics Classification: Rules Classification: Trees Classification: Trees 2 Classification: Bayes Classification: Neural Networks Classification: SVM Classification: Evaluation Classification: Evaluation 2 Regression, Prediction Input Preprocessing Attribute Selection Association Rule Mining ARM: A Priori and Data Structures ARM: Improvements ARM: Advanced Techniques Clustering: Challenges, Basics Clustering: Improvements Clustering: Advanced Algorithms Hybrid Approaches Graph Mining, Web Mining Text Mining: Challenges, Basics Text Mining: Text-as-Data Text Mining: Text-as-Language Revision for Exam Classification: Challenges, Basics February 05, 2009 Slide 2

Today's Topics Classification Basic Algorithms: KNN Perceptron Winnow COMP527: Data Mining Classification Basic Algorithms: KNN Perceptron Winnow Classification: Challenges, Basics February 05, 2009 Slide 3

Training: Learn the classification model from labeled data COMP527: Data Mining Main Idea: Learn the concept of what it means to be part of a named class of instances. Called Supervised Learning, as it learns by example from data which is already classified correctly. Often called the Class Label attribute, hence learns from Labeled data. Two main phases: Training: Learn the classification model from labeled data Prediction: Use the pre-built model to classify new instances Classification: Challenges, Basics February 05, 2009 Slide 4

Classification Accuracy COMP527: Data Mining We need to use previously unseen instances to test a classifier. Over-fitting is the main problem. Classifiers will often learn too specific a model, and testing on data that was used in training would reinforce this problem. Need to split the data set into training and testing. Revised Phases for accuracy estimation: Split data set into distinct Training and Testing sets Build classifier with Training set Assess accuracy with Testing set – normally expressed as % Classification: Challenges, Basics February 05, 2009 Slide 5

Percent of instances classified correctly. Speed Comparing Methods COMP527: Data Mining Accuracy Percent of instances classified correctly. Speed Computational cost of both learning model and predicting classes Robustness Ability to cope with noisy or missing data Scalability Ability to cope with very large amounts of data Interpretability Is the model understandable to a human, or otherwise useful? Classification: Challenges, Basics February 05, 2009 Slide 6

Classification vs Prediction COMP527: Data Mining Classification predicts a class label from a given finite set. The label is a nominal attribute, so unordered and enumerable Some algorithms predict probability for more than one label Sometimes called a categorical attribute Prediction predicts a number instead of a label. Ordered and infinite set of possible outcomes Also often called Regression or Numeric Prediction Often viewed as a function Classification: Challenges, Basics February 05, 2009 Slide 7

Builds a model likely to be very different in structure to the data. Eager vs Lazy Learners COMP527: Data Mining Eager Learner: Constructs the model when it receives the training data. Builds a model likely to be very different in structure to the data. Lazy Learner: Doesn't construct a model when training, only when classifying new instances. Does only enough work to ensure that data can be compared later Sometimes called instance-based learners Most classifiers are Eager, but there's an important Lazy classifier called 'KNN' – K Nearest Neighbour Classification: Challenges, Basics February 05, 2009 Slide 8

Which group, left or right, for these two flowers? But First... COMP527: Data Mining Which group, left or right, for these two flowers? (Experiment reported on in Cognitive Science, 2002)‏ Classification: Challenges, Basics February 05, 2009 Slide 9

A combination of rote memorisation and the notion of 'resembles'. Resemblance COMP527: Data Mining People classify things by finding other items that are similar which have already been classified. For example: Is a new species a bird? Does it have the same attributes as lots of other birds? If so, then it's probably a bird too. A combination of rote memorisation and the notion of 'resembles'. Although kiwis can't fly like most other birds, they resemble birds more than they resemble other types of animals. So the problem is to find which instances most closely resemble the instance to be classified. Classification: Challenges, Basics February 05, 2009 Slide 10

KNN: Distance Measures COMP527: Data Mining Distance (or similarity) between instances is easy if the data is numeric. Typically use Euclidian distance: d = √(x1i-x1j)2 + (x2i-x2j)2 + ... Also Manhattan / City Block distance: d = (x1i-x1j) + (x2i-x2j) + ... However we should normalise all of the values to the same scale first. Otherwise income will overpower age, for example. Classification: Challenges, Basics February 05, 2009 Slide 11

KNN: Non Numeric Distance COMP527: Data Mining For nominal attributes, we can only compare whether the value is the same or not. Equally, this can be done by dividing enumerations into many boolean attributes. Might be able to convert to attributes between which distance can be determined by some function. Eg colour, temperature. Text can be treated as one attribute per word, with the frequency as the value, normalised to 0..1, and preferably with very high frequency words ignored (eg the, a, as, is...)‏ Classification: Challenges, Basics February 05, 2009 Slide 12

Classification process is then straight forward: KNN: Classification COMP527: Data Mining Classification process is then straight forward: Find the k closest instances to the test instance Predict the most common class among those instances Or predict the mean, for numeric prediction What value to use for k? Depends on dataset size. Large dbs need a higher k, whereas a high k for a small dataset might cross out of the class boundaries Calculate accuracy on test set for increasing value of k, and use a hill climbing algorithm to find the best. Typically use an odd number to help avoid ties Classification: Challenges, Basics February 05, 2009 Slide 13

KNN: Classification COMP527: Data Mining 5-NN – find 5 closest to black point, 3 blue and 2 red, so predict blue Classification: Challenges, Basics February 05, 2009 Slide 14

Classification can be very slow to find the k nearest instances. KNN: Classification COMP527: Data Mining Classification can be very slow to find the k nearest instances. In a trivial implementation, it could take |D| comparisons. Using indexing it can easily be improved. Also easy to parallelise as one comparison is completely distinct to other comparisons. Can remove instances from the data set that do not help, for example a tight cluster of 1000 instances of the same class is unnecessary for k<50 Can also use advanced data structures to improve the speed of classification, by storing the instance information appropriately. Classification: Challenges, Basics February 05, 2009 Slide 15

KNN: kD-Trees COMP527: Data Mining KD-Tree is a binary tree that divides the input space with a plane, then splits each such partition recursively. Each split is made parallel to an axis and through an instance. Typical strategy is to find the point closest to the mean in the current partition and split through it, along a different axis to the previous split. (Actually on the axis with the greatest variance)‏ Then to search, descend the tree to the leaf partition that contains the test instance. Search only that partition, then if an edge is closer than any of the k closest instances, search the parent partitions as well. Classification: Challenges, Basics February 05, 2009 Slide 16

KNN: kD-Trees First split at instance (7,4) then again at COMP527: Data Mining First split at instance (7,4) then again at (6,7) which divides the search space into more easily searchable sections. Classification: Challenges, Basics February 05, 2009 Slide 17

KNN: kD-Trees Then to classify the star, descend COMP527: Data Mining Then to classify the star, descend into the section with both star and the black instance. But note that the instance in the other section is closer, so we still must check the adjacent area. Note that the shaded area is the black node's sibling and hence cannot contain closer points. Classification: Challenges, Basics February 05, 2009 Slide 18

Two very simple eager methods: Perceptron and Winnow. COMP527: Data Mining Two very simple eager methods: Perceptron and Winnow. The both use the idea of a single neuron that fires when given the right stimuli. (We'll look at this idea again later under Neural Networks)‏ First thing to keep in mind is that the input to the perceptron must be a vector of numbers. Secondly, that it can only answer a 2 class problem – either the neuron fires (class 1) or it doesn't (class 2). Classification: Challenges, Basics February 05, 2009 Slide 19

Perceptron and Winnow COMP527: Data Mining The square boxes are inputs, the w lines are weights and the circle is the perceptron. The learning problem is to find the correct weights to apply to the attributes. The bias is a fixed value (1)‏ that is then learnt in the same way as the other attributes, in order to ensure that the result perceptron can check if the result is > 0 or not to see if it should fire. Classification: Challenges, Basics February 05, 2009 Slide 20

We can then multiply weight by value, and add them all up... Perceptron COMP527: Data Mining For each attribute, we have an input node. Then there is one output node to which all of them connect, with a weight on each connection. We can then multiply weight by value, and add them all up... w0a0 + w1a1 + ... + wnan Make it an equation equal to 0 and it's the equation for a hyperplane. So essentially we are learning the hyperplane that separates the two classes. Then classification is just checking which side of the plane the instance falls on. But how do we learn the weights? Classification: Challenges, Basics February 05, 2009 Slide 21

No complicated higher math here! Perceptron COMP527: Data Mining Remember that instances are a set of numeric attributes (a vector). We can also treat the weights on the connections as a vector. We only want to classify between two classes. So: weightVector = [0,...0] while classificationFailed, for each training instance I, if not classify(I) == I.class, if I.class == class1: weightVector += I else: weightVector -= I No complicated higher math here! Classification: Challenges, Basics February 05, 2009 Slide 22

Winnow COMP527: Data Mining Winnow only updates when it finds a misclassified instance, and uses multiplication to do the update rather than addition. It only works when the attribute values are also binary. (1 or 0)‏ delta = (user defined)‏ while classificationFailed, for each instance I, if classify(I) != I.class, if I.class == class1, for each attribute ai in I, if ai == 1, wi *= delta else, if ai == 1, wi /= delta Classification: Challenges, Basics February 05, 2009 Slide 23

Further Reading Witten, Section 3.8, and pp 124-136 COMP527: Data Mining Witten, Section 3.8, and pp 124-136 Han, Sections 6.1, 6.9 Dunham Sections 4.1-4.3 Berry and Linoff, Chapter 8 Berry and Browne, Chapter 6 Devijver and Kittler, Pattern Recognition: A Statistical Approach, Chapter 3 For KNN and Perceptron, Wikipedia, as always :)‏ Classification: Challenges, Basics February 05, 2009 Slide 24