数据挖掘 Introduction to Data Mining

数据挖掘 Introduction to Data Mining
Philippe Fournier-Viger Full professor School of Natural Sciences and Humanities Spring 2019 S C

Introduction Last week: Important: What is data mining (数据挖掘)?
Why we do data mining? What type of data? What type of patterns can we find in data? Important: QQ group: The PPTs are on the website.

Course schedule (日程安排)
Lecture 1 Introduction What is the knowledge discovery process? Lecture 2 Exploring the data Lecture 3 Classification Lecture 4 Lecture 5 Association analysis Lecture 6 Lecture 7 Clustering Lecture 8 Anomaly detection and advanced topics

Classification (分类) – part 1
? Based on chapter 8 and 9

Introduction Today, we will discuss a popular data mining task named classification. Classification(分类): classifying objects/instances into several classes (categories). Example 1: Predict if one will like the movie “Monkey King“ (西游记之大闹天宫) Class of persons who like the movie ? Class of persons who don’t like the movie

Introduction Today, we will discuss a popular data mining task named classification. Classification(分类): classifying objects/instances into several classes (categories). Example 2: Predict if a written character is the letter A, B or C Class “Letter A” ? Class “Letter B” Class “Letter C”

Introduction Today, we will discuss a popular data mining task named classification. Classification(分类): classifying objects/instances into several classes (categories). Example 3: Identify the topic of a news article Class “Sports” ? Class “International news” Class “Entertainment”

What kind of data? We will assume data stored in a table:
Dimension, attribute or variable Record, Instance NAME AGE INCOME GENDER EDUCATION John 99 1 元 Male Ph.D. Lucia 44 20元 Female Master Paul 33 25元 Daisy 20 50元 High school Jack 15 10元 value « Male »

What kind of data? To do classification, we need to select one attribute as the “target attribute” (目标属性). It is the attribute that we want to predict. “target attribute” NAME AGE INCOME GENDER EDUCATION John 99 1 元 Male Ph.D. Lucia 44 20元 Female Master Paul 33 25元 Daisy 20 50元 High school Jack 15 10元

Goal of classification
Classification (分类): predicting the value of the target attribute for some new data. The possible values for the target attributes are called “classes” “target attribute” NAME AGE INCOME GENDER EDUCATION John 99 1 元 Male Ph.D. Lucia 44 20元 Female Master Paul 33 25元 Daisy 20 50元 High school Jack 15 10元 Macy 35 ?????????

Training data (训练数据) To perform classification, we need to have training data. The training data provides several records where the value of the target attribute is known. “target attribute” NAME AGE INCOME GENDER EDUCATION John 99 1 元 Male Ph.D. Lucia 44 20元 Female Master Paul 33 25元 Daisy 20 50元 High school Jack 15 10元 Macy 35 ????????? Training data 训练数据 Here, classes are : Ph.D., Master, high school…

Building a classifier (分类器)
Using the training data, we want to build a classifier (分类器). Classifier: a model that can be used to predict the values of the target attribute based on the values of the other attributes. “target attribute” NAME AGE INCOME GENDER EDUCATION John 99 1 元 Male Ph.D. Lucia 44 20元 Female Master Paul 33 25元 Daisy 20 50元 High school Jack 15 10元 Macy 35 ????????? 分类器 fen1 lei4qi4 Training data 训练数据

Types of classifier There exists several types of classifiers:
Decision trees (决策树), CART, ID3, C4.5, SLIQ, SPRINT… Neural networks (人工神经网络)/ deep learning （深度学习） SVM, Naïve Bayes classifier (素贝叶斯分类器), Associative classifiers, etc. We will discuss a few of them to see how they work, can be used, and discuss their advantages and limitations.

Building/using a classifier
Training data 训练数据 Building a classifier using the training data Model (classifier 分类器) Testing data 测试数据 predictions Yes No … Applying the classifier

What is a good classifier?
can perform predictions for new records, can perform accurate predictions. Various performance measures: 𝑎𝑐𝑐𝑢𝑟𝑎𝑐𝑦= 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑐𝑜𝑟𝑟𝑒𝑐𝑡 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛𝑠 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑟𝑒𝑐𝑜𝑟𝑑𝑠 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛= 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑖𝑛𝑐𝑜𝑟𝑟𝑒𝑐𝑡 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛𝑠 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛𝑠 …

Classification vs regression
In regression (线性回归), the target attribute is a continuous value (e.g. weight of a person) In classification, the target attribute is a discrete value (e.g. fraud or not fraud?) Classification generally works well for predicting binary attributes or nominal attributes. It may not work so well for ordinal attribute (e.g. small, medium, larger) or hierarchical attributes (e.g. human, mammal, animal…)

Using classifiers to understand the data
Some classifiers can be used to understand the data: some classifiers indicate the criteria that are used to distinguish between the different classes. e.g. decision trees (决策树) other types of classifiers works well are difficult to interpret by humans. e.g. neural networks （人工神经网络）

Various applications Examples: To determine：
if some human cells are malignant (恶性细胞), if credit card transactions are legitimate or fraud, what is the topic of a news {sport, entertainment, weather, …}. the political views, age, and gender of a person on a social network (社会网络).

Decision trees (决策树)

Example of decision tree (决策树)
Training data 训练数据 Classifier: a decision tree Attributes that will be used for decision-making Refund Yes No NO MarSt Single, Divorced Married TaxInc NO < 80K > 80K Données d’entraînement NO YES Note: several decision trees with different attributes may be created for the same data. 

A second example of decision tree
Training data 训练数据 Different attributes are used for decision-making MarSt Single, Divorced Married NO Refund No Yes NO TaxInc < 80K > 80K NO YES Which tree is better? We will discuss about this later…

Example: how to use a decision tree for prediction
Testing data (测试数据) We start from the root (根节点) Refund MarSt TaxInc YES NO Yes No Married Single, Divorced < 80K > 80K

Testing data (测试数据) Refund MarSt TaxInc YES NO Yes No Married
Single, Divorced < 80K > 80K

Testing data (测试数据) Refund Yes No NO MarSt Single, Divorced Married
TaxInc NO < 80K > 80K NO YES

The predicion made by the decision tree is:
Testing data (测试数据) Refund Yes No NO MarSt Single, Divorced Married The predicion made by the decision tree is: “Cheat = NO” TaxInc NO < 80K > 80K NO YES

VOCABULARY ROOT NODE (根节点) (the root of the tree) DECISION NODES
Refund … No DECISION NODES or INNER NODES (内部节点) (where a choice is made based on an attribute value) … MarSt Single, Divorced Married TaxInc NO < 80K > 80K LEAF NODES （叶节点） (representing classes) NO YES

How decision trees are built?

Hunt’s algorithm ? Let Dt, be the set of records reaching a node t.
Initially Dt is the set of all records in the database. Procedure: If all records from Dt belong to the same class yt, then t will be a leaf with the label yt If Dt = ∅, then t will be a leaf node with the default class yd If records in Dt belong to several classes, t will be a decision node. An attribute will be used to split the records. Dt ? The node t

Example Training data The target attribute is « cheat »

Example Training data The records do not belong to the same class (we have Yes and No for the « Cheat » attribute). We can choose the « refund » atribute to try to separate the records.

Example Training data Refund Yes No The records do not belong to the same class (we have Yes and No for the « Cheat » attribute). We can choose the « refund » atribute to try to separate the records.

Example Training data Refund Yes No If « refund = Yes » then all records belong to the same class (« Cheat = No »)

Example Yes No Don’t Cheat
Training data Refund Don’t Cheat Yes No If « refund = Yes » then all records belong to the same class (« Cheat = No ») Hence, we create a leaf node « don’t cheat ».

Example Yes No Don’t cheat
Training data Refund Don’t cheat Yes No Marital Status If « refund = No » then all records do not belong to the same class. Hence, we must create a decision node. We can choose the attribute « marital status ».

Example Don’t Cheat Yes No Single, Divorced Married
Training data Refund Don’t Cheat Yes No Marital Status Single, Divorced Married If « refund = No » and « Marital status = single or divorced » not all records are of the same class. We can create a node « income » to try to separate the records

Example Don’t Cheat Yes No Single, Divorced Married < 80K
Training data Refund Don’t Cheat Yes No Marital Status Single, Divorced Married Taxable Income < 80K All the records are of the same class Thus we create leaf nodes

Example Don’t Cheat Yes No Single, Divorced Married < 80K >= 80K
Training data Refund Don’t Cheat Yes No Marital Status Single, Divorced Married Taxable Income < 80K >= 80K All the records are of the same class Thus we create leaf nodes

Training data Refund Don’t Cheat Yes No Marital Status Single, Divorced Married Taxable Income < 80K >= 80K Alll records are of the same class

Training data Refund Don’t Cheat Yes No Marital Status Single, Divorced Married Taxable Income < 80K >= 80K Alll records are of the same class We create a leaf node END!

How to choose the attributes for building a decision tree?
The “greedy” approach We build a decision tree by always choosing the attribute that best separate the data using a single attribute. By doing this, we hope to obtain the best tree…. But it is not guaranteed. Challenges It is sometimes possible to separate records using many different attributes. When should we stop growing the tree? Should we use a small tree or very big tree? What criterion should we use to separate the records (e.g. > 80 K) it depends on the type of attribute 

For nominal attributes (名义属性)
Binary split: only two branches, we must find the best way to separate the records Multiple splits: a branch for each value. Car type {sport, luxury} {family} Car type {familly, luxury} {sport} or Car type family sport luxury

For ordinal attributes (顺序属性）
Multiple split: a branch for each value Binary split: two branches. The order between values must be respected Size small medium large Size {small, medium} {large} Car type {small, large} {medium}

For continuous attributes (连续属性)
First approach: separating the values into several ranges. Second approach: binary decision. e.g. (Price < 55 $), (Size  140 cm) We need to consider multiple possibilities and choose the best one, This can be time-consuming for a computer.

Which attribute to choose to split data?
Suppose that we have 20 records and that 10 belong to a class C0 and 10 to a class C1. Which attribute to choose to split the data? Several possibilities… Which tree is the best?

Which attribute to choose to split data?
The “Greedy approach” (贪心的方法) Homogenous nodes are preferred (with a single class). We need an impurity measure: non-homogenous, high impurity homogeneous, low impurity

How to measure impurity?
Several measures Consider a node t. Let p(i|t) denote the fraction of records that belong to a class i. Entropy(t) = GINI(t) = IncorrectClassification(t) = 1 − max(p(i|t)) The entropy is used by algorithms such as ID3 and C4.5. The GINI is used by algorithms such as CART, SLIQ and SPRINT Note: we suppose that 0 log20 = 0

Example of calculation
We can see that these three measures varies in a similar way

Which atribute to choose to separate the data ? (general idea)
Before separating the data, we calculate the mesure for the current node: M0 Then, we calculate the measure for each attribute that could be used to split the data A? B? yes no yes no Node N1 Node N2 Node N3 Node N4 M1 M2 M3 M4 M12 M34 We choose the attribute that provide the best gain: M0 – M12 vs M0 – M34

The « GINI » measure

Measuring a node’s impurity using GINI
Consider a node t. (Note: p( j | t) is the relative frequency of the class j at the node t). The maximum value is (1 – 1 / number_of_classes) when all records are equally distributed between classes. The minimum is 0 when all records belong to a single class. If the measure is lower, it is better (the attribute is more discriminative).

Example of GINI calculations for a node t
P(C1) = 0/6 = P(C2) = 6/6 = 1 Gini(t) = 1 – P(C1)2 – P(C2)2 = 1 – 0 – 1 = 0 t t P(C1) = 1/ P(C2) = 5/6 Gini(t) = 1 – (1/6)2 – (5/6)2 = 0.278 P(C1) = 2/ P(C2) = 4/6 Gini(t) = 1 – (2/6)2 – (4/6)2 = 0.444 t

Using GINI to evaluate how well the data is separated using an attribute
Consider that a node t, divided into k nodes using an attribute: ni = number of records for the child node i, n = number of records of the node t. This formula is a weighted average. It gives more weight to nodes that are more pure and contain many records. t 1 2 … k

Example of GINI for a binary attribute
(1) Dividing the records into two nodes using an attribute B B? yes no Node N1 Node N2 (2) Gini(N1) = 1 – (5/7)2 – (2/7)2 = 0.408 Gini(N2) = 1 – (1/5)2 – (4/5)2 = 0.32 (3) How good splitting using attribute B is? GINIsplit = 7/12 * /12 * =

Another example Multiple child nodes VS binary split
If we create a child node for each attribute value: If we do a binary split (only two branches) Car type Family Sport Luxury

Can GINI be used for continuous attributes?
Yes! We can do a binary decision (two choices only) e.g. > 80K and ≤ 80 K How do we choose the value? 80 ? 70 ? 60 ? Many possibilities! We could calculate GINI for each posible values. But it would not be very fast.

A better approach To find the best value in an efficient way:
Sort records by increasing values for the attribute Scan the value and update the table (matrix) by increasing the count and at the same time calculate GINI. Choose the smallest GINI value in the table for the split Values for the split Sorted values The smallest value

The entropy measure

The entropy measure for a node t
The maximum is log nc when records are evenly distributed between the nc classes. The minimum is 0 when all records belong to the same class. This measure is based on Shannon’s information theory.

Comparison of the entroy and GINI measures for two classes
Entropy Gini All records belong to the same class ( p = 0) All records belong to the same class p = 1 as many record in each classes (p = 0.5) = percentage of records in one of the two classes

Examples of entropy calculation
P(C1) = 0/6 = P(C2) = 6/6 = 1 Entropy = – 0 log 0 – 1 log 1 = – 0 – 0 = 0 P(C1) = 1/ P(C2) = 5/6 Entropy = – (1/6) log2 (1/6) – (5/6) log2 (1/6) = 0.65 P(C1) = 2/ P(C2) = 4/6 Entropy = – (2/6) log2 (2/6) – (4/6) log2 (4/6) = 0.92

How good is a split in terms of entropy?
Gain We need to choose the attribute that provides the largest decrease in terms of entropy (which minimize the gain). This measure is used by ID3 and C4.5 Disadvantage: tends to choose attributes that split records into many very small and pure nodes. t 1 2 … k

A variation of the gain Gain Ratio
This formula penalizes a split that creates too many nodes using SplitINFO. Used by C4.5 It was proposed to address drawbacks of the GAIN t 1 2 … k

When should we stop?

When should we stop? Don't split a node:
when all the records belong to a single class. when all of the records have similar values. if the gain is greater or equal to zero but smaller than a predetermined threshold. if the number of records is less than a threshold. …

Properties of decision trees

Why use decision trees? Small trees are easy to understand by humans.
Building a decision tree is very fast. O(n x d x log(d)) where n is the number of attributes and d is the number of records Classifying new instances is extremely fast. O(w) where w is the tree depth) Accuracy is similar to other classifiers for some simple data. No hypothesis on the data distribution.

Why using decision trees
Very good for some tasks e.g.Kinect ( programmer.info/news/105-artificial-intelligence/2176- kinects-ai-breakthrough-explained.html , a « forest » of several trees (random forest) Decision trees can be quite noise tolerant. Decision trees can also avoid the problem of overfitting if some techniques are used.

Example of how a decision tree separates the records in a dataset having two numeric attributes.
Until now, we have discussed only about trees that use one attribute at a time to split the data For this data, a decision tree works well because this dataset is linearly separable (线性可分).

This tree splits using two attributes in the same node. x + y < 1
Here is an example of data that cannot provide a perfect solution when using basic decision trees A solution would be to use an oblique decision tree, with a node having x+ y < 1 as condition. This tree splits using two attributes in the same node. x + y < 1 Class = + Class =

Advanced discussion It was shown that some other classifiers such as recurrent neural networks (人工神经网络) can approximate any continuous function （连续函数） with an arbitrary precision when using at least one hidden layer. Those are existential proofs G. Cybenko, Approximation by superpositions of a sigmoidal function, Mathematics of Control, Signals, and Systems 2 (4) 1989. K. Hornik, M. Stinchcombe, H. White, Multilayer feedforward networks are universal approximators, Neural Networks 2 (5) 1989. K.-I. Funahashi, On the approximate realization of continuous mappings by neural networks, Neural Networks 2 (3) 1989.

Other problems some attributes may be used several times in the same branch, some subtrees may be identical Trees may become difficult to interpret! A solution: use more complex conditions to split the data Another solution: utilize rules instead of trees Illustrations: Han & Kamber (2011)

The problem of overfitting (過適)

Overfitting Two types of error:
Training error: error on the training data. Generalization error: error on the testing data Overfitting （過適）is said to occur when the generalization error is greater than the training error.

Overfitting (example 1)
Consider the following training data Here there are only two classes ..

Overfitting (illustration 1)
Overfitting is here caused by the number of nodes in the decision tree. The tree becomes too similar to the training data

Underfitting (欠拟合) Underfitting: when the model is too small to be able to learn the true structure of the data. e.g. here, the error rate is high

Overfitting and underfitting
These problems are related to how complex a model (tree) is. There are other reasons for overfitting: 

Other causes of overfitting
a data point that is noise Overfitting may be caused by noise. This could be an error in the data, or maybe just an exception

Other reasons of overfitting
Overfitting may also be caused by the lack of records in one or more areas. In this case, having more data would influence how new records in that area would be classified.

Is there overfitting? Several methods to check if there is overfitting: Optimistic approach: training error is the generalization error Pessimistic approach: A penalty is added for each node in the tree. Generalization error e’(t) of a node t is estimated as the training error e’(t) = (e(t)+0.5) Using testing data to estimate the error: some part of the testing data may be kept for estimating the error. For example, 1/3 of the testing data may be kept.

How to avoid overfitting
When constructing the tree: stop splitting if the number of records in a node is too small… stop splitting if the gain is too small, … After constructing the tree: cut branches of the tree. If generalization error decreasse, keep trimming the tree. Note: when a branch is cut, the class for the new leaf node is the class of the majority of records in that node.

Other problems Data fragmentation: the number of records is too small to be significant for some leaves of a tree. Search strategy: there exists different ways of building trees Expressivity: some types of decision trees may not be able to model the data well.

How to evaluate the performance of a classifier?

Several approaches Split the data into training and testing data (e.g. 2/3 and 1/3 of the data) problem: some part of the data is not used for training the model An alternative is k-fold cross-validation …

How to compare the performance of classifiers?
Example: a tree M1 provides 15% of error for 30 records but a tree M2 has 25% error for 3000 records. Which one is better. There exists methods to estimate the confidence interval for the error or accuracy. Section 4.6 of the book

Conclusion

Conclusion We introduced the topic of classification, and the concept of classifiers. We have learnt about decision trees, a particular type of classifiers, that is quite popular as it is simple and interpretable by humans.

References Chapter 8 and 9. Han and Kamber (2011), Data Mining: Concepts and Techniques, 3rd edition, Morgan Kaufmann Publishers, Chapter 4. Tan, Steinbach & Kumar (2006), Introduction to Data Mining, Pearson education, ISBN-10: …

数据挖掘 Introduction to Data Mining

Similar presentations

Presentation on theme: "数据挖掘 Introduction to Data Mining"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

数据挖掘 Introduction to Data Mining

Similar presentations

Presentation on theme: "数据挖掘 Introduction to Data Mining"— Presentation transcript:

Similar presentations

About project

Feedback