Classification.

Slides:



Advertisements
Similar presentations
Classification and Prediction
Advertisements

DECISION TREES. Decision trees  One possible representation for hypotheses.
CHAPTER 9: Decision Trees
Classification Techniques: Decision Tree Learning
Lecture outline Classification Decision-tree classification.
Classification and Prediction
Classification & Prediction
1 Classification with Decision Trees I Instructor: Qiang Yang Hong Kong University of Science and Technology Thanks: Eibe Frank and Jiawei.
Classification Continued
Lecture 5 (Classification with Decision Trees)
Example of a Decision Tree categorical continuous class Splitting Attributes Refund Yes No NO MarSt Single, Divorced Married TaxInc NO < 80K > 80K.
Classification and Prediction
Classification.
© Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist.
Chapter 5 Data mining : A Closer Look.
Chapter 7 Decision Tree.
DATA MINING : CLASSIFICATION. Classification : Definition  Classification is a supervised learning.  Uses training sets which has correct answers (class.
Chapter 9 Neural Network.
Basics of Decision Trees  A flow-chart-like hierarchical tree structure –Often restricted to a binary structure  Root: represents the entire dataset.
Introduction to machine learning and data mining 1 iCSC2014, Juan López González, University of Oviedo Introduction to machine learning Juan López González.
Basic Data Mining Technique
Decision Trees. 2 Outline  What is a decision tree ?  How to construct a decision tree ? What are the major steps in decision tree induction ? How to.
May 27, 2016Data Mining: Concepts and Techniques1 Chapter 6. Classification and Prediction What is classification? What is prediction? Issues regarding.
CS685 : Special Topics in Data Mining, UKY The UNIVERSITY of KENTUCKY Classification CS 685: Special Topics in Data Mining Fall 2010 Jinze Liu.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Classification COMP Seminar BCB 713 Module Spring 2011.
CS690L Data Mining: Classification
For Monday No new reading Homework: –Chapter 18, exercises 3 and 4.
Chapter 6 Classification and Prediction Dr. Bernard Chen Ph.D. University of Central Arkansas.
Decision Trees Binary output – easily extendible to multiple output classes. Takes a set of attributes for a given situation or object and outputs a yes/no.
CSE 5331/7331 F'07© Prentice Hall1 CSE 5331/7331 Fall 2007 Machine Learning Margaret H. Dunham Department of Computer Science and Engineering Southern.
Decision Trees Example of a Decision Tree categorical continuous class Refund MarSt TaxInc YES NO YesNo Married Single, Divorced < 80K> 80K Splitting.
Classification and Prediction
Data Mining and Decision Support
Classification Today: Basic Problem Decision Trees.
1 Classification: predicts categorical class labels (discrete or nominal) classifies data (constructs a model) based on the training set and the values.
Decision Trees.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Part II - Classification© Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II - Classification Margaret H. Dunham Department of Computer.
Decision Tree Learning DA514 - Lecture Slides 2 Modified and expanded from: E. Alpaydin-ML (chapter 9) T. Mitchell-ML.
Review of Decision Tree Learning Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Chapter 6 Decision Tree.
Classification and Prediction
DECISION TREES An internal node represents a test on an attribute.
Ch9: Decision Trees 9.1 Introduction A decision tree:
Chapter 6 Classification and Prediction
Information Management course
Dipartimento di Ingegneria «Enzo Ferrari»,
Introduction to Data Mining, 2nd Edition by
Classification and Prediction
Classification and Prediction — Slides for Textbook — — Chapter 7 —
Data Mining: Concepts and Techniques
Basic Concepts and Decision Trees
CS 685: Special Topics in Data Mining Jinze Liu
Data Mining – Chapter 3 Classification
Classification and Prediction
CS 685: Special Topics in Data Mining Jinze Liu
CSCI N317 Computation for Scientific Applications Unit Weka
Classification and Prediction
©Jiawei Han and Micheline Kamber
Avoid Overfitting in Classification
Classification.
CS 685: Special Topics in Data Mining Spring 2009 Jinze Liu
A task of induction to find patterns
CS 685: Special Topics in Data Mining Jinze Liu
Classification 1.
A task of induction to find patterns
Presentation transcript:

Classification

分類與預測 Classification (分類) Prediction (預測) predict categorical class labels 預測分門別類的類別標籤 categorical class labels: 離散式的或名目式的(discrete/nominal) 根據訓練組資料的分類屬性的值(即類別標籤),建構一個模型,並以之將新資料歸類 Prediction (預測) model continuous-valued functions 掌握連續值型態的函數,預測未知或遺失的值

Classification Problem Given a database D={t1,t2,…,tn} and a set of classes C={C1,…,Cm}, the Classification Problem is to define a mapping f:DgC where each ti is assigned to one class. Actually divides D into equivalence classes. Prediction is similar, but may be viewed as having infinite number of classes.

Classification Ex: Grading If x >= 90 then grade =A. If 80<=x<90 then grade =B. If 70<=x<80 then grade =C. If 60<=x<70 then grade =D. If x<50 then grade =F. >=90 <90 x x A <80 >=80 x B >=70 <70 x C >=60 <50 F D

Classification Ex: Letter Recognition View letters as constructed from 5 components: Letter A Letter B Letter C Letter D Letter E Letter F

Classification Examples Teachers classify students’ grades as A, B, C, D, or F. Identify mushrooms as poisonous or edible. Classify a river as flooding or not. Identify an individual as certain credit risk. Speech recognition: Identify the spoken “word” from the classified examples (rules). Pattern recognition: Identify a pattern from the classified ones. 典型應用 信用核發,目標行銷,醫療診斷,處方(treatment)有效性分析

Classification—兩大步驟 建構模型: 描述一組預定的類別(predetermined class) 每筆資料皆已具有預定的類別(由 class label attribute決定) training set :用來建構模型的資料組 模型的呈現:(1) classification rules (2) decision trees (3) mathematical formulae 使用模型: 將未來或未知之objects分類 估計模型之 accuracy 將已知類別的test sample灌入模型,比較模型的答案類別 Accuracy rate: test set samples正確被模型分類的比例 Test set:獨立於 training set, 否則會over-fitting 若accuracy 可接受,用此模型將未知類別標籤的資料項分類 Most common techniques use DTs, NNs, or are based on distances or statistical methods

Process 1: 建構模型 Classification Algorithms Training Data Classifier (Model) IF rank = ‘professor’ OR years > 6 THEN tenured = ‘yes’

Process 2:使用模型 Tenured? Classifier Testing Data Unseen Data (Jeff, Professor, 4) Tenured?

Defining Classes Partitioning Based Distance Based

Supervised vs. Unsupervised Learning 監督式學習 (classification) 監督: training data (觀察值或測量值等) 帶有指示此觀察之標籤 新資料依照training set來分類 非監督式學習(clustering) training data 的類別標籤未知 給予一組觀察值或測量值等,目的是建構資料中存在的 類別或群體(cluster)

議題 (1): Data Preparation Data cleaning Preprocess data in order to reduce noise and handle missing values 相關性分析(Relevance analysis):即屬性的選取 (feature selection) Remove the irrelevant or redundant attributes Data transformation Generalize and/or normalize data Missing Data Ignore Replace with assumed value

議題(2): 評估 Classification Methods 預測準確率(Predictive accuracy) Classification accuracy on test data Confusion matrix OC Curve 速度與可擴充性( scalability) 建構 classifier 的時間多久 何時可以使用classifier 強固性(Robustness) handling noise and missing values 可擴充性(Scalability) disk-resident databases 的效率 可解釋性(Interpretability) 對model的瞭解與model可提供的insight(瞭解深度) rule有多好( goodness measure ) decision tree size compactness of classification rules

Height Example Data accuracy

Classification Performance True Positive False Negative False Positive True Negative

Confusion Matrix Example Using height data example with Output1 correct and Output2 actual assignment

Operating Characteristic Curve

Classification Using Decision Trees Partitioning based: Divide search space into rectangular regions. Tuple placed into class based on the region within which it falls. DT approaches differ in how the tree is built: DT Induction Internal nodes associated with attribute and arcs with values for that attribute. Algorithms: ID3, C4.5, CART

Decision Tree Given: D = {t1, …, tn} where ti=<ti1, …, tih> Database schema contains {A1, A2, …, Ah} Classes C={C1, …., Cm} Decision or Classification Tree is a tree associated with D such that Each internal node is labeled with attribute, Ai Each arc is labeled with predicate which can be applied to attribute at parent Each leaf node is labeled with a class, Cj

DT Induction

DT Splits Area M Gender F Height

Comparing DTs Balanced Deep

Decision Tree Induction Training Dataset

Output: A Decision Tree for “buys_computer” age? <=30 overcast 30..40 >40 student? yes credit rating? no yes excellent fair no yes no yes an example from Quinlan’s ID3

Decision Tree Induction演算法 基本方法 (a greedy algorithm) top-down recursive divide-and-conquer 式的建構此樹 開始時,所有的example都在樹根 只能處理categorical屬性(若屬性是continuous-valued, 預做 discretization) 遞迴地將examples依據所選屬性 partition 選屬性的根據:以(1) heuristic(2)或statistical measure (如 information gain) 停止partitioning的條件 某一節點的類別均相同 已沒有其他可以用來partition的屬性 – 這個leaf的類別以「多數決」 已經沒有sample留下

Decision Tree Induction is often based on Information Theory So

Information

DT Induction When all the marbles in the bowl are mixed up, little information is given. When the marbles in the bowl are all from one class and those in the other two classes are on either side, more information is given. Use this approach with DT Induction !

Information/Entropy Given probabilitites p1, p2, .., ps whose sum is 1, Entropy is defined as: Entropy measures the amount of randomness or surprise or uncertainty. Goal in classification no surprise entropy = 0

Entropy log (1/p) H(p,1-p)

選屬性: Information Gain (ID3/C4.5) S contains si tuples of class Ci for i = {1, …, m} information measures info required to classify any arbitrary tuple entropy of attribute A with values {a1,a2,…,av} information gained by branching on attribute A

Information Gain 例子 表 “age <=30” 14個中有5個,其中 2 ‘yes’、3 ‘ no’ 故 同理 Class P: buys_computer = “yes” Class N: buys_computer = “no” I(p, n) = I(9, 5) =0.940 Compute the entropy for age: 表 “age <=30” 14個中有5個,其中 2 ‘yes’、3 ‘ no’ 故 同理

由decision tree產生規則 將知識以 IF-THEN 規則方式呈現 每個路徑(從root到leaf) 產生一條規則 路徑上每個attribute-value pair是一個條件組合( conjunction) Leaf具有類別預測值 人們對於規則比較容易懂 Example IF age = “<=30” AND student = “no” THEN buys_computer = “no” IF age = “<=30” AND student = “yes” THEN buys_computer = “yes” IF age = “31…40” THEN buys_computer = “yes” IF age = “>40” AND credit_rating = “excellent” THEN buys_computer = “yes” IF age = “<=30” AND credit_rating = “fair” THEN buys_computer = “no”

避免 Overfitting Overfitting: induced tree 可能會overfit training data 太多分支:有些可能是noise或outlier的不正常反應 unseen samples的accuracy不好 Two approaches 預先修剪(Prepruning): 建樹時早一點結束 如果會導致goodness measure掉到某個threshold下,別再「分」了(no more split) 合適的threshold 難定 後修剪(Postpruning): 從長太滿的樹中移除branches 可得到一連串漸進(progressively)修剪的樹 用一組不同於 training 的data 決定那一個「修剪的樹」最佳

決定 Final Tree Size的方法 分成 training (2/3) 與 testing (1/3) sets used for data set with large number of samples 使用 cross validation, 例如, 10倍 cross validation divide the data set into k subsamples use k-1 subsamples as training data and one sub-sample as test data --- k-fold cross-validation for data set with moderate size 使用 所有 data training 但使用 statistical test (如 chi-square) 估計 展開(expand)或修剪(prune) 某個節點可否改善整體分配 用 minimum description length (MDL) 原則 當encoding minimized時,停止長樹

Information Gain example YES: 9 NO: 5 Total = 14 P(YES) = 9/14 = 0.64 P(NO) =5/14 = 0.36 I(Y,N) = -( 9/14 * log2(9/14) + 5/14 * log2(5/14) ) = 0.94 ----------------------------------------------------------------------- Age (<= 30): I(Y,N) = -( 2/5 * log2(2/5) + 3/5 * log2(3/5) ) = 0.97 YES 2 NO 3 Age (31..40): I(Y, N) = -( 4/4 * log2(4/4) + 0/4 * log2(0/4) ) = 0 YES 4 NO 0 Age(>40):-( 2/5 * log2(2/5) + 3/5 * log2(3/5) ) = 0.97 YES 3 NO 2 E(age) = 5/14 I(Y,N)age(<=30) + 4/14 I(Y,N)age(31..40)+ 5/14 I(Y,N)age(>40) = 0.694 Gain = 0.94 – 0.694 = 0.246

Example: Error Rate Test set: 84 84 2/3 1/3 3/4 1/4 1/7 3/7 56 28 2/7 6+2 E1 E2 E3 E4 E5 1/6 1/12 E1 類別錯誤的個數 進入這個 leaf 的個數 = = 2/(6+2) Etree = 2/3*(1/7*E1+ 3/7*E2+3/7*E3)+1/3*(3/4*E4+1/4*E5)

Example of overfitting “Yes”: 11 “No” : 9 Predict: Yes “Yes”: 3 “No”: 0 Predict: Yes “Yes”: 8 “No”: 9 Predict: No Entropy = -(11/20log211/20+ 9/20log29/20) = 0.993 Average entropy = -17/20(8/17log28/17+ 9/17log29/17) = 0.848

強化(enhance)基本決策樹 允許 continuous-valued 屬性 處理 missing attribute values 動態定義將continuous attribute value partition為離散區間的新離散屬性 處理 missing attribute values 指定共通值 指定各個值的機率 建立Attribute 為現有稀疏的屬性(sparsely represented)建立新attribute 可減少(簡化) fragmentation, repetition, and replication

DT Issues Choosing Splitting Attributes Ordering of Splitting Attributes Splits Tree Structure Stopping Criteria Training Data Pruning

ID3 Creates tree using information theory concepts and tries to reduce expected number of comparison.. ID3 chooses split attribute with the highest information gain:

ID3 Example (Output1) Starting state entropy: 4/15 log(15/4) + 8/15 log(15/8) + 3/15 log(15/3) = 0.4384 Gain using gender: Female: 3/9 log(9/3)+6/9 log(9/6)=0.2764 Male: 1/6 (log 6/1) + 2/6 log(6/2) + 3/6 log(6/3) = 0.4392 Weighted sum: (9/15)(0.2764) + (6/15)(0.4392) = 0.34152 Gain: 0.4384 – 0.34152 = 0.09688 Gain using height: 0.4384 – (2/15)(0.301) = 0.3983 Choose height as first splitting attribute

CART, C4.5, CHAID Classification And Regression Tree maximize diversitybefore- diversityafter diversity分散度-僅含單一類別之diversity低 eg. entropy/information compute I(2,3,9)? C4.5 Post-pruning CHAID A tree is “pruned” by halting its construction early. (Pre-pruning) CHAID is restricted to categorical variables Continuous variables must be broken into ranges or replaced with classes such as high, low, medium

C4.5 ID3 favors attributes with large number of divisions Improved version of ID3: Missing Data Continuous Data Pruning Rules GainRatio:

CART Create Binary Tree Uses entropy Formula to choose split point, s, for node t: PL,PR probability that a tuple in the training set will be on the left or right side of the tree.

CART Example At the start, there are six choices for split point (right branch on equality): P(Gender)=2(6/15)(9/15)(2/15 + 4/15 + 3/15)=0.224 P(1.6) = 0 P(1.7) = 2(2/15)(13/15)(0 + 8/15 + 3/15) = 0.169 P(1.8) = 2(5/15)(10/15)(4/15 + 6/15 + 3/15) = 0.385 P(1.9) = 2(9/15)(6/15)(4/15 + 2/15 + 3/15) = 0.256 P(2.0) = 2(12/15)(3/15)(4/15 + 8/15 + 3/15) = 0.32 Split at 1.8

Classification in Large Databases Classification—a classical problem extensively studied by statisticians and machine learning researchers Scalability: Classifying data sets with millions of examples and hundreds of attributes with reasonable speed Why decision tree induction in data mining? relatively faster learning speed (than other classification methods) convertible to simple and easy to understand classification rules can use SQL queries for accessing databases comparable classification accuracy with other methods

Scalable Decision Tree Induction in Data Mining SLIQ (EDBT’96 — Mehta et al.) builds an index for each attribute and only class list and the current attribute list reside in memory SPRINT (VLDB’96 — J. Shafer et al.) constructs an attribute list data structure PUBLIC (VLDB’98 — Rastogi & Shim) integrates tree splitting and tree pruning: stop growing the tree earlier RainForest (VLDB’98 — Gehrke, Ramakrishnan & Ganti) separates the scalability aspects from the criteria that determine the quality of the tree builds an AVC-list (attribute, value, class label)

Presentation of Classification Results

Decision Tree in SGI/MineSet 3.0

Tool: Decision Tree C4.5, the “classic” decision-tree tool, developed by J. Q. Quinlan http://www.cse.unsw.edu.tw/~quinlan Classification Tree in Excel EC4.5, a more efficient version of C4.5 http://www.-kdd.di.unipi.it/software IND, provides Gini and C4.5 decision trees http://ic.arc.nasa.gov/projects/bayes-group/ind/IND-program.html

Regression Assume data fits a predefined function Determine best values for regression coefficients c0,c1,…,cn. Assume an error: y = c0+c1x1+…+cnxn+e Estimate error using mean squared error for training set:

Linear Regression Poor Fit

Classification Using Regression Division: Use regression function to divide area into regions. Prediction: Use regression function to predict a class membership function. Input includes desired class.

Division

Prediction

Classification Using Distance Place items in class to which they are “closest”. Must determine distance between an item and a class. Classes represented by Centroid: Central value. Medoid: Representative point. Individual points Algorithm: KNN

K Nearest Neighbor (KNN): Training set includes classes. Examine K items near item to be classified. New item placed in class with the most number of close items. O(q) for each tuple to be classified. (Here q is the size of the training set.)

KNN

KNN Algorithm

Classification Using Neural Networks Typical NN structure for classification: One output node per class Output value is class membership function value Supervised learning For each tuple in training set, propagate it through NN. Adjust weights on edges to improve future classification. Algorithms: Propagation, Backpropagation, Gradient Descent

NN Issues Number of source nodes Number of hidden layers Training data Number of sinks Interconnections Weights Activation Functions Learning Technique When to stop learning

Decision Tree vs. Neural Network

Propagation Tuple Input

NN Propagation Algorithm

Example Propagation

NN Learning Adjust weights to perform better with the associated test data. Supervised: Use feedback from knowledge of correct classification. Unsupervised: No knowledge of correct classification needed.

NN Supervised Learning

Supervised Learning Possible error values assuming output from node i is yi but should be di: Change weights on arcs based on estimated error

NN Backpropagation Propagate changes to weights backward from output layer to input layer. Delta Rule: r wij= c xij (dj – yj) Gradient Descent: technique to modify the weights in the graph.

Backpropagation

Backpropagation Algorithm

Gradient Descent

Gradient Descent Algorithm

Output Layer Learning

Hidden Layer Learning

Types of NNs Different NN structures used for different problems. Perceptron Self Organizing Feature Map Radial Basis Function Network

Perceptron Perceptron is one of the simplest NNs. No hidden layers.

Perceptron Example Suppose: Summation: S=3x1+2x2-6 Activation: if S>0 then 1 else 0

Self Organizing Feature Map (SOFM) SOM Competitive Unsupervised Learning Observe how neurons work in brain: Firing impacts firing of those near Neurons far apart inhibit each other Neurons have specific nonoverlapping tasks Ex: Kohonen Network

Kohonen Network

Kohonen Network Competitive Layer – viewed as 2D grid Similarity between competitive nodes and input nodes: Input: X = <x1, …, xh> Weights: <w1i, … , whi> Similarity defined based on dot product Competitive node most similar to input “wins” Winning node weights (as well as surrounding node weights) increased.

Radial Basis Function Network RBF function has Gaussian shape RBF Networks Three Layers Hidden layer – Gaussian activation function Output layer – Linear activation function

Radial Basis Function Network

Classification Using Rules Perform classification using If-Then rules Classification Rule: r = <a,c> Antecedent, Consequent May generate from from other techniques (DT, NN) or generate directly. Algorithms: Gen, RX, 1R, PRISM

Generating Rules from DTs

Generating Rules Example

Generating Rules from NNs

1R Algorithm

1R Example

PRISM Algorithm

PRISM Example

Decision Tree vs. Rules Tree has implied order in which splitting is performed. Tree created based on looking at all classes. Rules have no ordering of predicates. Only need to look at one class to generate its rules. 決策樹方法的優點?缺點?