Classifiers in Atlas CS240B Class Notes UCLA. Data Mining z Classifiers: yBayesian classifiers yDecision trees z The Apriori Algorithm zDBSCAN Clustering:

Slides:



Advertisements
Similar presentations
Data Mining Lecture 9.
Advertisements

Random Forest Predrag Radenković 3237/10
Decision Trees Decision tree representation ID3 learning algorithm
Data Mining Techniques: Classification. Classification What is Classification? –Classifying tuples in a database –In training set E each tuple consists.
Classification Techniques: Decision Tree Learning
Decision Trees Instructor: Qiang Yang Hong Kong University of Science and Technology Thanks: Eibe Frank and Jiawei Han.
Decision Tree Rong Jin. Determine Milage Per Gallon.
Bayesian classifiers.
Classification and Prediction
Classification and Regression. Classification and regression  What is classification? What is regression?  Issues regarding classification and regression.
Lecture 5 (Classification with Decision Trees)
SEG Tutorial 1 – Classification Decision tree, Naïve Bayes & k-NN CHANG Lijun.
Classification and Prediction by Yen-Hsien Lee Department of Information Management College of Management National Sun Yat-Sen University March 4, 2003.
Classification.
Knowledge discovery & data mining: Classification
Machine Learning Lecture 10 Decision Trees G53MLE Machine Learning Dr Guoping Qiu1.
SEEM Tutorial 2 Classification: Decision tree, Naïve Bayes & k-NN
Chapter 7 Decision Tree.
Oct 17, 2006Sudeshna Sarkar, IIT Kharagpur1 Machine Learning Sudeshna Sarkar IIT Kharagpur.
Data Mining: Classification
Instructor: Dan Hebert
DATA MINING : CLASSIFICATION. Classification : Definition  Classification is a supervised learning.  Uses training sets which has correct answers (class.
Bayesian Networks. Male brain wiring Female brain wiring.
©Jiawei Han and Micheline Kamber
11/9/2012ISC471 - HCI571 Isabelle Bichindaritz 1 Classification.
Classification. 2 Classification: Definition  Given a collection of records (training set ) Each record contains a set of attributes, one of the attributes.
Naïve Bayes Classifier. Bayes Classifier l A probabilistic framework for classification problems l Often appropriate because the world is noisy and also.
Classification and Prediction (Data Mining: Concepts and Techniques)
Bayesian Classifier. 2 Review: Decision Tree Age? Student? Credit? fair excellent >40 31…40
CS690L Data Mining: Classification
Classification and Prediction What is classification? What is prediction? Issues regarding classification and prediction Classification by decision tree.
Bayesian Classification
An Exercise in Machine Learning
An Introduction Student Name: Riaz Ahmad Program: MSIT( ) Subject: Data warehouse & Data Mining.
Mehdi Ghayoumi MSB rm 132 Ofc hr: Thur, a Machine Learning.
1 Classification: predicts categorical class labels (discrete or nominal) classifies data (constructs a model) based on the training set and the values.
Data Mining By Farzana Forhad CS 157B. Agenda Decision Tree and ID3 Rough Set Theory Clustering.
Decision Trees.
Chapter 6. Classification and Prediction Classification by decision tree induction Bayesian classification Rule-based classification Classification by.
BAYESIAN LEARNING. 2 Bayesian Classifiers Bayesian classifiers are statistical classifiers, and are based on Bayes theorem They can calculate the probability.
Chapter 3 Data Mining: Classification & Association Chapter 4 in the text box Section: 4.3 (4.3.1),
DATA MINING TECHNIQUES (DECISION TREES ) Presented by: Shweta Ghate MIT College OF Engineering.
Review of Decision Tree Learning Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
Bayesian Classification 1. 2 Bayesian Classification: Why? A statistical classifier: performs probabilistic prediction, i.e., predicts class membership.
Data Mining: Concepts and Techniques
DECISION TREES An internal node represents a test on an attribute.
CS240A Final Project 2.
Data Mining: Classification
Prepared by: Mahmoud Rafeek Al-Farra
KDD: Classification UCLA CS240A Notes*
©Jiawei Han and Micheline Kamber
Classification and Prediction
Data Mining: Concepts and Techniques Classification
Data Mining Concept Description
©Jiawei Han and Micheline Kamber
Data Mining Algorithms
Machine Learning Bayes Learning Bai Xiao.
12/2/2018.
Lecture 3 Classification and Prediction
Classification Bayesian Classification 2018年12月30日星期日.
Generative Models and Naïve Bayes
Data Mining: Concepts and Techniques (3rd ed.) — Chapter 8 —
Data Mining: Classification
©Jiawei Han and Micheline Kamber
UNIT-6 Classification and Prediction
©Jiawei Han and Micheline Kamber
A task of induction to find patterns
August 6, 2019Data Mining: Concepts and Techniques1 Data Mining: Concepts and Techniques — Slides for Textbook — — Chapter 7 — ©Jiawei Han and Micheline.
COP5577: Principles of Data Mining Fall 2008 Lecture 4 Dr
Presentation transcript:

Classifiers in Atlas CS240B Class Notes UCLA

Data Mining z Classifiers: yBayesian classifiers yDecision trees z The Apriori Algorithm zDBSCAN Clustering:

The Classification Task zInput: a training set of tuples, each labelled with one class label zOutput: a model (classifier) which assigns a class label to each tuple based on the other attributes. zThe model can be used to predict the class of new tuples, for which the class label is missing or unknown zSome natural applications ycredit approval ymedical diagnosis ytreatment effectiveness analysis

Train & Test zThe tuples (observations, samples) are partitioned in training set + test set. zClassification is performed in two steps: 1.training - build the model from training set 2.Testing (for accuracy, etc.)

Classical example: play tennis? Training set from Quinlan’s Book Seq Could have Been used to generate the RID column

Bayesian classification zThe classification problem may be formalized using a-posteriori probabilities: z P(C|X) = prob. that the sample tuple X= is of class C. zE.g. P(class=N | outlook=sunny,windy=true,…) zIdea: assign to sample X the class label C such that P(C|X) is maximal

Estimating a-posteriori probabilities zBayes theorem: P(C|X) = P(X|C)·P(C) / P(X) zP(X) is constant for all classes zP(C) = relative freq of class C samples zC such that P(C|X) is maximum = C such that P(X|C)·P(C) is maximum

Naïve Bayesian Classification zNaïve assumption: attribute independence P(x 1,…,x k |C) = P(x 1 |C)·…·P(x k |C) zFor Categorical attributes: P(x i |C) is estimated as the relative freq of samples having value x i as i-th attribute in class C zComputationally this is a count with grouping

Play-tennis example: estimating P(x i |C) outlook P(sunny|p) = 2/9P(sunny|n) = 3/5 P(overcast|p) = 4/9P(overcast|n) = 0 P(rain|p) = 3/9P(rain|n) = 2/5 temperature P(hot|p) = 2/9P(hot|n) = 2/5 P(mild|p) = 4/9P(mild|n) = 2/5 P(cool|p) = 3/9P(cool|n) = 1/5 humidity P(high|p) = 3/9P(high|n) = 4/5 P(normal|p) = 6/9P(normal|n) = 2/5 windy P(true|p) = 3/9P(true|n) = 3/5 P(false|p) = 6/9P(false|n) = 2/5 P(p) = 9/14 P(n) = 5/14

Bayesian Classifiers zThe training can be done by SQL count and grouping sets (but that might require many passes through the data). If the results are stored in a table called SUMMARY, then:SUMMARY z The testing is a simple SQL query on SUMMARY zFirst operation is to verticalize the table

Decision tree obtained with ID3 (Quinlan 86) outlook overcast humiditywindy highnormal weak strong sunny rain NNPP P [0] [2] [1][3] [4] […]

Decision Tree Classifiers z Computed in a recursive fashion z Various ways to split and computing the splitting function zFirst operation is to verticalize the table

Classical example

Initial state: the node column Training set from Quinlan’s book

First Level (Outlook will then be deleted)

Gini index zE.g., two classes, Pos and Neg, and dataset S with p Pos-elements and n Neg-elements. zfp = p/(p+n)fn = n/(p+n) gini(S) = 1 – fp 2 - fn 2 zIf dataset S is split into S 1, S 2, S 3 then gini split (S 1, S 2, S 3 ) = gini( S 1 ) · (p 1 +n 1 )/(p+n) + gini( S 2 ) · (p 2 +n 2 )/(p+n) +gini( S 3 ) · (p 2 +n 2 )/(p+n) These computations can be easily expressed in ATLaS

Programming in ATLaS zTable-based programming is powerful and natural for data intensive z SQL can be ackward and many extensions are possible z But even SQL `as is’ is adequate

The ATLaS System zThe system compile ATLaS programs into C programs, which zExecutes on Berkeley DB record manager zThe 100 Apriori program compiles into 2,800 lines of C zOther data structures (R-trees, in-memory tables) have been added using the same API. zThe system is now 54,000 lines of C++ code.

ATLaS: Conclusions zA native extensibility mechanism for SQL—and a simple one. More efficient than Java or PL/SQL zEffective with Data Minining Applications z Also OLAP applications, and recursive queries, and temporal database applications zComplement current mechanisms based on UDFs and Data Blades z Supports and favors streaming aggregates (SQL implicit default is blocking) z Good basis for determining program properties: e.g. (non)monotonic and blocking behavior zThese are lessons that future QLs cannot easily ignore.

The Future z Continuous queries on Data Streams z Other extensions and improvements z Stay tuned: