Oct 17, 2006Sudeshna Sarkar, IIT Kharagpur1 Machine Learning Sudeshna Sarkar IIT Kharagpur.

Slides:



Advertisements
Similar presentations
Naïve Bayes. Bayesian Reasoning Bayesian reasoning provides a probabilistic approach to inference. It is based on the assumption that the quantities of.
Advertisements

Bayesian Learning Provides practical learning algorithms
Classification Techniques: Decision Tree Learning
What is Statistical Modeling
Classification and Prediction.  What is classification? What is prediction?  Issues regarding classification and prediction  Classification by decision.
Bayesian classifiers.
Data Mining Techniques Outline
Part 7.3 Decision Trees Decision tree representation ID3 learning algorithm Entropy, information gain Overfitting.
Classifiers in Atlas CS240B Class Notes UCLA. Data Mining z Classifiers: yBayesian classifiers yDecision trees z The Apriori Algorithm zDBSCAN Clustering:
Classification and Regression. Classification and regression  What is classification? What is regression?  Issues regarding classification and regression.
Lecture 5 (Classification with Decision Trees)
Three kinds of learning
Generative Models Rong Jin. Statistical Inference Training ExamplesLearning a Statistical Model  Prediction p(x;  ) Female: Gaussian distribution N(
Classification and Prediction by Yen-Hsien Lee Department of Information Management College of Management National Sun Yat-Sen University March 4, 2003.
Classification.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Part I: Classification and Bayesian Learning
CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.
Ensemble Learning (2), Tree and Forest
Crash Course on Machine Learning
DATA MINING : CLASSIFICATION. Classification : Definition  Classification is a supervised learning.  Uses training sets which has correct answers (class.
Copyright R. Weber Machine Learning, Data Mining ISYS370 Dr. R. Weber.
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
Data mining and machine learning A brief introduction.
Bayesian Networks. Male brain wiring Female brain wiring.
INTRODUCTION TO MACHINE LEARNING. $1,000,000 Machine Learning  Learn models from data  Three main types of learning :  Supervised learning  Unsupervised.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
Processing of large document collections Part 2 (Text categorization, term selection) Helena Ahonen-Myka Spring 2005.
11/9/2012ISC471 - HCI571 Isabelle Bichindaritz 1 Classification.
Text Classification, Active/Interactive learning.
Naive Bayes Classifier
Introduction to machine learning and data mining 1 iCSC2014, Juan López González, University of Oviedo Introduction to machine learning Juan López González.
Classification. 2 Classification: Definition  Given a collection of records (training set ) Each record contains a set of attributes, one of the attributes.
Naïve Bayes Classifier. Bayes Classifier l A probabilistic framework for classification problems l Often appropriate because the world is noisy and also.
Learning from Observations Chapter 18 Through
Today Ensemble Methods. Recap of the course. Classifier Fusion
CLASSIFICATION: Ensemble Methods
Bayesian Classifier. 2 Review: Decision Tree Age? Student? Credit? fair excellent >40 31…40
Bayesian Classification. Bayesian Classification: Why? A statistical classifier: performs probabilistic prediction, i.e., predicts class membership probabilities.
Classification Techniques: Bayesian Classification
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
Chapter 6 Bayesian Learning
Machine Learning II Decision Tree Induction CSE 573.
CSE 5331/7331 F'07© Prentice Hall1 CSE 5331/7331 Fall 2007 Machine Learning Margaret H. Dunham Department of Computer Science and Engineering Southern.
Machine Learning Concept Learning General-to Specific Ordering
Text Categorization With Support Vector Machines: Learning With Many Relevant Features By Thornsten Joachims Presented By Meghneel Gore.
Random Forests Ujjwol Subedi. Introduction What is Random Tree? ◦ Is a tree constructed randomly from a set of possible trees having K random features.
Bayesian Learning Provides practical learning algorithms
Seminar on Machine Learning Rada Mihalcea Decision Trees Very short intro to Weka January 27, 2003.
Dimensionality Reduction in Unsupervised Learning of Conditional Gaussian Networks Authors: Pegna, J.M., Lozano, J.A., Larragnaga, P., and Inza, I. In.
Data Mining and Decision Support
1 Machine Learning: Lecture 6 Bayesian Learning (Based on Chapter 6 of Mitchell T.., Machine Learning, 1997)
Bayesian Learning Bayes Theorem MAP, ML hypotheses MAP learners
Chapter 6. Classification and Prediction Classification by decision tree induction Bayesian classification Rule-based classification Classification by.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Classification COMP Seminar BCB 713 Module Spring 2011.
BAYESIAN LEARNING. 2 Bayesian Classifiers Bayesian classifiers are statistical classifiers, and are based on Bayes theorem They can calculate the probability.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Dependency Networks for Inference, Collaborative filtering, and Data Visualization Heckerman et al. Microsoft Research J. of Machine Learning Research.
Naive Bayes Classifier. REVIEW: Bayesian Methods Our focus this lecture: – Learning and classification methods based on probability theory. Bayes theorem.
Network Management Lecture 13. MACHINE LEARNING TECHNIQUES 2 Dr. Atiq Ahmed Université de Balouchistan.
Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.
Bayesian Classification 1. 2 Bayesian Classification: Why? A statistical classifier: performs probabilistic prediction, i.e., predicts class membership.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Data Mining Lecture 11.
CSCI 5822 Probabilistic Models of Human and Machine Learning
Classification Bayesian Classification 2018年12月30日星期日.
Machine Learning: Lecture 6
Machine Learning: UNIT-3 CHAPTER-1
What is Artificial Intelligence?
Presentation transcript:

Oct 17, 2006Sudeshna Sarkar, IIT Kharagpur1 Machine Learning Sudeshna Sarkar IIT Kharagpur

Oct 17, 2006Sudeshna Sarkar, IIT Kharagpur2 Learning methodologies  Learning from labelled data (supervised learning) eg. Classification, regression, prediction, function approx  Learning from unlabelled data (unsupervised learning) eg. Clustering, visualization, dimensionality reduction  Learning from sequential data eg. Speech recognition, DNA data analysis  Associations  Reinforcement Learning

Oct 17, 2006Sudeshna Sarkar, IIT Kharagpur3 Unsupervised Learning  Clustering: grouping similar instances  Example applications Clustering items based on similarity Clustering users based on interests Clustering words based on similarity of usage

Oct 17, 2006Sudeshna Sarkar, IIT Kharagpur4 Reinforcement Learning  Learning a policy: A sequence of outputs  No supervised output but delayed reward  Credit assignment problem  Game playing  Robot in a maze  Multiple agents, partial observability

Oct 17, 2006Sudeshna Sarkar, IIT Kharagpur5 Inductive Learning Methods  Find Similar  Decision Trees  Naïve Bayes  Bayes Nets  Support Vector Machines (SVMs)  All support: “Probabilities” - graded membership; comparability across categories Adaptive - over time; across individuals

Oct 17, 2006Sudeshna Sarkar, IIT Kharagpur6 Find Similar  Aka, relevance feedback  Rocchio  Classifier parameters are a weighted combination of weights in positive and negative examples -- “centroid”  New items classified using:  Use all features, idf weights,

Oct 17, 2006Sudeshna Sarkar, IIT Kharagpur7 Decision Trees  Learn a sequence of tests on features, typically using top-down, greedy search  Binary (yes/no) or continuous decisions f1f1 !f 1 f7f7 !f 7 P(class) =.6 P(class) =.9 P(class) =.2

Oct 17, 2006Sudeshna Sarkar, IIT Kharagpur8  Aka, binary independence model  Maximize: Pr (Class | Features)  Assume features are conditionally independent - math easy; surprisingly effective Naïve Bayes x1x1 x3x3 x2x2 xnxn C

Oct 17, 2006Sudeshna Sarkar, IIT Kharagpur9 Bayes Nets  Maximize: Pr (Class | Features)  Does not assume independence of features - dependency modeling x1x1 x3x3 x2x2 xnxn C

Oct 17, 2006Sudeshna Sarkar, IIT Kharagpur10 Support Vector Machines  Vapnik (1979)  Binary classifiers that maximize margin Find hyperplane separating positive and negative examples Optimization for maximum margin: Classify new items using: support vectors

Oct 17, 2006Sudeshna Sarkar, IIT Kharagpur11 Support Vector Machines  Extendable to: Non-separable problems (Cortes & Vapnik, 1995) Non-linear classifiers (Boser et al., 1992)  Good generalization performance OCR (Boser et al.) Vision (Poggio et al.) Text classification (Joachims)

Oct 17, 2006Sudeshna Sarkar, IIT Kharagpur12 Cross-Validation  Estimate the accuracy of a hypothesis induced by a supervised learning algorithm  Predict the accuracy of a hypothesis over future unseen instances  Select the optimal hypothesis from a given set of alternative hypotheses Pruning decision trees Model selection Feature selection  Combining multiple classifiers (boosting)

Oct 17, 2006Sudeshna Sarkar, IIT Kharagpur13 Holdout Method  Partition data set D = {(v 1,y 1 ),…,(v n,y n )} into training D t and validation set D h =D\D t Training D t Validation D\D t acc h = 1/h  (vi,yi)  Dh  (I(D t,v i ),y i ) I(D t,v i ) : output of hypothesis induced by learner I trained on data D t for instance v i  (i,j) = 1 if i=j and 0 otherwise Problems: makes insufficient use of data training and validation set are correlated

Oct 17, 2006Sudeshna Sarkar, IIT Kharagpur14 Cross-Validation  k-fold cross-validation splits the data set D into k mutually exclusive subsets D 1,D 2,…,D k  Train and test the learning algorithm k times, each time it is trained on D\D i and tested on D i D1D1 D2D2 D3D3 D4D4 D1D1 D2D2 D3D3 D4D4 D1D1 D2D2 D3D3 D4D4 D1D1 D2D2 D3D3 D4D4 D1D1 D2D2 D3D3 D4D4 acc cv = 1/n  (vi,yi)  D  (I(D\D i,v i ),y i )

Oct 17, 2006Sudeshna Sarkar, IIT Kharagpur15 Cross-Validation  Uses all the data for training and testing  Complete k-fold cross-validation splits the dataset of size m in all (m over m/k) possible ways (choosing m/k instances out of m)  Leave n-out cross-validation sets n instances aside for testing and uses the remaining ones for training (leave one-out is equivalent to n-fold cross-validation) Leave one out is widely used  In stratified cross-validation, the folds are stratified so that they contain approximately the same proportion of labels as the original data set

Oct 17, 2006Sudeshna Sarkar, IIT Kharagpur16 Bootstrap  Samples n instances uniformly from the data set with replacement  Probability that any given instance is not chosen after n samples is (1-1/n) n  e -1   The bootstrap sample is used for training the remaining instances are used for testing  acc boot = 1/b  i=1 b (0.632 0 i acc s ) where 0 i is the accuracy on the test data of the i-th bootstrap sample, acc s is the accuracy estimate on the training set and b the number of bootstrap samples

Oct 17, 2006Sudeshna Sarkar, IIT Kharagpur17 Wrapper Model Input features Feature subset search Feature subset evaluation Feature subset evaluation Induction algorithm

Oct 17, 2006Sudeshna Sarkar, IIT Kharagpur18 Wrapper Model  Evaluate the accuracy of the inducer for a given subset of features by means of n-fold cross-validation  The training data is split into n folds, and the induction algorithm is run n times. The accuracy results are averaged to produce the estimated accuracy.  Forward elimination: Starts with the empty set of features and greedily adds the feature that improves the estimated accuracy at most  Backward elimination: Starts with the set of all features and greedily removes features and greedily removes the worst feature

Oct 17, 2006Sudeshna Sarkar, IIT Kharagpur19 Bagging  For each trial t=1,2,…,T create a bootstrap sample of size N.  Generate a classifier C t from the bootstrap sample  The final classifier C* takes class that receives the majority votes among the C t Training set 1 Training set 2 Training set T … C1C1 C2C2 CTCT train … instance C*C* yesno yes

Oct 17, 2006Sudeshna Sarkar, IIT Kharagpur20 Bagging  Bagging requires ”instable” classifiers like for example decision trees or neural networks ”The vital element is the instability of the prediction method. If perturbing the learning set can cause significant changes in the predictor constructed, then bagging can improve accuracy.” (Breiman 1996)

Oct 17, 2006Sudeshna Sarkar, IIT Kharagpur21 Naïve Bayes Learner Assume target function f: X  V, where each instance x described by attributes. Most probable value of f(x) is: Naïve Bayes assumption: (attributes are conditionally independent)

Oct 17, 2006Sudeshna Sarkar, IIT Kharagpur22 Bayesian classification  The classification problem may be formalized using a-posteriori probabilities:  P(C|X) = prob. that the sample tuple X= is of class C.  E.g. P(class=N | outlook=sunny,windy=true,…)  Idea: assign to sample X the class label C such that P(C|X) is maximal

Oct 17, 2006Sudeshna Sarkar, IIT Kharagpur23 Estimating a-posteriori probabilities  Bayes theorem: P(C|X) = P(X|C)·P(C) / P(X)  P(X) is constant for all classes  P(C) = relative freq of class C samples  C such that P(C|X) is maximum = C such that P(X|C)·P(C) is maximum  Problem: computing P(X|C) is unfeasible!

Oct 17, 2006Sudeshna Sarkar, IIT Kharagpur24 Naïve Bayesian Classification  Naïve assumption: attribute independence P(x 1,…,x k |C) = P(x 1 |C)·…·P(x k |C)  If i-th attribute is categorical: P(x i |C) is estimated as the relative freq of samples having value x i as i-th attribute in class C  If i-th attribute is continuous: P(x i |C) is estimated thru a Gaussian density function  Computationally easy in both cases

Oct 17, 2006Sudeshna Sarkar, IIT Kharagpur25 NB Classifier Example EnjoySport example: estimating P(x i |C) outlook P(sunny|P) = 2/9P(sunny|N) = 3/5 P(overcast|P) = 4/9 P(overcast|N) = 0 P(rain|P) = 3/9P(rain|N) = 2/5 temperature P(hot|P) = 2/9P(hot|N) = 2/5 P(mild|P) = 4/9P(mild|N) = 2/5 P(cool|P) = 3/9P(cool|N) = 1/5 Humidity P(high|P) = 3/9P(high|N) = 4/5 P(normal|P) = 6/9P(normal|N) = 2/5 Windy P(true|P) = 3/9P(true|N) = 3/5 P(false|P) = 6/9P(false|N) = 2/5 P(P) = 9/14 P(N) = 5/14

Oct 17, 2006Sudeshna Sarkar, IIT Kharagpur26 NB Classifier Example (cont’d)  Given a training set, we can compute the probabilities

Oct 17, 2006Sudeshna Sarkar, IIT Kharagpur27 NB Classifier Example (cont’d) Predict enjoying sport in the day with the condition (P(v| o=sunny, t= cool, h=high w=strong)) using the training data: we have :

Oct 17, 2006Sudeshna Sarkar, IIT Kharagpur28 The independence hypothesis…  … makes computation possible  … yields optimal classifiers when satisfied  … but is seldom satisfied in practice, as attributes (variables) are often correlated.  Attempts to overcome this limitation: Bayesian networks, that combine Bayesian reasoning with causal relationships between attributes Decision trees, that reason on one attribute at the time, considering most important attributes first

Oct 17, 2006Sudeshna Sarkar, IIT Kharagpur29 The Naïve Bayes Algorithm Naïve_Bayes_Learn (examples) for each target value v j estimate P(v j ) for each attribute value ai of each attribute a estimate P(a i | v j ) Classify_New_Instance (x) Typical estimation of P(a i | vj) Where n: examples with v=v j ; p is prior estimate for P(a i |v j ) nc: examples with a=a i, m is the weight to prior )|()(max j xa i Vvj j vaPvPv i    

Oct 17, 2006Sudeshna Sarkar, IIT Kharagpur30 Bayesian Belief Networks  Naïve Bayes assumption of conditional independence too restrictive  But it is intractable without some such assumptions  Bayesian Belief network (Bayesian net) describe conditional independence among subsets of variables (attributes): combining prior knowledge about dependencies among variables with observed training data.  Bayesian Net Node = variables Arc = dependency DAG, with direction on arc representing causality

Oct 17, 2006Sudeshna Sarkar, IIT Kharagpur31 Bayesian Networks: Multi-variables with Dependency  Bayesian Belief network (Bayesian net) describes conditional independence among subsets of variables (attributes): combining prior knowledge about dependencies among variables with observed training data.  Bayesian Net Node = variables and each variable has a finite set of mutually exclusive states Arc = dependency DAG, with direction on arc representing causality Variable A with parents B1, …., Bn has a conditional probability table P (A | B1, …., Bn)

Oct 17, 2006Sudeshna Sarkar, IIT Kharagpur32 Bayesian Belief Networks Age, Occupation and Income determine if customer will buy this product. Given that customer buys product, whether there is interest in insurance is now independent of Age, Occupation, Income. P(Age, Occ, Inc, Buy, Ins ) = P(Age)P(Occ)P(Inc) P(Buy|Age,Occ,Inc)P(Int|Buy) Current State-of-the Art: Given structure and probabilities, existing algorithms can handle inference with categorical values and limited representation of numerical values Age Occ Income Buy X Interested in Insurance

Oct 17, 2006Sudeshna Sarkar, IIT Kharagpur33 General Product Rule

Oct 17, 2006Sudeshna Sarkar, IIT Kharagpur34 Nodes as Functions input: parents state values output: a distribution over its own value A B a b ab~aba~b~a~b X P(X|A=a, B=b) A node in BN is a conditional distribution function lmhlmh lmhlmh

Oct 17, 2006Sudeshna Sarkar, IIT Kharagpur35 Special Case : Naïve Bayes h e1e2en…………. P(e1, e2, ……en, h ) = P(h) P(e1 | h) …….P(en | h)

Oct 17, 2006Sudeshna Sarkar, IIT Kharagpur36 Inference in Bayesian Networks AgeIncome House Owner … Voting Pattern Newspaper Preference Living Location How likely are elderly rich people to buy DallasNews? P( paper = DallasNews | Age>60, Income > 60k)

Oct 17, 2006Sudeshna Sarkar, IIT Kharagpur37 Bayesian Learning B E A C N ~b e a c n b ~e ~a ~c n ………………... BurglaryEarthquake Alarm Call Newscast Input : fully or partially observable data cases Output : parameters AND also structure Learning Methods: EM (Expectation Maximisation) using current approximation of parameters to estimate filled in data using filled in data to update parameters (ML) Gradient Ascent Training Gibbs Sampling (MCMC)