Machine Learning: Foundations Course Number 0368403401 Prof. Nathan Intrator TA: Daniel Gill, Guy Amit.

Slides:



Advertisements
Similar presentations
The Software Infrastructure for Electronic Commerce Databases and Data Mining Lecture 4: An Introduction To Data Mining (II) Johannes Gehrke
Advertisements

Statistical Machine Learning- The Basic Approach and Current Research Challenges Shai Ben-David CS497 February, 2007.
Godfather to the Singularity
On-line learning and Boosting
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.
Data Mining Classification: Alternative Techniques
Data Mining Classification: Alternative Techniques
Support Vector Machines
Visual Recognition Tutorial
Machine Learning Neural Networks
EECS 349 Machine Learning Instructor: Doug Downey Note: slides adapted from Pedro Domingos, University of Washington, CSE
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Machine Learning: Foundations Yishay Mansour Tel-Aviv University.
CSE 546 Data Mining Machine Learning Instructor: Pedro Domingos.
Data Mining with Decision Trees Lutz Hamel Dept. of Computer Science and Statistics University of Rhode Island.
Machine Learning CMPT 726 Simon Fraser University
Machine Learning: Symbol-Based
Introduction to Machine Learning course fall 2007 Lecturer: Amnon Shashua Teaching Assistant: Yevgeny Seldin School of Computer Science and Engineering.
Machine Learning Motivation for machine learning How to set up a problem How to design a learner Introduce one class of learners (ANN) –Perceptrons –Feed-forward.
Machine Learning: Foundations Course Number Prof. Nathan Intrator TA: Daniel Gill, Guy Amit.
Visual Recognition Tutorial
Sparse vs. Ensemble Approaches to Supervised Learning
Learning Programs Danielle and Joseph Bennett (and Lorelei) 4 December 2007.
Statistical Learning: Pattern Classification, Prediction, and Control Peter Bartlett August 2002, UC Berkeley CIS.
Maria-Florina Balcan A Theoretical Model for Learning from Labeled and Unlabeled Data Maria-Florina Balcan & Avrim Blum Carnegie Mellon University, Computer.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Introduction to Machine Learning* Prof. D. Spears COSC 4010/5010, Section 1 Spring 2004 * This material is taken from the textbook, Machine Learning, Volume.
Advanced Database Applications Database Indexing and Data Mining CS591-G1 -- Fall 2001 George Kollios Boston University.
Radial Basis Function Networks
Machine Learning Usman Roshan Dept. of Computer Science NJIT.
CS Machine Learning. What is Machine Learning? Adapt to / learn from data  To optimize a performance function Can be used to:  Extract knowledge.
Copyright R. Weber Machine Learning, Data Mining ISYS370 Dr. R. Weber.
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
COMP3503 Intro to Inductive Modeling
Midterm Review Rao Vemuri 16 Oct Posing a Machine Learning Problem Experience Table – Each row is an instance – Each column is an attribute/feature.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
Introduction to machine learning and data mining 1 iCSC2014, Juan López González, University of Oviedo Introduction to machine learning Juan López González.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
Machine Learning in Ad-hoc IR. Machine Learning for ad hoc IR We’ve looked at methods for ranking documents in IR using factors like –Cosine similarity,
1 Generative and Discriminative Models Jie Tang Department of Computer Science & Technology Tsinghua University 2012.
1 Machine Learning 1.Where does machine learning fit in computer science? 2.What is machine learning? 3.Where can machine learning be applied? 4.Should.
1 Pattern Recognition Pattern recognition is: 1. A research area in which patterns in data are found, recognized, discovered, …whatever. 2. A catchall.
Machine Learning, Decision Trees, Overfitting Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University January 14,
Copyright R. Weber Machine Learning, Data Mining INFO 629 Dr. R. Weber.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
Instructor: Pedro Domingos
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
KNN & Naïve Bayes Hongning Wang Today’s lecture Instance-based classifiers – k nearest neighbors – Non-parametric learning algorithm Model-based.
Gaussian Processes For Regression, Classification, and Prediction.
Fast Query-Optimized Kernel Machine Classification Via Incremental Approximate Nearest Support Vectors by Dennis DeCoste and Dominic Mazzoni International.
Data Mining and Decision Support
NTU & MSRA Ming-Feng Tsai
Machine Learning Chapter 18, 21 Some material adopted from notes by Chuck Dyer.
Supervised Machine Learning: Classification Techniques Chaleece Sandberg Chris Bradley Kyle Walsh.
Machine Learning: Foundations Course Number Prof. Nathan Intrator Teaching Assistants: Daniel Gill, Guy Amit.
Naïve Bayes Classifier April 25 th, Classification Methods (1) Manual classification Used by Yahoo!, Looksmart, about.com, ODP Very accurate when.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
FNA/Spring CENG 562 – Machine Learning. FNA/Spring Contact information Instructor: Dr. Ferda N. Alpaslan
Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.
Machine Learning Usman Roshan Dept. of Computer Science NJIT.
CS 9633 Machine Learning Support Vector Machines
Machine Learning overview Chapter 18, 21
School of Computer Science & Engineering
CH. 1: Introduction 1.1 What is Machine Learning Example:
Machine Learning for dotNET Developer Bahrudin Hrnjica, MVP
Data Mining Lecture 11.
CSCI 5822 Probabilistic Models of Human and Machine Learning
Predictive Learning from Data
Overview of Machine Learning
A task of induction to find patterns
Presentation transcript:

Machine Learning: Foundations Course Number Prof. Nathan Intrator TA: Daniel Gill, Guy Amit

Course structure There will be 4 homework exercises They will be theoretical as well as programming All programming will be done in Matlab Course info accessed from Final has not been decided yet Office hours Wednesday 4-5 (Contact via )

Class Notes Groups of 2-3 students will be responsible for a scribing class notes Submission of class notes by next Monday (1 week) and then corrections and additions from Thursday to the following Monday 30% contribution to the grade

Class Notes (cont) Notes will be done in LaTeX to be compiled into PDF via miktex. (Download from School site) Style file to be found on course web site Figures in GIF

Basic Machine Learning idea Receive a collection of observations associated with some action label Perform some kind of “Machine Learning” to be able to: –Receive a new observation –“Process” it and generate an action label that is based on previous observations Main Requirement: Good generalization

Learning Approaches Store observations in memory and retrieve –Simple, little generalization (Distance measure?) Learn a set of rules and apply to new data –Sometimes difficult to find a good model –Good generalization Estimate a “flexible model” from the data –Generalization issues, data size issues

Storage & Retrieval Simple, computationally intensive –little generalization How can retrieval be performed? –Requires a “distance measure” between stored observations and new observation Distance measure can be given or “learned” (Clustering)

Learning Set of Rules How to create “reliable” set of rules from the observed data –Tree structures –Graphical models Complexity of the set of rules vs. generalization

Estimation of a flexible model What is a “flexible” model –Universal approximator –Reliability and generalization, Data size issues

Applications Control –Robot arm –Driving and navigating a car –Medical applications: Diagnosis, monitoring, drug release Web retrieval based on user profile –Customized ads: Amazon –Document retrieval: Google

Related Disciplines machine learning AI probability & statistics computational complexity theory control theory information theory philosophy psychology neurophysiology ethology decision theory game theory optimization biological evolution statistical mechanics

Why Now? Technology ready: –Algorithms and theory. Information abundant: –Flood of data (online) Computational power –Sophisticated techniques Industry and consumer needs.

Example 1: Credit Risk Analysis Typical customer: bank. Database: –Current clients data, including: –basic profile (income, house ownership, delinquent account, etc.) –Basic classification. Goal: predict/decide whether to grant credit.

Example 1: Credit Risk Analysis Rules learned from data: IF Other-Delinquent-Accounts > 2 and Number-Delinquent-Billing-Cycles >1 THEN DENAY CREDIT IF Other-Delinquent-Accounts = 0 and Income > $30k THEN GRANT CREDIT

Example 2: Clustering news Data: Reuters news / Web data Goal: Basic category classification: –Business, sports, politics, etc. –classify to subcategories (unspecified) Methodology: –consider “typical words” for each category. –Classify using a “distance “ measure.

Example 3: Robot control Goal: Control a robot in an unknown environment. Needs both –to explore (new places and action) –to use acquired knowledge to gain benefits. Learning task “control” what is observes!

History of Machine Learning (cont’d) 1960’s and 70’s: Models of human learning –High-level symbolic descriptions of knowledge, e.g., logical expressions or graphs/networks, e.g., (Karpinski & Michalski, 1966) (Simon & Lea, 1974). –META-DENDRAL (Buchanan, 1978), for example, acquired task-specific expertise (for mass spectrometry) in the context of an expert system. –Winston’s (1975) structural learning system learned logic-based structural descriptions from examples. 1970’s: Genetic algorithms –Developed by Holland (1975) 1970’s - present: Knowledge-intensive learning –A tabula rasa approach typically fares poorly. “To acquire new knowledge a system must already possess a great deal of initial knowledge.” Lenat’s CYC project is a good example.

History of Machine Learning (cont’d) 1970’s - present: Alternative modes of learning (besides examples) –Learning from instruction, e.g., (Mostow, 1983) (Gordon & Subramanian, 1993) –Learning by analogy, e.g., (Veloso, 1990) –Learning from cases, e.g., (Aha, 1991) –Discovery (Lenat, 1977) –1991: The first of a series of workshops on Multistrategy Learning (Michalski) 1970’s – present: Meta-learning –Heuristics for focusing attention, e.g., (Gordon & Subramanian, 1996) –Active selection of examples for learning, e.g., (Angluin, 1987), (Gasarch & Smith, 1988), (Gordon, 1991) –Learning how to learn, e.g., (Schmidhuber, 1996)

History of Machine Learning (cont’d) 1980 – The First Machine Learning Workshop was held at Carnegie-Mellon University in Pittsburgh – Three consecutive issues of the International Journal of Policy Analysis and Information Systems were specially devoted to machine learning – A special issue of SIGART Newsletter reviewed current projects in the field of machine learning – The Second International Workshop on Machine Learning, in Monticello at the University of Illinois – The establishment of the Machine Learning journal – The beginning of annual international conferences on machine learning (ICML) – The beginning of regular workshops on computational learning theory (COLT). 1990’s – Explosive growth in the field of data mining, which involves the application of machine learning techniques.

A Glimpse in to the future Today status: –First-generation algorithms: –Neural nets, decision trees, etc. Well-formed data-bases Future: –many more problems: –networking, control, software. –Main advantage is flexibility!

Relevant Disciplines Artificial intelligence Statistics Computational learning theory Control theory Information Theory Philosophy Psychology and neurobiology.

Type of models Supervised learning –Given access to classified data Unsupervised learning –Given access to data, but no classification Control learning –Selects actions and observes consequences. –Maximizes long-term cumulative return.

Learning: Complete Information Probability D 1 over and probability D 2 for Equally likely. Computing the probability of “smiley” given a point (x,y). Use Bayes formula. Let p be the probability.

Predictions and Loss Model Boolean Error –Predict a Boolean value. –each error we lose 1 (no error no loss.) –Compare the probability p to 1/2. –Predict deterministically with the higher value. –Optimal prediction (for this loss) Can not recover probabilities!

Predictions and Loss Model quadratic loss –Predict a “real number” q for outcome 1. –Loss (q-p) 2 for outcome 1 –Loss ([1-q]-[1-p]) 2 for outcome 0 –Expected loss: (p-q) 2 –Minimized for p=q (Optimal prediction) recovers the probabilities Needs to know p to compute loss!

Predictions and Loss Model Logarithmic loss –Predict a “real number” q for outcome 1. –Loss log 1/q for outcome 1 –Loss log 1/(1-q) for outcome 0 –Expected loss: -p log q -(1-p) log (1-q) –Minimized for p=q (Optimal prediction) recovers the probabilities Loss does not depend on p!

The basic PAC Model Unknown target function f(x) Distribution D over domain X Goal: find h(x) such that h(x) approx. f(x) Given H find h  H that minimizes Pr D [h(x)  f(x)]

Basic PAC Notions S - sample of m examples drawn i.i.d using D True error  (h)= Pr D [h(x)=f(x)] Observed error  ’(h)= 1/m |{ x  S | h(x)  f(x) }| Example (x,f(x)) Basic question: How close is  (h) to  ’(h)

Bayesian Theory Prior distribution over H Given a sample S compute a posterior distribution: Maximum Likelihood (ML) Pr[S|h] Maximum A Posteriori (MAP) Pr[h|S] Bayesian Predictor  h(x) Pr[h|S].

Nearest Neighbor Methods Classify using near examples. Assume a “ structured space ” and a “ metric ” ?

Computational Methods How to find an h e H with low observed error. Heuristic algorithm for specific classes. Most cases computational tasks are provably hard.

Separating Hyperplane Perceptron: sign(  x i w i ) Find w w n Limited representation x1x1 xnxn w1w1 wnwn  sign

Neural Networks Sigmoidal gates: a=  x i w i and output = 1/(1+ e -a ) Back Propagation x1x1 xnxn

Decision Trees x 1 > 5 x 6 >

Decision Trees Limited Representation Efficient Algorithms. Aim: Find a small decision tree with low observed error.

Decision Trees PHASE I: Construct the tree greedy, using a local index function. Ginni Index : G(x) = x(1-x), Entropy H(x)... PHASE II: Prune the decision Tree while maintaining low observed error. Good experimental results

Complexity versus Generalization hypothesis complexity versus observed error. More complex hypothesis have lower observed error, but might have higher true error.

Basic criteria for Model Selection Minimum Description Length  ’(h) + |code length of h| Structural Risk Minimization:  ’(h) + sqrt{ log |H| / m }

Genetic Programming A search Method. Local mutation operations Cross-over operations Keeps the “ best ” candidates. Change a node in a tree Replace a subtree by another tree Keep trees with low observed error Example: decision trees

General PAC Methodology Minimize the observed error. Search for a small size classifier Hand-tailored search method for specific classes.

Weak Learning Small class of predicates H Weak Learning: Assume that for any distribution D, there is some predicate h  H that predicts better than 1/2+ . Weak Learning Strong Learning

Boosting Algorithms Functions: Weighted majority of the predicates. Methodology: Change the distribution to target “ hard ” examples. Weight of an example is exponential in the number of incorrect classifications. Extremely good experimental results and efficient algorithms.

Support Vector Machine n dimensions m dimensions

Support Vector Machine Use a hyperplane in the LARGE space. Choose a hyperplane with a large MARGIN Project data to a high dimensional space.

Other Models Membership Queries x f(x)

Fourier Transform f(x) =  z  z (x)  z (x) = (-1) Many Simple classes are well approximated using large coefficients. Efficient algorithms for finding large coefficients.

Reinforcement Learning Control Problems. Changing the parameters changes the behavior. Search for optimal policies.

Clustering: Unsupervised learning

Unsupervised learning: Clustering

Basic Concepts in Probability For a single hypothesis h: –Given an observed error –Bound the true error Markov Inequality Chebyshev Inequality Chernoff Inequality

Basic Concepts in Probability Switching from h 1 to h 2 : –Given the observed errors –Predict if h 2 is better. Total error rate Cases where h 1 (x)  h 2 (x) –More refine

Course structure Store observations in memory and retrieve –Simple, little generalization (Distance measure?) Learn a set of rules and apply to new data –Sometimes difficult to find a good model –Good generalization Estimate a “flexible model” from the data –Generalization issues, data size issues