ONLINE LEARNING CS446 -FALL ‘15 Administration Registration Hw2Hw2 is out  Please start working on it as soon as possible  Come to sections with questions.

Slides:



Advertisements
Similar presentations
Multi-Layer Perceptron (MLP)
Advertisements

Slides from: Doug Gray, David Poole
Learning in Neural and Belief Networks - Feed Forward Neural Network 2001 년 3 월 28 일 안순길.
1 Machine Learning: Lecture 4 Artificial Neural Networks (Based on Chapter 4 of Mitchell T.., Machine Learning, 1997)
G53MLE | Machine Learning | Dr Guoping Qiu
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.
Data Mining Classification: Alternative Techniques
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Machine Learning: Connectionist McCulloch-Pitts Neuron Perceptrons Multilayer Networks Support Vector Machines Feedback Networks Hopfield Networks.
Simple Neural Nets For Pattern Classification
1 Chapter 10 Introduction to Machine Learning. 2 Chapter 10 Contents (1) l Training l Rote Learning l Concept Learning l Hypotheses l General to Specific.
Announcements See Chapter 5 of Duda, Hart, and Stork. Tutorial by Burge linked to on web page. “Learning quickly when irrelevant attributes abound,” by.
Measuring Model Complexity (Textbook, Sections ) CS 410/510 Thurs. April 27, 2007 Given two hypotheses (models) that correctly classify the training.
Neural Networks Marco Loog.
Machine Learning Motivation for machine learning How to set up a problem How to design a learner Introduce one class of learners (ANN) –Perceptrons –Feed-forward.
Introduction to Boosting Aristotelis Tsirigos SCLT seminar - NYU Computer Science.
Artificial Neural Networks
Experts and Boosting Algorithms. Experts: Motivation Given a set of experts –No prior information –No consistent behavior –Goal: Predict as the best expert.
CS 4700: Foundations of Artificial Intelligence
Online Learning Algorithms
Machine Learning CS 165B Spring 2012
Dr. Hala Moushir Ebied Faculty of Computers & Information Sciences
Artificial Neural Networks
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
Machine Learning CSE 681 CH2 - Supervised Learning.
For Friday Read chapter 18, section 7 Program 3 due.
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 16 Nov, 3, 2011 Slide credit: C. Conati, S.
Universit at Dortmund, LS VIII
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
1 Machine Learning: Lecture 8 Computational Learning Theory (Based on Chapter 7 of Mitchell T.., Machine Learning, 1997)
1 Chapter 6. Classification and Prediction Overview Classification algorithms and methods Decision tree induction Bayesian classification Lazy learning.
Linear Discrimination Reading: Chapter 2 of textbook.
Linear Classification with Perceptrons
1 Chapter 10 Introduction to Machine Learning. 2 Chapter 10 Contents (1) l Training l Rote Learning l Concept Learning l Hypotheses l General to Specific.
CS Inductive Bias1 Inductive Bias: How to generalize on novel data.
CSSE463: Image Recognition Day 14 Lab due Weds, 3:25. Lab due Weds, 3:25. My solutions assume that you don't threshold the shapes.ppt image. My solutions.
Back-Propagation Algorithm AN INTRODUCTION TO LEARNING INTERNAL REPRESENTATIONS BY ERROR PROPAGATION Presented by: Kunal Parmar UHID:
For Friday No reading Take home exam due Exam 2. For Monday Read chapter 22, sections 1-3 FOIL exercise due.
Today’s Topics Read: Chapters 7, 8, and 9 on Logical Representation and Reasoning HW3 due at 11:55pm THURS (ditto for your Nannon Tourney Entry) Recipe.
Machine Learning Concept Learning General-to Specific Ordering
EEE502 Pattern Recognition
Classification Course web page: vision.cis.udel.edu/~cv May 14, 2003  Lecture 34.
Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University Today: Computational Learning Theory Probably Approximately.
Data Mining and Decision Support
CS 8751 ML & KDDComputational Learning Theory1 Notions of interest: efficiency, accuracy, complexity Probably, Approximately Correct (PAC) Learning Agnostic.
1 Perceptron as one Type of Linear Discriminants IntroductionIntroduction Design of Primitive UnitsDesign of Primitive Units PerceptronsPerceptrons.
1 Learning Bias & Clustering Louis Oliphant CS based on slides by Burr H. Settles.
Machine Learning Lecture 1: Intro + Decision Trees Moshe Koppel Slides adapted from Tom Mitchell and from Dan Roth.
Machine Learning Chapter 7. Computational Learning Theory Tom M. Mitchell.
1 Machine Learning in Natural Language More on Discriminative models Dan Roth University of Illinois, Urbana-Champaign
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Page 1 CS 546 Machine Learning in NLP Review 1: Supervised Learning, Binary Classifiers Dan Roth Department of Computer Science University of Illinois.
Pattern Recognition Lecture 20: Neural Networks 3 Dr. Richard Spillman Pacific Lutheran University.
1 CS 391L: Machine Learning: Computational Learning Theory Raymond J. Mooney University of Texas at Austin.
CSE343/543 Machine Learning Mayank Vatsa Lecture slides are prepared using several teaching resources and no authorship is claimed for any slides.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Neural networks and support vector machines
Fall 2004 Backpropagation CS478 - Machine Learning.
CS 388: Natural Language Processing: Neural Networks
Dan Roth Department of Computer and Information Science
Computational Learning Theory
Administration Registration  Hw1 is due tomorrow night
Announcements HW4 due today (11:59pm) HW5 out today (due 11/17 11:59pm)
Neural Networks CS 446 Machine Learning.
Classification with Perceptrons Reading:
CS 4/527: Artificial Intelligence
Machine Learning Today: Reading: Maria Florina Balcan
Perceptron as one Type of Linear Discriminants
Machine Learning: UNIT-3 CHAPTER-2
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Presentation transcript:

ONLINE LEARNING CS446 -FALL ‘15 Administration Registration Hw2Hw2 is out  Please start working on it as soon as possible  Come to sections with questions On Thursday we will have two lectures:  Usual one, 12:30-11:45  An additional one, 5pm-6:15pm; 1320 DCL 1 Questions

ONLINE LEARNING CS446 -FALL ‘15 Projects Projects proposals are due on Friday 10/16/15 Within a week we will give you an approval to continue with your project along with comments and/or a request to modify/augment/do a different project. There will also be a mechanism for peer comments. We allow team projects – a team can be up to 3 people. Please start thinking and working on the project now. Your proposal is limited to 1-2 pages, but needs to include references and, ideally, some of the ideas you have developed in the direction of the project (maybe even some preliminary results). Any project that has a significant Machine Learning component is good. You can do experimental work, theoretical work, a combination of both or a critical survey of results in some specialized topic. The work has to include some reading. Even if you do not do a survey, you must read (at least) two related papers or book chapters and relate your work to it. Originality is not mandatory but is encouraged. Try to make it interesting! 2

ONLINE LEARNING CS446 -FALL ‘15 Examples KDD Cup 2013:  "Author-Paper Identification": given an author and a small set of papers, we are asked to identify which papers are really written by the author.   “Author disambiguation”: given a list of authors, we are asked to de-duplicate it or cluster them, so that the strings refer to the same author are in the same cluster.  Adapt and NLP program to a new domain Work on making learned hypothesis (e.g., linear threshold functions) more comprehensible (medical domain example) Develop a (multi-modal) People Identifier Compare Regularization methods: e.g., Winnow vs. L1 Regularization Large scale clustering of documents + name the cluster Deep Networks: convert a state of the art NLP program to a deep network, efficient, architecture. Try to prove something 3

ONLINE LEARNING CS446 -FALL ‘15 A Guide Learning Algorithms  Search: (Stochastic) Gradient Descent with LMS  Decision Trees & Rules Importance of hypothesis space (representation) How are we doing?  Simplest: Quantification in terms of cumulative # of mistakes  More later Perceptron  How to deal better with large features spaces & sparsity?  Winnow  Variations of Perceptron  Dealing with overfitting  Closing the loop: Back to Gradient Descent  Dual Representations & Kernels Beyond Binary Classification?  Multi-class classification and Structured Prediction More general way to quantify learning performance (PAC)  New Algorithms (SVM, Boosting) 4 Today: Take a more general perspective and think more about learning, learning protocols, quantifying performance, etc. This will motivate some of the ideas we will see next.

ONLINE LEARNING CS446 -FALL ‘15 Quantifying Performance We want to be able to say something rigorous about the performance of our learning algorithm. We will concentrate on discussing the number of examples one needs to see before we can say that our learned hypothesis is good. 5

ONLINE LEARNING CS446 -FALL ‘15 There is a hidden (monotone) conjunction the learner (you) is to learn How many examples are needed to learn it ? How ?  Protocol I: The learner proposes instances as queries to the teacher  Protocol II: The teacher (who knows f) provides training examples  Protocol III: Some random source (e.g., Nature) provides training examples; the Teacher (Nature) provides the labels (f(x)) Learning Conjunctions 6

ONLINE LEARNING CS446 -FALL ‘15 Learning Conjunctions Protocol I: The learner proposes instances as queries to the teacher Since we know we are after a monotone conjunction: Is x 100 in? f(x)=0 (conclusion: Yes) Is x 99 in? f(x)=1 (conclusion: No) Is x 1 in ? f(x)=1 (conclusion: No) A straight forward algorithm requires n=100 queries, and will produce as a result the hidden conjunction (exactly). What happens here if the conjunction is not known to be monotone? If we know of a positive example, the same algorithm works. 7

ONLINE LEARNING CS446 -FALL ‘15 Learning Conjunctions Protocol II: The teacher (who knows f) provides training examples 8

ONLINE LEARNING CS446 -FALL ‘15 Learning Conjunctions Protocol II: The teacher (who knows f) provides training examples 9

ONLINE LEARNING CS446 -FALL ‘15 Learning Conjunctions Protocol II: The teacher (who knows f) provides training examples (We learned a superset of the good variables) 10

ONLINE LEARNING CS446 -FALL ‘15 Learning Conjunctions Protocol II: The teacher (who knows f) provides training examples (We learned a superset of the good variables) To show you that all these variables are required… 11

ONLINE LEARNING CS446 -FALL ‘15 Learning Conjunctions Protocol II: The teacher (who knows f) provides training examples (We learned a superset of the good variables) To show you that all these variables are required…  need x 2  need x 3  …..  need x 100 A straight forward algorithm requires k = 6 examples to produce the hidden conjunction (exactly). Modeling Teaching Is tricky 12

ONLINE LEARNING CS446 -FALL ‘15 Learning Conjunctions Protocol III: Some random source (e.g., Nature) provides training examples Teacher (Nature) provides the labels (f(x))  13

ONLINE LEARNING CS446 -FALL ‘15 Learning Conjunctions Protocol III: Some random source (e.g., Nature) provides training examples  Teacher (Nature) provides the labels (f(x)) Algorithm: Elimination  Start with the set of all literals as candidates  Eliminate a literal that is not active (0) in a positive example 14

ONLINE LEARNING CS446 -FALL ‘15 Learning Conjunctions Protocol III: Some random source (e.g., Nature) provides training examples  Teacher (Nature) provides the labels (f(x)) Algorithm: Elimination  Start with the set of all literals as candidates  Eliminate a literal that is not active (0) in a positive example 15

ONLINE LEARNING CS446 -FALL ‘15 Learning Conjunctions Protocol III: Some random source (e.g., Nature) provides training examples  Teacher (Nature) provides the labels (f(x)) Algorithm: Elimination  Start with the set of all literals as candidates  Eliminate a literal that is not active (0) in a positive example  16

ONLINE LEARNING CS446 -FALL ‘15 Learning Conjunctions Protocol III: Some random source (e.g., Nature) provides training examples  Teacher (Nature) provides the labels (f(x)) Algorithm: Elimination  Start with the set of all literals as candidates  Eliminate a literal that is not active (0) in a positive example    learned nothing  17

ONLINE LEARNING CS446 -FALL ‘15 Learning Conjunctions Protocol III: Some random source (e.g., Nature) provides training examples  Teacher (Nature) provides the labels (f(x)) Algorithm: Elimination  Start with the set of all literals as candidates  Eliminate a literal that is not active (0) in a positive example    learned nothing    learned nothing  18

ONLINE LEARNING CS446 -FALL ‘15 Learning Conjunctions Protocol III: Some random source (e.g., Nature) provides training examples  Teacher (Nature) provides the labels (f(x)) Algorithm: Elimination  Start with the set of all literals as candidates  Eliminate a literal that is not active (0) in a positive example   learned nothing   learned nothing  19

ONLINE LEARNING CS446 -FALL ‘15 Learning Conjunctions Protocol III: Some random source (e.g., Nature) provides training examples  Teacher (Nature) provides the labels (f(x)) Algorithm: Elimination  Start with the set of all literals as candidates  Eliminate a literal that is not active (0) in a positive example   learned nothing   learned nothing   Final hypothesis:  Is that good ? Performance ? # of examples ? 20

ONLINE LEARNING CS446 -FALL ‘15 Learning Conjunctions Protocol III: Some random source (e.g., Nature) provides training examples  Teacher (Nature) provides the labels (f(x)) Algorithm: Elimination   Final hypothesis:  With the given data, we only learned an “approximation” to the true concept Is it good Is it good Performance ? Performance ? # of examples ? # of examples ? 21

ONLINE LEARNING CS446 -FALL ‘15 Two Directions Can continue to analyze the probabilistic intuition:  Never saw x 1 =0 in positive examples, maybe we’ll never see it?  And if we will, it will be with small probability, so the concepts we learn may be pretty good  Good: in terms of performance on future data  PAC framework Mistake Driven Learning algorithms  Update your hypothesis only when you make mistakes  Good: in terms of how many mistakes you make before you stop, happy with your hypothesis.  Note: not all on-line algorithms are mistake driven, so performance measure could be different. Two Directions 22

ONLINE LEARNING CS446 -FALL ‘15 On-Line Learning Two new learning algorithms (learn a linear function over the feature space)  Perceptron (+ many variations)  Winnow  General Gradient Descent view Issues:  Importance of Representation  Complexity of Learning  Idea of Kernel Based Methods  More about features 23

ONLINE LEARNING CS446 -FALL ‘15 Motivation Consider a learning problem in a very high dimensional space And assume that the function space is very sparse (every function of interest depends on a small number of attributes.) Can we develop an algorithm that depends only weakly on the space dimensionality and mostly on the number of relevant attributes ? How should we represent the hypothesis? Middle Eastern deserts are known for their sweetness 24

ONLINE LEARNING CS446 -FALL ‘15 On-Line Learning Of general interest; simple and intuitive model; Robot in an assembly line, language learning,… Important in the case of very large data sets, when the data cannot fit memory – Streaming data Evaluation: We will try to make the smallest number of mistakes in the long run.  What is the relation to the “real” goal?  Generate a hypothesis that does well on previously unseen data 25

ONLINE LEARNING CS446 -FALL ‘15 Not the most general setting for on-line learning. Not the most general metric (Regret: cumulative loss; Competitive analysis) On-Line Learning Model:  Instance space: X (dimensionality – n)  Target: f: X  {0,1}, f  C, concept class (parameterized by n) Protocol:  learner is given x  X  learner predicts h(x), and is then given f(x) (feedback) Performance: learner makes a mistake when h(x)  f(x)  number of mistakes algorithm A makes on sequence S of examples, for the target function f. A is a mistake bound algorithm for the concept class C, if MA(c) is a polynomial in n, the complexity parameter of the target concept. On Line Model 26

ONLINE LEARNING CS446 -FALL ‘15 On-Line/Mistake Bound Learning We could ask: how many mistakes to get to ² - ± (PAC) behavior?  Instead, looking for exact learning. (easier to analyze) No notion of distribution; a worst case model Memory: get example, update hypothesis, get rid of it (??) 27

ONLINE LEARNING CS446 -FALL ‘15 On-Line/Mistake Bound Learning We could ask: how many mistakes to get to ² - ± (PAC) behavior  Instead, looking for exact learning. (easier to analyze) No notion of distribution; a worst case model Memory: get example, update hypothesis, get rid of it (??) Drawbacks:  Too simple  Global behavior: not clear when will the mistakes be made 28

ONLINE LEARNING CS446 -FALL ‘15 On-Line/Mistake Bound Learning We could ask: how many mistakes to get to ² - ± (PAC) behavior  Instead, looking for exact learning. (easier to analyze) No notion of distribution; a worst case model Memory: get example, update hypothesis, get rid of it (??) Drawbacks:  Too simple  Global behavior: not clear when will the mistakes be made Advantages:  Simple  Many issues arise already in this setting  Generic conversion to other learning models  “Equivalent” to PAC for “natural” problems (?) 29

ONLINE LEARNING CS446 -FALL ‘15 Is it clear that we can bound the number of mistakes ? Let C be a finite concept class. Learn f ² C CON:  In the ith stage of the algorithm:  C i all concepts in C consistent with all i-1 previously seen examples  Choose randomly f 2 C i and use to predict the next example  Clearly, C i+1 µ C i and, if a mistake is made on the ith example, then |C i+1 | < |C i | so progress is made. The CON algorithm makes at most |C|-1 mistakes Can we do better ? Generic Mistake Bound Algorithms 30

ONLINE LEARNING CS446 -FALL ‘15 Let C be a concept class. Learn f ² C Halving: In the ith stage of the algorithm:  all concepts in C consistent with all i-1 previously seen examples Given an example consider the value for all and predict by majority. The Halving Algorithm 31

ONLINE LEARNING CS446 -FALL ‘15 Let C be a concept class. Learn f ² C Halving: In the ith stage of the algorithm:  all concepts in C consistent with all i-1 previously seen examples Given an example consider the value for all and predict by majority. Predict 1 if The Halving Algorithm 32

ONLINE LEARNING CS446 -FALL ‘15 Let C be a concept class. Learn f ² C Halving: In the ith stage of the algorithm:  all concepts in C consistent with all i-1 previously seen examples Given an example consider the value for all and predict by majority. Predict 1 if Clearly and if a mistake is made in the ith example, then The Halving algorithm makes at most log(|C|) mistakes The Halving Algorithm 33

ONLINE LEARNING CS446 -FALL ‘15 The Halving Algorithm Hard to compute In some cases Halving is optimal (C - class of all Boolean functions) In general, to be optimal, instead of guessing in accordance with the majority of the valid concepts, we should guess according to the concept group that gives the least number of expected mistakes (even harder to compute) 34

ONLINE LEARNING CS446 -FALL ‘15 There is a hidden conjunctions the learner is to learn The number of conjunctions: log(|C|) = n The elimination algorithm makes n mistakes  Learn from positive examples; eliminate active literals. k-conjunctions:  Assume that only k<<n attributes occur in the disjunction The number of k-conjunctions:  log(|C|) =  Can we learn efficiently with this number of mistakes ? Learning Conjunctions What’s possible? 35 Can mistakes be bounded in the non- finite case? Can this bound be achieved?

ONLINE LEARNING CS446 -FALL ‘15 Representation Assume that you want to learn conjunctions. Should your hypothesis space be the class of conjunctions?  Theorem: Given a sample on n attributes that is consistent with a conjunctive concept, it is NP-hard to find a pure conjunctive hypothesis that is both consistent with the sample and has the minimum number of attributes.  [David Haussler, AIJ’88: “Quantifying Inductive Bias: AI Learning Algorithms and Valiant's Learning Framework”] Same holds for Disjunctions. Intuition: Reduction to minimum set cover problem.  Given a collection of sets that cover X, define a set of examples so that learning the best (dis/conj)junction implies a minimal cover. Consequently, we cannot learn the concept efficiently as a (dis/con)junction. But, we will see that we can do that, if we are willing to learn the concept as a Linear Threshold function. In a more expressive class, the search for a good hypothesis sometimes becomes combinatorially easier. 36 Importance of Representation

ONLINE LEARNING CS446 -FALL ‘15 Linear Functions Disjunctions At least m of n: Exclusive-OR: Non-trivial DNF Linear Functions 37 f (x) = 1 if w 1 x 1 + w 2 x w n x n >=  0 Otherwise { y = ( x 1  x 2 v ) ( x 1  x 2 ) y = ( x 1  x 2 ) v ( x 3  x 4 ) y = x 1  x 3  x 5 y = ( 1 x x x 5 >= 1) y = at least 2 of { x 1, x 3, x 5 } y = ( 1 x x x 5 >=2)

ONLINE LEARNING CS446 -FALL ‘15 38 w ¢ x = w ¢ x = 

ONLINE LEARNING CS446 -FALL ‘15 Footnote About the Threshold 39 On previous slide, Perceptron has no threshold But we don’t lose generality:

ONLINE LEARNING CS446 -FALL ‘15 Perceptron learning rule On-line, mistake driven algorithm. Rosenblatt (1959) suggested that when a target output value is provided for a single neuron with fixed input, it can incrementally change weights and learn to produce the output using the Perceptron learning rule (Perceptron == Linear Threshold Unit) Perceptron T

ONLINE LEARNING CS446 -FALL ‘15 Perceptron learning rule We learn f:X  {-1,+1} represented as f =sgn{w  x) Where X= {0,1} n or X= R n and w  R n Given Labeled examples : {(x 1, y 1 ), (x 2, y 2 ),…(x m, y m )} Perceptron 41 1.Initialize w=0  2. Cycle through all examples a. Predict the label of instance x to be y’ = sgn{w  x) b. If y’  y, update the weight vector: w = w + r y x (r - a constant, learning rate) Otherwise, if y’=y, leave weights unchanged.

ONLINE LEARNING CS446 -FALL ‘15 The Perceptron rule  If y = +1: x should be above the decision boundary  Raise the decision boundary’s slope: w i+1 := w i + x  If y = -1: x should be below the decision boundary Lower the decision boundary’s slope: w i+1 := w i – x 42 Target x Previous Model x New Model x Target x New Model x Previous Model x

ONLINE LEARNING CS446 -FALL ‘15 Perceptron in action 43 wx = 0 Current decision boundary w Current weight vector x (with y = +1) next item to be classified x as a vector x as a vector added to w wx = 0 New decision boundary w New weight vector (Figures from Bishop 2006) Positive Negative

ONLINE LEARNING CS446 -FALL ‘15 Perceptron in action 44 wx = 0 Current decision boundary w Current weight vector x (with y = +1) next item to be classified x as a vector x as a vector added to w wx = 0 New decision boundary w New weight vector (Figures from Bishop 2006) Positive Negative

ONLINE LEARNING CS446 -FALL ‘15 Perceptron learning rule If x is Boolean, only weights of active features are updated Why is this important? Perceptron 45 1.Initialize w=0  2. Cycle through all examples a. Predict the label of instance x to be y’ = sgn{w  x) b. If y’  y, update the weight vector to w = w + r y x (r - a constant, learning rate) Otherwise, if y’=y, leave weights unchanged.

ONLINE LEARNING CS446 -FALL ‘15 Perceptron Learnability Obviously can’t learn what it can’t represent (???)  Only linearly separable functions Minsky and Papert (1969) wrote an influential book demonstrating Perceptron’s representational limitations  Parity functions can’t be learned (XOR)  In vision, if patterns are represented with local features, can’t represent symmetry, connectivity Research on Neural Networks stopped for years Rosenblatt himself (1959) asked, “What pattern recognition problems can be transformed so as to become linearly separable?” Perceptron 46

ONLINE LEARNING CS446 -FALL ‘15 47 ( x 1  x 2 ) v ( x 3  x 4 ) y 1  y 2

ONLINE LEARNING CS446 -FALL ‘15 Perceptron Convergence Perceptron Convergence Theorem: If there exist a set of weights that are consistent with the data (i.e., the data is linearly separable), the perceptron learning algorithm will converge  How long would it take to converge ? Perceptron Cycling Theorem: If the training data is not linearly separable the perceptron learning algorithm will eventually repeat the same set of weights and therefore enter an infinite loop.  How to provide robustness, more expressivity ? Perceptron 48