Machine Learning Tutorial – UIST 2002

Slides:



Advertisements
Similar presentations
Conceptual Clustering
Advertisements

Slides from: Doug Gray, David Poole
1 Machine Learning: Lecture 4 Artificial Neural Networks (Based on Chapter 4 of Mitchell T.., Machine Learning, 1997)
For Wednesday Read chapter 19, sections 1-3 No homework.
Data Mining Classification: Alternative Techniques
Data Mining Classification: Alternative Techniques
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Intelligent Environments1 Computer Science and Engineering University of Texas at Arlington.
Instance Based Learning
Machine Learning: Connectionist McCulloch-Pitts Neuron Perceptrons Multilayer Networks Support Vector Machines Feedback Networks Hopfield Networks.
CS Perceptrons1. 2 Basic Neuron CS Perceptrons3 Expanded Neuron.
Machine Learning Neural Networks
x – independent variable (input)
Machine Learning Neural Networks.
Three kinds of learning
Learning….in a rather broad sense: improvement of performance on the basis of experience Machine learning…… improve for task T with respect to performance.
CS 4700: Foundations of Artificial Intelligence
CS Instance Based Learning1 Instance Based Learning.
Saturation, Flat-spotting Shift up Derivative Weight Decay No derivative on output nodes.
CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.
Radial Basis Function Networks
CS 478 – Introduction1 Introduction to Machine Learning CS 478 Professor Tony Martinez.
Artificial Neural Networks
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
CS 312: Algorithm Analysis Lecture #38: Approximation Algorithms and Local Search Credit: Tony Martinez, Eric Ringger, and some figures from Dasgupta et.
Bounded Approximation Algorithms Sometimes we can handle NP problems with polynomial time algorithms which are guaranteed to return a solution within some.
Neural Networks Ellen Walker Hiram College. Connectionist Architectures Characterized by (Rich & Knight) –Large number of very simple neuron-like processing.
Chapter 9 Neural Network.
Chapter 3 Neural Network Xiu-jun GONG (Ph. D) School of Computer Science and Technology, Tianjin University
11 CSE 4705 Artificial Intelligence Jinbo Bi Department of Computer Science & Engineering
START OF DAY 4 Reading: Chap. 3 & 4. Project Topics & Teams Select topics/domains Select teams Deliverables – Description of the problem – Selection.
CS Evolutionary Algorithms1 Evolutionary Algorithms.
1 Machine Learning The Perceptron. 2 Heuristic Search Knowledge Based Systems (KBS) Genetic Algorithms (GAs)
NEURAL NETWORKS FOR DATA MINING
CS 478 – Tools for Machine Learning and Data Mining Backpropagation.
CS Decision Trees1 Decision Trees Highly used and successful Iteratively split the Data Set into subsets one attribute at a time, using most informative.
CS Ensembles1 Ensembles. 2 A “Holy Grail” of Machine Learning Automated Learner Just a Data Set or just an explanation of the problem Hypothesis.
Artificial Intelligence Chapter 3 Neural Networks Artificial Intelligence Chapter 3 Neural Networks Biointelligence Lab School of Computer Sci. & Eng.
Decision Trees Binary output – easily extendible to multiple output classes. Takes a set of attributes for a given situation or object and outputs a yes/no.
CS Inductive Bias1 Inductive Bias: How to generalize on novel data.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
Back-Propagation Algorithm AN INTRODUCTION TO LEARNING INTERNAL REPRESENTATIONS BY ERROR PROPAGATION Presented by: Kunal Parmar UHID:
CITS7212: Computational Intelligence An Overview of Core CI Technologies Lyndon While.
Ensemble Methods in Machine Learning
Neural Networks Presented by M. Abbasi Course lecturer: Dr.Tohidkhah.
Neural Networks Teacher: Elena Marchiori R4.47 Assistant: Kees Jong S2.22
Chapter 18 Connectionist Models
Artificial Neural Network
Chapter 6 Neural Network.
CS Machine Learning Instance Based Learning (Adapted from various sources)
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Data Mining CH6 Implementation: Real machine learning schemes(2) Reporter: H.C. Tsai.
Neural networks (2) Reminder Avoiding overfitting Deep neural network Brief summary of supervised learning methods.
129 Feed-Forward Artificial Neural Networks AMIA 2003, Machine Learning Tutorial Constantin F. Aliferis & Ioannis Tsamardinos Discovery Systems Laboratory.
Learning: Neural Networks Artificial Intelligence CMSC February 3, 2005.
Learning with Neural Networks Artificial Intelligence CMSC February 19, 2002.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Machine Learning Supervised Learning Classification and Regression
Fall 2004 Backpropagation CS478 - Machine Learning.
Learning with Perceptrons and Neural Networks
第 3 章 神经网络.
with Daniel L. Silver, Ph.D. Christian Frey, BBA April 11-12, 2017
CSC 578 Neural Networks and Deep Learning
Artificial Intelligence Chapter 3 Neural Networks
Artificial Intelligence Chapter 3 Neural Networks
Artificial Intelligence Chapter 3 Neural Networks
Artificial Intelligence Chapter 3 Neural Networks
Evolutionary Ensembles with Negative Correlation Learning
Memory-Based Learning Instance-Based Learning K-Nearest Neighbor
Artificial Intelligence Chapter 3 Neural Networks
Presentation transcript:

Machine Learning Tutorial – UIST 2002 Machine Learning and Neural Networks Professor Tony Martinez Computer Science Department Brigham Young University http://axon.cs.byu.edu/~martinez Machine Learning Tutorial – UIST 2002

Machine Learning Tutorial – UIST 2002 Tutorial Overview Introduction and Motivation Neural Network Model Descriptions Perceptron Backpropagation Issues Overfitting Applications Other Models Decision Trees, Nearest Neighbor/IBL, Genetic Algorithms, Rule Induction, Ensembles Machine Learning Tutorial – UIST 2002

Machine Learning Tutorial – UIST 2002 More Information You can download this presentation from: ftp://axon.cs.byu.edu/pub/papers/NNML.ppt An excellent introductory text to Machine Learning: Machine Learning, Tom M. Mitchell, McGraw Hill, 1997 Machine Learning Tutorial – UIST 2002

What is Inductive Learning Gather a set of input-output examples from some application: Training Set i.e. Speech Recognition, financial forecasting Train the learning model (Neural network, etc.) on the training set until it solves it well The Goal is to generalize on novel data not yet seen Gather a further set of input-output examples from the same application: Test Set Use the learning system on actual data Machine Learning Tutorial – UIST 2002

Machine Learning Tutorial – UIST 2002 Motivation Costs and Errors in Programming Our inability to program "subjective" problems General, easy-to use mechanism for a large set of applications Improvement in application accuracy - Empirical Machine Learning Tutorial – UIST 2002

Example Application - Heart Attack Diagnosis The patient has a set of symptoms - Age, type of pain, heart rate, blood pressure, temperature, etc. Given these symptoms in an Emergency Room setting, a doctor must diagnose whether a heart attack has occurred. How do you train a machine learning model solve this problem using the inductive learning model? Consistent approach Knowledge of ML approach not critical Need to select a reasonable set of input features Machine Learning Tutorial – UIST 2002

Examples and Discussion Loan Underwriting Which Input Features (Data) Divide into Training Set and Test Set Choose a learning model Train model on Training set Predict accuracy with the Test Set How to generalize better? Different Input Features Different Learning Model Issues Intuition vs. Prejudice Social Response Machine Learning Tutorial – UIST 2002

UC Irvine Machine Learning Data Base Iris Data Set 4.8,3.0,1.4,0.3, Iris-setosa 5.1,3.8,1.6,0.2, Iris-setosa 4.6,3.2,1.4,0.2, Iris-setosa 5.3,3.7,1.5,0.2, Iris-setosa 5.0,3.3,1.4,0.2, Iris-setosa 7.0,3.2,4.7,1.4, Iris-versicolor 6.4,3.2,4.5,1.5, Iris-versicolor 6.9,3.1,4.9,1.5, Iris-versicolor 5.5,2.3,4.0,1.3, Iris-versicolor 6.5,2.8,4.6,1.5, Iris-versicolor 6.0,2.2,5.0,1.5, Iris-viginica 6.9,3.2,5.7,2.3, Iris-viginica 5.6,2.8,4.9,2.0, Iris-viginica 7.7,2.8,6.7,2.0, Iris-viginica 6.3,2.7,4.9,1.8, Iris-viginica Machine Learning Tutorial – UIST 2002

Voting Records Data Base democrat,n,y,y,n,y,y,n,n,n,n,n,n,y,y,y,y democrat,n,y,n,y,y,y,n,n,n,n,n,n,?,y,y,y republican,n,y,n,y,y,y,n,n,n,n,n,n,y,y,?,y republican,n,y,n,y,y,y,n,n,n,n,n,y,y,y,n,y democrat,y,y,y,n,n,n,y,y,y,n,n,n,n,n,?,? republican,n,y,n,y,y,n,n,n,n,n,?,?,y,y,n,n republican,n,y,n,y,y,y,n,n,n,n,y,?,y,y,?,? democrat,n,y,y,n,n,n,y,y,y,n,n,n,y,n,?,? democrat,y,y,y,n,n,y,y,y,?,y,y,?,n,n,y,? republican,n,y,n,y,y,y,n,n,n,n,n,y,?,?,n,? republican,n,y,n,y,y,y,n,n,n,y,n,y,y,?,n,? democrat,y,n,y,n,n,y,n,y,?,y,y,y,?,n,n,y democrat,y,?,y,n,n,n,y,y,y,n,n,n,y,n,y,y republican,n,y,n,y,y,y,n,n,n,n,n,?,y,y,n,n Machine Learning Tutorial – UIST 2002

Machine Learning Sketch History Neural Networks - Connectionist - Biological Plausibility Late 50’s, early 60’s, Rosenblatt, Perceptron Minsky & Papert 1969 - The Lull, symbolic expansion Late 80’s - Backpropagation, Hopfield, etc. - The explosion Machine Learning - Artificial Intelligence - Symbolic - Psychological Plausibility Samuel (1959) - Checkers evaluation strategies 1970’s and on - ID3, Instance Based Learning, Rule induction, … Currently – Symbolic and connectionist lumped under ML Genetic Algorithms - 1970’s Originally lumped in connectionist Now an exploding area – Evolutionary Algorithms Machine Learning Tutorial – UIST 2002

Inductive Learning - Supervised Assume a set T of examples of the form (x,y) where x is a vector of features/attributes and y is a scalar or vector output By examining the examples postulate a hypothesis H(x) => y for arbitrary x Spectrum of Supervised Algorithms Unsupervised Learning Reinforcement Learning Machine Learning Tutorial – UIST 2002

Other Machine Learning Areas Case Based Reasoning Analogical Reasoning Speed-up Learning Inductive Learning is the most studied and successful to date Data Mining COLT – Computational Learning Theory Machine Learning Tutorial – UIST 2002

Machine Learning Tutorial – UIST 2002

Perceptron Node – Threshold Logic Unit x1 w1 x2 w2 Z xn wn Machine Learning Tutorial – UIST 2002

Machine Learning Tutorial – UIST 2002 Learning Algorithm x1 .4 .1 Z x2 -.2 x2 T 1 .1 .3 .4 .8 Machine Learning Tutorial – UIST 2002

First Training Instance .8 .3 .4 .1 Z =1 -.2 Net = .8*.4 + .3*-.2 = .26 x2 T 1 .1 .3 .4 .8 Machine Learning Tutorial – UIST 2002

Second Training Instance .4 .1 .4 .1 Z =1 -.2 Net = .4*.4 + .1*-.2 = .14 x2 T 1 .1 .3 .4 .8 Dwi = (T - Z) * C * Xi Machine Learning Tutorial – UIST 2002

Machine Learning Tutorial – UIST 2002 Delta Rule Learning Dwij = C(Tj – Zj) xi Create a network with n input and m output nodes Each iteration through the training set is an epoch Continue training until error is less than some epsilon Perceptron Convergence Theorem: Guaranteed to find a solution in finite time if a solution exists As can be seen from the node activation function the decision surface is an n-dimensional hyper plane Machine Learning Tutorial – UIST 2002

Machine Learning Tutorial – UIST 2002 Linear Separability Machine Learning Tutorial – UIST 2002

Linear Separability and Generalization When is data noise vs. a legitimate exception Machine Learning Tutorial – UIST 2002

Limited Functionality of Hyperplane Machine Learning Tutorial – UIST 2002

Gradient Descent Learning Error Landscape TSS: Total Sum Squared Error Weight Values Machine Learning Tutorial – UIST 2002

Deriving a Gradient Descent Learning Algorithm Goal to decrease overall error (or other objective function) each time a weight is changed Total Sum Squared error = S (Ti – Zi)2 Seek a weight changing algorithm such that is negative If a formula can be found then we have a gradient descent learning algorithm Perceptron/Delta rule is a gradient descent learning algorithm Linearly-separable problems have no local minima Machine Learning Tutorial – UIST 2002

Multi-layer Perceptron Can compute arbitrary mappings Assumes a non-linear activation function Training Algorithms less obvious Backpropagation learning algorithm not exploited until 1980’s First of many powerful multi-layer learning algorithms Machine Learning Tutorial – UIST 2002

Responsibility Problem Output 1 Wanted 0 Machine Learning Tutorial – UIST 2002

Multi-Layer Generalization Machine Learning Tutorial – UIST 2002

Machine Learning Tutorial – UIST 2002 Backpropagation Multi-layer supervised learner Gradient Descent weight updates Sigmoid activation function (smoothed threshold logic) Backpropagation requires a differentiable activation function Machine Learning Tutorial – UIST 2002

Multi-layer Perceptron Topology Input Layer Hidden Layer(s) Output Layer Machine Learning Tutorial – UIST 2002

Backpropagation Learning Algorithm Until Convergence (low error or other criteria) do Present a training pattern Calculate the error of the output nodes (based on T - Z) Calculate the error of the hidden nodes (based on the error of the output nodes which is propagated back to the hidden nodes) Continue propagating error back until the input layer is reached Update all weights based on the standard delta rule with the appropriate error function d Dwij = Cdj Zi Machine Learning Tutorial – UIST 2002

Activation Function and its Derivative Node activation function f(net) is typically the sigmoid Derivate of activation function is critical part of algorithm 1 .5 -5 5 Net .25 -5 5 Net Machine Learning Tutorial – UIST 2002

Backpropagation Learning Equations j k i Machine Learning Tutorial – UIST 2002

Backpropagation Summary Excellent Empirical results Scaling – The pleasant surprise Local Minima very rare is problem and network complexity increase Most common neural network approach User defined parameters lead to more difficulty of use Number of hidden nodes, layers, learning rate, etc. Many variants Adaptive Parameters, Ontogenic (growing and pruning) learning algorithms Higher order gradient descent (Newton, Conjugate Gradient, etc.) Recurrent networks Machine Learning Tutorial – UIST 2002

Machine Learning Tutorial – UIST 2002 Inductive Bias The approach used to decide how to generalize novel cases Occam’s Razor – The simplest hypothesis which fits the data is usually the best – Still many remaining options A B C -> Z A B’ C -> Z A B C’ -> Z A B’ C’ -> Z A’ B’ C’ -> Z’ Now you receive the new input A’ B C What is your output? Machine Learning Tutorial – UIST 2002

Machine Learning Tutorial – UIST 2002 Overfitting Noise vs. Exceptions revisited Machine Learning Tutorial – UIST 2002

Machine Learning Tutorial – UIST 2002 The Overfit Problem TSS Validation/Test Set Training Set Epochs Newer powerful models can have very complex decision surfaces which can converge well on most training sets by learning noisy and irrelevant aspects of the training set in order to minimize error (memorization in the limit) This makes them susceptible to overfit if not carefully considered Machine Learning Tutorial – UIST 2002

Machine Learning Tutorial – UIST 2002 Avoiding Overfit Inductive Bias – Simplest accurate model More Training Data (vs. overtraining - One epoch limit) Validation Set (requires separate test set) Backpropagation – Tends to build from simple model (0 weights) to just large enough weights (Validation Set) Stopping criteria with any constructive model (Accuracy increase vs Statistical significance) – Noise vs. Exceptions Specific Techniques Weight Decay, Pruning, Jitter, Regularization Ensembles Machine Learning Tutorial – UIST 2002

Machine Learning Tutorial – UIST 2002 Ensembles Many different Ensemble approaches Stacking, Gating/Mixture of Experts, Bagging, Boosting, Wagging, Mimicking, Combinations Multiple diverse models trained on same problem and then their outputs are combined The specific overfit of each learning model is averaged out If models are diverse (uncorrelated errors) then even if the individual models are weak generalizers, the ensemble can be very accurate Combining Technique M1 M2 M3 Mn Machine Learning Tutorial – UIST 2002

Machine Learning Tutorial – UIST 2002 Application Issues Choose relevant features Normalize features Can learn to ignore irrelevant features, but will have to fight the curse of dimensionality More data (training examples) the better Slower training acceptable for complex and production applications if accuracy improvement, (“The week phenomenon”) Execution normally fast regardless of training time Machine Learning Tutorial – UIST 2002

Machine Learning Tutorial – UIST 2002 Decision Trees - ID3/C4.5 Top down induction of decision trees Highly used and successful Attribute Features - discrete nominal (mutually exclusive) – Real valued features are discretized Search for smallest tree is too complex (always NP hard) C4.5 use common symbolic ML philosophy of a greedy iterative approach Machine Learning Tutorial – UIST 2002

Decision Tree Learning Mapping by Hyper-Rectangles A1 A2 Machine Learning Tutorial – UIST 2002

Machine Learning Tutorial – UIST 2002 ID3 Learning Approach C is the current set of examples A test on attribute A partitions C into {Ci, C2,...,Cw} where w is the number of values of A C Red Green Attribute:Color Purple C1 C2 C3 Machine Learning Tutorial – UIST 2002

Decision Tree Learning Algorithm Start with the Training Set as C and test how each attribute partitions C Choose the best A for root The goodness measure is based on how well attribute A divides C into different output classes – A perfect attribute would divide C into partitions that contain only one output class each – A poor attribute (irrelevant) would leave each partition with the same ratio of classes as in C 20 questions analogy – good questions quickly minimize the possibilities Continue recursively until sets unambiguously classified or a stopping criteria is reached Machine Learning Tutorial – UIST 2002

ID3 Example and Discussion Temperature Humidity P N P N Hot 2 2 High 3 4 Mild 4 2 Normal 6 1 Cool 3 1 Gain: .029 Gain: .151 14 Examples. Uses Information Gain. Attributes which best discriminate between classes are chosen If the same ratios are found in partitioned set, then gain is 0 Machine Learning Tutorial – UIST 2002

Machine Learning Tutorial – UIST 2002 ID3 - Conclusions Good Empirical Results Comparable application robustness and accuracy with neural networks - faster learning (though NNs are more natural with continuous features - both input and output) Most used and well known of current symbolic systems - used widely to aid in creating rules for expert systems Machine Learning Tutorial – UIST 2002

Nearest Neighbor Learners Broad Spectrum Basic K-NN, Instance Based Learning, Case Based Reasoning, Analogical Reasoning Simply store all or some representative subset of the examples in the training set Generalize on the fly rather than use pre-acquired hypothesis - faster learning, slower execution, information retained, memory intensive Machine Learning Tutorial – UIST 2002

Nearest Neighbor Algorithms Machine Learning Tutorial – UIST 2002

Nearest Neighbor Variations How many examples to store How do stored example vote (distance weighted, etc.) Can we choose a smaller set of near-optimal examples (prototypes/exemplars) Storage reduction Faster execution Noise robustness Distance Metrics – non-Euclidean Irrelevant Features – Feature weighting Machine Learning Tutorial – UIST 2002

Evolutionary Computation/Algorithms Genetic Algorithms Simulate “natural” evolution of structures via selection and reproduction, based on performance (fitness) Type of Heuristic Search - Discovery, not inductive in isolation Genetic Operators - Recombination (Crossover) and Mutation are most common 1 1 0 2 3 1 0 2 2 1 (Fitness = 10) 2 2 0 1 1 3 1 1 0 0 (Fitness = 12) 2 2 0 1 3 1 0 2 2 1 (Fitness = calculated or f(parents)) Machine Learning Tutorial – UIST 2002

Evolutionary Algorithms Start with initialized population P(t) - random, domain- knowledge, etc. Population usually made up of possible parameter settings for a complex problem Typically have fixed population size (like beam search) Selection Parent_Selection P(t) - Promising Parents used to create new children Survive P(t) - Pruning of unpromising candidates Evaluate P(t) - Calculate fitness of population members. Ranges from simple metrics to complex simulations. Machine Learning Tutorial – UIST 2002

Evolutionary Algorithm Procedure EA t = 0; Initialize Population P(t); Evaluate P(t); Until Done{ /*Sufficiently “good” individuals discovered*/ t = t+1; Parent_Selection P(t); Recombine P(t); Mutate P(t); Survive P(t);} Machine Learning Tutorial – UIST 2002

Machine Learning Tutorial – UIST 2002 EA Example Goal: Discover a new automotive engine to maximize performance, reliability, and mileage while minimizing emissions Features: CID (Cubic inch displacement), fuel system, # of valves, # of cylinders, presence of turbo-charging Assume - Test unit which tests possible engines and returns integer measure of goodness Start with population of random engines Machine Learning Tutorial – UIST 2002

Machine Learning Tutorial – UIST 2002

Machine Learning Tutorial – UIST 2002

Machine Learning Tutorial – UIST 2002 Genetic Operators Crossover variations - multi-point, uniform probability, averaging, etc. Mutation - Random changes in features, adaptive, different for each feature, etc. Others - many schemes mimicking natural genetics: dominance, selective mating, inversion, reordering, speciation, knowledge-based, etc. Reproduction - terminology - selection based on fitness - keep best around - supported in the algorithms Critical to maintain balance of diversity and quality in the population Machine Learning Tutorial – UIST 2002

Evolutionary Algorithms There exist mathematical proofs that evolutionary techniques are efficient search strategies There are a number of different Evolutionary strategies Genetic Algorithms Evolutionary Programming Evolution Strategies Genetic Programming Strategies differ in representations, selection, operators, evaluation, etc. Most independently discovered, initially function optimization (EP, ES) Strategies continue to “evolve” Machine Learning Tutorial – UIST 2002

Genetic Algorithm Comments Much current work and extensions Numerous application attempts. Can plug into many algorithms requiring search. Has built-in heuristic. Could augment with domain heuristics “Lazy Man’s Solution” to any tough parameter search Machine Learning Tutorial – UIST 2002

Machine Learning Tutorial – UIST 2002 Rule Induction Creates a set of symbolic rules to solve a classification problem Sequential Covering Algorithms Until no good and significant rules can be created Create all first order rules Ax -> Classy Score each rule based on goodness (accuracy) and significance using the current training set Iteratively (greedily) expand the best rules to n+1 attributes, score the new rules, and prune weak rules to keep the total candidate list at a fixed size (beam search) Pick the one best rule and remove all instances from the training set that the rule covers Machine Learning Tutorial – UIST 2002

Rule Induction Variants Ordered Rule lists (decision lists) - naturally supports multiple output classes A=Green and B=Tall -> Class 1 A=Red and C=Fast -> Class 2 Else Class 1 Placing new rules at beginning or end of list Unordered rule lists for each output class (must handle multiple matches) Rule induction can handle noise by no longer creating new rules when gain is negligible or not statistically significant Machine Learning Tutorial – UIST 2002

Machine Learning Tutorial – UIST 2002 Conclusion Many new algorithms and approaches being proposed Application areas rapidly increasing Amount of available data and information growing User desire for more adaptive and user-specific computer interaction This need for specific and adaptable user interaction will make machine learning a more important tool in user interface research and applications Machine Learning Tutorial – UIST 2002