Machine Learning.

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

Unsupervised Learning Clustering K-Means. Recall: Key Components of Intelligent Agents Representation Language: Graph, Bayes Nets, Linear functions Inference.
Classification and Prediction: Regression Via Gradient Descent Optimization Bamshad Mobasher DePaul University.
Supervised and Unsupervised learning and application to Neuroscience Cours CA6b-4.
Principal Component Analysis
Learning From Data Chichang Jou Tamkang University.
Classification and Prediction: Regression Analysis
Collaborative Filtering Matrix Factorization Approach
CSE 185 Introduction to Computer Vision Pattern Recognition.
This week: overview on pattern recognition (related to machine learning)
Lecture 27: Recognition Basics CS4670/5670: Computer Vision Kavita Bala Slides from Andrej Karpathy and Fei-Fei Li
Data Mining and Decision Support
Machine Learning Artificial Neural Networks MPλ ∀ Stergiou Theodoros 1.
Machine Learning Supervised Learning Classification and Regression K-Nearest Neighbor Classification Fisher’s Criteria & Linear Discriminant Analysis Perceptron:
CS : Designing, Visualizing and Understanding Deep Neural Networks
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Machine Learning Supervised Learning Classification and Regression
CSE 4705 Artificial Intelligence
Neural networks and support vector machines
Big data classification using neural network
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Semi-Supervised Clustering
Machine Learning – Classification David Fenyő
DEEP LEARNING BOOK CHAPTER to CHAPTER 6
Convolutional Neural Network
Evaluating Classifiers
Generative Adversarial Imitation Learning
Deep Feedforward Networks
Artificial Neural Networks
Deep Learning Amin Sobhani.
Goodfellow: Chap 5 Machine Learning Basics
Machine Learning – Regression David Fenyő
Restricted Boltzmann Machines for Classification
Machine learning, pattern recognition and statistical data modelling
The Elements of Statistical Learning
Announcements HW4 due today (11:59pm) HW5 out today (due 11/17 11:59pm)
Intelligent Information System Lab
Basic machine learning background with Python scikit-learn
Neural networks (3) Regularization Autoencoder
Machine Learning Basics
Softmax Classifier + Generalization
Machine Learning – Regression David Fenyő
Recognition - III.
Hyperparameters, bias-variance tradeoff, validation
Collaborative Filtering Matrix Factorization Approach
CS 4501: Introduction to Computer Vision Training Neural Networks II
Artificial Intelligence Chapter 3 Neural Networks
What is Regression Analysis?
[Figure taken from googleblog
Pattern Recognition and Machine Learning
Generalization in deep learning
Lecture: Deep Convolutional Neural Networks
Biointelligence Laboratory, Seoul National University
Deep Learning for Non-Linear Control
Contact: Machine Learning – (Linear) Regression Wilson Mckerrow (Fenyo lab postdoc) Contact:
Machine Learning – Neural Networks David Fenyő
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Artificial Intelligence Chapter 3 Neural Networks
Softmax Classifier.
Machine Learning Algorithms – An Overview
Machine learning overview
Image Classification & Training of Neural Networks
Artificial Intelligence Chapter 3 Neural Networks
Neural networks (3) Regularization Autoencoder
Machine Learning – a Probabilistic Perspective
Introduction to Neural Networks
Image recognition.
Support Vector Machines 2
Machine Learning in Business John C. Hull
Presentation transcript:

Machine Learning

Example: Image Classification 2 Russakovsky et al., ImageNet Large Scale Visual Recognition Challenge. IJCV, 2015.

Example: Games

Example: Language Translation 4

Example: Tumor Subtypes 5

Example: Skin Cancer Diagnosis 6 Esteva et al., “Dermatologist-level classification of skin cancer with deep neural networks”, Nature. 2017

Unsupervised Learning 7 Finding the structure in data. Clustering Dimension reduction

Unsupervised Learning: Clustering 8 How many clusters? Where to set the borders between clusters? Need to select a distance measure. Examples of methods: k-means clustering Hierarchical clustering

Unsupervised Learning: Dimension Reduction Examples of methods: Principal Component Analysis (PCA) t-Distributed Stochastic Neighbor Embedding (t-SNE) Independent Component Analysis (ICA) Non-Negative Matrix Factorization (NMF) Multi-Dimensional Scaling (MDS)

Linear Regression – one independent variable 10 Relationship: 𝑦= 𝑤 1 𝑥 1 + 𝑤 0 +𝜖 Data: 𝑦 𝑗 , 𝑥 1𝑗 for j=1..n Loss function: sum of squared errors: 𝐿= 𝑗 𝜖 𝑗 2 = 𝑗 𝑦 𝑗 − (𝑤 1 𝑥 1𝑗 + 𝑤 0 ) 2

Linear Regression – Error Landscape Sum of Square Errors Slope

Linear Regression – Error Landscape Slope Sum of Square Errors Intercept Slope

Linear Regression – Error Landscape Slope Sum of Square Errors Intercept Slope

Minimizing the loss function: Linear Regression – One Independent Variable 14 Minimizing the loss function: 𝜕𝐿 𝜕 𝑤 1 = 𝜕 𝜕 𝑤 1 𝑗 𝜖 𝑗 2 =0 𝜕𝐿 𝜕 𝑤 0 = 𝜕 𝜕 𝑤 0 𝑗 𝜖 𝑗 2 =0

Minimizing the loss function, L (sum of squared errors): Linear Regression – One Independent Variable 15 Minimizing the loss function, L (sum of squared errors): 𝜕𝐿 𝜕 𝑤 1 = 𝜕 𝜕 𝑤 1 𝑗 𝜖 𝑗 2 = 𝜕 𝜕 𝑤 1 𝑗 𝑦 𝑗 − (𝑤 1 𝑥 1𝑗 + 𝑤 0 ) 2 =0 𝜕𝐿 𝜕 𝑤 0 = 𝜕 𝜕 𝑤 0 𝑗 𝜖 𝑗 2 = 𝜕 𝜕 𝑤 0 𝑗 𝑦 𝑗 − (𝑤 1 𝑥 1𝑗 + 𝑤 0 ) 2 =0

Model Capacity: Overfitting and Underfitting 16

Model Capacity: Overfitting and Underfitting 17

Model Capacity: Overfitting and Underfitting 18

Model Capacity: Overfitting and Underfitting 19 Training Error Error on Training Set Degree of polynomial

Model Capacity: Overfitting and Underfitting 20 With four parameters I can fit an elephant, and with five I can make him wiggle his trunk. John von Neumann

Training and Testing Data Set Test Training

Data Snooping 22 Do not use the test data for any purpose during training.

Training and Testing Testing Error Training Error Error on Training Set Training Error Degree of polynomial

Training and Testing Testing Error Training Error Error on Training Set Training Error Degree of polynomial

𝜕 𝜕 𝑤 𝑖 𝑗 𝑦 𝑗 − 𝑖 𝑤 𝑖 𝑓 𝑖 ( 𝒙 𝑗 ) 2 +𝜆 𝑖 𝑤 𝑖 2 =0 Regularization Linear regression: 25 𝜕 𝜕 𝑤 𝑖 𝑗 𝑦 𝑗 − 𝑖 𝑤 𝑖 𝑓 𝑖 ( 𝒙 𝑗 ) 2 =0 Regularized (L2) linear regression: 25 𝜕 𝜕 𝑤 𝑖 𝑗 𝑦 𝑗 − 𝑖 𝑤 𝑖 𝑓 𝑖 ( 𝒙 𝑗 ) 2 +𝜆 𝑖 𝑤 𝑖 2 =0

Linear Regression - Regularization Degree of polynomial = 9 10 100 1000 Coefficient Coefficient Coefficient

Supervised Learning: Classification 27

Supervised Learning: Classification 28

𝑦= 𝑤 1 𝑥 1 + 𝑤 0 +𝜖 𝑦=𝜎( 𝑤 1 𝑥 1 + 𝑤 0 +𝜖) Logistic Regression Linear Regression: 𝑦= 𝑤 1 𝑥 1 + 𝑤 0 +𝜖 Logistic Regression: 𝑦=𝜎( 𝑤 1 𝑥 1 + 𝑤 0 +𝜖) where 𝜎(𝑡)= 1 1+ 𝑒 −𝑡 29 𝑤 1 =1 𝑤 1 =10

Sum of Square Errors as Loss Function 𝑤 1 𝑤 0

Sum of Square Errors as Loss Function 𝑤 1 𝑤 0

Sum of Square Errors as Loss Function 𝑤 1 𝑤 0 𝑤 0 𝑤 1

𝐿 𝒘 =log⁡( 𝑖=1 𝑛 𝜎 𝒙 𝑖 𝑦 𝑖 (1−𝜎( 𝒙 𝑖 )) 1−𝑦 𝑖 )= Logistic Regression – Loss Function 𝐿 𝒘 =log⁡( 𝑖=1 𝑛 𝜎 𝒙 𝑖 𝑦 𝑖 (1−𝜎( 𝒙 𝑖 )) 1−𝑦 𝑖 )= 𝑖=1 𝑛 𝑦 𝑖 log 𝜎 𝒙 𝑖 + (1−𝑦 𝑖 ) log 1−𝜎 𝒙 𝑖 where 𝜎(𝑡)= 1 1+ 𝑒 −𝑡

Logistic Regression – Error Landscape 𝑤 1 𝑤 0

Logistic Regression – Error Landscape 𝑤 1 𝑤 0

Logistic Regression – Error Landscape 𝑤 1 𝑤 0 𝑤 0 𝑤 1

Training: Gradient Descent 37

Training: Gradient Descent 38

Training: Gradient Descent 39

Training: Gradient Descent 40

Training: Gradient Descent 41 We want to use a large training rate when we are far from the minimum and decrease it as we get closer.

Training: Gradient Descent 42 If the gradient is small in an extended region, gradient descent becomes very slow.

Training: Gradient Descent 43 Gradient descent can get stuck in local minima. To improve the behavior for shallow local minima, we can modify gradient descent to take the average of the gradient for the last few steps (similar to momentum and friction).

Linear Regression – Error Landscape Sum of Square Errors

Linear Regression – Error Landscape Sum of Square Errors

Linear Regression – Error Landscape Sum of Absolute Errors

Linear Regression – Error Landscape

Gradient Descent

Gradient Descent

Gradient Descent

Gradient Descent

Gradient Descent

Linear Regression – Gradient Descent

Linear Regression – Gradient Descent

Linear Regression – Gradient Descent

Linear Regression – Gradient Descent

Gradient Descent

Gradient Descent – Learning Rate Too Small Too Large

Gradient Descent – Learning Rate Decay Constant Learning Rate Decaying Learning Rate

Partially Remembering Gradient Descent – Unequal Gradients Constant Learning Rate Decaying Learning Rate Partially Remembering Previous Gradients

Gradient Descent Sum of Square Errors Sum of Absolute Errors

Outliers Sum of Square Errors Sum of Absolute Errors

Variable Variance

Evaluation of Binary Classification Models Predicted 0 1 True Negative False Positive 1 64 Actual False Negative True Positive True Positive Rate / Sensitivity / Recall = TP/(TP+FN) – fraction of label 1 predicted to be label 1 False Positive Rate = FP/(FP+TN) – fraction of label 0 predicted to be label 1 Accuracy = (TP+TN)/total - fraction of correct predictions Precision = TP/(TP+FP) – fraction of correct among positive predictions False discovery rate = 1 – precision Specificity = TN/(TN+FP) – fraction of correct predictions among label 0

Evaluation of Binary Classification Models Label 0 Label 1 Label 0 Label 1 True Positives True Positives False Positives False Positives

Evaluation of Binary Classification Models Label 0 Label 0 Label 1 Label 1 False Positives False Positives True Positives True Positives

Receiver Operator Characteristic (ROC) Evaluation of Binary Classification Models False Positive Rate False Positive Rate False Positive Rate False Positive Rate True Positive Rate True Positive Rate True Positive Rate True Positive Rate Receiver Operator Characteristic (ROC) True Positive Rate True Positive Rate True Positive Rate True Positive Rate False Positive Rate False Positive Rate False Positive Rate False Positive Rate

Neural Networks w1 x1 𝑓( 𝑖 𝑤 𝑖 𝑥 𝑖 +𝑏) w2 x2 xn wn . Input Output 𝑓( 𝑖 𝑤 𝑖 𝑥 𝑖 +𝑏) 68 w2 x2 . xn wn Input Output Hidden

Generative Adversarial Networks 69 Nguyen et al., “Plug & Play Generative Networks: Conditional Iterative Generation of Images in Latent Space”, https://arxiv.org/abs/1612.00005.

Deep Dream Google DeepDream: The Garden of Earthly Delights 70 Google DeepDream: The Garden of Earthly Delights Hieronymus Bosch: The Garden of Earthly Delights

Artistic Style 71 LA. Gatys, A.S. Ecker, M. Bethge, “A Neural Algorithm of Artistic Style”, https://arxiv.org/pdf/1508.06576v1.pdf

Image Captioning – Combining CNNs and RNNs 72 Karpathy, Andrej & Fei-Fei, Li, "Deep visual-semantic alignments for generating image descriptions", Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2015) 3128-3137

Training and Testing Data Set Test Training

Data Set Test Validation Training Validation: Choosing Hyperparameters Examples of hyperparameters: Learning rate schedule Regularization parameter Number of nearest neighbors

Curse of Dimensionality 75 When the number of dimensions increase, the volume increases and the data becomes sparse. It is typical for biomedical data that there are few samples and many measurements.

No Free Lunch 76 Wolpert, David (1996), Neural Computation, pp. 1341-1390.

Can we trust the predictions of classifiers? 77 Ribeiro, Singh and Guestrin ,"Why Should I Trust You? Explaining the Predictions of Any Classifier“, In: ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), 2016

Adversarial Fooling Examples Original correctly classified image Classified as ostrich Perturbation 78 Szegedy et al., “Intriguing properties of neural networks”, https://arxiv.org/abs/1312.6199

Machine Learning