Signal processing and Networking for Big Data Applications Lecture 17: Deep Learning Methodology and Applications Zhu Han University of Houston Thanks.

Slides:



Advertisements
Similar presentations
Unsupervised Learning Clustering K-Means. Recall: Key Components of Intelligent Agents Representation Language: Graph, Bayes Nets, Linear functions Inference.
Advertisements

Imbalanced data David Kauchak CS 451 – Fall 2013.
Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
CSC321: Neural Networks Lecture 3: Perceptrons
Pattern Recognition and Machine Learning
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
October 14, 2010Neural Networks Lecture 12: Backpropagation Examples 1 Example I: Predicting the Weather We decide (or experimentally determine) to use.
Artificial Neural Networks
Introduction to Machine Learning Approach Lecture 5.
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
December 5, 2012Introduction to Artificial Intelligence Lecture 20: Neural Network Application Design III 1 Example I: Predicting the Weather Since the.
Time Series Data Analysis - I Yaji Sripada. Dept. of Computing Science, University of Aberdeen2 In this lecture you learn What are Time Series? How to.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 9: Ways of speeding up the learning and preventing overfitting Geoffrey Hinton.
Applying Neural Networks Michael J. Watts
© Janice Regan, CMPT 300, May CMPT 300 Introduction to Operating Systems Memory: Relocation.
Marwan Al-Namari 1 Digital Representations. Bits and Bytes Devices can only be in one of two states 0 or 1, yes or no, on or off, … Bit: a unit of data.
Programming Fundamentals. Overview of Previous Lecture Phases of C++ Environment Program statement Vs Preprocessor directive Whitespaces Comments.
CSC321: Lecture 7:Ways to prevent overfitting
COMP135/COMP535 Digital Multimedia, 2nd edition Nigel Chapman & Jenny Chapman Chapter 2 Lecture 2 – Digital Representations.
Computer Architecture Lecture 26 Past and Future Ralph Grishman November 2015 NYU.
CSC321 Lecture 5 Applying backpropagation to shape recognition Geoffrey Hinton.
Machine Learning CUNY Graduate Center Lecture 6: Linear Regression II.
Neural Networks Lecture 4 out of 4. Practical Considerations Input Architecture Output.
What’s going on here? Can you think of a generic way to describe both of these?
Introduction to Machine Learning, its potential usage in network area,
Machine Learning Supervised Learning Classification and Regression
Neural networks and support vector machines
Big data classification using neural network
OPERATING SYSTEMS CS 3502 Fall 2017
Neural Network Architecture Session 2
Deep Feedforward Networks
The Relationship between Deep Learning and Brain Function
Applying Neural Networks
Deep Learning Amin Sobhani.
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Multimodal Learning with Deep Boltzmann Machines
A Closer Look at Instruction Set Architectures
Reading: Pedro Domingos: A Few Useful Things to Know about Machine Learning source: /cacm12.pdf reading.
Data Compression.
Inception and Residual Architecture in Deep Convolutional Networks
Basic machine learning background with Python scikit-learn
Data Structures Mohammed Thajeel To the second year students
Neural networks (3) Regularization Autoencoder
Machine Learning Basics
Convolutional Networks
Efficient Estimation of Word Representation in Vector Space
Fitting Curve Models to Edges
Collaborative Filtering Matrix Factorization Approach
INF 5860 Machine learning for image classification
Chap. 7 Regularization for Deep Learning (7.8~7.12 )
Word Embedding Word2Vec.
Creating Data Representations
Chapter 3 DataStorage Foundations of Computer Science ã Cengage Learning.
Connecting Data with Domain Knowledge in Neural Networks -- Use Deep learning in Conventional problems Lizhong Zheng.
Neural Networks Geoff Hulten.
On Convolutional Neural Network
Chapter 11 Practical Methodology
October 6, 2011 Dr. Itamar Arel College of Engineering
Tuning CNN: Tips & Tricks
Neural networks (3) Regularization Autoencoder
Evaluating Classifiers
Word embeddings (continued)
Computer Vision Lecture 19: Object Recognition III
Deep Learning Authors: Yann LeCun, Yoshua Bengio, Geoffrey Hinton
Natural Language Processing (NLP) Systems Joseph E. Gonzalez
Data Pre-processing Lecture Notes for Chapter 2
Introduction to Neural Networks
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
CS249: Neural Language Model
Presentation transcript:

Signal processing and Networking for Big Data Applications Lecture 17: Deep Learning Methodology and Applications Zhu Han University of Houston Thanks for Xunsheng Du and Kevin Tsai’s preparation

outline Methodology Applications Overview Performance Metrics Default Baseline Models Determining Whether to Gather More Data Applications Large Scale Deep Learning Computer Vision Speech Recognition Natural Language Processing Others

Overview Successfully applying deep learning techniques requires more than just a good knowledge of what algorithms exist and the principles that explain how they work. Practitioners need to decide whether to gather more data, increase or decrease model capacity, add or remove regularizing features, improve the optimization of a model, improve approximate inference in a model, or debug the software implementation of the model.

Overview Practical Design Process: 1. Determine your goals --- what error metric to use, and target value for this error metric 2. Establish a working end-to-end pipeline as soon as possible 3. Instrument the system well to determine bottlenecks in performance 4. Repeatedly make incremental changes such as gathering new data, adjusting hyperparameters, or changing algorithms, based on specific findings from your instrumentation.

outline Methodology Applications Overview Performance Metrics Default Baseline Models Determining Whether to Gather More Data Applications Large Scale Deep Learning Computer Vision Speech Recognition Natural Language Processing Others

Performance metrics Determining your goals, in terms of which error metric to use, is a necessary first step because your error metric will guide all of your future actions. Keep in mind that for most applications, it is impossible to achieve absolute zero error. This is because your input features may not contain complete information about the output variable, or because the system might be intrinsically stochastic. In other word, empirical distribution ≠ real distribution

Performance metrics The amount of training data needed should trade off the cost of “data collection” and error reduction If only for academic evaluation, we could work based on previous model and focus on the effectiveness of algorithm rather than the amount of data If applied to market, we should consider the safety, cost effectiveness, as well as how appealing it is for the consumer

Performance metrics Sometimes, need more advanced metrics For example, an e-mail spam detection system can make two kinds of mistakes: incorrectly classifying a legitimate message as spam, and incorrectly allowing a spam message to appear in the inbox. The latter is more harmful than the former We need an advanced metric to evaluate this situation

Performance metrics Sometimes we wish to train a binary classifier that is intended to detect some rare events. For example, we might design a medical test for a rare disease. Suppose that only one in every million people has this disease. We can easily achieve 99.9999% accuracy on the detection task, by simply hard-coding the classifier to always report that the disease is absent.

Performance metrics Only accuracy is not enough and appropriate in this scenario One way to solve this problem is to instead measure precision and recall.

Performance metrics Precision is the fraction of detections reported by the model that were correct, while recall is the fraction of true events that were detected. A detector that says no one has the disease would achieve perfect precision, but zero recall. A detector that says everyone has the disease would achieve perfect recall, but precision equal to the percentage of people who have the disease (0.0001% in our example of a disease that only one people in a million have).

Performance metrics When using precision and recall, it is common to plot a PR curve, with precision on the y-axis and recall on the x-axis. In many cases, we wish to summarize the performance of the classifier with a single number rather than a curve. To do so, we can convert precision p and recall r into an F-score given by

Performance metrics In a word, to determine what performance metric to use in your model, you have to first know what goal you have to reach.

outline Methodology Applications Overview Performance Metrics Default Baseline Models Determining Whether to Gather More Data Applications Large Scale Deep Learning Computer Vision Speech Recognition Natural Language Processing Others

Default baseline models If your problem has a chance of being solved by just choosing a few linear weights correctly, you may want to begin with a simple statistical model like logistic regression. If you know that your problem falls into an “AI-complete” category like object recognition, speech recognition, machine translation, and so on, then you are likely to do well by beginning with an appropriate deep learning model.

Default baseline models First, choose the general category of model based on the structure of your data. A reasonable choice of optimization algorithm is SGD with momentum with a decaying learning rate (popular decay schemes that perform better or worse on different problems include decaying linearly until reaching a fixed minimum learning rate, decaying exponentially, or decreasing the learning rate by a factor of 2-10 each time validation error plateaus).

Default baseline models Another very reasonable alternative is Adam. Batch normalization can have a dramatic effect on optimization performance, especially for convolutional networks and networks with sigmoidal nonlinearities. While it is reasonable to omit batch normalization from the very first baseline, it should be introduced quickly if optimization appears to be problematic.

Default baseline models If your task is similar to another task that has been studied extensively, you will probably do well by first copying the model and algorithm that is already known to perform best on the previously studied task.

outline Methodology Applications Overview Performance Metrics Default Baseline Models Determining Whether to Gather More Data Applications Large Scale Deep Learning Computer Vision Speech Recognition Natural Language Processing Others

Determining whether to gather more data Many machine learning novices are tempted to make improvements by trying out many different algorithms. However, it is often much better to gather more data than to improve the learning algorithm.

Determining whether to gather more data How does one decide whether to gather more data? First, determine whether the performance on the training set is acceptable. (1) If performance on the training set is poor, the learning algorithm is not using the training data that is already available, so there is no reason to gather more data. (2) Instead, try increasing the size of the model by adding more layers or adding more hidden units to each layer. Also, try improving the learning algorithm,

Determining whether to gather more data How does one decide whether to gather more data? First, determine whether the performance on the training set is acceptable. (3) If large models and carefully tuned optimization algorithms do not work well, then the problem might be the quality of the training data. The data may be too noisy or may not include the right inputs needed to predict the desired outputs.

Determining whether to gather more data How does one decide whether to gather more data? If the performance on the training set is acceptable, then measure the performance on a test set (1) If the performance on the test set is also acceptable, then there is nothing left to be done.

Determining whether to gather more data How does one decide whether to gather more data? If the performance on the training set is acceptable, then measure the performance on a test set (2) If test set performance is much worse than training set performance, then gathering more data is one of the most effective solutions.

Determining whether to gather more data The key considerations are the cost and feasibility of gathering more solutions. At large internet companies with millions or billions of users, it is feasible to gather large datasets, and the expense of doing so can be considerably less than the other alternatives, so the answer is almost always to gather more training data. In other contexts, such as medical applications, it may be costly or infeasible to gather more data. A simple alternative to gathering more data is to reduce the size of the model or improve regularization,

outline Methodology Applications Overview Performance Metrics Default Baseline Models Determining Whether to Gather More Data Applications Large Scale Deep Learning Computer Vision Speech Recognition Natural Language Processing Others

CPU implementation Faster when use fixed-point arithmetic rather than floating-point arithmetic 3X speedup over strong floating-point system. Vanhoucke et al. (2011) New CPU has different performance characteristics. Floating-point implementations might be faster for new devices.

GPU implementations Originally for video game rendering Performing matrix multiplications and division in parallel to convert 3-D coordinates into 2-D Designed to have high degree of parallelism and memory bandwidth with low clock speed and less branching capability relative to CPUs (depends on the model of CPU and GPU)

GPU implementations Neural networks involve large buffers of parameters, activation values, and gradient values Buffers might be large enough to fall outside of the cache of a computer with standard CPU GPUs offer high memory bandwidth

GPU implementations Steinkrau et al. (2005) implemented a two-layer fully connected 2005 neural network on a GPU and reported a 3X speedup over their CPU-based baseline Chellapilla et al (2006) demonstrated that the same technique could be used to accelerate supervised convolutional networks

GPU implementations Writable memory locations are not cached. Directly write into GPU memory. Fast read and write Coalesced: reads or writes occur when several threads can each read or write a value that they need simultaneously Threads are divided into small groups called warps. Each thread in same warp executes the same instruction

GPU implementations Pylearn2 Theano Cuda-convent TensorFlow Torch

Large scale distributed implementations Training across many machines instead single machine Data parallelism: each input run on separate machine Model parallelism: multiple machines work on a single datapoint Academic deep learning researchers typically cannot afford for the system 

Model compression Time and memory cost Train a model once, then deploy it to be used by billions Replace original, expensive model with smaller model that requires less memory and runtime Take out unnecessary units or parameters when deploy to users

Specialized hardware implementations General purpose processing units (CPUs and GPUs) typically use 32 or 64 bits of precision It’s possible to use less precision 1990s, cost too much on hardware Nowadays, 8 and 16 bits of precision are suggested for low-precision implementations

Computer vision

Computer vision Human visual abilities One of the most active research areas for deep learning application Object recognition or detection

preprocessing Mixing images that lie in [0,1] with [0,255] will usually result in failure since mixing black and white image with color image will result on high probability on the color 0 and 1 (computer doesn’t know the 1 in black and white image is 255 in color image) Dataset augmentation can reduce generalization error Preprocessing is for human designer to describe but not relevance to the task. In short, just for human to explain how the neural network works.

Contrast normalization Magnitude of the difference between the bright and the dark pixels in an image Global contrast normalization (GCN): prevent images from having varying amounts of contrast. Subtract the mean from each image. Rescaling SD across its pixels to some constant. Image with low contrast often have little information

Contrast normalization Local contrast normalization (LCN): contrast is normalized across each small window (Ex: 9x9 window shift across the image and do the normalization instead using the mean from all the pixels like GCN)

Speech recognition

Speech recognition Combined hidden Markov models (HMMs) and Gaussian mixture models (GMMs) GMMs: association between acoustic features and phonemes HMMs: modeled the sequence of phonemes HMM generates a sequence of phonemes and discrete sub-phonemic states, then a GMM transforms each discrete symbol into a brief segment of audio waveform

Natural Language processing

Natural Language processing We usually choose to regard natural language as a sequence of words, rather than a sequence of individual characters or bytes Word-based language models must operate on an extremely high- dimensional and sparse discrete space since each word not only contains letters but also the meaning of the word (Ex: positive, negative, happy, angry, etc.)

N-grams Language model defines a probability distribution over sequences of tokens in a natural language n-gram: fixed-length sequences of tokens (Ex: “I am smart”  [(I,am),(am,smart),(I,smart)] Models based on n-grams: define the conditional probability of the n-th token given the preceding n-1 tokens

N-grams Maximum likelihood estimate can be computed simply by counting how many times each possible n gram occurs in the training set Usually train both n-gram model and n-1 model simultaneously Back-off methods look-up the lower-order n-grams if the frequency of the context is too small to use the higher-order model. More formally, they estimate the distribution over by using contexts , for increasing k until a sufficiently reliable estimate is found Class-based language models: improve statistical efficiency of n-gram models

Neural Language models (NLMs) Recognize two words are similar without losing the ability to encode each word as distinct from the other. (Ex: if “cat” is similar to “dog”, the sentence contains “cat” can inform the predictions for the sentences that contain “dog”) Word embedding: view the raw symbols as points in a space of dimension equal to the vocabulary size (differ from each dataset).

High-dimensional outputs Importance sampling: reduce incorrect word will reduce computational cost. Sample only a subset of the words can still know the meaning of the sentence. Bag of words: sparse vector indicates the presence or absence of a word from the vocabulary in the document

Other applications Recommender system Exploration vs. exploitation Knowledge representation, reasoning and question answering