Parallel Systems to Compute

Slides:



Advertisements
Similar presentations
Neural networks Introduction Fitting neural networks
Advertisements

ImageNet Classification with Deep Convolutional Neural Networks
Machine Learning Neural Networks
Lecture 14 – Neural Networks
Traffic Sign Recognition Using Artificial Neural Network Radi Bekker
CSC 4510 – Machine Learning Dr. Mary-Angela Papalaskari Department of Computing Sciences Villanova University Course website:
A Genetic Algorithms Approach to Feature Subset Selection Problem by Hasan Doğu TAŞKIRAN CS 550 – Machine Learning Workshop Department of Computer Engineering.
Classification / Regression Neural Networks 2
Neural Networks Teacher: Elena Marchiori R4.47 Assistant: Kees Jong S2.22
Object Recognizing. Deep Learning Success in 2012 DeepNet and speech processing.
Joe Bradish Parallel Neural Networks. Background  Deep Neural Networks (DNNs) have become one of the leading technologies in artificial intelligence.
Convolutional Neural Network
Neural networks (2) Reminder Avoiding overfitting Deep neural network Brief summary of supervised learning methods.
Machine Learning Artificial Neural Networks MPλ ∀ Stergiou Theodoros 1.
Xintao Wu University of Arkansas Introduction to Deep Learning 1.
Machine Learning Supervised Learning Classification and Regression
2/13/2018 4:38 AM © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN.
Big data classification using neural network
Some Slides from 2007 NIPS tutorial by Prof. Geoffrey Hinton
Stanford University.
Analysis of Sparse Convolutional Neural Networks
RNNs: An example applied to the prediction task
Convolutional Neural Network
CS 388: Natural Language Processing: LSTM Recurrent Neural Networks
Deep Feedforward Networks
The Relationship between Deep Learning and Brain Function
Deep Learning Amin Sobhani.
Data Mining, Neural Network and Genetic Programming
Chilimbi, et al. (2014) Microsoft Research
ECE 5424: Introduction to Machine Learning
INTRODUCTION TO Machine Learning 3rd Edition
COMP24111: Machine Learning and Optimisation
Deep Learning Hung-yi Lee 李宏毅.
Intelligent Information System Lab
ECE 6504 Deep Learning for Perception
Convolution Neural Networks
Machine Learning: The Connectionist
TensorFlow and Clipper (Lecture 24, cs262a)
A Cloud System for Machine Learning Exploiting a Parallel Array DBMS
Classification / Regression Neural Networks 2
Introduction to Deep Learning for neuronal data analyses
Department of Electrical and Computer Engineering
Bird-species Recognition Using Convolutional Neural Network
RNNs: Going Beyond the SRN in Language Prediction
Introduction to Neural Networks
Goodfellow: Chap 6 Deep Feedforward Networks
Brain Inspired Algorithms Dr. David Fagan
Introduction to Deep Learning with Keras
Deep learning Introduction Classes of Deep Learning Networks
Parallel Analytic Systems
Neural Networks Geoff Hulten.
Deep Learning for Non-Linear Control
Neural Networks ICS 273A UC Irvine Instructor: Max Welling
Vinit Shah, Joseph Picone and Iyad Obeid
History of Deep Learning 1/16/19
RNNs: Going Beyond the SRN in Language Prediction
Deep Learning Some slides are from Prof. Andrew Ng of Stanford.
Graph Neural Networks Amog Kamsetty January 30, 2019.
Neural networks (1) Traditional multi-layer perceptrons
实习生汇报 ——北邮 张安迪.
Martin Schrimpf & Jon Gauthier MIT BCS Peer Lectures
Deep Learning Authors: Yann LeCun, Yoshua Bengio, Geoffrey Hinton
EE 193: Parallel Computing
Automatic Handwriting Generation
Time Complexity and Parallel Speedup to Compute the Gamma Summarization Matrix Carlos Ordonez, Yiqun Zhang University of Houston, USA 1.
CS295: Modern Systems: Application Case Study Neural Network Accelerator Sang-Woo Jun Spring 2019 Many slides adapted from Hyoukjun Kwon‘s Gatech “Designing.
Learning and Memorization
Principles of Back-Propagation
Overall Introduction for the Lecture
Patterson: Chap 1 A Review of Machine Learning
Presentation transcript:

Parallel Systems to Compute Deep Neural Networks Carlos Ordonez 1

Authorities in the field Geoffrey Hinton (U Toronto, Google): Non-linearity, Backpropagation, relu, Boltzmann machine Y. LeCunn (NYU, Facebook, USA): first deep net that recognized digits, learning rate, back prop. A. Ng (Stanford, USA): multicore, parallel deep netss M. Jordan (UC Berkeley, USA): LDA, clustering J. Dean (Google, USA): Parallel processing Z. Ghahramani (Cambridge, UK): Linear Gaussian models Y. Li (Alibaba, China): Computer vision

Acknowledgments E. Omiecinski, advisor (my 1st paper, on image mining based on Blobworld-UC Berkeley) J. Dean, Google (inspiring talk) G. Hinton and Z. Ghahramani: early contact with ML M. Stonebraker, MIT (large arrays) V. Baladandayuthapani (Bayesian stats) My PhD student: Sikder Tahsin Al-Amin My colleagues at UH (50% of them are working on deep learning) 3

Success of deep nets in AI problems Signal: speech recognition (voice) Image: computer vision (digits, image classif.) Language: beyond IR, natural language

Learning performance d

Popular libraries Pytorch (Facebook, USA) Tensorflow (Google, USA), C++, distributed memory Keras Caffe (UC/Berkeley, USA)

Deep Neural net Input: data set Output: weights or probabilities Neuron activation f(): sigmoid, tanh, relu Weights+biases Loss function: quadratic in regress,; classif. error Optional: filters (convolution, most common) Deep nets can be stacked

Classification of NNs Shallow: 1 or 2 layers Deep: 3-10, 10-100, 100-1000 Convoluted or recurrent

Basic neuron model

Foundation: logistic regression x

Computation Input: data set Iterations f() evaluation Loss (fitness) function Forward propagation Backward propagation Convolution (filters) Dropping neurons

Data Set A matrix: n vectors of d dimensions (not features!) vector xi perhaps labeled feature engineering (variable creation) automated feature creation (in contrast to manual feature creation) Domain knowledge absolutely necessary Benchmark data sets: LeNet, CIFAR

Classical activation functions f(): sigmoid and tanh

Forward propagation

Backward propagation

Typical convolution Convolutional:

Aspects that impact computation time Raw input, but pre-processing can reduce size: images/signal (sample, compress.) or text (stemming, IR) Big data f(), non-linear (linear algebra optimizations not feasible) Large # of matrix multiplications Large # of iterations needed to achieve overfit Connectivity: Dense vs sparse connected layers, but dynamic Convolution: depends on filter size

Big Data Aspects Signal: large time series databases with words Image: 1000s of images, where each image is a large matrix..digit recognition is considered solved Language: 1000s of documents

Transforming sigmoid into relu

Modern activation functions f(): relu and variations

Layers Fully/sparsely connected Filters: convolution, FFT

Fully connected layers sds

Convolutional layer X

Controlling overfit: regression

Controlling overfit: classification

Dropping neurons: randomly 1/2 X

Optimizations and acceleration Gradient descent MAC: matrix multiplication More compact network Sparsely connected layers (dropping) Threshold on # of weights that contribute to yi Early stopping Weight sharing parallel processing Filters (convolution): FFT to reduce O() of matrix *

Finding optimal weights Acceleration with gradient descent

Examples a

Overfit & early stopping: # of iterations

Floating point bottlenecks Matrix multiplication Basic operation MAC: multiply and accumulate, similar dgemm() in LAPACK

Parallel Computation CPU: multiple threads in cores, share L1 or L2 cache GPU: many cores, attached processor+memory TPU: purpose-specific Distributed: multiple CPUs, each CPU with its own RAM Shared-nothing: not common, netowrk communication in RAM, I/O cost generally ignored In short, it looks more like a traditional MPI cluster

Parallel data systems: architecture Shared-nothing, message- passing P machines (nodes) Data partitioned before computation: load time Examples: Parallel DBMSs, Hadoop HDFS, MapReduce, Spark

Hardware acceleration Modifying floating point computations DRAM SRAM: basic ALU ops in RAM LSTM Non-volatile memory: in-place, reduce: precision, # of writes

Modifying floating point computations Reduce floating point precision Reduce # of matrix multiplications

Tensorflow: generalizing operations

Tensorflow: distributed computation

Tensorflow replication: data parallelism

Conclusions Data set and neural net must fit in RAM (single machine or distributed memory) Raw data preferred since net learns features Large # of matrix operations: matrix-vector, matrix-matrix, filters, sums Many iterations needed to decrease loss Parallel processing: essential

Future work Bigger deep nets, beyond RAM TPUs beyond GPUs Big data: not images, not language Interpreting weights and biases via traditional statistics Bayesian methods Generative linear models have more solid theory