Parallel Systems to Compute

Slides:

Advertisements

Similar presentations

Neural networks Introduction Fitting neural networks

Advertisements

ImageNet Classification with Deep Convolutional Neural Networks

Machine Learning Neural Networks

Lecture 14 – Neural Networks

Traffic Sign Recognition Using Artificial Neural Network Radi Bekker

CSC 4510 – Machine Learning Dr. Mary-Angela Papalaskari Department of Computing Sciences Villanova University Course website:

A Genetic Algorithms Approach to Feature Subset Selection Problem by Hasan Doğu TAŞKIRAN CS 550 – Machine Learning Workshop Department of Computer Engineering.

Classification / Regression Neural Networks 2

Neural Networks Teacher: Elena Marchiori R4.47 Assistant: Kees Jong S2.22

Object Recognizing. Deep Learning Success in 2012 DeepNet and speech processing.

Joe Bradish Parallel Neural Networks. Background  Deep Neural Networks (DNNs) have become one of the leading technologies in artificial intelligence.

Convolutional Neural Network

Neural networks (2) Reminder Avoiding overfitting Deep neural network Brief summary of supervised learning methods.

Machine Learning Artificial Neural Networks MPλ ∀ Stergiou Theodoros 1.

Xintao Wu University of Arkansas Introduction to Deep Learning 1.

Machine Learning Supervised Learning Classification and Regression

2/13/2018 4:38 AM © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN.

Big data classification using neural network

Some Slides from 2007 NIPS tutorial by Prof. Geoffrey Hinton

Stanford University.

Analysis of Sparse Convolutional Neural Networks

RNNs: An example applied to the prediction task

Convolutional Neural Network

CS 388: Natural Language Processing: LSTM Recurrent Neural Networks

Deep Feedforward Networks

The Relationship between Deep Learning and Brain Function

Deep Learning Amin Sobhani.

Data Mining, Neural Network and Genetic Programming

Chilimbi, et al. (2014) Microsoft Research

ECE 5424: Introduction to Machine Learning

INTRODUCTION TO Machine Learning 3rd Edition

COMP24111: Machine Learning and Optimisation

Deep Learning Hung-yi Lee 李宏毅.

Intelligent Information System Lab

ECE 6504 Deep Learning for Perception

Convolution Neural Networks

Machine Learning: The Connectionist

TensorFlow and Clipper (Lecture 24, cs262a)

A Cloud System for Machine Learning Exploiting a Parallel Array DBMS

Classification / Regression Neural Networks 2

Introduction to Deep Learning for neuronal data analyses

Department of Electrical and Computer Engineering

Bird-species Recognition Using Convolutional Neural Network

RNNs: Going Beyond the SRN in Language Prediction

Introduction to Neural Networks

Goodfellow: Chap 6 Deep Feedforward Networks

Brain Inspired Algorithms Dr. David Fagan

Introduction to Deep Learning with Keras

Deep learning Introduction Classes of Deep Learning Networks

Parallel Analytic Systems

Neural Networks Geoff Hulten.

Deep Learning for Non-Linear Control

Neural Networks ICS 273A UC Irvine Instructor: Max Welling

Vinit Shah, Joseph Picone and Iyad Obeid

History of Deep Learning 1/16/19

RNNs: Going Beyond the SRN in Language Prediction

Deep Learning Some slides are from Prof. Andrew Ng of Stanford.

Graph Neural Networks Amog Kamsetty January 30, 2019.

Neural networks (1) Traditional multi-layer perceptrons

实习生汇报 ——北邮张安迪.

Martin Schrimpf & Jon Gauthier MIT BCS Peer Lectures

Deep Learning Authors: Yann LeCun, Yoshua Bengio, Geoffrey Hinton

EE 193: Parallel Computing

Automatic Handwriting Generation

Time Complexity and Parallel Speedup to Compute the Gamma Summarization Matrix Carlos Ordonez, Yiqun Zhang University of Houston, USA 1.

CS295: Modern Systems: Application Case Study Neural Network Accelerator Sang-Woo Jun Spring 2019 Many slides adapted from Hyoukjun Kwon‘s Gatech “Designing.

Learning and Memorization

Principles of Back-Propagation

Overall Introduction for the Lecture

Patterson: Chap 1 A Review of Machine Learning

Presentation transcript:

Parallel Systems to Compute Deep Neural Networks Carlos Ordonez 1

Authorities in the field Geoffrey Hinton (U Toronto, Google): Non-linearity, Backpropagation, relu, Boltzmann machine Y. LeCunn (NYU, Facebook, USA): first deep net that recognized digits, learning rate, back prop. A. Ng (Stanford, USA): multicore, parallel deep netss M. Jordan (UC Berkeley, USA): LDA, clustering J. Dean (Google, USA): Parallel processing Z. Ghahramani (Cambridge, UK): Linear Gaussian models Y. Li (Alibaba, China): Computer vision

Acknowledgments E. Omiecinski, advisor (my 1st paper, on image mining based on Blobworld-UC Berkeley) J. Dean, Google (inspiring talk) G. Hinton and Z. Ghahramani: early contact with ML M. Stonebraker, MIT (large arrays) V. Baladandayuthapani (Bayesian stats) My PhD student: Sikder Tahsin Al-Amin My colleagues at UH (50% of them are working on deep learning) 3

Success of deep nets in AI problems Signal: speech recognition (voice) Image: computer vision (digits, image classif.) Language: beyond IR, natural language

Learning performance d

Popular libraries Pytorch (Facebook, USA) Tensorflow (Google, USA), C++, distributed memory Keras Caffe (UC/Berkeley, USA)

Deep Neural net Input: data set Output: weights or probabilities Neuron activation f(): sigmoid, tanh, relu Weights+biases Loss function: quadratic in regress,; classif. error Optional: filters (convolution, most common) Deep nets can be stacked

Classification of NNs Shallow: 1 or 2 layers Deep: 3-10, 10-100, 100-1000 Convoluted or recurrent

Basic neuron model

Foundation: logistic regression x

Computation Input: data set Iterations f() evaluation Loss (fitness) function Forward propagation Backward propagation Convolution (filters) Dropping neurons

Data Set A matrix: n vectors of d dimensions (not features!) vector xi perhaps labeled feature engineering (variable creation) automated feature creation (in contrast to manual feature creation) Domain knowledge absolutely necessary Benchmark data sets: LeNet, CIFAR

Classical activation functions f(): sigmoid and tanh

Forward propagation

Backward propagation

Typical convolution Convolutional:

Aspects that impact computation time Raw input, but pre-processing can reduce size: images/signal (sample, compress.) or text (stemming, IR) Big data f(), non-linear (linear algebra optimizations not feasible) Large # of matrix multiplications Large # of iterations needed to achieve overfit Connectivity: Dense vs sparse connected layers, but dynamic Convolution: depends on filter size

Big Data Aspects Signal: large time series databases with words Image: 1000s of images, where each image is a large matrix..digit recognition is considered solved Language: 1000s of documents

Transforming sigmoid into relu

Modern activation functions f(): relu and variations

Layers Fully/sparsely connected Filters: convolution, FFT

Fully connected layers sds

Convolutional layer X

Controlling overfit: regression

Controlling overfit: classification

Dropping neurons: randomly 1/2 X

Optimizations and acceleration Gradient descent MAC: matrix multiplication More compact network Sparsely connected layers (dropping) Threshold on # of weights that contribute to yi Early stopping Weight sharing parallel processing Filters (convolution): FFT to reduce O() of matrix *

Finding optimal weights Acceleration with gradient descent

Examples a

Overfit & early stopping: # of iterations

Floating point bottlenecks Matrix multiplication Basic operation MAC: multiply and accumulate, similar dgemm() in LAPACK

Parallel Computation CPU: multiple threads in cores, share L1 or L2 cache GPU: many cores, attached processor+memory TPU: purpose-specific Distributed: multiple CPUs, each CPU with its own RAM Shared-nothing: not common, netowrk communication in RAM, I/O cost generally ignored In short, it looks more like a traditional MPI cluster

Parallel data systems: architecture Shared-nothing, message- passing P machines (nodes) Data partitioned before computation: load time Examples: Parallel DBMSs, Hadoop HDFS, MapReduce, Spark

Hardware acceleration Modifying floating point computations DRAM SRAM: basic ALU ops in RAM LSTM Non-volatile memory: in-place, reduce: precision, # of writes

Modifying floating point computations Reduce floating point precision Reduce # of matrix multiplications

Tensorflow: generalizing operations

Tensorflow: distributed computation

Tensorflow replication: data parallelism

Conclusions Data set and neural net must fit in RAM (single machine or distributed memory) Raw data preferred since net learns features Large # of matrix operations: matrix-vector, matrix-matrix, filters, sums Many iterations needed to decrease loss Parallel processing: essential

Future work Bigger deep nets, beyond RAM TPUs beyond GPUs Big data: not images, not language Interpreting weights and biases via traditional statistics Bayesian methods Generative linear models have more solid theory