Large-scale Machine Learning

Slides:



Advertisements
Similar presentations
Semi-Stochastic Gradient Descent Methods Jakub Konečný University of Edinburgh ETH Zurich November 3, 2014.
Advertisements

SE263 Video Analytics Course Project Initial Report Presented by M. Aravind Krishnan, SERC, IISc X. Mei and H. Ling, ICCV’09.
Semi-Stochastic Gradient Descent Methods Jakub Konečný University of Edinburgh BASP Frontiers Workshop January 28, 2014.
Optimization Tutorial
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Efficient and Numerically Stable Sparse Learning Sihong Xie 1, Wei Fan 2, Olivier Verscheure 2, and Jiangtao Ren 3 1 University of Illinois at Chicago,
Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.
Neural Networks Ellen Walker Hiram College. Connectionist Architectures Characterized by (Rich & Knight) –Large number of very simple neuron-like processing.
Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.
RESOURCES, TRADE-OFFS, AND LIMITATIONS Group 5 8/27/2014.
Building high-level features using large-scale unsupervised learning Anh Nguyen, Bay-yuan Hsu CS290D – Data Mining (Spring 2014) University of California,
Efficient and Numerically Stable Sparse Learning Sihong Xie 1, Wei Fan 2, Olivier Verscheure 2, and Jiangtao Ren 3 1 University of Illinois at Chicago,
Computer Architecture Lecture 26 Past and Future Ralph Grishman November 2015 NYU.
Large Scale Distributed Distance Metric Learning by Pengtao Xie and Eric Xing PRESENTED BY: PRIYANKA.
Data Summit 2016 H104: Building Hadoop Applications Abhik Roy Database Technologies - Experian LinkedIn Profile:
Machine Learning Artificial Neural Networks MPλ ∀ Stergiou Theodoros 1.
Machine Learning Supervised Learning Classification and Regression
Conclusions on CS3014 David Gregg Department of Computer Science
2/13/2018 4:38 AM © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN.
Neural networks and support vector machines
Stanford University.
TensorFlow– A system for large-scale machine learning
Deep Learning Software: TensorFlow
The role of optimization in machine learning
Deep Learning Methods For Automated Discourse CIS 700-7
Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming
Sathya Ronak Alisha Zach Devin Josh
Multilayer Perceptrons
Ph.D. in Computer Science
Zhu Han University of Houston Thanks for Dr. Mingyi Hong’s slides
Chilimbi, et al. (2014) Microsoft Research
Alternative system models
Lecture 07: Soft-margin SVM
Real Neurons Cell structures Cell body Dendrites Axon
Understanding Generalization in Adaptive Data Analysis
Ten Words … that promise adequate capacity to digest massive datasets and offer powerful predictive analytics thereupon. These principles and strategies.
FPGA: Real needs and limits
dawn.cs.stanford.edu/benchmark
Generalization and adaptivity in stochastic convex optimization
Announcements HW4 due today (11:59pm) HW5 out today (due 11/17 11:59pm)
Privacy and Fault-Tolerance in Distributed Optimization Nitin Vaidya University of Illinois at Urbana-Champaign.
COMP61011 : Machine Learning Ensemble Models
C-LSTM: Enabling Efficient LSTM using Structured Compression Techniques on FPGAs Shuo Wang1, Zhe Li2, Caiwen Ding2, Bo Yuan3, Qinru Qiu2, Yanzhi Wang2,
TensorFlow and Clipper (Lecture 24, cs262a)
A Cloud System for Machine Learning Exploiting a Parallel Array DBMS
Goodfellow: Chap 6 Deep Feedforward Networks
CMPT 733, SPRING 2016 Jiannan Wang
For convex optimization
Lecture 07: Soft-margin SVM
Logistic Regression & Parallel SGD
cs540 - Fall 2016 (Shavlik©), Lecture 20, Week 11
Goodfellow: Chapter 14 Autoencoders
Lecture 08: Soft-margin SVM
Neural Networks Geoff Hulten.
Deep Learning for Non-Linear Control
Overfitting and Underfitting
Lecture 2 CMS 165 Optimization.
CS639: Data Management for Data Science
TensorFlow: A System for Large-Scale Machine Learning
Seminar on Machine Learning Rada Mihalcea
CMS 165 Lecture 18 Summing up...
Time Complexity and Parallel Speedup to Compute the Gamma Summarization Matrix Carlos Ordonez, Yiqun Zhang University of Houston, USA 1.
Memory-Based Learning Instance-Based Learning K-Nearest Neighbor
Batch Normalization.
Rohan Yadav and Charles Yuan (rohany) (chenhuiy)
Modeling IDS using hybrid intelligent systems
Search-Based Approaches to Accelerate Deep Learning
CSC 578 Neural Networks and Deep Learning
Goodfellow: Chapter 14 Autoencoders
Patterson: Chap 1 A Review of Machine Learning
Presentation transcript:

Large-scale Machine Learning ECE 901 Epilogue Large-scale Machine Learning and Optimization

ML Pipelines Test Data Input Data Feature and Model Selection Training

Google’s deep nets can take 100s hrs to train ML Pipelines Test Data Input Data Feature Selection Training Model Google’s deep nets can take 100s hrs to train [Dean et al., 2012]

Why Optimization?

OPT at the heart of ML Measures model fit for data point i (avoids under-fitting) Measures model “complexity” (avoids over-fitting)

Solvability ≠ Scalability Many of the problems that we visit, can be solved in polynomial time, but that’s not enough. E.g., O( #examples^4 * dimension^5 ) not scalable We need fast algorithms! Ideally O(#examples * #dimension) time We want algorithms amenable to parallelization! if serial runs in O(T) time, we want O(T/P) on P cores.

Performance Trade-off We want our algorithms to be here stat. accuracy speed parallelizability

Algorithms / Optimization This course Algorithms / Optimization Statistics Systems

Goals of this course Learn new algorithmic tools for large-scale ML Produce a research-quality project report

What we covered Part 1: Part 2: ERM and Optimization Convergence properties of SGD and Variants Generalization performance, Algorithmic Stability Neural Nets Part 2: Multicore/Parallel Optimization Serializable Machine Learning Distributed ML Stragglers in Distributed Computation

What we did not cover 0-th order optimization Several 1-st order based algorithms: Mirror Descent, Prox., ADMM, Accelerated GD, Nesterov’s Optimal Method,… Second order optimization Semidefinite/Linear Programming Graph Problems in ML Sketching/Low-dimensional embeddings Model Selection Feature Selection Data Serving Active Learning Online Learning Unsupervised Learning

Recap

Gradients at the core of Optimization Gradient Descent Stochastic Gradient Stochastic Coordinate Frank-Wolfe Variance Reduction Projected Gradient

Convergence Guarantees TL;DR: Structure Helps

Convexity TL;DR: We can solve any convex problem

Non-Convexity TL;DR: For general non-convex, only grad. convergence

Neural Nets TL;DR: Very expressive, very effective, very hard to analyze

Algorithmic Stability TL;DR: Stability => Generalization

Parallel and Distributed ML Multi-socket NUMA TL;DR: We still don’t have a good understanding

Stochastic Gradient is almost always Course TL;DR Stochastic Gradient is almost always almost the answer

Many Open Research Problems

Parallel ML

Open Problems: Asynchronous Algorithms Asynchronous algorithms great for Shared Memory Systems Issues when scaling across nodes Similar Issues for Distributed: speedup #threads O.P. : How to provably scale on NUMA? O.P. : What is the right ML Paradigm for Distributed?

Open Problems: Asynchronous Algorithms Assumptions: Holy grail: Sparsity + convexity => linear speedups O.P. : Hogwild! On Dense Problems Only soft sparsity needed = uncorrelated sampled gradients Maybe we should featurize dense ML Problems, so that updates are sparse Fundamental Trade-off Sparsity vs Learning? O.P. : Hogwild! On Non-convex Problems

Distributed ML

Open Problems: Distributed ML How fast does distributed-SGD converge? How can we measure speedups? Comm. is expensive, how often do we average? How do we choose the right model? What happens with delayed nodes? Does fault tolerance matter?

Open Problems: Distributed ML Some models are better from a systems perspective Does it fit in a single machine? Is model architecture amenable to low communication? Some models easier to partition Can we increase sparsity (less comm) without losing with accuracy?

Open Problems: Distributed ML Stong Scaling

Open Problems: Distributed ML t: latency (second) Measured on Amazon AWS f=f1+f2+f3 f1 f2 f3 How to mitigate Straggler Nodes? f1 f2 f3

Open Problems: Distributed ML t: latency (second) Measured on Amazon AWS f=f1+f2+f3 f1 f2 f3 How to design algorithms robust to delays? f1 f2 f3

Open Problems: Distributed ML Coded computation with low decoding complexity Nonlinear / nonconvex functions Expander Codes to the Rescue? Most times “lossy” learning is fine, maybe “terminate” slow nodes? f=f1+f2+f3+f4 f1 f2 f3 f4 f1+f2+f3+f4

Learning Theory

Open Problems: Stability Can we test Stability of algorithms in sublinear time? Classes of Noncovnex problems that SGD is stable? What Neural Net architectures lead to Stable Models?

Open Problems: Stability/Robustness Well trained models, with good error, exhibit low robustness prediction(model, data) ≠ prediction(model, data+noise) Theory question: How robust are models trained by SGD? Theory question: If we add noise to training, does it robustify the model?

What is the right SW platform?

Machine Learning On Different Frameworks

What is the right HW platform?

Machine Learning On Different Platforms Q: How do we optimize ML for NUMA Architectures? Q: How do we parallelize ML across mobile devices? Q: Should we build hardware optimized for ML algorithms? (FPGAs?) Q: ML on GPUS

Large-Scale Machine Learning The Driving Question How can we enable Large-Scale Machine Learning On New Technologies? ML Algorithms Systems

You want to be here!

Survey