Carnegie Mellon Machine Learning in the Cloud Yucheng Low Aapo Kyrola Danny Bickson Joey Gonzalez Carlos Guestrin Joe Hellerstein David O’Hallaron.

Slides:



Advertisements
Similar presentations
Bayesian Belief Propagation
Advertisements

Scaling Up Graphical Model Inference
MINJAE HWANG THAWAN KOOBURAT CS758 CLASS PROJECT FALL 2009 Extending Task-based Programming Model beyond Shared-memory Systems.
Piccolo: Building fast distributed programs with partitioned tables Russell Power Jinyang Li New York University.
SE263 Video Analytics Course Project Initial Report Presented by M. Aravind Krishnan, SERC, IISc X. Mei and H. Ling, ICCV’09.
Carnegie Mellon Joseph Gonzalez Joint work with Yucheng Low Aapo Kyrola Danny Bickson Carlos Guestrin Guy Blelloch Joe Hellerstein David O’Hallaron A New.
Distributed Graph Analytics Imranul Hoque CS525 Spring 2013.
Cloud Computing Resource provisioning Keke Chen. Outline  For Web applications statistical Learning and automatic control for datacenters  For data.
Joseph Gonzalez Joint work with Yucheng Low Aapo Kyrola Danny Bickson Carlos Guestrin Guy Blelloch Joe Hellerstein David O’Hallaron A New Parallel Framework.
APACHE GIRAPH ON YARN Chuan Lei and Mohammad Islam.
Yucheng Low Aapo Kyrola Danny Bickson A Framework for Machine Learning and Data Mining in the Cloud Joseph Gonzalez Carlos Guestrin Joe Hellerstein.
Carnegie Mellon Joseph Gonzalez Joint work with Yucheng Low Aapo Kyrola Danny Bickson Carlos Guestrin Guy Blelloch Joe Hellerstein David O’Hallaron A New.
Piccolo – Paper Discussion Big Data Reading Group 9/20/2010.
Summary Problem: Exponential Performance Gap: Computer architectures transitioned from exponential frequency scaling to parallelism ending decades of free.
Carnegie Mellon Joseph Gonzalez Joint work with Yucheng Low Aapo Kyrola Danny Bickson Carlos Guestrin Guy Blelloch Joe Hellerstein David O’Hallaron A New.
Carnegie Mellon Joseph Gonzalez Joint work with Yucheng Low Aapo Kyrola Danny Bickson Kanat Tangwon- gsan Carlos Guestrin Guy Blelloch Joe Hellerstein.
Scaling Distributed Machine Learning with the BASED ON THE PAPER AND PRESENTATION: SCALING DISTRIBUTED MACHINE LEARNING WITH THE PARAMETER SERVER – GOOGLE,
Distributed Computations
GraphLab A New Parallel Framework for Machine Learning Carnegie Mellon Based on Slides by Joseph Gonzalez Mosharaf Chowdhury.
Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy Lin and Michael Schatz University of Maryland Tuesday, June 29, 2010 This work is licensed.
Map-Reduce and Parallel Computing for Large-Scale Media Processing Youjie Zhou.
Distributed Iterative Training Kevin Gimpel Shay Cohen Severin Hacker Noah A. Smith.
Big Learning with Graph Computation Joseph Gonzalez Download the talk:
Joseph Gonzalez Yucheng Low Aapo Kyrola Danny Bickson Joe Hellerstein Alex Smola Distributed Graph-Parallel Computation on Natural Graphs Haijie Gu The.
BiGraph BiGraph: Bipartite-oriented Distributed Graph Partitioning for Big Learning Jiaxin Shi Rong Chen, Jiaxin Shi, Binyu Zang, Haibing Guan Institute.
GraphLab A New Framework for Parallel Machine Learning
U.S. Department of the Interior U.S. Geological Survey David V. Hill, Information Dynamics, Contractor to USGS/EROS 12/08/2011 Satellite Image Processing.
Introduction to Parallel Programming MapReduce Except where otherwise noted all portions of this work are Copyright (c) 2007 Google and are licensed under.
Research on cloud computing application in the peer-to-peer based video-on-demand systems Speaker : 吳靖緯 MA0G rd International Workshop.
Computer System Architectures Computer System Software
Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Background: MapReduce and FREERIDE Co-clustering on FREERIDE Experimental.
Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Summary of Contributions Background: MapReduce and FREERIDE Wavelet.
Carnegie Mellon University GraphLab Tutorial Yucheng Low.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
Ahsanul Haque *, Swarup Chandra *, Latifur Khan * and Charu Aggarwal + * Department of Computer Science, University of Texas at Dallas + IBM T. J. Watson.
X-Stream: Edge-Centric Graph Processing using Streaming Partitions
Distributed shared memory. What we’ve learnt so far  MapReduce/Dryad as a distributed programming model  Data-flow (computation as vertex, data flow.
Ahsanul Haque *, Swarup Chandra *, Latifur Khan * and Michael Baron + * Department of Computer Science, University of Texas at Dallas + Department of Mathematical.
GraphLab: how I understood it with sample code Aapo Kyrola, Carnegie Mellon Univ. Oct 1, 2009.
Carnegie Mellon Yucheng Low Aapo Kyrola Danny Bickson A Framework for Machine Learning and Data Mining in the Cloud Joseph Gonzalez Carlos Guestrin Joe.
Carnegie Mellon Machine Learning in the Cloud Yucheng Low Aapo Kyrola Danny Bickson Joey Gonzalez Carlos Guestrin Joe Hellerstein David O’Hallaron.
Carnegie Mellon Yucheng Low Aapo Kyrola Danny Bickson A Framework for Machine Learning and Data Mining in the Cloud Joseph Gonzalez Carlos Guestrin Joe.
Is Your Graph Algorithm Eligible for Nondeterministic Execution? Zhiyuan Shao, Lin Hou, Yan Ai, Yu Zhang and Hai Jin Services Computing Technology and.
DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters Nanyang Technological University Shanjiang Tang, Bu-Sung Lee, Bingsheng.
Big data Usman Roshan CS 675. Big data Typically refers to datasets with very large number of instances (rows) as opposed to attributes (columns). Data.
Practical Message-passing Framework for Large-scale Combinatorial Optimization Inho Cho, Soya Park, Sejun Park, Dongsu Han, and Jinwoo Shin KAIST 2015.
A N I N - MEMORY F RAMEWORK FOR E XTENDED M AP R EDUCE 2011 Third IEEE International Conference on Coud Computing Technology and Science.
Data Parallel and Graph Parallel Systems for Large-scale Data Processing Presenter: Kun Li.
Onlinedeeneislam.blogspot.com1 Design and Analysis of Algorithms Slide # 1 Download From
1 Relational Factor Graphs Lin Liao Joint work with Dieter Fox.
A Distributed Framework for Machine Learning and Data Mining in the Cloud BY YUCHENG LOW, JOSEPH GONZALEZ, AAPO KYROLA, DANNY BICKSON, CARLOS GUESTRIN.
Learning A Better Compiler Predicting Unroll Factors using Supervised Classification And Integrating CPU and L2 Cache Voltage Scaling using Machine Learning.
Department of Computer Science, Johns Hopkins University Pregel: BSP and Message Passing for Graph Computations EN Randal Burns 14 November 2013.
COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University
Jimmy Lin and Michael Schatz Design Patterns for Efficient Graph Algorithms in MapReduce Michele Iovino Facoltà di Ingegneria dell’Informazione, Informatica.
Department of Computer Science, Johns Hopkins University Lecture 7 Finding Concurrency EN /420 Instructor: Randal Burns 26 February 2014.
Hadoop Javad Azimi May What is Hadoop? Software platform that lets one easily write and run applications that process vast amounts of data. It includes:
TensorFlow– A system for large-scale machine learning
Big Data: Graph Processing
A New Parallel Framework for Machine Learning
Chilimbi, et al. (2014) Microsoft Research
Abstract Major Cloud computing companies have started to integrate frameworks for parallel data processing in their product portfolio, making it easy for.
CSCI5570 Large Scale Data Processing Systems
COS 518: Advanced Computer Systems Lecture 12 Mike Freedman
CS110: Discussion about Spark
Machine Learning in the Cloud
TensorFlow: A System for Large-Scale Machine Learning
MapReduce: Simplified Data Processing on Large Clusters
Rohan Yadav and Charles Yuan (rohany) (chenhuiy)
Presentation transcript:

Carnegie Mellon Machine Learning in the Cloud Yucheng Low Aapo Kyrola Danny Bickson Joey Gonzalez Carlos Guestrin Joe Hellerstein David O’Hallaron

In ML we face BIG problems 24 Hours a Minute YouTube 13 Million Wikipedia Pages 500 Million Facebook Users 3.6 Billion Flickr Photos

Exponential Parallelism Exponentially Increasing Sequential Performance Constant Sequential Performance Processor Speed GHz Exponentially Increasing Parallel Performance Exponentially Increasing Parallel Performance Release Date

The Challenges of Parallelism Wide array of different parallel architectures: New algorithm design challenges: Race conditions and deadlocks Distributed state New software implementations challenges: Parallel debugging and profiling Hardware specific APIs 4 GPUsMulticoreClustersMini CloudsClouds

Our Current Solution ML experts repeatedly solve the same parallel design challenges: Implement and debug complex parallel system Tune for a specific parallel platform A month later the conference paper contains: “We implemented ______ in parallel.” 5 avoid these problems by using high-level abstractions Graduate students

CPU 1 CPU 2 CPU 3 CPU 4 MapReduce – Map Phase Embarrassingly Parallel independent computation No Communication needed

CPU 1 CPU 2 CPU 3 CPU 4 MapReduce – Map Phase Embarrassingly Parallel independent computation No Communication needed

CPU 1 CPU 2 CPU 3 CPU 4 MapReduce – Map Phase Embarrassingly Parallel independent computation No Communication needed

CPU 1 CPU 2 MapReduce – Reduce Phase Fold/Aggregation

MapReduce and ML Excellent for large data-parallel tasks! 10 Data-Parallel Complex Parallel Structure Is there more to Machine Learning ? Cross Validation Feature Extraction Map Reduce Computing Sufficient Statistics

Slow Processor Iterative Algorithms? We can implement iterative algorithms in MapReduce: Data CPU 1 CPU 2 CPU 3 Data CPU 1 CPU 2 CPU 3 Data CPU 1 CPU 2 CPU 3 Iterations Barrier

MapAbuse: Iterative MapReduce Only a subset of data needs computation: Data CPU 1 CPU 2 CPU 3 Data CPU 1 CPU 2 CPU 3 Data CPU 1 CPU 2 CPU 3 Iterations Barrier

MapAbuse: Iterative MapReduce System is not optimized for iteration: Data CPU 1 CPU 2 CPU 3 Data CPU 1 CPU 2 CPU 3 Data CPU 1 CPU 2 CPU 3 Iterations Disk Penalty Startup Penalty

Data-Parallel Algorithms can be Inefficient Limitations of MapReduce can lead to inefficient parallel algorithms Optimized in Memory MapReduceBP Asynchronous Splash BP But distributed Splash BP was built from scratch… efficient, parallel implementation was painful, painful, painful to achieve

What about structured Problems? 15 Interdependent Computation: Not Map-Reducible Example Problem: Will I be successful in research? May not be able to safely update neighboring nodes. [e.g., Gibbs Sampling] Success depends on the success of others.

GraphLab Data-Parallel Structured Iterative Parallel Parallel Computing and ML Not all algorithms are efficiently data parallel 16 Cross Validation Feature Extraction Belief Propagation SVM Kernel Methods Deep Belief Networks Neural Networks Tensor Factorization Learning Graphical Models Lasso Map Reduce Computing Sufficient Statistics Sampling ?

Common Properties 1) Sparse Data Dependencies 2) Local Computations 3) Iterative Updates Sparse Primal SVM Tensor/Matrix Factorization Expectation Maximization Optimization Sampling Belief Propagation Operation A Operation B

Gibbs Sampling X4X4 X4X4 X5X5 X5X5 X6X6 X6X6 X9X9 X9X9 X8X8 X8X8 X3X3 X3X3 X2X2 X2X2 X1X1 X1X1 X7X7 X7X7 1) Sparse Data Dependencies 2) Local Computations 3) Iterative Updates

GraphLab is the Solution Designed specifically for ML needs Express data dependencies Iterative Simplifies the design of parallel programs: Abstract away hardware issues Automatic data synchronization Addresses multiple hardware architectures Multicore Distributed Cloud computing GPU implementation in progress

Carnegie Mellon A New Framework for Parallel Machine Learning

GraphLab Data Graph Update Function Shared Data Table Scheduling Update Functions and Scopes GraphLab Model GraphLab Model

Part 1: Data Graph A Graph with data associated with every vertex and edge :Data x 3 : Sample value C(X 3 ): sample counts Φ(X 6,X 9 ): Binary potential X1X1 X2X2 X3X3 X5X5 X6X6 X7X7 X8X8 X9X9 X 10 X4X4 X 11

Update Functions Update Functions: operations applied on vertex  transform data in scope of vertex Gibbs Update: - Read samples on adjacent vertices - Read edge potentials - Compute new sample for current vertex

Update Function Schedule e e f f g g k k j j i i h h d d c c b b a a CPU 1 CPU 2 a a h h a a i i b b d d

Part 2: Update Function Schedule e e f f g g k k j j i i h h d d c c b b a a CPU 1 CPU 2 a a i i b b d d

Static Schedule Scheduler determines the order of Update Function Evaluations Synchronous Schedule: Every vertex updated simultaneously Round Robin Schedule: Every vertex updated sequentially

ConvergedSlowly Converging Focus Effort Need for Dynamic Scheduling

Dynamic Schedule e e f f g g k k j j i i h h d d c c b b a a CPU 1 CPU 2 a a h h a a b b b b i i

Dynamic Schedule Update Functions can insert new tasks into schedule FIFO Queue Wildfire BP [Selvatici et al.] Priority Queue Residual BP [Elidan et al.] Splash Schedule Splash BP [Gonzalez et al.] Obtain different algorithms simply by changing a flag! --scheduler=fifo --scheduler=priority --scheduler=splash

Global Information What if we need global information? Sum of all the vertices? Algorithm Parameters? Sufficient Statistics?

Part 3: Shared Data Table (SDT) Global constant parameters Constant: Total # Samples Constant: Total # Samples Constant: Temperature Constant: Temperature

Accumulate Function: Sync Operation Sync is a fold/reduce operation over the graph Sync! Apply Function: Add Divide by |V| Accumulate performs an aggregation over vertices Apply makes a final modification to the accumulated data Example: Compute the average of all the vertices

Shared Data Table (SDT) Global constant parameters Global computation (Sync Operation) Constant: Total # Samples Constant: Total # Samples Sync: Sample Statistics Sync: Sample Statistics Sync: Loglikelihood Constant: Temperature Constant: Temperature

Carnegie Mellon Safety and Consistency

Write-Write Race If adjacent update functions write simultaneously Left update writes:Right update writes: Final Value

Race Conditions + Deadlocks Just one of the many possible races Race-free code is extremely difficult to write GraphLab design ensures race-free operation (through user-tunable consistency mechanism)

Part 4: Scope Rules Guaranteed safety for all update functions

Full Consistency Parallel update only allowed two vertices apart  Reduced opportunities for parallelism Parallel update only allowed two vertices apart  Reduced opportunities for parallelism

Obtaining More Parallelism Not all update functions will modify the entire scope! Belief Propagation: Only uses edge data Gibbs Sampling: Only needs to read adjacent vertices

Edge Consistency

Obtaining More Parallelism “Map” operations. Feature extraction on vertex data

Vertex Consistency

Thm: Sequential Consistency GraphLab guarantees sequential consistency  parallel execution,  sequential execution of update functions which produce same result CPU 1 CPU 2 CPU 1 Parallel Sequential time key for proving correctness of parallel algorithm

GraphLab Model GraphLab Model Data Graph Update Function Shared Data Table Scheduling Update Functions and Scopes

Carnegie Mellon Multicore Experiments

Shared Memory Implemention in C++ using Pthreads Tested on a 16 processor machine 4x Quad Core AMD Opteron GB RAM Belief Propagation +Parameter Learning Gibbs Sampling CoEM Lasso Compressed Sensing SVM PageRank Tensor Factorization

Graphical Model Learning 3D retinal image denoising Data Graph: 256x64x64 (1M) vertices Update Function Belief Propagation Sync Acc: Compute inference statistics Apply:Take a gradient step Sync: Edge-potential Sync: Edge-potential

Carnegie Mellon

Graphical Model Learning Inference Gradient Step Standard parameter learning takes gradient only after inference is complete Parallel Inference + Gradient Step Parallel Inference + Gradient Step With GraphLab: Take gradient step while inference is running Runtime 3x faster! 2100 sec 700 sec Iterated Simultaneous

Lasso 50 Data matrix, n x d weights d x 1 Observations n x 1 5 Features 4 Examples Shooting Algorithm [Coordinate Descent] Updates on weight vertices modify losses on observation vertices. Requires the Full Consistency Model Financial prediction dataset from Kogan et al [2009].

Full Consistency 51 Optimal Better Dense Sparse

Relaxing Consistency 52 Why does this work? (We may have an answer soon.) Better Optimal Dense Sparse

CoEM (Rosie Jones, 2005) Named Entity Recognition Task VerticesEdges Small0.2M20M Large2M200M the dog Australia Catalina Island ran quickly travelled to is pleasant Hadoop95 Cores7.5 hrs Is “Dog” an animal? Is “Catalina” a place?

CoEM (Rosie Jones, 2005) 54 Optimal Better Small Large GraphLab16 Cores30 min 15x Faster!6x fewer CPUs! Hadoop95 Cores7.5 hrs

Carnegie Mellon GraphLab in the Cloud

Moving towards the cloud… Purchasing and maintaining computers is very expensive Most computing resources seldomly used Only for deadlines… Buy time, access hundreds or thousands of processors Only pay for needed resources

Addressing cloud computing challenges distributed memory optimized data partition limited bandwidth smart caching, interleave computation/comm high latency push data, latency hiding mechanism challenge GraphLab solution

GraphLab in the Cloud Experiments Highly optimized implementation Computer clusters Amazon EC2 GraphLab automatically configures and distributes through EC2 Very easy to use Thoroughly evaluated in three case studies: coEM probabilistic tensor factorization video co-segmentation inference & learning in a huge graphical model

Experiment Setup 59 Regular Nodes8 Cores per node Up to 128 Cores HPC Node16 Cores per node Up to 256 Cores Tested on both Regular and HPC nodes, using up to 32 machines

CoEM (Rosie Jones, 2005) 60 Optimal Better Small Large GraphLab16 Cores30 min Hadoop95 Cores7.5 hrs GraphLab in the Cloud 32 EC2 machines 80 secs 0.3% of Hadoop time

Video Cosegmentation Segments mean the same A lot of data! Complex solutions infeasible!

Video Cosegmentation Naïve Idea: Treat patches independently Use Gaussian EM clustering (on image features) E step: Predict membership of each patch given cluster centers M step: Compute cluster centers given memberships of each patch Does not take relationships among patches into account!

Video Cosegmentation Better Idea: Connect the patches using an MRF. Set edge potentials so that adjacent (spatially and temporally) patches prefer to be of the same cluster. Gaussian EM clustering with a twist: E step: Make unary potentials for each patch using cluster centers. Predict membership of each patch using BP M step: Compute cluster centers given memberships of each patch D. Batra, et al. iCoseg: Interactive co-segmentation with intelligent scribble guidance. CVPR 2010.

Video Co-Segmentation Discover “coherent” segment types across a video (extends Batra et al. ‘10) 1. Form super-voxels video 2. EM & inference in Markov random field Huge model: 23 million nodes, 390 million edges

Cost-Time Tradeoff video co-segmentation results more machines, higher cost faster a few machines helps a lot diminishing returns

Bayesian Tensor Factorization 66 Netflix Users Movies Time User 1 User 2 User 3 Movie 1 Movie 2 Movie 3 At time 6 At time 2 At time 3 Vertices store user-factors and movie-factors Edges store user movie ratings Time-factors are stored as Shared Data Time Factors

Bayesian Tensor Factorization 67 Regular Nodes Optimal HPC Nodes Better HPC Nodes Regular Nodes Better Netflix dataset: 480K users, 18K movies, 27 time periods, 100M ratings Xiong et al (2010)1 Core2160s per iteration Distributed GraphLab (HPC) 256 Cores6.8s per iteration

Carnegie Mellon Parallel GraphLab 1.1 Multicore Available Today GraphLab in the Cloud soon… Documentation… Code… Tutorials…

69 MulticoreDistributedGPU GraphLab Engine Native C++ Interface Parallel PThread based implementation Matlab™ 2010b interface using EMLC Java, Jython interface through JNI Tutorials and Demonstration Code GraphLab Release 1.1

C++, Java and Python Native C++ interface highest Performance access to complete GraphLab feature set template heavy code can be difficult for novice C++ programmers Pure Java API for GraphLab use plain Java objects for the whole graph write update function in Java full Graphlab C++ performance through Java Native Interface (JNI) Python API for GraphLab (via Jython) vertex and edge data can be any Python type Python allows very concise update functions works on top of Java interface for GraphLab 70

Matlab Interface Update Functions are written in a Matlab subset (embedded Matlab) Update Functions compiled to native C++ code which do not depend on Matlab Generated MEX interface allow resultant GraphLab program to interface with Matlab easily 71

Matlab Interface 72

GraphLab Model GraphLab Model Data Graph Update Function Shared Data Table Scheduling Update Functions and Scopes

GraphLab Parallel abstraction tailored to Machine Learning Parallel framework compactly expresses Data/computational dependencies Iterative computation Achieves state-of-the-art parallel performance on variety of problems Easy to use E.g., data partition, optimized communication, automatically configures & distributes over EC2,…

Future Work Distributed GraphLab Robustness GPU GraphLab Memory bus bottle neck Warp alignment State-of-the-art performance for.

Carnegie Mellon Parallel GraphLab 1.1 Multicore Available Today GraphLab in the Cloud soon… Documentation… Code… Tutorials…

77

78

79

80

Matlab Interface 81 Gets the current vertex data.

Matlab Interface 82 Gets the edge data on a particular in-edge

Matlab Interface 83 Computes the new belief by multiplying incoming messages with the unary potential

Matlab Interface 84 Updates the current vertex data

Matlab Interface 85 For each in/out edge pair

Matlab Interface 86 Compute the new outgoing message

Matlab Interface 87 Sets the new outgoing edge data

Matlab Interface 88 Computes the message residual

Matlab Interface 89 If residual is large, schedule destination vertex

Matlab Interface 90 bpupdate.m compile_update_function({‘bpupdate’}, vdata_example, edata_example, …) [newvdata, newedata, newadj] = bp(vertexdata, edgedata, adjacency_matrix, scheduling_spec); MRF Specification

Matlab Interface 91 bpupdate.m compile_update_function({‘bpupdate’}, vdata_example, edata_example, …) [newvdata, newedata, newadj] = bp(vertexdata, edgedata, adjacency_matrix, scheduling_spec); Resultant Graph after running BP

Matlab to GraphLab Compiler Details 92 Update functionsGraph datatype examples Type check restrictions and generate EML type descriptors Vdata.belief = [1,1] Vdata.unary = [2,2] bpupdate.m vdata.belief = [double(0)]; eml.varsize('vdata.belief', [1 Inf]); vdata.unary = [double(0)]; eml.varsize('vdata.unary', [1 Inf]); Matlab to C generation with EMLC Parse output C code and generate 1.converters between emxArray and mxArray. 2.emxArray serialization and deserialiation. Generate binary, mex and m frontend code. Makefile generation Extensive use of C++ templates and preprocessor to identify graph datatypes and to wrap the update functions.

Gibbs Sampling Two methods for sequentially consistency: Scopes Edge Scope graphlab(gibbs, edge, sweep); Scheduling Graph Coloring CPU 1CPU 2CPU 3 t0t0 t1t1 t2t2 t3t3 graphlab(gibbs, vertex, colored);

Gibbs Sampling Protein-protein interaction networks [Elidan et al. 2006] Pair-wise MRF 14K Vertices 100K Edges 10x Speedup Scheduling reduces locking overhead Optimal Better Round robin schedule Colored Schedule

Lasso L1 regularized Linear Regression Shooting Algorithm (Coordinate Descent) Due to the properties of the update, full consistency is needed

Lasso L1 regularized Linear Regression Shooting Algorithm (Coordinate Descent) Due to the properties of the update, full consistency is needed

Lasso L1 regularized Linear Regression Shooting Algorithm (Coordinate Descent) Due to the properties of the update, full consistency is needed Finance Dataset from Kogan et al [2009].

Full Consistency Optimal Better Dense Sparse

Relaxing Consistency Why does this work? (Open Question) Better Optimal Dense Sparse

Did you compare against ___ Lasso We had trouble getting the standard Lasso implementations to run on the dataset But we will love to try it out

Comparing against Hadoop Hadoop is the current available implementation of Tom Mitchell’s group we are assisting. Now this is what they are using. Demonstrate that Hadoop (though popular) is not necessarily the best framework for your problem

DAG Abstraction Computation represented as a Directed Acyclic Graph. Vertices represent “programs”. A program starts when data is received on all incoming edges, and outputs data on its outgoing edges.

Clocked Systolic Array Processors are arranged in a directed graph. All processors compute, then transmit outgoing messages in sync.

Clocked Systolic Array Processors are arranged in a directed graph. All processors compute, then transmit outgoing messages in sync.

Read-Write Race Update Function reads from in-edges and writes to out-edges

Read-Write Race Update Function reads from in-edges and writes to out-edges Read-Write Race if write while an adjacent update function is reading

Compressed Sensing Represent image as sparse linear combination of basis functions. Minimize reconstruction error. Interior Point method with Gaussian BP as linear solver 50% random wavelet projection

Graphical Model Learning 3D retinal image denoising Data Graph: 256x64x64 Update Function: Belief Propagation Shared Data: Edge Potentials. Gradient step computed with Sync Optimal Better Runtime Inference Gradient Step