CS (borrowing heavily from slides by Kay Ousterhout)

Slides:



Advertisements
Similar presentations
UC Berkeley a Spark in the cloud iterative and interactive cluster computing Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.
Advertisements

MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
Spark: Cluster Computing with Working Sets
Matei Zaharia Large-Scale Matrix Operations Using a Data Flow Engine.
Scaling Distributed Machine Learning with the BASED ON THE PAPER AND PRESENTATION: SCALING DISTRIBUTED MACHINE LEARNING WITH THE PARAMETER SERVER – GOOGLE,
Mesos A Platform for Fine-Grained Resource Sharing in Data Centers Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D. Joseph, Randy.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Hadoop & Cheetah. Key words Cluster  data center – Lots of machines thousands Node  a server in a data center – Commodity device fails very easily Slot.
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
U.S. Department of the Interior U.S. Geological Survey David V. Hill, Information Dynamics, Contractor to USGS/EROS 12/08/2011 Satellite Image Processing.
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
Resilient Distributed Datasets: A Fault- Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
HADOOP DISTRIBUTED FILE SYSTEM HDFS Reliability Based on “The Hadoop Distributed File System” K. Shvachko et al., MSST 2010 Michael Tsitrin 26/05/13.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
Other Map-Reduce (ish) Frameworks: Spark William Cohen 1.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.
BIG DATA/ Hadoop Interview Questions.
Resilient Distributed Datasets A Fault-Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
Lecture 3 – MapReduce: Implementation CSE 490h – Introduction to Distributed Computing, Spring 2009 Except as otherwise noted, the content of this presentation.
PySpark Tutorial - Learn to use Apache Spark with Python
Image taken from: slideshare
TensorFlow– A system for large-scale machine learning
Big Data is a Big Deal!.
PROTECT | OPTIMIZE | TRANSFORM
Sushant Ahuja, Cassio Cristovao, Sameep Mohta
About Hadoop Hadoop was one of the first popular open source big data technologies. It is a scalable fault-tolerant system for processing large datasets.
Hadoop Aakash Kag What Why How 1.
Hadoop.
Introduction to Distributed Platforms
Spark.
Distributed Programming in “Big Data” Systems Pramod Bhatotia wp
An Open Source Project Commonly Used for Processing Big Data Sets
Large-scale file systems and Map-Reduce
Chilimbi, et al. (2014) Microsoft Research
Spark Presentation.
Introduction to MapReduce and Hadoop
Hadoop Clusters Tess Fulkerson.
Software Engineering Introduction to Apache Hadoop Map Reduce
Apache Spark Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing Aditya Waghaye October 3, 2016 CS848 – University.
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
Introduction to Spark.
Distributed Systems CS
MapReduce Simplied Data Processing on Large Clusters
湖南大学-信息科学与工程学院-计算机与科学系
February 26th – Map/Reduce
CMPT 733, SPRING 2016 Jiannan Wang
Cse 344 May 4th – Map/Reduce.
CS110: Discussion about Spark
Ch 4. The Evolution of Analytic Scalability
Apache Spark Lecture by: Faria Kalim (lead TA) CS425, UIUC
Distributed System Gang Wu Spring,2018.
Parallel Analytic Systems
Overview of big data tools
Discretized Streams: A Fault-Tolerant Model for Scalable Stream Processing Zaharia, et al (2012)
Spark and Scala.
Apache Spark Lecture by: Faria Kalim (lead TA) CS425 Fall 2018 UIUC
Distributed Systems CS
Charles Tappert Seidenberg School of CSIS, Pace University
Introduction to Spark.
5/7/2019 Map Reduce Map reduce.
Fast, Interactive, Language-Integrated Cluster Computing
COS 518: Distributed Systems Lecture 11 Mike Freedman
MapReduce: Simplified Data Processing on Large Clusters
Lecture 29: Distributed Systems
CS639: Data Management for Data Science
Presentation transcript:

CS194-16 (borrowing heavily from slides by Kay Ousterhout) Scaling Up CS194-16 (borrowing heavily from slides by Kay Ousterhout)

Overview Why Big Data? (and Big Models) Hadoop Spark Parameter Server and MPI

Big Data Lots of Data: Facebook’s daily logs: 60 TB 1000 genomes project: 200 TB Google web index: 10+ PB Lots of Questions: Computational Marketing Recommendations and Personalization Genetic analysis

But How Much Data Do You Need? The answer of course depends on the question but for many applications the answer is: As much as you can get Big Data about people (text, web, social media) follow power law statistics. Log feature frequency Most features occur only once or twice Log feature rank

How Much Data Do You Need? The number of features grows in proportion to the amount of data – doubling the dataset size roughly doubles the number of users we observe. Even one or two observations of a user improves predictions for them, so more data (and bigger models!)  more revenue. Log feature frequency Most features occur only once or twice Log feature rank

Hardware for Big Data Budget hardware Not "gold plated" Many low-end servers Easy to add capacity Cheaper per CPU/disk Increased Complexity in software: Fault tolerance Virtualization Image: Steve Jurvetson/Flickr

Problems with Cheap HW Failures, e.g. (Google numbers) 1-5% hard drives/year 0.2% DIMMs/year Commodity Network (1-10 Gb/s) speeds vs. RAM Much more latency (100x – 100,000x) Lower throughput (100x-1000x) Uneven Performance Variable network latency External loads ADD: How often do things fail? Uneven performance: inconsistent hardware/data skew: outliers are rare -- but you'll see them more when you're spreading across more nodes

MapReduce Review from 61C?

MapReduce: Word Count “I am Sam I am Sam Sam I am Do you like Green eggs and ham? I do not like them I do not like Green eggs and ham Would you like them Here or there? …” New dataset in a bunch of pieces! Everything doesn’t need to be stored on one machine

Word Count with one Reducer “I am Sam I am Sam Sam I am Do you like Green eggs and ham? I do not like them I do not like Green eggs and ham Would you like them Here or there? …” {I: 3, am: 3, …} {I: 6, am: 5, …} {do: 2, you: 1, … } {Sam: 1, I: 2, … } A single reducer gets everything {Would: 1, you: 1,… }

Word Count with Multiple Reducers “I am Sam I am Sam Sam I am Do you like Green eggs and ham? I do not like them I do not like Green eggs and ham Would you like them Here or there? …” {I: 3, am: 3, …} {I: 6, do: 3, …} {do: 2, you: 1, … } {am: 5, Sam: 4 …} {Sam: 1, I: 2, … } {you: 2 …} New dataset in a bunch of pieces! Everything doesn’t need to be stored on one machine {Would: 1 …} {Would: 1, you: 1,… }

MapReduce: Word Count Map Reduce “I am Sam {I: 3, {I: 6, I am Sam Sam I am Do you like Green eggs and ham? I do not like them I do not like Green eggs and ham Would you like them Here or there? …” {I: 3, am: 3, …} {I: 6, do: 3, …} {do: 2, you: 1, … } {am: 5, Sam: 4 …} {Sam: 1, I: 2, … } {you: 2 …} New dataset in a bunch of pieces! Everything doesn’t need to be stored on one machine {Would: 1 …} {Would: 1, you: 1,… }

MapReduce: Failures? Start a new copy! “I am Sam {I: 3, I am Sam Sam I am Do you like Green eggs and ham? I do not like them I do not like Green eggs and ham Would you like them Here or there? …” {I: 3, am: 3, …} {do: 2, you: 1, … } Start a new copy! {Sam: 1, I: 2, … } New dataset in a bunch of pieces! Everything doesn’t need to be stored on one machine {Would: 1, you: 1,… }

MapReduce: Slow Tasks Start a new copy! “I am Sam {I: 3, I am Sam Sam I am Do you like Green eggs and ham? I do not like them I do not like Green eggs and ham Would you like them Here or there? …” {I: 3, am: 3, …} Start a new copy!

MapReduce: Distributed Execution HDFS data HDFS data Image: Wikimedia commons (RobH/Tbayer (WMF))

MapReduce for Machine Learning There are batch algorithms for most Machine Learning tasks: process entire dataset, compute gradient, update model Two kinds of parallel ML algorithm: Data Parallel: distributed data, shared model. Model Parallel: data and model are distributed. Data Parallel batch algorithms can be implemented in one map-reduce step Model parallel algorithms require two reduce steps for each iteration (reduce, and then redistribute to each mapper)

Batch Gradient Descent w0 w1 model data Map over data Reduce

Gradient Descent on MR Image: Wikimedia commons (RobH/Tbayer (WMF))

Too much disk I/O Gradient Descent on MR Image: Wikimedia commons (RobH/Tbayer (WMF))

Tech trend: cost of memory RAM 2010: RAM 1 cent/MB PRICE disk flash YEAR Via http://www.jcmit.com/mem2014.htm

Approaches Hadoop Spark Parameter Server and MPI

Persist data in-memory: Optimized for batch, data-parallel ML algorithms An efficient, general-purpose language for cluster processing of big data In-memory query processing (Shark)

Practical Challenges with Hadoop: Very low-level programming model (Jim Gray) Very little re-use of Map-Reduce code between applications Laborious programming: design code, build jar, deploy on cluster Relies heavily on Java reflection to communicate with to-be-defined application code.

Practical Advantages of Spark: High-level programming model: can be used like SQL or like a tuple store. Interactivity. Integrated UDFs (User-Defined Functions). High-level model (Scala Actors) for distributed programming. Scala generics instead of reflection: Spark code is generic over [Key,Value] types.

Spark: Fault Tolerance Hadoop: Once computed, don’t lose it Spark: Remember how to recompute User

Spark: Fault Tolerance Hadoop: Once computed, don’t lose it Spark: Remember how to recompute User

Spark programming model (Python) sc = pyspark.SparkContext(...) raw_ratings = sc.textFile("...", 4) RDD (Resilient Distributed Dataset) Distributed array, 4 partitions Elements are lines of input Computed on demand Compute = (re)read from input

Spark programming model (Python) lines = sc.textFile("...", 4) print lines.count() # # # #

Spark programming model (Python) lines = sc.textFile("...", 4) comments = lines.filter(isComment) print lines.count(), comments.count() lines lines comments # # # # # # # #

Spark programming model (Python) lines = sc.textFile("...", 4) lines.cache() # save, don't recompute! comments = lines.filter(isComment) print lines.count(), comments.count() lines # comments RAM # # RAM # # RAM # # RAM #

Other transformations rdd.filter(lambda x: x % 2 == 0) # [1, 2, 3] → [2] rdd.map(lambda x: x * 2) # [1, 2, 3] → [2, 4, 6] rdd.flatMap(lambda x: [x, x+5]) # [1, 2, 3] → [1, 6, 2, 7, 3, 8]

Shuffle transformations rdd.groupByKey() # [(1,'a'), (2,'c'), (1,'b')] → # [(1,['a','b']), (2,['c']) rdd.sortByKey() # [(1,'a'), (1,'b'), (2,'c')]

Getting data out of RDDs rdd.reduce(lambda a, b: a * b) # [1,2,3] → 6 rdd.take(2) # RDD of [1,2,3] → [1,2] # as list rdd.collect() # RDD of [1,2,3] → [1,2,3] # as list rdd.saveAsTextFile(...)

Example: Logistic Regression in PySpark points = sc.textFile(...).map(parsePoint).cache() w = numpy.random.ranf(size = D) # model vector for i in range(ITERATIONS): gradient = points.map( lambda p: (1/(1+exp(-p.y*(w.dot(p.x))))-1)*p.y*p.x ).reduce(lambda a, b: a + b) w -= gradient print "Final model: %s" % w

Spark's Machine Learning Toolkit MLLib: Algorithms Classification SVM, Logistic Regression, Decision Trees, Naive Bayes Regression Linear (with L1 or L2 regularization) Unsupervised: Alternating Least Squares K-Means SVD Optimizers Optimization primitives (SGD, L-BGFS)

Example: Logistic Regression with MLLib from pyspark.mllib.classification \ import LogisticRegressionWithSGD trainData = sc.textFile("...").map(parsePoint) testData = sc.textFile("...").map(...) model = \ LogisticRegressionWithSGD.train(trainData) predictions = model.predict(testData) Should remind you of scikit-learn

Spark Driver and Executors Driver runs user interaction, acts as master for batch jobs Driver hosts machine learning models Executors hold data partitions Tasks process data blocks Typically tasks/executor = number of hardware threads Model Should remind you of scikit-learn Data blocks

Spark Shuffle Used for GroupByKey, Sort, Join operations Map and Reduce tasks run on executors Data is partitioned by key on by each mapper, saved to buckets, then forwarded to appropriate reducer using a key => reducer mapping function. Should remind you of scikit-learn

Architectural Consequences Simple programming: Centralized model on driver, broadcast to other nodes. Models must fit in single-machine memory, i.e. Spark supports data parallelism but not model parallelism. Heavy load on the driver. Model update time grows with number of nodes. Cluster performance on most ML tasks on par with single-node system with GPU. Shuffle performance is similar to Hadoop, but still improving.

Other uses for MapReduce/Spark Non-ML applications: Data processing: Select columns Map functions over datasets Joins GroupBy and Aggregates Spark admits 3 usage modes: Type queries interactively, use :replay Run (uncompiled) scripts Compile Scala Spark code, use interactively or in batch

Other notable Spark tools SQL-like query support (Shark, Spark SQL) BlinkDB (approximate statistical queries) Graph operations (GraphX) Stream processing (Spark streaming) KeystoneML (Data Pipelines)

5 Min Break Should remind you of scikit-learn

Approaches Hadoop Spark Parameter Server and MPI

Parameter Server Originally developed at Google (DistBelief system) for deep learning, designed to address the following limitations of Hadoop/Spark etc: Full support for minibatch model updates (1000s to millions of model updates per pass over the dataset). Full support for model parallelism. Support any-time updates or queries.* Now used at Yahoo, Google, Baidu, and in several academic projects. Should remind you of scikit-learn

Parameter Server Model distributed across server nodes Should remind you of scikit-learn Data distributed across client nodes

Parameter Server Model distributed across server nodes Nodes request the data they need to process the current data minibatch Should remind you of scikit-learn Data distributed across client nodes

Parameter Server Model distributed across server nodes Servers send the requested data back to each client Should remind you of scikit-learn Data distributed across client nodes

Parameter Server Model distributed across server nodes Clients send model updates (gradients) back to the appropriate servers. Should remind you of scikit-learn Data distributed across client nodes

Parameter Server Model distributed across server nodes Servers aggregate client updates into the next model iteration Should remind you of scikit-learn Data distributed across client nodes

Parameter Server Scales to 10s of parameter servers and 100s of clients. Handles sparse data easily. Holds records for most large-model ML tasks. Should remind you of scikit-learn

Parameter Server –’s Need to schedule both client and server clusters. Multiple server are clusters needed for multi-step ML pipelines. The design decision to support anytime client updates forces complex synchronization and locking logic onto the server. Asymmetry between number of servers and clients often leads to network bottlenecks. Should remind you of scikit-learn

Optimizing Parameter Servers Balance Network B/W => nservers = nclients Gain 2x network bandwidth by folding clients and servers onto same nodes Should remind you of scikit-learn

Optimizing Parameter Servers  MPI Drop client-data-push in favor of server pull: No need for synch or locking on server. Use relaxed synchronization instead. The result is a version of MPI (Message Passing Interface), a protocol used in scientific computing. For cluster computing MPI needs to be modified to: Support pull/push of a subset of model data. Allow loose synchronization of clients. Some dropped data and timeouts. Good current research topic! Should remind you of scikit-learn

Summary Why Big Data (and Big Models)? Hadoop Spark Parameter Server and MPI