Spark System Background Matei Zaharia  [June 2010. HotCloud 2010. ]  Spark: Cluster Computing with Working Sets  [April 2012. NSDI.

Spark System xuejilong@gmai.com

Background Matei Zaharia  [June 2010. HotCloud 2010. ]  Spark: Cluster Computing with Working Sets  [April 2012. NSDI 2012. Best paper awards]  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing  [June 2012. HotCloud 2012.]  Discretized Streams: An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters (Streaming Spark)  [Nov 2013. SOSP 2013.]  Discretized Streams: Fault-Tolerant Streaming Computation at Scale

Resilient Distributed Datasets A Fault-Tolerant Abstraction for In-Memory Cluster Computing

Motivation MapReduce greatly simplified “big data” analysis on large, unreliable clusters But not good at some apps:  Iterative computation (multi stages)  Interactive ad-hoc queries Leads to some specialized frameworks:  E.g., Pregel for graph processing

Motivation Complex apps and interactive queries both need one thing that MapReduce lacks: Efficient primitives for data sharing In MapReduce, the only way to share data across jobs is stable storage  slow!

Examples iter. 1 iter. 2... Input HDFS read HDFS write HDFS read HDFS write Input query 1 query 2 query 3 result 1 result 2 result 3... HDFS read Slow due to replication and disk I/O, but necessary for fault tolerance

iter. 1 iter. 2... Input Goal: In-Memory Data Sharing Input query 1 query 2 query 3... one-time processing design a distributed memory abstraction that is both fault-tolerant and efficient

Resilient Distributed Datasets (RDDs) Restricted form of distributed shared memory »Immutable, partitioned collections of records »Can only be built through coarse-grained deterministic transformations (map, filter, join, …) Efficient fault recovery using lineage »Log one operation to apply to many elements »Recompute lost partitions on failure

Input query 1 query 2 query 3... RDD Recovery one-time processing iter. 1 iter. 2... Input

Generality of RDDs Despite their restrictions, RDDs can express surprisingly many parallel algorithms Unify many current programming models  Data flow models: MapReduce, Dryad, SQL, …  Specialized models for iterative apps: BSP (Pregel), iterative MapReduce (Haloop), bulk incremental, …

Memory bandwidth Network bandwidth Tradeoff Space Granularity of Updates Write Throughput Fine Coarse LowHigh K-V stores, databases, RAMCloud Best for batch workloads Best for transactional workloads HDFSRDDs

Spark Programming Interface DryadLINQ-like API in the Scala language Provides:  Resilient distributed datasets (RDDs)  Operations on RDDs:  transformations (build new RDDs)  actions (compute and output results)  Control of each RDD’s partitioning (layout across nodes) and persistence (storage in RAM, on disk, etc)

Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) messages.persist() Block 1 Block 2 Block 3 Worker Master messages.filter(_.contains(“foo”)).count messages.filter(_.contains(“bar”)).count tasks results Msgs. 1 Msgs. 2 Msgs. 3 Base RDD Transformed RDD Action Result: full-text search of Wikipedia in <1 sec (vs 20 sec for on-disk data) Result: scaled to 1 TB data in 5-7 sec (vs 170 sec for on-disk data) hdfs lines errors message

RDDs track the graph of transformations that built them (their lineage) to rebuild lost data E.g.: messages = textFile(...).filter(_.contains(“error”)).map(_.split(‘\t’)(2)) HadoopRDD path = hdfs://… HadoopRDD path = hdfs://… FilteredRDD func = _.contains(...) FilteredRDD func = _.contains(...) MappedRDD func = _.split(…) MappedRDD func = _.split(…) Fault Recovery HadoopRDDFilteredRDDMappedRDD

Fault Recovery Results Failure happens

Example: PageRank links & ranks repeatedly joined Reuse of Links RDD Can co-partition them (e.g. hash both on URL) to avoid shuffles Problems: generate lot of intermediate RDDs Contribs 0 join Contribs 2 Ranks 0 (url, rank) Ranks 0 (url, rank) Links (url, neighbors) Links (url, neighbors) Ranks 2 reduce Ranks 1...

PageRank Performance

Implementation Runs on Mesos [NSDI 11] to share clusters w/ Hadoop Can read from any Hadoop input source (HDFS, S3, …) Spark Hadoop MPI Mesos Node … No changes to Scala language or compiler

Programming Models Implemented on Spark RDDs can express many existing parallel models »MapReduce, DryadLINQ »Pregel graph processing [200 LOC] »Iterative MapReduce [200 LOC] »SQL: Hive on Spark (Shark) »Enables apps to efficiently intermix these models All are based on coarse-grained operations

Behavior with Insufficient RAM

Scalability Logistic RegressionK-Means

Conclusion RDDs offer a simple and efficient programming model for a broad range of applications Leverage the coarse-grained nature of many parallel algorithms for low-overhead recovery

Discretized Streams Fault-Tolerant Streaming Computation at Scale

Many important applications need to process large data streams arriving in real time  User activity statistics (e.g. Facebook’s Puma)  Spam detection  Traffic estimation  Network intrusion detection Target: large-scale apps that must run on tens- hundreds of nodes with O(1 sec) latency Motivation

Challenge To run at large scale, system has to be both:  Fault-tolerant: recover quickly from failures and stragglers  Cost-efficient: do not require significant hardware beyond that needed for basic processing Existing streaming systems don’t have both properties

Traditional Streaming Systems “Record-at-a-time” processing model  Each node has mutable state  For each record, update state & send new records mutable state node 1 node 3 input records push node 2 input records

Fault tolerance For These System Fault tolerance : replication, upstream backup node 1 node 3 node 2 node 1’ node 3’ node 2’ synchronization input node 1 node 3 node 2 standb y input Fast recovery, but 2x hardware cost Only need 1 standby, but slow to recover

Observation Batch processing models for clusters (e.g. MapReduce) provide fault tolerance efficiently  Divide job into deterministic tasks  Rerun failed/slow tasks in parallel on other nodes Idea: run a streaming computation as a series of very small, deterministic batches  Same recovery schemes at much smaller timescale  Work to make batch size as small as possible

Discretized Stream t = 1: t = 2: input batch operation pull …… input immutable dataset (stored reliably) immutable dataset (output or state); stored in memory without replication immutable dataset (output or state); stored in memory without replication …

Parallel Recovery If a node fails/straggles, recompute its dataset partitions in parallel on other nodes map input dataset output dataset Faster recovery than upstream backup, without the cost of replication

Speed up Batch Jobs  Prototype built on the Spark  process 2 GB/s (20M records/s) of data on 50 nodes at sub- second latency Max throughput within a given latency bound (1 or 2s)

Programming Model A discretized stream (D-stream) is a sequence of immutable, partitioned datasets  Specifically, resilient distributed datasets (RDDs), the storage abstraction in Spark Deterministic transformations operators produce new streams

API LINQ-like language-integrated API in Scala pageViews = readStream("...", "1s") ones = pageViews.map(ev => (ev.url, 1)) counts = ones.runningReduce(_ + _) t = 1: t = 2: pageViewsonescounts mapreduce... = RDD = partition

Evaluation  Compare with Storm

Evaluation  Failure recovery

Evaluation  Speculative Execution

Conclusion  D-Streams forgo traditional streaming wisdom by batching data in small timesteps  Enable efficient, new parallel recovery scheme  Let users seamlessly intermix streaming, batch and interactive queries

Spark System Background Matei Zaharia  [June 2010. HotCloud 2010. ]  Spark: Cluster Computing with Working Sets  [April 2012. NSDI.

Similar presentations

Presentation on theme: "Spark System Background Matei Zaharia  [June 2010. HotCloud 2010. ]  Spark: Cluster Computing with Working Sets  [April 2012. NSDI."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Spark System Background Matei Zaharia  [June 2010. HotCloud 2010. ]  Spark: Cluster Computing with Working Sets  [April 2012. NSDI.

Similar presentations

Presentation on theme: "Spark System Background Matei Zaharia  [June 2010. HotCloud 2010. ]  Spark: Cluster Computing with Working Sets  [April 2012. NSDI."— Presentation transcript:

Similar presentations

About project

Feedback