Download presentation
Presentation is loading. Please wait.
Published byClemence Newman Modified over 8 years ago
1
Spark System xuejilong@gmai.com
2
Background Matei Zaharia [June 2010. HotCloud 2010. ] Spark: Cluster Computing with Working Sets [April 2012. NSDI 2012. Best paper awards] Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing [June 2012. HotCloud 2012.] Discretized Streams: An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters (Streaming Spark) [Nov 2013. SOSP 2013.] Discretized Streams: Fault-Tolerant Streaming Computation at Scale
3
Resilient Distributed Datasets A Fault-Tolerant Abstraction for In-Memory Cluster Computing
4
Motivation MapReduce greatly simplified “big data” analysis on large, unreliable clusters But not good at some apps: Iterative computation (multi stages) Interactive ad-hoc queries Leads to some specialized frameworks: E.g., Pregel for graph processing
5
Motivation Complex apps and interactive queries both need one thing that MapReduce lacks: Efficient primitives for data sharing In MapReduce, the only way to share data across jobs is stable storage slow!
6
Examples iter. 1 iter. 2... Input HDFS read HDFS write HDFS read HDFS write Input query 1 query 2 query 3 result 1 result 2 result 3... HDFS read Slow due to replication and disk I/O, but necessary for fault tolerance
7
iter. 1 iter. 2... Input Goal: In-Memory Data Sharing Input query 1 query 2 query 3... one-time processing design a distributed memory abstraction that is both fault-tolerant and efficient
8
Resilient Distributed Datasets (RDDs) Restricted form of distributed shared memory »Immutable, partitioned collections of records »Can only be built through coarse-grained deterministic transformations (map, filter, join, …) Efficient fault recovery using lineage »Log one operation to apply to many elements »Recompute lost partitions on failure
9
Input query 1 query 2 query 3... RDD Recovery one-time processing iter. 1 iter. 2... Input
10
Generality of RDDs Despite their restrictions, RDDs can express surprisingly many parallel algorithms Unify many current programming models Data flow models: MapReduce, Dryad, SQL, … Specialized models for iterative apps: BSP (Pregel), iterative MapReduce (Haloop), bulk incremental, …
11
Memory bandwidth Network bandwidth Tradeoff Space Granularity of Updates Write Throughput Fine Coarse LowHigh K-V stores, databases, RAMCloud Best for batch workloads Best for transactional workloads HDFSRDDs
12
Spark Programming Interface DryadLINQ-like API in the Scala language Provides: Resilient distributed datasets (RDDs) Operations on RDDs: transformations (build new RDDs) actions (compute and output results) Control of each RDD’s partitioning (layout across nodes) and persistence (storage in RAM, on disk, etc)
13
Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) messages.persist() Block 1 Block 2 Block 3 Worker Master messages.filter(_.contains(“foo”)).count messages.filter(_.contains(“bar”)).count tasks results Msgs. 1 Msgs. 2 Msgs. 3 Base RDD Transformed RDD Action Result: full-text search of Wikipedia in <1 sec (vs 20 sec for on-disk data) Result: scaled to 1 TB data in 5-7 sec (vs 170 sec for on-disk data) hdfs lines errors message
14
RDDs track the graph of transformations that built them (their lineage) to rebuild lost data E.g.: messages = textFile(...).filter(_.contains(“error”)).map(_.split(‘\t’)(2)) HadoopRDD path = hdfs://… HadoopRDD path = hdfs://… FilteredRDD func = _.contains(...) FilteredRDD func = _.contains(...) MappedRDD func = _.split(…) MappedRDD func = _.split(…) Fault Recovery HadoopRDDFilteredRDDMappedRDD
15
Fault Recovery Results Failure happens
16
Example: PageRank links & ranks repeatedly joined Reuse of Links RDD Can co-partition them (e.g. hash both on URL) to avoid shuffles Problems: generate lot of intermediate RDDs Contribs 0 join Contribs 2 Ranks 0 (url, rank) Ranks 0 (url, rank) Links (url, neighbors) Links (url, neighbors) Ranks 2 reduce Ranks 1...
17
PageRank Performance
18
Implementation Runs on Mesos [NSDI 11] to share clusters w/ Hadoop Can read from any Hadoop input source (HDFS, S3, …) Spark Hadoop MPI Mesos Node … No changes to Scala language or compiler
19
Programming Models Implemented on Spark RDDs can express many existing parallel models »MapReduce, DryadLINQ »Pregel graph processing [200 LOC] »Iterative MapReduce [200 LOC] »SQL: Hive on Spark (Shark) »Enables apps to efficiently intermix these models All are based on coarse-grained operations
20
Behavior with Insufficient RAM
21
Scalability Logistic RegressionK-Means
22
Conclusion RDDs offer a simple and efficient programming model for a broad range of applications Leverage the coarse-grained nature of many parallel algorithms for low-overhead recovery
23
Discretized Streams Fault-Tolerant Streaming Computation at Scale
24
Many important applications need to process large data streams arriving in real time User activity statistics (e.g. Facebook’s Puma) Spam detection Traffic estimation Network intrusion detection Target: large-scale apps that must run on tens- hundreds of nodes with O(1 sec) latency Motivation
25
Challenge To run at large scale, system has to be both: Fault-tolerant: recover quickly from failures and stragglers Cost-efficient: do not require significant hardware beyond that needed for basic processing Existing streaming systems don’t have both properties
26
Traditional Streaming Systems “Record-at-a-time” processing model Each node has mutable state For each record, update state & send new records mutable state node 1 node 3 input records push node 2 input records
27
Fault tolerance For These System Fault tolerance : replication, upstream backup node 1 node 3 node 2 node 1’ node 3’ node 2’ synchronization input node 1 node 3 node 2 standb y input Fast recovery, but 2x hardware cost Only need 1 standby, but slow to recover
28
Observation Batch processing models for clusters (e.g. MapReduce) provide fault tolerance efficiently Divide job into deterministic tasks Rerun failed/slow tasks in parallel on other nodes Idea: run a streaming computation as a series of very small, deterministic batches Same recovery schemes at much smaller timescale Work to make batch size as small as possible
29
Discretized Stream t = 1: t = 2: input batch operation pull …… input immutable dataset (stored reliably) immutable dataset (output or state); stored in memory without replication immutable dataset (output or state); stored in memory without replication …
30
Parallel Recovery If a node fails/straggles, recompute its dataset partitions in parallel on other nodes map input dataset output dataset Faster recovery than upstream backup, without the cost of replication
31
Speed up Batch Jobs Prototype built on the Spark process 2 GB/s (20M records/s) of data on 50 nodes at sub- second latency Max throughput within a given latency bound (1 or 2s)
32
Programming Model A discretized stream (D-stream) is a sequence of immutable, partitioned datasets Specifically, resilient distributed datasets (RDDs), the storage abstraction in Spark Deterministic transformations operators produce new streams
33
API LINQ-like language-integrated API in Scala pageViews = readStream("...", "1s") ones = pageViews.map(ev => (ev.url, 1)) counts = ones.runningReduce(_ + _) t = 1: t = 2: pageViewsonescounts mapreduce... = RDD = partition
34
Evaluation Compare with Storm
35
Evaluation Failure recovery
36
Evaluation Speculative Execution
37
Conclusion D-Streams forgo traditional streaming wisdom by batching data in small timesteps Enable efficient, new parallel recovery scheme Let users seamlessly intermix streaming, batch and interactive queries
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.