Some slides borrowed from the authors

Some slides borrowed from the authors
Discretized Streams: Fault-Tolerant Streaming Computation at Scale [SOSP’ 13] Matei Zaharia, Tathagata Das (TD), Haoyuan (HY) Li, Timothy Hunter, Scott Shenker, Ion Stoica UC Berkeley Presented by Ling Ding Some slides borrowed from the authors

Require tens to hundreds of nodes Require second-scale latencies
Motivation Many big-data applications need to process large data streams in near-real time Require tens to hundreds of nodes Require second-scale latencies Website monitoring Fraud detection Ad monetization

Challenge Stream processing systems must recover from failures and stragglers quickly and efficiently More important for streaming systems than batch systems Traditional streaming systems don’t achieve these properties simultaneously

Outline Limitations of Traditional Streaming Systems
Discretized Stream Processing Unification with Batch and Interactive Processing

Traditional Streaming Systems
Continuous operator model mutable state node 1 node 3 input records node 2 Each node runs an operator with in-memory mutable state For each input record, state is updated and new records are sent out Mutable state is lost if node fails Various techniques exist to make state fault-tolerant

Fault-tolerance in Traditional Systems
Node Replication [e.g. Borealis, Flux ] Separate set of “hot failover” nodes process the same data streams Synchronization protocols ensures exact ordering of records in both sets On failure, the system switches over to the failover nodes input hot failover nodes input sync protocol Fast recovery, but 2x hardware cost

Fault-tolerance in Traditional Systems
Upstream Backup [e.g. TimeStream, Storm ] Each node maintains backup of the forwarded records since last checkpoint A “cold failover” node is maintained On failure, upstream nodes replay the backup records serially to the failover node to recreate the state input replay input backup cold failover node Only need 1 standby, but slow recovery

Slow Nodes in Traditional Systems
Node Replication Upstream Backup input input input Neither approach handles stragglers

Goal Scales to hundreds of nodes Achieves second-scale latency
Tolerate node failures and stragglers Sub-second fault and straggler recovery Minimal overhead beyond base processing

Why is it hard? Stateful continuous operators tightly integrate “computation” with “mutable state” Makes it harder to define clear boundaries when computation and state can be moved around stateful continuous operator mutable state input records output

Dissociate computation from state
Make state immutable and break computation into small, deterministic, stateless tasks Defines clear boundaries where state and computation can be moved around independently stateless task state 1 input 1 state 2 stateless task state 2 input 2 stateless task input 3

Batch Processing Systems!

Batch Processing Systems
Batch processing systems like MapReduce divide Data into small partitions Jobs into small, deterministic, stateless map / reduce tasks M immutable map outputs R immutable input dataset immutable output dataset stateless map tasks stateless reduce tasks

Parallel Recovery Failed tasks are re-executed on the other nodes in parallel M R R R M M M M R R M M M R M immutable input dataset M M immutable output dataset stateless map tasks stateless reduce tasks

Discretized Stream Processing

Run a streaming computation as a series of small, deterministic batch jobs Store intermediate state data in cluster memory Try to make batch sizes as small as possible to get second-scale latencies

time = 0 - 1: batch operations input Input: replicated dataset stored in memory Output or State: non-replicated dataset stored in memory time = 1 - 2: input input stream output / state stream …

Example: Counting page views
Discretized Stream (DStream) is a sequence of immutable, partitioned datasets Can be created from live data streams or by applying bulk, parallel transformations on other DStreams views ones counts t: 0 - 1 t: 1 - 2 map reduce creating a DStream views = readStream(" "1 sec") ones = views.map(ev => (ev.url, 1)) counts = ones.runningReduce((x,y) => x+y) transformation

Fine-grained Lineage Datasets track fine-grained operation lineage
Datasets are periodically checkpointed asynchronously to prevent long lineages views ones counts t: 0 - 1 map reduce t: 1 - 2 t: 2 - 3

Parallel Fault Recovery
Lineage is used to recompute partitions lost due to failures Datasets on different time steps recomputed in parallel Partitions within a dataset also recomputed in parallel views ones counts t: 0 - 1 map reduce t: 1 - 2 t: 2 - 3

Comparison to Upstream Backup
Faster recovery than upstream backup, without the 2x cost of node replication views ones counts t: 0 - 1 t: 1 - 2 t: 2 - 3 Discretized Stream Processing Upstream Backup parallelism within a batch parallelism across time intervals state stream replayed serially

How much faster than Upstream Backup?
Recover time = time taken to recompute and catch up Depends on available resources in the cluster Lower system load before failure allows faster recovery Parallel recovery with 10 nodes faster than 5 nodes Parallel recovery with 5 nodes faster than upstream backup

Parallel Straggler Recovery
Straggler mitigation techniques Detect slow tasks (e.g. 2X slower than other tasks) Speculatively launch more copies of the tasks in parallel on other machines Masks the impact of slow nodes on the progress of the system

Evaluation

Spark Streaming Implemented using Spark processing engine*
Spark allows datasets to be stored in memory, and automatically recovers them using lineage Modifications required to reduce jobs launching overheads from seconds to milliseconds [ *Resilient Distributed Datasets - NSDI, 2012 ]

How fast is Spark Streaming?
Can process 60M records/second on 100 nodes at 1 second latency Tested with core EC2 instances and 100 streams of text Count the sentences having a keyword WordCount over 30 sec sliding window

How does it compare to others?
Throughput comparable to other commercial stream processing systems System Throughput per core [ records / sec ] Spark Streaming 160k Oracle CEP 125k Esper 100k StreamBase 30k Storm [ Refer to the paper for citations ]

How fast can it recover from faults?
Recovery time improves with more frequent checkpointing and more nodes Failure Word Count over 30 sec window

How fast can it recover from stragglers?
Speculative execution of slow tasks mask the effect of stragglers

Unification with Batch and Interactive Processing

Unification with Batch and Interactive Processing
Discretized Streams creates a single programming and execution model for running streaming, batch and interactive jobs Combine live data streams with historic data liveCounts.join(historicCounts).map(...) Interactively query live streams liveCounts.slice(“21:00”, “21:05”).count()

App combining live + historic data
Mobile Millennium Project: Real-time estimation of traffic transit times using live and past GPS observations Markov chain Monte Carlo simulations on GPS observations Very CPU intensive Scales linearly with cluster size

Takeaways Large scale streaming systems must handle faults and stragglers Discretized Streams model streaming computation as series of batch jobs Uses simple techniques to exploit parallelism in streams Scales to 100 nodes with 1 second latency Recovers from failures and stragglers very fast Spark Streaming is open source - spark-project.org Used in production by ~ 10 organizations!

Structured Streaming Spark Summit 2016

Streaming in Apache Spark
Spark Streaming changed how people write streaming apps Functional, concise and expressive SQL Streaming MLlib GraphX Fault-tolerant state management Spark Core Unified stack with batch processing More than 50% users consider most important partof Apache Spark 3

Streaming apps are growing more complex
4

Streaming computations don’t run in isolation
Need to interact with batch data, interactive analysis, machine learning, etc.

Use case: IoT Device Monitoring
Anomaly detection Learn models offline Use online + continuous learning IoT events from Kafka event stream ETL into long term storage Prevent data loss Prevent duplicates Status monitoring Handle late data Aggregate on windows on event time Interactively debug issues - consistency

Not just streaming any more
Use case: IoT Device Monitoring Anomaly detection - Learn modelsoffline - Use online + continuous learning IoT events event stream from Kafka ETL into long term storage - Preventdata loss Status monitoring - Preventduplicates Interactively - Handle late data debug issues - Aggregate on windows - consistency on eventtime Continuous Applications Not just streaming any more

Pain points with DStreams
Processing with event-time, dealing with late data DStream API exposes batch time, hard to incorporate event-time Interoperate streaming with batch AND interactive RDD/DStream has similar API, but still requires translation Reasoning about end-to-end guarantees Requires carefully constructing sinks that handle failures correctly Data consistency in the storage while being updated

Structured Streaming

The simplest way to perform streaming analytics is not having to reason about streaming at all

New Model Time Input Query
Trigger: every 1 sec 2 1 3 Time Input: data from source as an append-only table Input data up to 1 data up to 2 data up to 3 Query Trigger: how frequently to check input for new data Query: operations on input usual map/filter/reduce new window, session ops

to data sink after every trigger
New Model Trigger: every 1 sec 2 1 3 Time Result: final operated table updated every trigger interval Input data up to 1 data up to 2 data up to 3 Query Output: what part of result to write to data sink after every trigger Complete output: Write full result table every time Result output for data up to 1 output for data up to 2 output for data up to 3 complete output Output

to data sink after every trigger
New Model Trigger: every 1 sec 2 1 3 Time Result: final operated table updated every triggerinterval Input data up to 1 data up to 2 data up to 3 Query Output: what part of result to write to data sink after every trigger Complete output: Write full result table every time Result Delta output: Write only the rows that changed in result from previous batch Append output: Write only new rows output for data up to 1 output for data up to 2 output for data up to 3 delta output Output *Not all output modes are feasible withall queries

API - Dataset/DataFrame
Static, bounded data Streaming, unbounded data Single API !

Batch ETL with DataFrames
input = spark.read .format("json") load("source-path") result = input .select("device", "signal") .where("signal > 15") result.write .format("parquet") .save("dest-path") Read from Json file Select some devices Write to parquet file

Streaming ETL with DataFrames
input = spark.read .format("json") .stream("source-path") result = input .select("device", "signal") .where("signal > 15") result.write .format("parquet") .startStream("dest-path") Read from Json file stream Replace load() with stream() Select some devices Code does not change Write to Parquet file stream Replace save() with startStream()

Streaming ETL with DataFrames
input = spark.read .format("json") .stream("source-path") result = input .select("device", "signal") .where("signal > 15") result.write .format("parquet") .startStream("dest-path") read…stream() creates a streaming DataFrame, does not start any of the computation write…startStream() defines where & how to output the data and starts the processing

Streaming ETL with DataFrames 1 2
1 2 3 input = spark.read .format("json") .stream("source-path") Input result = input .select("device", "signal") .where("signal > 15") Result [append-only table] new rows in result of 2 result.write .format("parquet") .startStream("dest-path") new rows in result of 3 Output [append mode]

Continuous Aggregations
Continuously compute average signal across all devices input.avg("signal") Continuously compute average signal of each type of device input.groupBy("device-type") .avg("signal")

Continuous Windowed Aggregations
Continuously compute average signal of each type of device in last 10 minutes using event-time input.groupBy( $"device-type", window($"event-time-col", "10 min")) .avg("signal") Simplifies event-time stream processing (notpossible in DStreams) Works on both, streaming and batch jobs

Joining streams with static data
kafkaDataset = spark.read .kafka("iot-updates") .stream() staticDataset = ctxt.read .jdbc("jdbc://", "iot-device-info") joinedDataset = kafkaDataset.join( staticDataset, "device-type") Join streaming data from Kafka with static data via JDBC to enrich the streaming data … … withouthaving to thinkthat you are joining streaming data

input.select("device", "signal")
Output Modes Defines what is outputted every time there is a trigger Different output modes make sense for different queries input.select("device", "signal") .write .outputMode("append") .format("parquet") .startStream("dest-path") Append mode with non-aggregation queries input.agg(count("*")) .write .outputMode("complete") .format("parquet") .startStream("dest-path") Complete mode with aggregation queries

Query Management Stop it, wait for it to terminate Get status
query = result.write .format("parquet") .outputMode("append") .startStream("dest-path") query.stop() query.awaitTermination() query.exception() query.sourceStatuses() query.sinkStatus() query: a handle to the running streaming computation for managingit Stop it, wait for it to terminate Get status Get error, if terminated Multiple queries can be active at the same time Each query has unique name for keepingtrack

incrementalexecution
Query Execution Logically: Dataset operations on table (i.e. as easy to understand as batch) Physically: Spark automatically runs the query in streaming fashion (i.e. incrementally and continuously) DataFrame Logical Plan Catalyst optimizer Continuous, incrementalexecution

Structured Streaming High-level streaming API built on Datasets/DataFrames Event time, windowing, sessions, sources & sinks End-to-end exactly once semantics Unifies streaming, interactive and batch queries Aggregate data in a stream, then serve using JDBC Add, remove, change queries at runtime Build and apply ML models

What can you do with this that’s hard with other engines?
True unification Same code + same super-optimized engine for everything Flexible API tightly integratedwith the engine Choose your own tool - Dataset/DataFrame/SQL Greater debuggability and performance Benefits of Spark in-memory computing, elastic scaling, fault-tolerance, straggler mitigation, …

Some slides borrowed from the authors

Similar presentations

Presentation on theme: "Some slides borrowed from the authors"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Some slides borrowed from the authors

Similar presentations

Presentation on theme: "Some slides borrowed from the authors"— Presentation transcript:

Similar presentations

About project

Feedback