CSCI5570 Large Scale Data Processing Systems

CSCI5570 Large Scale Data Processing Systems
Distributed Stream Processing Systems James Cheng CSE, CUHK Slide Ack.: modified based on the slides from Nathan Marz, Mahender Immadi, Thirupathi Guduru and Karthick Ramasamy

Twitter, Inc., *University of Wisconsin – Madison
Ankit Toshniwal, Siddarth Taneja, Amit Shukla, Karthik Ramasamy, Jignesh M. Patel*, Sanjeev Kulkarni, Jason Jackson, Krishna Gade, Maosong Fu, Jake Donham, Nikunj Bhagat, Sailesh Mittal, Dmitriy Ryaboy Twitter, Inc., *University of Wisconsin – Madison SIGMOD 2014

Twitter Storm Storm is currently one of the most popular stream processing system Features Efficient at-least-once message processing guarantee Flexible message dispatching schemes

Storm-Core Concepts Tuple Stream Spout Bolt Topology Task

Tuple and Stream Tuple Stream Data unit (or message primitive)
Contain different fields (E.g., word and count) Stream Unbounded sequence of tuples

Spout Source of data streams
Wrap a streaming data source and emit tuples Examples: Twitter Streaming API/Kafka

Bolt Abstraction of processing elements
Consume tuples and may output tuples Examples: Filter/Aggregation/Join

Topology Job definition A DAG consists of spouts, bolts and edges

Task Each Spout and Bolt are running in multiple instances in parallel
An instance is denoted as a task

Stream Grouping When a tuple is emitted, which processing element does it go to?

Stream Grouping Shuffle grouping: send a tuple to a consumer processing element randomly Fields grouping: mod hashing on one or several fields of a tuple All grouping: replicate all the tuples to every consumer processing element Global grouping: send all tuples to a single processing element

Storm Word Count Topology(Job)
Twitter Spout Split Sentence Split Sentence WordCount Report Storm Word Count Topology shuffle field global

Streaming Word Count TopologyBuilder is used to construct topologies in Java

Streaming Word Count Define a spout in the topology with parallelism of 5 tasks

Streaming Word Count Split sentences into words with parallelism of 8 tasks Consumer decides what data it receives and how it gets grouped

Streaming Word Count Create a word count stream

Streaming Word Count

Streaming Word Count Submitting topology to a cluster

Streaming Word Count Running topology in local mode

System Overview Nimbus (Master) Supervisor (Slave) Zookeeper
Distributing and coordinating the execution of the topology Failure monitoring Supervisor (Slave) Spawn workers Execute spouts or bolts Keep listening tuples Zookeeper Coordination management Nimbus Zookeeper Supervisor Storm Framework

Nimbus and Zookeeper Nimbus: similar to JobTracker in Hadoop
User describes the topology as a Thrift object and sends the object to Nimbus any programming language can be used to create a Storm topology e.g., Summingbird User also uploads the user code to Nimbus Nimbus uses a combination of local disk and Zookeeper to store states about the topology User code is stored on local disk The topology Thrift objects are stored in Zookeeper Supervisors tell Nimbus periodically the topologies they are running and any vacancies to run more topologies Nimbus does the match-making between pending topologies and supervisors The Apache Thrift software framework, for scalable cross-language services development, combines a software stack with a code generation engine to build services that work efficiently and seamlessly between C++, Java, Python, PHP, Ruby, Erlang, Perl, Haskell, C#, Cocoa, JavaScript, Node.js, Smalltalk, OCaml and Delphi and other languages.

Nimbus and Zookeeper Zookeeper: coordination between Nimbus and Supervisors Nimbus and Supervisors are stateless, all their states are kept in Zookeeper or on local disk: key to Storm’s resilience If Nimbus service fails, workers still continue to make forward progress Supervisors restart the workers if they fail But if Nimbus is down, then users cannot submit new topologies If running topologies experience machine failures, they cannot be reassigned to different machines until Nimbus is revived

Supervisor Supervisor runs on each Storm node
receives assignments from Nimbus and spawns workers based on the assignment monitors the health of the workers and respawns them if necessary Supervisor architecture: Supervisor spawns three threads The main thread reads the Storm configuration, initializes the Supervisor’s global map, creates a persistent local state in the file system, and schedules recurring timer events Three types of events => next page

Supervisor The heartbeat event: The synchronize supervisor event:
scheduled to run (e.g., every 15 sec) in the context of the main thread the thread reports to Nimbus that the supervisor is alive The synchronize supervisor event: executed (e.g., every 10 sec) in the event manager thread the thread is responsible for managing the changes in the existing assignments if the changes include addition of new topologies, schedules a synchronize process event

Supervisor The synchronize process event:
runs (e.g., every 3 sec) under the context of the process event manager thread the thread is responsible for managing worker processes that run a fragment of the topology on the same node as the supervisor reads worker heartbeats from the local state and classifies those workers as either valid, timed out, not started, or disallowed “timed out”: the worker did not provide a heartbeat in the specified time frame, assumed to be dead “not started”: yet to be started because it belongs to a newly submitted topology, or an existing topology whose worker is being moved to this supervisor “disallowed”: should not be running either because its topology has been killed, or the worker of the topology has been moved to another node

Workers and Executors Each worker process runs several executors inside a JVM Executors are threads within the worker process Each executor can run several tasks A task is an instance of a spout or a bolt A task is strictly bound to an executor (no dynamic reassignment, e.g., for load balancing, at the moment)

Workers and Executors To route incoming and outgoing tuples, each worker process has two dedicated threads: a worker receive thread and a worker send thread

Workers and Executors Each executor also consists of two threads: the user logic thread and the executor send thread

Workers and Executors Worker receive thread: examines the destination task id of an incoming tuple and queues the incoming tuple to the appropriate in queue associated with its executor

Workers and Executors User logic thread: takes incoming tuples from the in queue, examines the destination task id, and then runs the actual task (a spout or bolt instance) for the tuple, and generates output tuple(s). These outgoing tuples are then placed in an out queue that is associated with this executor.

Workers and Executors Executor send thread: takes the tuples from the out queue and puts them in a global transfer queue. The global transfer queue contains all the outgoing tuples from executors in the worker process

Workers and Executors Worker send thread: examines each tuple in the global transfer queue and based on its task destination id, sends it to the next worker downstream. For outgoing tuples that are destined for a different task on the same worker, it writes the tuple directly into the in queue of the destination task.

Messages Processing Guarantee (Fault Tolerance)
At Most Once (e.g. S4) Messages may be missing Minimum overhead At Least Once (e.g. Storm) Messages will not be lost, but may be processed more than once Medium overhead Exactly Once (e.g. MillWheel) Messages will be processed exactly once Maximum overhead

Storm At-Least-Once Guarantee
Each tuple emitted from spout will be processed at least once, which is meaningful in idempotent operations. Idempotent operation Duplicate operation of the same input will not affect the output, which means f(f(x)) = f(x), in which x means input and f means some operations. Example: filter, maximum, minimum (ok to apply the operation to the same input more than once) The implementation can be very efficient

Storm-At-Least-Once Guarantee
Implementation XOR value of pairs will be 0 1^2…^N-1^N^N^N-1…2^1 = 0 Each tuple is either created or consumed and we can XOR its ID in the above two phases.

Storm-At-Least-Once Guarantee
Add extra ACKerBolt XOR each source tupleID from spouts and new tupleIDs generated by processing that source tuple Changes of XOR value in the example: 001 -> 001^001^002^003=001-> 001^002^003=0 Send ACK to Spout1 when the value becomes 0 Spout1 will resend tuple1 when no ACK has been received for a long time. Spout1 ack tuple1 tuple1 ACKerBolt SplitterBolt tuple1, tuple2, tuple3 Tuple1:[Spout1, 000] Tuple1:[Spout1, 001] Tuple1:[Spout1, 001] tuple2, tuple3 tuple2, tuple3 WordCountBolt

Runs on hundreds of servers (spread across multiple datacenters) at Twitter Several hundreds of topologies run on these clusters, some run on more than a few hundred nodes Many terabytes of data flows through the Storm clusters every day, generating several billions of output tuples

Storm topologies are used by a number of groups inside Twitter, including revenue, user services, search, and content discovery Simple things like filtering and aggregating the content of various streams at Twitter (e.g. computing counts) Also for more complex things like running simple machine learning algorithms (e.g. clustering) on stream data

Storm is resilient to failures, continues to work even when Nimbus is down (the workers continue making forward progress) Can take a machine down for maintenance without affecting the topology The latency of the 99th percentile response time for processing a tuple is close to 1ms Cluster availability is 99.9% over a long period of time

Guaranteeing Message Processing
Tuple tree

A spout tuple is not fully processed until all tuples in the tree have been completed If the tuple tree is not completed within a specified timeout, the spout tuple is replayed

Reliability API “Anchoring” creates a new edge in the tuple tree

Marks a single node in the tree as complete

Storm tracks tuple trees for you in an extremely efficient way and provides at-least-once guarantee

Transactional Topologies
How do you do idempotent counting with an at least once delivery guarantee? Won’t you overcount? Transactional topologies solve this problem and provides exactly-once guarantee

Exactly-once guarantee for each tuple is expensive Process small batches of tuples Batch 1 Batch 2 Batch 3

If a batch fails, replay the whole batch Once a batch is completed, commit the batch Bolts can optionally be “committers” Batch 1 Batch 2 Batch 3

Commits are ordered. If there’s a failure during commit, the whole batch + commit is retried Commit 1 Commit 2 Commit 3 Commit 4

CSCI5570 Large Scale Data Processing Systems

Similar presentations

Presentation on theme: "CSCI5570 Large Scale Data Processing Systems"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CSCI5570 Large Scale Data Processing Systems

Similar presentations

Presentation on theme: "CSCI5570 Large Scale Data Processing Systems"— Presentation transcript:

Similar presentations

About project

Feedback