Apache Storm: Design And Usage

Apache Storm: Design And Usage
Roman Boiko UBS->TripAdvisor

Please, questions at the end
Thank you!

No one tool or framework is a panacea!
Disclaimer No one tool or framework is a panacea! Use any ONLY if you are 100% sure it’s usage pattern fits your needs

Never start app with selecting framework!
Hint Never start app with selecting framework! Write domain objects and business logic first – and only than think about infrastructure and frameworks!

What is Storm? “the Hadoop of realtime” Nathan Marz

What is Storm? Processing large data streams in realtime
Fault tolerant and scalable Reliable – “at least once” or “exactly once” processing modes Alternative to standard network of queues and workers

Author Previously proven by writing Cascalog
Author of upcoming book “Big Data” (Manning) Excellent blog post on Storm history:

Why compared to Hadoop Easy to explain through familiar product
Author knows Hadoop very well Vs Hadoop: H:MapReduceJobs - will eventually finish S:Topology - runs with messaging forever(until stopped) S:MasterNode - a daemon called "Nimbus". Responsible for distributing code around the cluster, assigning tasks to machines, and monitoring for failures. H:JobTracker. All coordination between Nimbus and the Supervisors is done through a Zookeeper cluster. Additionally, the Nimbus daemon and Supervisor daemons are fail-fast and stateless; all state is kept in Zookeeper or on local disk. This means you can kill -9 Nimbus or the Supervisors and they’ll start back up like nothing happened. This design leads to Storm clusters being incredibly stable. We’ve had topologies running for months without requiring any maintenance. Since topology definitions are just Thrift structs, and Nimbus is a Thrift service, you can create and submit topologies using any programming language. Everything in Storm runs in parallel in a distributed way. Spouts and bolts execute as many threads across the cluster, and they pass messages to each other in a distributed way. Messages never pass through any sort of central router, and there are no intermediate queues. A tuple is passed directly from the thread who created it to the threads that need to consume it. Storm guarantees that every message flowing through a topology will be processed, even if a machine goes down and the messages it was processing get dropped. How Storm accomplishes this without any intermediate queuing is the key to how it works and what makes it so fast.

Use Cases Stream processing Continuous computation Distributed RPC ETL
Trade STP Continuous computation Trending Twitter topics Rule-based Matching Distributed RPC Alternative to MapReduce? Doesn’t sound like a good idea Stream processing: Storm can be used to process a stream of new data and update databases in realtime. (ETL) Continuous computation: Storm can do a continuous query and stream the results to clients in realtime. An example is streaming trending topics on Twitter into browsers. Distributed RPC: Storm can be used to parallelize an intense query on the fly. The idea is that your Storm topology is a distributed function that waits for invocation messages. When it receives an invocation, it computes the query and sends back the results. Examples of Distributed RPC are parallelizing search queries or doing set operations on large numbers of large sets.

Author Previously proven by writing Cascalog
Author of upcoming book “Big Data” (Manning) Excellent blog post on Storm history:

Neighbours Apache Zookeeper – is used to manage Storm cluster
Apache Kafka – most commonly used source of events/data Hadoop YARN – making HDFS source and destination of data AWS – automated deployment tool exists Automated deploy for Storm on AWS - massively accelerated Storm's development, as it made it easy for me to test clusters of all different sizes and configurations. Enabled me to iterate much, much faster.

Competitors Apache Spark Erlang/OTP workers, Akka, …
Apache Hadoop (when you can change logic and use batch approach) Yahoo!/Apache S4 (Push vs Pull)

How did we do this in bank
Sets of workers connected with message queues A lot of infra code vs a bit of business logic Custom setup on every machine Painful maintenance of brokers Painful releases

Why workers + message queues is hard
Fault tolerance is harder to achieve – control both workers and brokers Message brokers must be scaled alongside workers When topology is redeployed – old messages in queues have to be cleaned? Speed. Instead of sending messages directly between spouts and bolts, the messages go through a 3rd party Why not intermediate MQ brokers for messaging? - to provide guarantees of the data processing They were a huge, complex moving part that would have to be scaled alongside Storm. They create uncomfortable situations, such as what to do when a topology is redeployed. There might still be intermediate messages on the brokers that are no longer compatible with the new version of the topology. So those messages would have to be cleaned up/ignored somehow. They make fault-tolerance harder. I would have to figure out what to do not just when Storm workers went down, but also when individual brokers went down. They're slow. Instead of sending messages directly between spouts and bolts, the messages go through a 3rd party, and not only that, the messages need to be persisted to disk.

The complexity Storm hides
Guaranteed message processing Robust process management Supervisor kills worker -> no orphan tasks (unlike in Hadoop) The Nimbus daemon and supervisor daemons fail-fast Fault detection and automatic reassignment Efficient message passing – no 3rd party Message serialization Local mode for testing Distributed for real usage – very simple Guaranteed message processing: Storm guarantees that each tuple coming off a spout will be fully processed by the topology. To do this, Storm tracks the tree of messages that a tuple triggers. If a tuple fails to be fully processed, Storm will replay the tuple from the Spout. Storm incorporates some clever tricks to track the tree of messages in an efficient way. Robust process management: One of Storm’s main tasks is managing processes around the cluster. When a new worker is assigned to a supervisor, that worker should be started as quickly as possible. When that worker is no longer assigned to that supervisor, it should be killed and cleaned up. An example of a system that does this poorly is Hadoop. When Hadoop launches a task, the burden for the task to exit is on the task itself. Unfortunately, tasks sometimes fail to exit and become orphan processes, sucking up memory and resources from other tasks. In Storm, the burden of killing a worker process is on the supervisor that launched it. Orphaned tasks simply cannot happen with Storm, no matter how much you stress the machine or how many errors there are. Accomplishing this is tricky because Storm needs to track not just the worker processes it launches, but also subprocesses launched by the workers (a subprocess is launched when a bolt is written in another language). The nimbus daemon and supervisor daemons are stateless and fail-fast. If they die, the running topologies aren’t affected. The daemons just start back up like nothing happened. This is again in contrast to how Hadoop works. Fault detection and automatic reassignment: Tasks in a running topology heartbeat to Nimbus to indicate that they are running smoothly. Nimbus monitors heartbeats and will reassign tasks that have timed out. Additionally, all the tasks throughout the cluster that were sending messages to the failed tasks quickly reconnect to the new location of the tasks. Efficient message passing: No intermediate queuing is used for message passing between tasks. Instead, messages are passed directly between tasks using ZeroMQ. This is simpler and way more efficient than using intermediate queuing. ZeroMQ is a clever “super-socket” library that employs a number of tricks for maximizing the throughput of messages. For example, it will detect if the network is busy and automatically batch messages to the destination. Another important part of message passing between processes is serializing and deserializing messages in an efficient way. Again, Storm automates this for you. By default, you can use any primitive type, strings, or binary records within tuples. If you want to be able to use another type, you just need to implement a simple interface to tell Storm how to serialize it. Then, whenever Storm encounters that type, it will automatically use that serializer. Local mode and distributed mode: Storm has a “local mode” where it simulates a Storm cluster completely in-process. This lets you iterate on your topologies quickly and write unit tests for your topologies. You can run the same code in local mode as you run on the cluster. "Zombie workers" - because in H worker has to kill itself. So in S daemon who started worker is killing it as well. In H if JobTracker dies - all jobs terminate as well, even if they were running for few days. S is "process fault-tolerant" - daemon can be killed with -9 and restarted without impact on running topologies

Storm Implementation APIs – Java (to make integrations easier)
Core – Clojure Inter-worker communication: ZeroMQ -> Netty Topologies – Thrift data structures By keeping Storm's APIs 100% Java, Storm was ensured to have a very large amount of potential users. By doing the implementation in Clojure, I was able to be a lot more productive and get the project working sooner. I also planned from the beginning to make Storm usable from non-JVM languages. Topologies are defined as Thrift data structures, and topologies are submitted using a Thrift API. Additionally, I designed a protocol so that spouts and bolts could be implemented in any language. Making Storm accessible from other languages makes the project accessible by more people. It makes it much easier for people to migrate to Storm, as they don't necessarily have to rewrite their existing realtime processing in Java. Instead they can port their existing code to run on Storm's multi-language API. Why not intermediate MQ brokers for messaging? - to provide guarantees of the data processing They were a huge, complex moving part that would have to be scaled alongside Storm. They create uncomfortable situations, such as what to do when a topology is redeployed. There might still be intermediate messages on the brokers that are no longer compatible with the new version of the topology. So those messages would have to be cleaned up/ignored somehow. They make fault-tolerance harder. I would have to figure out what to do not just when Storm workers went down, but also when individual brokers went down. They're slow. Instead of sending messages directly between spouts and bolts, the messages go through a 3rd party, and not only that, the messages need to be persisted to disk.

Storm Implementation APIs – Java (to make integrations easier)
Core – Clojure Inter-worker communication: ZeroMQ -> Netty Topologies – Thrift data structures Also: Kryo serialization framework, LMAX Disruptor – local queues By keeping Storm's APIs 100% Java, Storm was ensured to have a very large amount of potential users. By doing the implementation in Clojure, I was able to be a lot more productive and get the project working sooner. I also planned from the beginning to make Storm usable from non-JVM languages. Topologies are defined as Thrift data structures, and topologies are submitted using a Thrift API. Additionally, I designed a protocol so that spouts and bolts could be implemented in any language. Making Storm accessible from other languages makes the project accessible by more people. It makes it much easier for people to migrate to Storm, as they don't necessarily have to rewrite their existing realtime processing in Java. Instead they can port their existing code to run on Storm's multi-language API. Why not intermediate MQ brokers for messaging? - to provide guarantees of the data processing They were a huge, complex moving part that would have to be scaled alongside Storm. They create uncomfortable situations, such as what to do when a topology is redeployed. There might still be intermediate messages on the brokers that are no longer compatible with the new version of the topology. So those messages would have to be cleaned up/ignored somehow. They make fault-tolerance harder. I would have to figure out what to do not just when Storm workers went down, but also when individual brokers went down. They're slow. Instead of sending messages directly between spouts and bolts, the messages go through a 3rd party, and not only that, the messages need to be persisted to disk.

Storm author on Clojure
“By doing the implementation in Clojure, I was able to be a lot more productive and get the project working sooner.”

Storm author on Clojure
“Clojure is the best language I've ever used, by far. I use it because it makes me vastly more productive by allowing me to easily use techniques like immutability and functional programming. Its dynamic nature by being Lisp-based ensures that I can always mold Clojure as necessary to formulate the best possible abstractions. Storm would not be any different if I didn't use Clojure, it just would have been far more painful to build.”

Storm Terminology Stream – unbounded sequence of tuples. Distributed abstraction, can be produced and processed in parallel

Storm Terminology Spout - produces brand new streams (stream source)
Read from logs Read from queue Call API to get data … State spout?

Storm Terminology Bolt - takes in streams as input and produces streams as output(like Worker) Transformations Filters Joins

Storm Terminology Topology - the top-level abstraction, a network of spouts and bolts Can be transactional to guarantee “exactly once”

Spouts And Bolts

Storm Terminology Task – instance of bolt or spout

Storm Terminology Cluster - consists of:
Nimbus – JobTracker (also contains UI for monitoring) Set of Zookeeper nodes Set of Supervisors (Task trackers) Number of Workers Number of Ackers

Tuple Tree Root of the tree – tuple produced by Spout T1 T2,T3
Spout S1 ----> Bolt B > Bolt B2 ----> ... T1 / \ T2 T3

… Not exactly “Tree” T1 / \ T2 T3 \ / T4 | T5 “Tuple Tree” – historical term, indeed – DAG (Directed Acyclic Graph)

Tuple Processing T T2,T3 Spout S1 ----> Bolt B > Bolt B2 ----> ... In B1: Create T2, T3 as results of T1 processing Anchor T2, T3 to T1 (to suggest they are part of T1 “tuple tree”) Ack T1 to suggest that we don't need to process T1 anymore - T1 will be held in memory untill ack'ed (so if task is not ack'ing - it will run out of memory eventually) - Emitting shows framework that if further processing of T2 or T3 will fail - T1 will be replayed from S1, because T2 and T3 are in the same tuple tree - Failing input tuple T1 explicitly - triggers immediate replay of T1 from S1, without waiting for tuple timeout - Anchoring can be avoided - in case we want to break "transaction" and not replay T1 in case T2 or T3 will fail later on - We can anchor to more than one tuple - in case of aggregations and joins(what's the difference???). Than it will not be tree, but DAG (???). "Tuple Tree": historical name

Grouping Stream Grouping defines the way incoming stream is partitioned among bolt tasks Shuffle All Global Fields None Direct LocalOrShuffle – don’t understand this one No ackers – at most once Some ackers – at least one Exactly one – transactional topologies and Trident API

Guaranteed Message Processing
Acker has a map: TupleId -> [SpoutId, AckVal] /AckVal/ - XOR of all tuple id's in tree When AckVal == 0 -> Acker sends ack to spout /SpoutId/ for /TupleId/

Message Processing Contract
No Ackers – At Most Once Number of Ackers – At Least Once Transactional Topologies + TridentAPI – Exactly Once

Tuning Reliability vs Performance
Set number of acker tasks to 0 – tuple tree will not be tracked Don’t anchor – tuple is acknowledged Tune number of ackers for your amount of tuples to avoid bottleneck in reliability mechanism No ackers – at most once Some ackers – at least one Exactly one – transactional topologies and Trident API

Failure cases A tuple isn't acked because the task died
In this case the spout tuple ids at the root of the trees for the failed tuple will time out and be replayed

Failure cases Acker task dies
In this case all the spout tuples the acker was tracking will time out and be replayed

Failure cases Spout task dies
In this case the source that the spout talks to is responsible for replaying the messages For example, queues like Kestrel and RabbitMQ will place all pending messages back on the queue when a client disconnects.

Failure cases Nimbus dies Topologies keep running
You can’t rebalance tasks You can’t redeploy topologies Either fix Nimbus node or start new Nimbus and resubmit topologies For example, queues like Kestrel and RabbitMQ will place all pending messages back on the queue when a client disconnects.

Later Features Multi-Tenancy Isolation Scheduler Metrics API
Trident API (micro-batching) "Multi-Tenancy" - in Twitter people wanted run their apps on shared cluster "Isolation Scheduler" - to prevent people to request too much resources on shared cluster "Metrics API" - for people to monitor anything in their topologies Trident - a micro-batching API on top of Storm that provides exactly-once processing semantics - enabled Storm to be applied to a lot of new use cases.

Questions?

Thank you!

Apache Storm: Design And Usage

Similar presentations

Presentation on theme: "Apache Storm: Design And Usage"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Apache Storm: Design And Usage

Similar presentations

Presentation on theme: "Apache Storm: Design And Usage"— Presentation transcript:

Similar presentations

About project

Feedback