Download presentation
Presentation is loading. Please wait.
Published byCalvin Parsons Modified over 6 years ago
1
Original Slides by Nathan Marz @ Twitter Shyam Rajendran @ Nutanix
Storm Original Slides by Nathan Twitter Shyam Nutanix
2
Storm Developed by BackType which was acquired by Twitter
Lots of tools for data (i.e. batch) processing Hadoop, Pig, HBase, Hive, … None of them are realtime systems which are becoming a real requirement for businesses
3
History Hadoop ? For parallel batch processing : No Hacks for realtime Map/Reduce is built to leverage data localization on HDFS to distribute computational jobs. Works on big data. Storm ! Stream process data in realtime with no latency! Generates big data! 2011 : lead engineer analytics products
4
Storm Storm provides realtime computation Scalable
Guarantees no data loss Extremely robust and fault-tolerant Programming language agnostic
5
Concepts – Steam and Spouts
Stream Unbounded sequence of tuples ( storm data model ) <key, value(s)> pair ex. <“UIUC”, 5> Spouts Source of streams : Twitterhose API Stream of tweets or some crawler
6
Concept - Bolts Bolts Functions
Process (one or more ) input stream and produce new streams Functions Filter, Join, Apply/Transform etc
7
Concepts – Topology & Grouping
Graph of computation – can have cycles Network of Spouts and Bolts Spouts and bolts execute as many tasks across the cluster Grouping How to send tuples between the components / tasks?
8
Concepts – Grouping Shuffle Grouping Fields Grouping All Grouping
Distribute streams “randomly” to bolt’s tasks Fields Grouping Group a stream by a subset of its fields All Grouping All tasks of bolt receive all input tuples Useful for joins Global Grouping Pick task with lowest id
9
Cluster
10
Guranteed Message Processing
When is a message “Fully Proceed” ? "fully processed" when the tuple tree has been exhausted and every message in the tree has been processed A tuple is considered failed when its tree of messages fails to be fully processed within a specified timeout.
11
Fault Tolerance APIS Emit(tuple, output) Ack(tuple) Fail(tuple)
Emits an output tuple, perhaps anchored on an input tuple (first argument) Ack(tuple) Acknowledge that you (bolt) finished processing a tuple Fail(tuple) Immediately fail the spout tuple at the root of tuple topology if there is an exception from the database, etc. Must remember to ack/fail each tuple Each tuple consumes memory. Failure to do so results in memory leaks.
12
Failure Handling A tuple isn't acked because the task died:
Spout tuple ids at the root of the trees for the failed tuple will time out and be replayed. Acker task dies: All the spout tuples the acker was tracking will time out and be replayed. Spout task dies: The source that the spout talks to is responsible for replaying the messages. For example, queues like Kestrel and RabbitMQ will place all pending messages back on the queue when a client disconnects.
13
Storm Genius Major breakthrough : Tracking algorithm
Storm uses mod hashing to map a spout tuple id to an acker task. Acker task: Stores a map from a spout tuple id to a pair of values. Task id that created the spout tuple Second value is 64bit number : Ack Val XOR all tuple ids that have been created/acked in the tree. Tuple tree completed when Ack Val = 0
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.