Presentation is loading. Please wait.

Presentation is loading. Please wait.

Original Slides by Nathan Twitter Shyam Nutanix

Similar presentations


Presentation on theme: "Original Slides by Nathan Twitter Shyam Nutanix"— Presentation transcript:

1 Original Slides by Nathan Marz @ Twitter Shyam Rajendran @ Nutanix
Storm Original Slides by Nathan Twitter Shyam Nutanix

2 Storm Developed by BackType which was acquired by Twitter
Lots of tools for data (i.e. batch) processing Hadoop, Pig, HBase, Hive, … None of them are realtime systems which are becoming a real requirement for businesses

3 History Hadoop ? For parallel batch processing : No Hacks for realtime Map/Reduce is built to leverage data localization on HDFS to distribute computational jobs. Works on big data. Storm ! Stream process data in realtime with no latency! Generates big data! 2011 : lead engineer analytics products

4 Storm Storm provides realtime computation Scalable
Guarantees no data loss Extremely robust and fault-tolerant Programming language agnostic

5 Concepts – Steam and Spouts
Stream Unbounded sequence of tuples ( storm data model ) <key, value(s)> pair ex. <“UIUC”, 5> Spouts Source of streams : Twitterhose API Stream of tweets or some crawler

6 Concept - Bolts Bolts Functions
Process (one or more ) input stream and produce new streams Functions Filter, Join, Apply/Transform etc

7 Concepts – Topology & Grouping
Graph of computation – can have cycles Network of Spouts and Bolts Spouts and bolts execute as many tasks across the cluster Grouping How to send tuples between the components / tasks?

8 Concepts – Grouping Shuffle Grouping Fields Grouping All Grouping
Distribute streams “randomly” to bolt’s tasks Fields Grouping Group a stream by a subset of its fields All Grouping All tasks of bolt receive all input tuples Useful for joins Global Grouping Pick task with lowest id

9 Cluster

10 Guranteed Message Processing
When is a message “Fully Proceed” ? "fully processed" when the tuple tree has been exhausted and every message in the tree has been processed A tuple is considered failed when its tree of messages fails to be fully processed within a specified timeout.

11 Fault Tolerance APIS Emit(tuple, output) Ack(tuple) Fail(tuple)
Emits an output tuple, perhaps anchored on an input tuple (first argument) Ack(tuple) Acknowledge that you (bolt) finished processing a tuple Fail(tuple) Immediately fail the spout tuple at the root of tuple topology if there is an exception from the database, etc. Must remember to ack/fail each tuple Each tuple consumes memory. Failure to do so results in memory leaks.

12 Failure Handling A tuple isn't acked because the task died:
Spout tuple ids at the root of the trees for the failed tuple will time out and be replayed. Acker task dies: All the spout tuples the acker was tracking will time out and be replayed. Spout task dies: The source that the spout talks to is responsible for replaying the messages. For example, queues like Kestrel and RabbitMQ will place all pending messages back on the queue when a client disconnects.

13 Storm Genius Major breakthrough : Tracking algorithm
Storm uses mod hashing to map a spout tuple id to an acker task. Acker task: Stores a map from a spout tuple id to a pair of values. Task id that created the spout tuple Second value is 64bit number : Ack Val XOR all tuple ids that have been created/acked in the tree. Tuple tree completed when Ack Val = 0


Download ppt "Original Slides by Nathan Twitter Shyam Nutanix"

Similar presentations


Ads by Google