Part III BigData Analysis Tools (Storm) Yuan Xue
Introduction Limitation of Hadoop (MapReduce) Batch-oriented big data solution at its heart Gaps in ad-hoc and real-time data processing at massive scale The need for a dedicated real-time analytics solution “There’s no hack that will turn Hadoop into a realtime system; realtime data processing has a fundamentally different set of requirements than batch processing” -- Nathan Marz Solution Dremel (Google BigQuery) to support ad-hoc analytics Storm (Twitter’s real-time computation) engine to provide solution in the real-time data analytics world. Storm -- originally developed by BackType and running now under Twitter’s name, after BackType has been acquired by them.
Storm Architecture Storm architecture very much resembles to Hadoop architecture Two types of nodes: a master node and the worker nodes. The master node runs Nimbus that is copying the code to the cluster nodes and assigns tasks to the workers – it has a similar role as JobTracker in Hadoop. The worker nodes run the Supervisor which starts and stops worker processes – its role is similar to TaskTrackers in Hadoop. The coordination and all states between Nimbus and Supervisors are managed by Zookepeer, so the architecture looks as follows:
Storm Concepts Streams Unbounded sequence of tuples Spout nodes that produce data to be processed by other nodes. It can read data from HTTP streams, databases, files, message queues, etc Bolt Bolts can both receive and produce data in the Storm cluster. Execute: functions, filters, aggregation, joins, database access Topology Object that configures how the Storm cluster will look like: what Sprouts and Bolts it has and how they are chained together Similar to a MR job
Stream Grouping Question: When a tuple is emitted, which task does it go to? Shuffle grouping: pick a random task Fields grouping: consistent hashing on a subset of tuple fields All grouping: send to all tasks Global grouping: pick task with lowest id
Example Code starter/blob/master/src/jvm/storm/starter/WordCountTopology.java
The Lambda architecture
The Lambda architecture – Detailed View
Merge Realtime View into Batch View
Reference principles-for-architecting principles-for-architecting