Part III BigData Analysis Tools (Storm) Yuan Xue

Part III BigData Analysis Tools (Storm) Yuan Xue (yuan.xue@vanderbilt.edu)

Introduction  Limitation of Hadoop (MapReduce)  Batch-oriented big data solution at its heart  Gaps in ad-hoc and real-time data processing at massive scale  The need for a dedicated real-time analytics solution  “There’s no hack that will turn Hadoop into a realtime system; realtime data processing has a fundamentally different set of requirements than batch processing” -- Nathan Marz  Solution  Dremel (Google BigQuery) to support ad-hoc analytics  Storm (Twitter’s real-time computation) engine to provide solution in the real-time data analytics world.  Storm -- originally developed by BackType and running now under Twitter’s name, after BackType has been acquired by them.

Storm Architecture  Storm architecture very much resembles to Hadoop architecture  Two types of nodes: a master node and the worker nodes.  The master node runs Nimbus that is copying the code to the cluster nodes and assigns tasks to the workers – it has a similar role as JobTracker in Hadoop.  The worker nodes run the Supervisor which starts and stops worker processes – its role is similar to TaskTrackers in Hadoop.  The coordination and all states between Nimbus and Supervisors are managed by Zookepeer, so the architecture looks as follows:

Storm Concepts  Streams  Unbounded sequence of tuples  Spout  nodes that produce data to be processed by other nodes. It can read data from HTTP streams, databases, files, message queues, etc  Bolt  Bolts can both receive and produce data in the Storm cluster.  Execute: functions, filters, aggregation, joins, database access  Topology  Object that configures how the Storm cluster will look like: what Sprouts and Bolts it has and how they are chained together  Similar to a MR job

Stream Grouping  Question: When a tuple is emitted, which task does it go to?  Shuffle grouping: pick a random task  Fields grouping: consistent hashing on a subset of tuple fields  All grouping: send to all tasks  Global grouping: pick task with lowest id

Example Code  https://github.com/nathanmarz/storm- starter/blob/master/src/jvm/storm/starter/WordCountTopology.java

The Lambda architecture http://www.manning.com/marz/BDmeapch1.pdf http://lambda-architecture.net/

The Lambda architecture – Detailed View http://jameskinley.tumblr.com/post/37398560534/the-lambda-architecture-principles-for-architecting

Merge Realtime View into Batch View

Reference  http://bighadoop.wordpress.com/tag/storm/ http://bighadoop.wordpress.com/tag/storm/  http://jameskinley.tumblr.com/post/37398560534/the-lambda-architecture- principles-for-architecting http://jameskinley.tumblr.com/post/37398560534/the-lambda-architecture- principles-for-architecting  http://www.manning.com/marz/BDmeapch1.pdf http://www.manning.com/marz/BDmeapch1.pdf  https://github.com/nathanmarz/storm/wiki/Tutorial https://github.com/nathanmarz/storm/wiki/Tutorial

Part III BigData Analysis Tools (Storm) Yuan Xue

Similar presentations

Presentation on theme: "Part III BigData Analysis Tools (Storm) Yuan Xue"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Part III BigData Analysis Tools (Storm) Yuan Xue

Similar presentations

Presentation on theme: "Part III BigData Analysis Tools (Storm) Yuan Xue"— Presentation transcript:

Similar presentations

About project

Feedback