Real-Time Stream Processing CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook.

Slides:



Advertisements
Similar presentations
MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
Advertisements

Apache Storm A scalable distributed & fault tolerant real time computation system ( Free & Open Source ) Shyam Rajendran 16-Feb-15.
Mapreduce and Hadoop Introduce Mapreduce and Hadoop
© Hortonworks Inc Daniel Dai Thejas Nair Page 1 Making Pig Fly Optimizing Data Processing on Hadoop.
Programming Models for IoT and Streaming Data IC2E Internet of Things Panel Judy Qiu Indiana University.
Lecture 18-1 Lecture 17-1 Computer Science 425 Distributed Systems CS 425 / ECE 428 Fall 2013 Hilfi Alkaff November 5, 2013 Lecture 21 Stream Processing.
Matei Zaharia Large-Scale Matrix Operations Using a Data Flow Engine.
Data-Intensive Computing with MapReduce/Pig Pramod Bhatotia MPI-SWS Distributed Systems – Winter Semester 2014.
Hadoop in the Wild CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook.
1 Large-Scale Machine Learning at Twitter Jimmy Lin and Alek Kolcz Twitter, Inc. Presented by: Yishuang Geng and Kexin Liu.
Other Distributed Frameworks
CS 425 / ECE 428 Distributed Systems Fall 2014 Indranil Gupta (Indy) Lecture 22: Stream Processing, Graph Processing All slides © IG.
Hadoop Ecosystem Overview
Hadoop Ida Mele. Parallel programming Parallel programming is used to improve performance and efficiency In a parallel program, the processing is broken.
Committed to Deliver….  We are Leaders in Hadoop Ecosystem.  We support, maintain, monitor and provide services over Hadoop whether you run apache Hadoop,
Data Formats CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook.
Company LOGO An Introduction of JStorm
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
HAMS Technologies 1
Introduction to Hadoop Programming Bryon Gill, Pittsburgh Supercomputing Center.
Google’s MapReduce Connor Poske Florida State University.
Harp: Collective Communication on Hadoop Bingjing Zhang, Yang Ruan, Judy Qiu.
Key/Value Stores CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook.
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
Big Data Infrastructure Jimmy Lin University of Maryland Monday, April 20, 2015 Session 11: Beyond MapReduce — Stream Processing This work is licensed.
What is SAM-Grid? Job Handling Data Handling Monitoring and Information.
Hung-chih Yang 1, Ali Dasdan 1 Ruey-Lung Hsiao 2, D. Stott Parker 2
Database Applications (15-415) Part II- Hadoop Lecture 26, April 21, 2015 Mohammad Hammoud.
By Jeff Dean & Sanjay Ghemawat Google Inc. OSDI 2004 Presented by : Mohit Deopujari.
Chapter 5 Ranking with Indexes 1. 2 More Indexing Techniques n Indexing techniques:  Inverted files - best choice for most applications  Suffix trees.
Big Data,Map-Reduce, Hadoop. Presentation Overview What is Big Data? What is map-reduce? input/output data types why is it useful and where is it used?
A N I N - MEMORY F RAMEWORK FOR E XTENDED M AP R EDUCE 2011 Third IEEE International Conference on Coud Computing Technology and Science.
MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.
강호영 Contents Storm introduction – Storm Architecture – Concepts of Storm – Operation Modes : Local Mode vs. Remote(Cluster) Mode.
Part III BigData Analysis Tools (Storm) Yuan Xue
Csinparallel.org Workshop 307: CSinParallel: Using Map-Reduce to Teach Parallel Programming Concepts, Hands-On Dick Brown, St. Olaf College Libby Shoop,
CMSC 491 Hadoop-Based Distributed Computing Spring 2016 Adam Shook
Implementation of Classifier Tool in Twister Magesh khanna Vadivelu Shivaraman Janakiraman.
Hadoop in the Wild CMSC 491 Hadoop-Based Distributed Computing Spring 2016 Adam Shook.
Big thanks to everyone!.
Lecture 22: Stream Processing, Graph Processing
CSCI5570 Large Scale Data Processing Systems
ITCS-3190.
Event Based Systems Short intro on Trident
Original Slides by Nathan Twitter Shyam Nutanix
Real-Time Processing with Apache Flume, Kafka, and Storm Kamlesh Dhawale Ankalytics
9/18/2018 Big Data Analytics with HDInsight Module 6 – Storm Essentials Asad Khan Nishant Thacker Principal PM Manager Technical Product Manager.
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
SONATA: Query-Driven Network Telemetry
Boyang Peng, Le Xu, Indranil Gupta
Database Applications (15-415) Hadoop Lecture 26, April 19, 2016
MIT 802 Introduction to Data Platforms and Sources Lecture 2
Capital One Architecture Team and DataTorrent
Data-Intensive Distributed Computing
湖南大学-信息科学与工程学院-计算机与科学系
Introduction to PIG, HIVE, HBASE & ZOOKEEPER
February 26th – Map/Reduce
Cse 344 May 4th – Map/Reduce.
CS110: Discussion about Spark
Overview of big data tools
Discretized Streams: A Fault-Tolerant Model for Scalable Stream Processing Zaharia, et al (2012)
Lecture 22: Stream Processing, Graph Processing
Lecture 16 (Intro to MapReduce and Hadoop)
Computational Advertising and
MIT 802 Introduction to Data Platforms and Sources Lecture 2
Apache Storm: Design And Usage
Lecture 29: Distributed Systems
Presentation transcript:

Real-Time Stream Processing CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Agenda Apache Storm

Traditional Data Processing !!!ALL!!! the data !!!ALL!!! the data Batch Pre- Computation (aka MapReduce) Batch Pre- Computation (aka MapReduce) Index Query

Traditional Data Processing Slow... and views are out of date Absorbed into batch viewsNot absorbed Now Time

Compensating for the real-time stuff Need some kind of stream processing system to supplement our batch views Applications can then merge the batch and the real time views together!

How do we do that?

Enter: Storm Open-source project originally built by Twitter Now a top-level Apache project Enables distributed, fault-tolerant, real-time, guaranteed computation

A History Lesson on Twitter Metrics Twitter Firehose

A History Lesson on Metrics Twitter Firehose

Problems! Scaling is painful Fault-tolerance is practically non-existent Coding for it is awful

Wanted to Address Guaranteed data processing Horizontal Scalability Fault-tolerance No intermediate message brokers Higher level abstraction than message passing “Just works”

Storm Delivers Guaranteed data processing Horizontal Scalability Fault-tolerance No intermediate message brokers Higher level abstraction than message passing “Just works”

Use Cases Stream Processing Distributed RPC Continuous Computation

Storm Architecture Nimbus ZooKeeper Supervisor

Glossary Streams – Constant pump of data as Tuples Spouts – Source of streams Bolts – Process input streams and produce new streams – Functions, Filters, Aggregation, Joins, Talk to databases, etc. Topologies – Network of spouts and bolts

Tasks and Topologies

Grouping When a Tuple is emitted from a Spout or Bolt, where does it go? Shuffle Grouping – Pick a random task Fields Grouping – Consistent hashing on a subset of tuple fields All Grouping – Send to all tasks Global Grouping – Pick task with lowest ID

Topology shuffle [“url”] shuffle [“id1”, “id2”] all

Guaranteed Message Processing A tuple has not been fully processed until it all tuples in the “tuple tree” have been completed If the tree is not completed within a timeout, it is replayed Programmers need to use the API to ‘ack’ a tuple as completed

Stream Processing Example Word Count TopologyBuilder builder = new TopologyBuilder(); builder.setSpout(1, new SentenceSpout(true), 5); builder.setBolt(2, new SplitSentence(), 8).shuffleGrouping(1); builder.setBolt(3, new WordCount(), 12).fieldsGrouping(2, new Fields(“word”)); Map conf = new HashMap(); conf.put(Config.TOPOLOGY_WORKERS, 5); StormSubmitter.submitTopology(“word-count”, conf, builder.createTopology());

public static class SplitSentence extends ShellBolt implements IRichBolt { public SplitSentence() { super(“python”, “splitsentence.py”); } public void declareOutputFields(OutputFieldsDeclaraer declarer) { declarer.declare(new Fields(“word”)); } #!/usr/bin/python import storm class SplitSentenceBolt(storm.BasicBolt): def process(Self, tup): words = tup.values[0].split(“ “) for word in words: storm.emit([word])

public static class WordCount implements IBasicBolt { Map counts = new HashMap (); public void prepare(Map conf, TopologyContext context) {} public void execute(Tuple tuple, BasicOutputCollector collector) { String word = tuple.getString(0); Integer count = counts.get(word); if (count == null) { count = 0; } ++count; counts.put(Word, count); collector.emit(new Values(word, count)); } public void cleanup () {} public void declareOutputFields(OutputFieldsDeclarer declarer) { declarer.declare(new Fields(“word”, “count”)); }

Local Mode! TopologyBuilder builder = new TopologyBuilder(); builder.setSpout(1, new SentenceSpout(true), 5); builder.setBolt(2, new SplitSentence(), 8).shuffleGrouping(1); builder.setBolt(3, new WordCount(), 12).fieldsGrouping(2, new Fields(“word”)); Map conf = new HashMap(); conf.put(Config.TOPOLOGY_WORKERS, 5); LocalCluster cluster = new LocalCluster(); cluster.submitTopology(“word-count”, conf, builder.createTopology()); Thread.sleep(10000); cluster.shutdown();

Command Line Interface Starting a topology storm jar mycode.jar twitter.storm.MyTopology demo Stopping a topology storm kill demo

Distributed RPC

DRPC Example Reach Reach is the number of unique people exposed to a specific URL on Twitter URL Tweeter Follower Distinct Follower Count Reach

Reach Topology Spout GetTweeters CountAggregator GetFollowers Distinct shuffle [“follower-id”] global

Storm Review Distributed code and configurations Robust process management Monitors topologies and reassigns failed tasks Provides reliability by tracking tuple trees Routing and partitioning of streams Serialization Fine-grained performance stats of topologies

APACHE SPARK

Concern! Say I have an application that involves many iterations... – Graph Algorithms – K-Means Clustering – Six Degrees of Bieber Fever What's wrong with Hadoop MapReduce?

References distributed-and-faulttolerant-realtime- computation distributed-and-faulttolerant-realtime- computation time-analytics-with-storm time-analytics-with-storm /nsdi_spark.pdf /nsdi_spark.pdf