Heron: a stream data processing engine

Slides:



Advertisements
Similar presentations
Remus: High Availability via Asynchronous Virtual Machine Replication
Advertisements

Categories of I/O Devices
Wei-Chiu Chuang 10/17/2013 Permission to copy/distribute/adapt the work except the figures which are copyrighted by ACM.
Lecture 18-1 Lecture 17-1 Computer Science 425 Distributed Systems CS 425 / ECE 428 Fall 2013 Hilfi Alkaff November 5, 2013 Lecture 21 Stream Processing.
1 CMSC421: Principles of Operating Systems Nilanjan Banerjee Principles of Operating Systems Acknowledgments: Some of the slides are adapted from Prof.
The Google File System. Why? Google has lots of data –Cannot fit in traditional file system –Spans hundreds (thousands) of servers connected to (tens.
1: Operating Systems Overview
I/O Hardware n Incredible variety of I/O devices n Common concepts: – Port – connection point to the computer – Bus (daisy chain or shared direct access)
©Brooks/Cole, 2003 Chapter 7 Operating Systems Dr. Barnawi.
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
Pregel: A System for Large-Scale Graph Processing
Real-Time Software Design Yonsei University 2 nd Semester, 2014 Sanghyun Park.
LOGO OPERATING SYSTEM Dalia AL-Dabbagh
Operating System Review September 10, 2012Introduction to Computer Security ©2004 Matt Bishop Slide #1-1.
EXPOSE GOOGLE APP ENGINE AS TASKTRACKER NODES AND DATA NODES.
Threads in Java. History  Process is a program in execution  Has stack/heap memory  Has a program counter  Multiuser operating systems since the sixties.
Mesos A Platform for Fine-Grained Resource Sharing in the Data Center Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony Joseph, Randy.
Reference: Ian Sommerville, Chap 15  Systems which monitor and control their environment.  Sometimes associated with hardware devices ◦ Sensors: Collect.
The Alternative Larry Moore. 5 Nodes and Variant Input File Sizes Hadoop Alternative.
Ch 10 Shared memory via message passing Problems –Explicit user action needed –Address spaces are distinct –Small Granularity of Transfer Distributed Shared.
Very Large Scale Stream Processing inside Alibaba Alibaba.
HPC HPC-5 Systems Integration High Performance Computing 1 Application Resilience: Making Progress in Spite of Failure Nathan A. DeBardeleben and John.
Multithreading The objectives of this chapter are: To understand the purpose of multithreading To describe Java's multithreading mechanism.
Adaptive Online Scheduling in Storm Paper by Leonardo Aniello, Roberto Baldoni, and Leonardo Querzoni Presentation by Keshav Santhanam.
BIG DATA/ Hadoop Interview Questions.
MDC-700 Series Modbus Data Concentrator [2016,05,26]
INTRODUCTION TO WIRELESS SENSOR NETWORKS
MillWheel Fault-Tolerant Stream Processing at Internet Scale
Real-time Software Design
R-Storm: Resource Aware Scheduling in Storm
HERON.
DISTRIBUTED SYSTEMS Principles and Paradigms Second Edition ANDREW S
TensorFlow– A system for large-scale machine learning
Scalable containers with Apache Mesos and DC/OS
Module 12: I/O Systems I/O hardware Application I/O Interface
How Alluxio (formerly Tachyon) brings a 300x performance improvement to Qunar’s streaming processing Xueyan Li (Qunar) & Chunming Li (Garena)
About Hadoop Hadoop was one of the first popular open source big data technologies. It is a scalable fault-tolerant system for processing large datasets.
OpenPBS – Distributed Workload Management System
Introduction to Distributed Platforms
Chapter 3: Process Concept
Guangxiang Du*, Indranil Gupta
Chapter 10 Data Analytics for IoT
Original Slides by Nathan Twitter Shyam Nutanix
Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center
SOFTWARE DESIGN AND ARCHITECTURE
Concurrency, Processes and Threads
Chapter 9 – Real Memory Organization and Management
PREGEL Data Management in the Cloud
Operating System Structure
Abstract Major Cloud computing companies have started to integrate frameworks for parallel data processing in their product portfolio, making it easy for.
Intro to Processes CSSE 332 Operating Systems
Real-time Software Design
Replication Middleware for Cloud Based Storage Service
Operating Systems CPU Scheduling.
Boyang Peng, Le Xu, Indranil Gupta
湖南大学-信息科学与工程学院-计算机与科学系
Streaming Sensor Data Fjord / Sensor Proxy Multiquery Eddy
Operating System Concepts
13: I/O Systems I/O hardwared Application I/O Interface
CS703 - Advanced Operating Systems
Henge: Intent-Driven Multi-Tenant Stream Processing
Software models - Software Architecture Design Patterns
Lecture Topics: 11/1 General Operating System Concepts Processes
From Rivulets to Rivers: Elastic Stream Processing in Heron
Prof. Leonardo Mostarda University of Camerino
Concurrency, Processes and Threads
Chapter 13: I/O Systems I/O Hardware Application I/O Interface
Concurrency, Processes and Threads
COS 518: Distributed Systems Lecture 11 Mike Freedman
Module 12: I/O Systems I/O hardwared Application I/O Interface
Presentation transcript:

Heron: a stream data processing engine Shreya

Real Time Streaming by Storm Storm used to be the main platform for providing real time analytics at Twitter Storm topology is a directed graph of spouts and bolts Spouts: sources of input data/tuples (stream of tweets) Bolts: abstraction to represent computation on the stream. Like real-time active user counts (RTAC) Spout and bolts are run as tasks. A lot of work on

Storm Worker Multiple tasks are grouped into an executor Multiple executors are grouped into a worker Each worker runs as a JVM process. Worker processes are scheduled by the OS Each executor is mapped to two threads scheduled using preemptive priority based by the JVM. Each executor has a scheduler Worker process = jvm process Jvm= java vm. Runs a program. Complex scheduling problem

Limitations of Storm As the scale and diversity of the data increased the Storm’s limitations became apparent These many layers of scheduling makes it a much more complex problem and more difficult to predict performance Since each worker can run separate tasks, you could potentially have spouts and bolts from different sources depending on the same resources No resource isolation

Limitations of Storm Makes debugging difficult because when restarting it is possible to find the erroneous task scheduled with different tasks therefor difficult to find. Also if an entire worker process is killed it will hurt other running tasks. The resource allocation methods can lead to overprovisioning. Becomes more complicated as you increase the types of components being put in worker Hurting one topology can hurt others 3 spouts and 1 bolt example, gives 15 for each because one of the workers needs 10+5, and the other just 10, so we allocation 30 instaed of 25, takes the max needed a cleaner mapping from the logical units of computation to each physical process. The importance of such clean mapping for debug-ability is really crucial when responding to pager alerts for a failing topology, especially if it is a topology that is critical to the underlying business model. Requires sharing the cluster resources

Limitations of Storm As the number of bolts/ tasks increases in a worker, each worker tends to be connected to other workers. There are not enough ports in each worker to communicate: not scalable. A new engine was needed that could provide better scalability, sharing of cluster resources, and debuggability. Because the multiple components of the topology are bundled into one OS process, debugging is difficult.

Storm Nimbus Limitations Schedules, monitors, and distributes JARs. It also Manages counters for several topologies All these tasks make Nimbus the bottleneck Does not support resource isolation at granular level. Workers in different topologies on the same machine can interfere with each other. Solution: run entire topologies on one machine. -> Leads to waste of resources Uses Zookeeper to manage heartbeats from workers. Becomes the bottleneck for large numbers of topologies When Nimbus fails, users can’t modify topologies and failures cannot be automatically detected.

Efficiency Reduced performance were often caused by Suboptimal Replays: tuple failure anywhere leads to failure of the whole tree Long Garbage Collection Cycles: High RAM usage for GC cycles results in high latencies and high tuple failures Queue contention: there is contention for transfer queues, Storm has dealt with this issues by overprovisioning

Heron Runs topologies (directed acyclic graph of spouts and bolts). Programmer specifies number of tasks for each spout and bolt, how data is partitioned across spout. Tuple Processing follows At most once: no tuple proessed more than once At least once: Each tuple is guaranteed to be processed at least once. Too many fundamental issues so heron was the answer Api compatible with storm so easy to migrate

Architecture Deploy topologies to Aurora Scheduler using the Heron CL tool Aurora: generic service scheduler. Runs a framework on top of Mesos This scheduler can be replaced by a different one Each topology is an Aurora Job consisting of containers Abstraction scheduler

Architecture The first container is the Toplogy Master The rest are Stream Manager, Metris Manger, and Heron Instances (one JVM) Containers are scheduled by Aurora Processes communicate with each other using protocol buffers

Topology Master Responsible for managing topologies and serves as point of contact for discovering status of the topology A topology can only have one TM Provides an endpoint for toplogy metrics.

Stream Manager Manage routing of tuples efficiently. Heron Instances connect to the local SM to send and receive tuples. There are O(k2) connections where k is the number of containers/SMs Stage by Stage Backpressure: propagate backpressure stage by stage until reaches the spouts High and low water mark Allows traceability by finding what triggered the backpressure This design prevents rapid oscillation from going into and out of backpressure Sending start and stop backpressure is overhead. Spout Backpressure: Clamps on upstream components (spouts) when His are slowing down Dynamically adjusts rate at which data flows. For example if upstream is too fast, then this will lead to large queues downstream Uses tcp windowing mechanism for backpressure. This works because production consumption is equal to send receive rate of buffer so good metric. However slow downs upstream led to slowdowns backstream too so really slow process getting out of slow zones

Heron Instance Each Heron Instance is a JVM process. Single threaded approach: maintains TCP comuniation to local SM. Waits for tuples to invoke user logic code. Output is buffered until threshold is met and all is delivered to SM. User can block because of read/writes, calling the sleep system call Two threaded Approach: Gateway thread and task execution thread

Heron Instance Gateway thread: controls communication and data movement in and out. Maintains TCP connections to SM and metrics manager. Task Execution thread: receives tuples and runs user codes. Sends output to gateway thread. GC: can be triggered before gateway sends out tuples => periodically check queue size When data in exceeds bound action triggers backpressure mechanism. Tm writes physical plan to zookeeer to rediscover state in case of failure

Start up Sequence Topology is submitted to Heron The scheduler allocates necessary resources and schedules containers in several machiens TM comes up on first container and makes itself discoverable using Zookeeper. SM on each container conults Zookeeper to discover TM When connections ae all set up, TM runs assignment algorithm to different components. Begins execution

Failure Scenarios Deaths of processes, failure of containers, failure of machines If TM process dies, the container restarts that process and recovers state from Zookeeper. standby TM becomes master TM and restarted master TM becomes standby If SM dies, it rediscoversss TM upon restart. When HI dies, it is restarted and gets copy of plan from SM

Heron Tracker/Heron UI Saves metadata and collects information on topology. Provides an API to create further tools The UI provides a visual representation of the topologies. Can view the acyclic graph See statistics on specific components. Heron Viz automatically creates a dashboard that provides health and resource monitoring. Has resulted in a 3x reduction in hardware at Twitter.

Evaluation Tested against Word Count toplogy and RTAC toplogy. Considered at least once and at most once semantics Run on machines with 12 cores, 72 GB of RAM, and 500 GB of disk. Assumed no out of memory crashes or repetitive GC cycles.

Word Count Topology Spouts were could produce quickly Storm’s convoluted queue structure led to more overhead. Heron demonstrates better throughput and efficiency Similar results for disabled acknowledgments and RTAC topologies his increase is a result of the SMs becoming a bottleneck, as they need to maintain buffers for each connection to the other SMs, and it takes more time to consume data from more buffers. The tuples also live in these buffers for longer time given a cons tant input rate (only one spout instance per container).

Summary Resource provisioning is abstracted from cluster manager, allowing isolation A single heron instance runs only a spout or bolt allowing better debugging Step by step backpressure makes clear which component is failing Heron is efficient because there is component level resource allocation A TM allows each topology to be managed independently and failures of one to not interfere with another