Geoffrey Fox, Martin Swany Co-founder of Streamlio

Slides:



Advertisements
Similar presentations
System Area Network Abhiram Shandilya 12/06/01. Overview Introduction to System Area Networks SAN Design and Examples SAN Applications.
Advertisements

The Development of Mellanox - NVIDIA GPUDirect over InfiniBand A New Model for GPU to GPU Communications Gilad Shainer.
Dr. Kalpakis CMSC 621, Advanced Operating Systems. Fall 2003 URL: Distributed System Architectures.
Uncovering Performance and Interoperability Issues in the OFED Stack March 2008 Dennis Tolstenko Sonoma Workshop Presentation.
VIA and Its Extension To TCP/IP Network Yingping Lu Based on Paper “Queue Pair IP, …” by Philip Buonadonna.
Rheeve: A Plug-n-Play Peer- to-Peer Computing Platform Wang-kee Poon and Jiannong Cao Department of Computing, The Hong Kong Polytechnic University ICDCSW.
Socket Programming.
RDMA ENABLED WEB SERVER Rajat Sharma. Objective  To implement a Web Server serving HTTP client requests through RDMA replacing the traditional TCP/IP.
High Performance Communication using MPJ Express 1 Presented by Jawad Manzoor National University of Sciences and Technology, Pakistan 29 June 2015.
Embedded Transport Acceleration Intel Xeon Processor as a Packet Processing Engine Abhishek Mitra Professor: Dr. Bhuyan.
An overview of Infiniband Reykjavik, June 24th 2008 R E Y K J A V I K U N I V E R S I T Y Dept. Computer Science Center for Analysis and Design of Intelligent.
Case Study - GFS.
File Systems (2). Readings r Silbershatz et al: 11.8.
Chapter 17 Networking Dave Bremer Otago Polytechnic, N.Z. ©2008, Prentice Hall Operating Systems: Internals and Design Principles, 6/E William Stallings.
Presentation on Osi & TCP/IP MODEL
1 March 2010 A Study of Hardware Assisted IP over InfiniBand and its Impact on Enterprise Data Center Performance Ryan E. Grant 1, Pavan Balaji 2, Ahmad.
Copyright 2003 CCNA 1 Chapter 9 TCP/IP Transport and Application Layers By Your Name.
High Performance User-Level Sockets over Gigabit Ethernet Pavan Balaji Ohio State University Piyush Shivam Ohio State University.
High Performance Computing & Communication Research Laboratory 12/11/1997 [1] Hyok Kim Performance Analysis of TCP/IP Data.
Boosting Event Building Performance Using Infiniband FDR for CMS Upgrade Andrew Forrest – CERN (PH/CMD) Technology and Instrumentation in Particle Physics.
Jozef Goetz, Application Layer PART VI Jozef Goetz, Position of application layer The application layer enables the user, whether human.
© 2012 MELLANOX TECHNOLOGIES 1 The Exascale Interconnect Technology Rich Graham – Sr. Solutions Architect.
Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.
Chapter 15 – Part 2 Networks The Internal Operating System The Architecture of Computer Hardware and Systems Software: An Information Technology Approach.
Chapter 2 Applications and Layered Architectures Sockets.
Minimizing Communication Latency to Maximize Network Communication Throughput over InfiniBand Design and Implementation of MPICH-2 over InfiniBand with.
GVis: Grid-enabled Interactive Visualization State Key Laboratory. of CAD&CG Zhejiang University, Hangzhou
CE Operating Systems Lecture 13 Linux/Unix interprocess communication.
Latency Reduction Techniques for Remote Memory Access in ANEMONE Mark Lewandowski Department of Computer Science Florida State University.
Intel Research & Development ETA: Experience with an IA processor as a Packet Processing Engine HP Labs Computer Systems Colloquium August 2003 Greg Regnier.
LRPC Firefly RPC, Lightweight RPC, Winsock Direct and VIA.
Very Large Scale Stream Processing inside Alibaba Alibaba.
Review: – Why layer architecture? – peer entities – Protocol and service interface – Connection-oriented/connectionless service – Reliable/unreliable service.
Mellanox Connectivity Solutions for Scalable HPC Highest Performing, Most Efficient End-to-End Connectivity for Servers and Storage September 2010 Brandon.
Introduction Contain two or more CPU share common memory and peripherals. Provide greater system throughput. Multiple processor executing simultaneous.
Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.
1 Network Communications A Brief Introduction. 2 Network Communications.
Remigius K Mommsen Fermilab CMS Run 2 Event Building.
Part III BigData Analysis Tools (Storm) Yuan Xue
Towards High Performance Processing of Streaming Data May Supun Kamburugamuve, Saliya Ekanayake, Milinda Pathirage and Geoffrey C. Fox Indiana.
INTRODUCTION TO HIGH PERFORMANCE COMPUTING AND TERMINOLOGY.
Advisor: Hung Shi-Hao Presenter: Chen Yu-Jen
SOCKET PROGRAMMING Presented By : Divya Sharma.
Enhancements for Voltaire’s InfiniBand simulator
Balazs Voneki CERN/EP/LHCb Online group
TLDK overview Konstantin Ananyev 05/08/2016.
HERON.
Developing IoT endpoints with mbed Client
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING CLOUD COMPUTING
Last Class: Introduction
Infiniband Architecture
Chapter 3 Internet Applications and Network Programming
MCA – 405 Elective –I (A) Java Programming & Technology
Understanding the OSI Reference Model
Switching Techniques In large networks there might be multiple paths linking sender and receiver. Information may be switched as it travels through various.
Client-Server Interaction
IEEE BigData 2016 December 5-8, Washington D.C.
Transport Layer Our goals:
Internetworking: Hardware/Software Interface
Process-to-Process Delivery:
Multiple Processor Systems
Chapter 15 – Part 2 Networks The Internal Operating System
Indiana University, Bloomington
Twister2: Design of a Big Data Toolkit
MPJ: A Java-based Parallel Computing System
THE GOOGLE FILE SYSTEM.
Chapter 13: I/O Systems I/O Hardware Application I/O Interface
Process-to-Process Delivery: UDP, TCP
Last Class: Communication in Distributed Systems
Cluster Computers.
Presentation transcript:

Geoffrey Fox, Martin Swany Co-founder of Streamlio Low Latency Streaming: Heron on InfiniBand Supun Kamburugamuve supun@apache.org Geoffrey Fox, Martin Swany Karthik Ramasamy @karthikz Co-founder of Streamlio

Information Age Real-time is key á K 

Real Time Connected World Internet of Things Connected Vehicles Ñ 30 B connected devices by 2020 Data transferred per vehicle per month 4 MB -> 5 GB Health Care Digital Assistants (Predictive Analytics) + 153 Exabytes (2013) -> 2314 Exabytes (2020) $2B (2012) -> $6.5B (2019) [1] Siri/Cortana/Google Now  Machine Data Augmented/Virtual Reality 40% of digital universe by 2020 > $150B by 2020 [2] Oculus/HoloLens/Magic Leap [1] http://www.siemens.com/innovation/en/home/pictures-of-the-future/digitalization-and-software/digital-assistants-trends.html [2] http://techcrunch.com/2015/04/06/augmented-and-virtual-reality-to-hit-150-billion-by-2020/#.7q0heh:oABw

Value of Data [1] Courtesy Michael Franklin, BIRTE, 2015.

Introducing Heron Heron Design Goals Consistent performance at scale Easy to debug and tune Fast/Efficient General purpose streaming engine Storm API compatible Latency/Throughput configurability Flexible deployment modes Achieving low latency for Financial & IoT applications

Heron in Production @ Twitter Completely replaced Storm 3 years ago 3x reduction in cores and memory Significantly reduced operational overhead 10x reduction in production incidents

Heron Use Cases REALTIME ETL REAL TIME BI SPAM DETECTION REAL TIME TRENDS REALTIME ML REAL TIME OPS

Open Souring open sourced May 2016 https://github.com/twitter/heron http://heron.io Apache 2.0 License Contributions from Microsoft, Mesosphere, Google, Wanda Group, WeChat, Fitbit and growing open sourced May 2016

% Heron Core Concepts Topology Spouts Bolts Directed acyclic graph vertices = computation, and edges = streams of data tuples Spouts Sources of data tuples for the topology Examples - Kafka/Kestrel/MySQL/Postgres % Bolts Process incoming tuples, and emit outgoing tuples Examples - filtering/aggregation/join/any function

% % % % % Sample Heron Topology Bolt 1 Spout 1 Bolt 4 Bolt 2 Spout 2

Topology Architecture Master ZK Cluster Stream Manager I1 I2 I3 I4 Logical Plan, Physical Plan and Execution State Sync Physical Plan CONTAINER Metrics

Stream Manager - Design Goals Ñ Core logic in one centralized place Transport (tcp sockets, unix sockets, shared memory) + Super Efficient Interlanguage Data Format (Protobufs, Cap N’ Proto, etc)  Pluggable > Protocol (HTTP, gRPC, custom, etc) Oculus/HoloLens/Magic Leap Multilanguage Instances (C++, Java, Python)

Stream Manager - Current Implementation Ñ Implements at most once, at least once and exactly once Transport using TCP Sockets + Written in C++ for efficiency Protobuf data format  Custom protocol (very similar to gRPC)

High Performance Clusters Computationally expensive massively parallel applications Tightly synchronized parallel operations Efficient low latency communications are critical Even small amount of OS noise can affect performance at this scale Message Passing Interface (MPI) Less than 20µs latency for P2P communications 1,020 compute nodes, 21,824 cores, 43,648 GB of RAM

Putting into Context Example: All Reduce operation 15 – 20 Speedup Example: All Reduce operation Each parallel worker Has an integer array of equal size Calculate global sum of values at each array index OpenMPI 3.0.0 Each process has an array with 4 Integers Reduce 5 10 3 4 3 4 5 9 Broadcast All Reduce All Reduce 8 14 13 8 14 13 16 Nodes, 384 Parallel processes TCP and Infiniband Performance of MPI P1 P2

System Performance Factors Measure Architecture Throughput Efficient use of resources Algorithmic enhancements Special hardware Measure Throughput Latency Power Consumption Scalability Speedup

High Performance Interconnects InfiniBand Widely used interconnect Dominant in HPC clusters Intel Omni-Path Cray Aries Wide range of messaging capabilities Reliable/Unreliable Send/Receive (Channel Semantics) RDMA read/write (Memory Semantics) Atomic operations Infiniband SDR DDR QDR FDR EDR Years 2001 – 2004 2005 – 2007 2008- 2010 2011-2013 2014 Bandwidth 10Gb/s 20Gb/s 40Gb/s 56Gb/s 100Gb/s Latency 5 µsec 2.5 µsec 1.3 µsec 0.7 µsec .5 µsec

TCP Applications Server Client socket bind listen socket connect accept Execution Loop epoll epoll read(buf) write(buf) Copy bytes between kernel and user space Copy bytes between kernel and user space read(buf) write(buf) All are system calls

TCP Applications Shared resources including file descriptors Data locality is not preserved Many system calls Buffer copying from user space to kernel CPU Usage Breakdown of Web Server (Lighttpd) Serving a 64 byte file Linux-3.10 Solutions based on Data Plane Development Kit (DPDK) to improve TCP stack performance Fast User-level TCP Stack on DPDK https://dpdksummit.com/Archive/pdf/2016Asia/DPDK-ChinaAsiaPacificSummit2016-Park-FastUser.pdf

InfiniBand Transport Modes Wide range of capabilities Datagram – Can send receive to any Queue Pair (Similar to UDP) Unreliable – Data loss can occur Operation Unreliable Datagram Unreliable Connection Reliable Connection Reliable Datagram Send Yes Receive RDMA Write RDMA Read Atomic We chose Send/Receive with Reliable Connection

InfiniBand Applications Local Remote Need to pin the memory used by the network Initialization Receive Queue Transmit Queue Establish a receive and transmit queue pair Transmit Queue Receive Queue Completion Queues Completion Queues Post the receive buffers to receive queue Post the receive buffers to receive queue Execution Loop Receive Queue Receive Queue Send Queue Send Queue Completion Queues Completion Queues Poll completion queue for request completion Poll completion queue for request completion Post the send buffer to transmit queue Post the send buffer to transmit queue All are user level functions

InfiniBand Applications Post the receive buffers to receive queue Post the receive buffers to receive queue Execution Loop Receive Queue Receive Queue Send Queue Send Queue Completion Queues Completion Queues Poll completion queue for request completion Poll completion queue for request completion Post the send buffer to transmit queue Post the send buffer to transmit queue The receive buffers need to be posted before a send can complete Application level flow control (credit based) The buffers are of fixed size Need to register the buffer memory before start using them Completion order

Heron InfiniBand Integration Libfacbric for programming the interconnect Heron Architecture Heron Interconnect Integration

InfiniBand Integration Bootstrapping the communication Need to use out of band protocols to establish the communication TCP/IP to transfer the bootstrap information Buffer management Fixed buffer pool for both sending and receiving Break messages manually if the buffer size is less than message size Flow control Application level flow control using a standard credit based approach

Experiment Topologies Topology A. A long topology with 8 Stages Topology B. A shallow topology with 2 Stages Haswell Cluster: Intel Xeon E5-2670 running at 2.30GHz. 24 cores (2 sockets x 12 cores each) 128GB of main memory 56Gbps Infiniband and 1Gbps dedicated Ethernet connection

Latency Latency of the Topology B with 32 parallel bolt instances and varying number of spouts and message sizes. a) and b) are with 16 spouts and c) and d) are with 128k and 128bytes messages. The results are on Haswell cluster with IB.

Latency Latency of the Topology A with 1 spout and 7 bolt instances arranged in a chain with varying parallelism and message sizes. a) and b) are with 2 parallelism and c) and d) are with 128k and 128bytes messages. The results are on Haswell cluster with IB.

Throughput Throughput of the Topology B with 32 bolt instances and varying message sizes and spout instances. The message size varies from 16K to 512K bytes. The spouts are changed from 8 to 32. The experiments are conducted on Haswell cluster.

Message Serialization Overhead Total time to finish messages Total time to serialize messages Topology B

Future Improvements Get latency to less than quarter of a millisecond Avoid message serialization costs Streaming message pass through at stream managers Can transfer message while it is receiving Shared memory data transfer between Instances and Stream manager Stream manager can directly use the shared memory for communication

Curious to Learn More? http://dsc.soic.indiana.edu/publications/Heron_Infiniband.pdf http://www.depts.ttu.edu/cac/conferences/ucc2017/

Curious to Learn More?

Curious to Learn More?

Thank You