Performance measurement with ZeroMQ and FairMQ

Slides:

Advertisements

Similar presentations

Computing Infrastructure

Advertisements

MINJAE HWANG THAWAN KOOBURAT CS758 CLASS PROJECT FALL 2009 Extending Task-based Programming Model beyond Shared-memory Systems.

1 Memory Performance and Scalability of Intel’s and AMD’s Dual-Core Processors: A Case Study Lu Peng 1, Jih-Kwon Peir 2, Tribuvan K. Prakash 1, Yen-Kuang.

R2: An application-level kernel for record and replay Z. Guo, X. Wang, J. Tang, X. Liu, Z. Xu, M. Wu, M. F. Kaashoek, Z. Zhang, (MSR Asia, Tsinghua, MIT),

IBM 1350 Cluster Expansion Doug Johnson Senior Systems Developer.

HPCC Mid-Morning Break High Performance Computing on a GPU cluster Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery.

Kafka high-throughput, persistent, multi-reader streams

SKELETON BASED PERFORMANCE PREDICTION ON SHARED NETWORKS Sukhdeep Sodhi Microsoft Corp Jaspal Subhlok University of Houston.

Post-Copy Live Migration of Virtual Machines Michael R. Hines, Umesh Deshpande, Kartik Gopalan Computer Science, Binghamton University(SUNY) SIGOPS 09’

ALFA: The new ALICE-FAIR software framework

Distributed Interactive Ray Tracing for Large Volume Visualization Dave DeMarle Steven Parker Mark Hartner Christiaan Gribble Charles Hansen.

13 May, 2005GlueX Collaboration Meeting1 Experiences with Large Data Sets Curtis A. Meyer.

Performance Analysis of MPI Communications on the SGI Altix 3700 Nor Asilah Wati Abdul Hamid, Paul Coddington, Francis Vaughan Distributed & High Performance.

SUMS Storage Requirement 250 TB fixed disk cache 130 TB annual increment for permanently online data 100 TB work area (not controlled by SUMS) 2 PB near-line.

1 CS 501 Spring 2005 CS 501: Software Engineering Lecture 22 Performance of Computer Systems.

Evaluating current processors performance and machines stability R. Esposito 2, P. Mastroserio 2, F. Taurino 2,1, G. Tortone 2 1 INFM, Sez. di Napoli,

Lecture 39: Review Session #1 Reminders –Final exam, Thursday 3:10pm Sloan 150 –Course evaluation (Blue Course Evaluation) Access through.

Gordon: Using Flash Memory to Build Fast, Power-efficient Clusters for Data-intensive Applications A. Caulfield, L. Grupp, S. Swanson, UCSD, ASPLOS’09.

Presentation Timer Select a time to count down from the clock above 60 min 45 min 30 min 20 min 15 min 10 min 5 min or less.

Presentation Timer Select a time to count down from the clock above 60 min 45 min 30 min 20 min 15 min 10 min 5 min or less.

Presenter MaxAcademy Lecture Series – V1.0, September 2011 Introduction and Motivation.

HPCC Mid-Morning Break Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery Introduction to the new GPU (GFX) cluster.

ALFA - a common concurrency framework for ALICE and FAIR experiments

ECE Dept., University of Toronto

Adam Meyer, Michael Beck, Christopher Koch, and Patrick Gerber.

Tom Dietel University of Cape Town for the ALICE Collaboration Computing for ALICE at the LHC.

A TCP/IP transport layer for the DAQ of the CMS Experiment Miklos Kozlovszky for the CMS TriDAS collaboration CERN European Organization for Nuclear Research.

Boosting Event Building Performance Using Infiniband FDR for CMS Upgrade Andrew Forrest – CERN (PH/CMD) Technology and Instrumentation in Particle Physics.

Lab System Environment

JPCM - JDC121 JPCM. Agenda JPCM - JDC122 3 Software performance is Better Performance tuning requires accurate Measurements. JPCM - JDC124 Software.

Data transfer over the wide area network with a large round trip time H. Matsunaga, T. Isobe, T. Mashimo, H. Sakamoto, I. Ueda International Center for.

RAC parameter tuning for remote access Carlos Fernando Gamboa, Brookhaven National Lab, US Frederick Luehring, Indiana University, US Distributed Database.

Towards a Billion Routing Lookups per Second in Software  Author: Marko Zec, Luigi, Rizzo Miljenko Mikuc  Publisher: SIGCOMM Computer Communication Review,

 Copyright, HiPERiSM Consulting, LLC, George Delic, Ph.D. HiPERiSM Consulting, LLC (919) P.O. Box 569, Chapel Hill, NC.

IPDPS 2005, slide 1 Automatic Construction and Evaluation of “Performance Skeletons” ( Predicting Performance in an Unpredictable World ) Sukhdeep Sodhi.

High TCP performance over wide area networks Arlington, VA May 8, 2002 Sylvain Ravot CalTech HENP Working Group.

February 17, 2015 Software Framework Development P. Hristov for CWG13.

YONSEI UNIVERSITY Korea Emme Users’ Conference 21 April 2010 Prof. Jin-Hyuk Chung.

Using LINPACK + STREAM to Build a Simple, Composite Metric for System Throughput John D. McCalpin, Ph.D. IBM Corporation Austin, TX Presented ,

Parallelization of likelihood functions for data analysis Alfio Lazzaro CERN openlab Forum on Concurrent Programming Models and Frameworks.

Sep. 17, 2002BESIII Review Meeting BESIII DAQ System BESIII Review Meeting IHEP · Beijing · China Sep , 2002.

Performance Engineering E2EpiPEs and FastTCP Internet2 member meeting - Indianapolis World Telecom Geneva October 15, 2003

Gravitational N-body Simulation Major Design Goals -Efficiency -Versatility (ability to use different numerical methods) -Scalability Lesser Design Goals.

Baum, Boyett, & Garrison Comparing Intel C++ and Microsoft Visual C++ Compilers Michael Baum David Boyett Holly Garrison.

Threaded Programming Lecture 1: Concepts. 2 Overview Shared memory systems Basic Concepts in Threaded Programming.

Performance Analysis of HPC with Lmbench Didem Unat Supervisor: Nahil Sobh July 22 nd 2005 netfiles.uiuc.edu/dunat2/www.

David Stickland CMS Core Software and Computing

Memory Buffering Techniques Greg Stitt ECE Department University of Florida.

Pierre VANDE VYVRE ALICE Online upgrade October 03, 2012 Offline Meeting, CERN.

CWG13: Ideas and discussion about the online part of the prototype P. Hristov, 11/04/2014.

Monitoring for the ALICE O 2 Project 11 February 2016.

Pathway to Petaflops A vendor contribution Philippe Trautmann Business Development Manager HPC & Grid Global Education, Government & Healthcare.

A Memory Benchmarking Characterisation of ARM-Based Systems-on-Chip Thomas Wrigley University of the Witwatersrand, Johannesburg, South Africa GRID 2014,

Remigius K Mommsen Fermilab CMS Run 2 Event Building.

ALICE trigger control technologies run2 vs run3 layout IPbus introduction 1 Anton Jusko 27/01/20166CTP/LTU review.

ICE-DIP Project: Research on data transport for manycore processors for next generation DAQs Aram Santogidis › 5/12/2014.

Flexible data transport for online reconstruction M. Al-Turany Dennis Klein A. Rybalchenko 12/05/12 M. Al-Turany, Panda Collaboration Meeting, Goa 1.

S. Pardi Frascati, 2012 March GPGPU Evaluation – First experiences in Napoli Silvio Pardi.

HELMHOLTZ INSTITUT MAINZ Dalibor Djukanovic Helmholtz-Institut Mainz PANDA Collaboration Meeting GSI, Darmstadt.

Intra-Socket and Inter-Socket Communication in Multi-core Systems Roshan N.P S7 CSB Roll no:29.

Jun Doi IBM Research – Tokyo Early Performance Evaluation of Lattice QCD on POWER+GPU Cluster 17 July 2015.

02 de Octubre de 2016 QNX QNX 6.3 COMPARISON.

Network speed tests CERN 14-dec-2010.

Multiprocessing.

Silent Classroom Timer

Experiences with Large Data Sets

Computer Memory.

Introduction to CUDA Programming

Hybrid Programming with OpenMP and MPI

Presentation transcript:

Performance measurement with ZeroMQ and FairMQ Mohammad Al-Turany 20/02/15 CWG13 Meeting

Zero MQ performance tests suite Zero MQ deliver some tools to measure bandwidth and latency of the network, following executables are build by default and located in the perf subdirectory local_lat Remote_lat local_thr remote_thr 20/02/15 CWG13 Meeting

ØMQ performance tests suite Latency Test consists of local_lat and remote_lat. These are to be placed on two boxes that you wish to measure latency between. We did not perform this test up to know!! $ local_lat tcp://eth0:5555 1 100000 $ remote_lat tcp://192.168.0.111:5555 1 100000 message size: 1 [B] roundtrip count: 100000 average latency: 30.915 [us] latency reported is the one-way latency 20/02/15 CWG13 Meeting

ØMQ performance tests suite Throughput Test consists of local_thr and remote_thr. These are to be placed on two boxes that you wish to measure latency between. $local_thr tcp://eth0:5555 1 100000 $remote_thr tcp://192.168.0.111:5555 1 100000 message size: 1 [B] message count: 1000000 mean throughput: 5554568 [msg/s] mean throughput: 44.437 [Mb/s] 20/02/15 CWG13 Meeting

Running the Zero MQ performance test on the DAQ test cluster 20/02/15 CWG13 Meeting

Running the Zero MQ performance test on the DAQ test cluster 20/02/15 CWG13 Meeting

Running the Zero MQ performance test on the DAQ test cluster 20/02/15 CWG13 Meeting

Running the Zero MQ performance test on the DAQ test cluster 20/02/15 CWG13 Meeting

Performance test with FairMQ FLP 2 EPN aidrefma02 aidrefma01 Push-Pull pattern Message size= 10 Mbyte Throughput = 2,6 Gbyte/s 20/02/15 CWG13 Meeting

Performance test with FairMQ FLP 2 EPN aidrefma02 aidrefma01 Push-Pull pattern Message size= 10 Mbyte Throughput = 3,7 Gbyte/s 20/02/15 CWG13 Meeting

Performance test with FairMQ FLP 2 EPN aidrefma03 aidrefma01 Push-Pull pattern Message size= 10 Mbyte Throughput = 4,8 Gbyte/s 20/02/15 CWG13 Meeting

A node that use 3(4) cores to receive data via Ethernet or IPoverIB at a rate of more than 4 GByte/s, ist still usable for reconstruction? 20/02/15 CWG13 Meeting

STREAM: Sustainable Memory Bandwidth in High Performance Computers A simple synthetic benchmark program that measures sustainable memory bandwidth (in MB/s) and the corresponding computation rate for simple vector kernels. Specifically designed to work with datasets much larger than the available cache on any given system, so that the results are (presumably) more indicative of the performance of very large, vector style applications. http://www.cs.virginia.edu/stream/ 20/02/15 CWG13 Meeting

Stream Settings This system uses 8 bytes per array element. ------------------------------------------------------------- Array size = 200000000 (elements), Offset = 0 (elements) Memory per array = 1525.9 MiB (= 1.5 GiB). Total memory required = 4577.6 MiB (= 4.5 GiB). Each kernel will be executed 10 times. The *best* time for each kernel (excluding the first iteration) will be used to compute the reported bandwidth. Number of Threads requested = 12 Number of Threads counted = 12 20/02/15 CWG13 Meeting

STREAM is intended to measure the bandwidth from main memory 20/02/15 CWG13 Meeting

Performance and bandwidth test with FairMQ FLP 2 EPN aidrefma02 CERN: DAQ Lab system: 40 G Ethernet Dual socket Intel Sandy Bridge-EP, dual E5-2690 @ 2.90GHz, 2x8 hw cores - 32 threads, 64GB RAM Function Best Rate MB/s Avg time Min time Max time Copy: 15258.3 0.017153 0.010486 0.025462 Scale: 15019.2 0.017180 0.010653 0.025397 Add: 16883.6 0.021488 0.014215 0.036001 Triad: 16831.6 0.021190 0.014259 0.035066 -------------------------------------------------------------- name kernel bytes/iter FLOPS/iter COPY: a(i) = b(i) 16 0 SCALE: a(i) = q*b(i) 16 1 SUM: a(i) = b(i) + c(i) 24 1 TRIAD: a(i) = b(i) + q*c(i) 24 2 -------------------------------------------------------------- CWG13 Meeting 20/02/15

Performance and bandwidth test with FairMQ FLP 2 EPN aidrefma01 aidrefma02 FLP EPN CERN: DAQ Lab system: 40 G Ethernet Dual socket Intel Sandy Bridge-EP, dual E5-2690 @ 2.90GHz, 2x8 hw cores - 32 threads, 64GB RAM 8 MB Masseges 4.7 Gbyte/s Function Best Rate MB/s Copy: 12782.6 Scale: 12319.0 Add: 14210.4 Triad: 14317.3 -16 % -18 % -15 % CWG13 Meeting 20/02/15

Performance and bandwidth test with FairMQ FLP 2 EPN CPU Time in seconds needed to simulate 1000 events, 10 proton in FairRoot example 3 aidrefma01 aidrefma02 FLP EPN Run 12 processes Without MQ With 4 MB Messages With 8 MB Messages 54 61 68 58 64 66 62 56 57 55 63 67 60 65 57,3 62,1 61,2 5% 4% Geant Geant 4 MB Masseges 4.5 Gbyte/s 8 MB Masseges 4.7 Gbyte/s Geant Geant Geant Geant Geant CERN: DAQ Lab system: 40 G Ethernet Dual socket Intel Sandy Bridge-EP, dual E5-2690 @ 2.90GHz, 2x8 hw cores - 32 threads, 64GB RAM Geant Geant Geant Geant Geant Geant CWG13 Meeting 20/02/15

Performance and bandwidth test with FairMQ FLP 2 EPN CPU Time in seconds needed to simulate 1000 events, 100 proton in FairRoot example 3 aidrefma01 aidrefma02 FLP EPN Run 12 processes Without MQ With 8 MB Messages 565 605 573 615 570 598 603 602 563 601 619 576 616 574 606 567 609 577 595 570.2 605.6 6% Geant Geant 8 MB Masseges 4.7 Gbyte/s 2.8 TByte total data transfer Geant Geant Geant Geant Geant CERN: DAQ Lab system: 40 G Ethernet Dual socket Intel Sandy Bridge-EP, dual E5-2690 @ 2.90GHz, 2x8 hw cores - 32 threads, 64GB RAM Geant Geant Geant Geant Geant Geant CWG13 Meeting 20/02/15

Backup and Discussion 20/02/15 CWG13 Meeting

Run on STREAM version $Revision: 5.10 $ ------------------------------------------------------------- This system uses 8 bytes per array element. Array size = 10000000 (elements), Offset = 0 (elements) Memory per array = 76.3 MiB (= 0.1 GiB). Total memory required = 228.9 MiB (= 0.2 GiB). Each kernel will be executed 10 times. The *best* time for each kernel (excluding the first iteration) will be used to compute the reported bandwidth. Your clock granularity/precision appears to be 1 microseconds. Each test below will take on the order of 21173 microseconds. (= 21173 clock ticks) Increase the size of the arrays if this shows that you are not getting at least 20 clock ticks per test. WARNING -- The above is only a rough guideline. For best results, please be sure you know the precision of your system timer. Function Best Rate MB/s Avg time Min time Max time Copy: 15258.3 0.017153 0.010486 0.025462 Scale: 15019.2 0.017180 0.010653 0.025397 Add: 16883.6 0.021488 0.014215 0.036001 Triad: 16831.6 0.021190 0.014259 0.035066 20/02/15 CWG13 Meeting