Tuning DiFX2 for performance Adam Deller ASTRON 6th DiFX workshop, CSIRO ATNF, Sydney AUS.

Slides:



Advertisements
Similar presentations
IT253: Computer Organization
Advertisements

Adam Jorgensen Pragmatic Works Performance Optimization in SQL Server Analysis Services 2008.
MapReduce Online Tyson Condie UC Berkeley Slides by Kaixiang MO
© Crown copyright Met Office Met Office Unified Model I/O Server Paul Selwood.
Instruction Set Design
INSTRUCTION SET ARCHITECTURES
Practical Caches COMP25212 cache 3. Learning Objectives To understand: –Additional Control Bits in Cache Lines –Cache Line Size Tradeoffs –Separate I&D.
Lecture 13 Page 1 CS 111 Online File Systems: Introduction CS 111 On-Line MS Program Operating Systems Peter Reiher.
Components of Computer
Disk Access Model. Using Secondary Storage Effectively In most studies of algorithms, one assumes the “RAM model”: –Data is in main memory, –Access to.
Lecture 17 I/O Optimization. Disk Organization Tracks: concentric rings around disk surface Sectors: arc of track, minimum unit of transfer Cylinder:
FALL 2006CENG 351 Data Management and File Structures1 External Sorting.
Using Secondary Storage Effectively In most studies of algorithms, one assumes the "RAM model“: –The data is in main memory, –Access to any item of data.
Computer Organization and Architecture
Memory Management April 28, 2000 Instructor: Gary Kimura.
Efficiently Sharing Common Data HTCondor Week 2015 Zach Miller Center for High Throughput Computing Department of Computer Sciences.
1 Today I/O Systems Storage. 2 I/O Devices Many different kinds of I/O devices Software that controls them: device drivers.
The Design and Implementation of a Log-Structured File System Presented by Carl Yao.
CS364 CH08 Operating System Support TECH Computer Science Operating System Overview Scheduling Memory Management Pentium II and PowerPC Memory Management.
Layers and Views of a Computer System Operating System Services Program creation Program execution Access to I/O devices Controlled access to files System.
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
ECE 526 – Network Processing Systems Design Network Processor Architecture and Scalability Chapter 13,14: D. E. Comer.
Review of Memory Management, Virtual Memory CS448.
Instruction Set Architecture
Top 10 Performance Hints Adam Backman White Star Software
Chapter 5 Operating System Support. Outline Operating system - Objective and function - types of OS Scheduling - Long term scheduling - Medium term scheduling.
Hadoop Hardware Infrastructure considerations ©2013 OpalSoft Big Data.
So far we have covered … Basic visualization algorithms Parallel polygon rendering Occlusion culling They all indirectly or directly help understanding.
Processes and OS basics. RHS – SOC 2 OS Basics An Operating System (OS) is essentially an abstraction of a computer As a user or programmer, I do not.
Scalable Web Server on Heterogeneous Cluster CHEN Ge.
Cosc 2150: Computer Organization Chapter 6, Part 2 Virtual Memory.
March 16 & 21, Csci 2111: Data and File Structures Week 9, Lectures 1 & 2 Indexed Sequential File Access and Prefix B+ Trees.
Atacama Large Millimeter/submillimeter Array Expanded Very Large Array Robert C. Byrd Green Bank Telescope Very Long Baseline Array Software correlators.
CSE332: Data Abstractions Lecture 8: Memory Hierarchy Tyler Robison Summer
GPU DAS CSIRO ASTRONOMY AND SPACE SCIENCE Chris Phillips 23 th October 2012.
3-May-2006cse cache © DW Johnson and University of Washington1 Cache Memory CSE 410, Spring 2006 Computer Systems
PARALLEL APPLICATIONS EE 524/CS 561 Kishore Dhaveji 01/09/2000.
Operating Systems David Goldschmidt, Ph.D. Computer Science The College of Saint Rose CIS 432.
ROOT and Federated Data Stores What Features We Would Like Fons Rademakers CERN CC-IN2P3, Nov, 2011, Lyon, France.
CPSC 404, Laks V.S. Lakshmanan1 External Sorting Chapter 13: Ramakrishnan & Gherke and Chapter 2.3: Garcia-Molina et al.
DiFX Performance Testing Chris Phillips eVLBI Project Scientist 25 June 2009.
An FX software correlator for VLBI Adam Deller Swinburne University Australia Telescope National Facility (ATNF)
Precomputation- based Prefetching By James Schatz and Bashar Gharaibeh.
Workshop on Parallelization of Coupled-Cluster Methods Panel 1: Parallel efficiency An incomplete list of thoughts Bert de Jong High Performance Software.
SLACFederated Storage Workshop Summary For pre-GDB (Data Access) Meeting 5/13/14 Andrew Hanushevsky SLAC National Accelerator Laboratory.
Distributed FX software correlation Adam Deller Swinburne University/CSIRO Australia Telescope National Facility Supervisors: A/Prof Steven Tingay, Prof.
Pulsar tools in DiFX Adam Deller ASTRON 6th DiFX workshop, CSIRO ATNF, Sydney AUS.
DiFX Overview Adam Deller NRAO 3rd DiFX workshop, Curtin University, Perth.
ATCA GPU Correlator Strawman Design ASTRONOMY AND SPACE SCIENCE Chris Phillips | LBA Lead Scientist 17 November 2015.
FALL 2005CENG 351 Data Management and File Structures1 External Sorting Reference: Chapter 8.
Image Processing A Study in Pixel Averaging Building a Resolution Pyramid With Parallel Computing Denise Runnels and Farnaz Zand.
WIDE-FIELD IMAGING IN CLASSIC AIPS Eric W. Greisen National Radio Astronomy Observatory Socorro, NM, USA.
Background Computer System Architectures Computer System Software.
Introduction to Performance Tuning Chia-heng Tu PAS Lab Summer Workshop 2009 June 30,
Canadian Bioinformatics Workshops
Getting the Most out of Scientific Computing Resources
CMSC 611: Advanced Computer Architecture
Getting the Most out of Scientific Computing Resources
ECE232: Hardware Organization and Design
HDF5 Metadata and Page Buffering
Kay Ousterhout, Christopher Canel, Sylvia Ratnasamy, Scott Shenker
So far we have covered … Basic visualization algorithms
JIVE UniBoard Correlator (JUC) Firmware
Upgrading Condor Best Practices
Chapter 8: Memory management
Outline Module 1 and 2 dealt with processes, scheduling and synchronization Next two modules will deal with memory and storage Processes require data to.
File Storage and Indexing
Introduction to Operating Systems
Virtual Memory: Working Sets
Presentation transcript:

Tuning DiFX2 for performance Adam Deller ASTRON 6th DiFX workshop, CSIRO ATNF, Sydney AUS

Adam Deller 6th DiFX workshop, CSIRO ATNF Outline I/O bottlenecks and solutions Communication with the real world (reading raw data, writing visibilities) Interprocess communication Keeping out of memory trouble Minimizing CPU load in various corners of parameter space For more information and pictures:

Adam Deller 6th DiFX workshop, CSIRO ATNF Getting data into DiFX Master Node Core 1DataStream 1 DataStream 2 DataStream N Core 2 Core M … … Timerange, destination Baseband data Visibilities Source data Large, segmented ring buffer Visibility buffer processing buffer

Adam Deller 6th DiFX workshop, CSIRO ATNF Getting data into DiFX How to test? neutered_difx, with a small number of channels Fundamental limit: native transfer speed (disk read, network pipe) If this is the problem, buy a RAID or get infiniband or … Potential troublemaker: CPU utilisation on datastream node (competition) Can come from tsys estimation Tweaking: datastream databuffer

Adam Deller 6th DiFX workshop, CSIRO ATNF Datastream databuffer Key parameters: dataBufferFactor nDataSegments subintNS /“Subint” Only real potential problem I/O-wise: buffer too short (databufferfactor)

Adam Deller 6th DiFX workshop, CSIRO ATNF Getting visibilities out of DiFX Master Node Core 1DataStream 1 DataStream 2 DataStream N Core 2 Core M … … Timerange, destination Baseband data Visibilities Source data Large, segmented ring buffer Visibility buffer processing buffer To disk

Adam Deller 6th DiFX workshop, CSIRO ATNF Getting visibilities out of DiFX FxManager writes the visibilities to disk This is very rarely a problem unless you have a dying disk or very large and/or frequent visibility dumps Testing: neutered_difx + fake data source (ensures good input speeds) Tweaking: none If you want to write out visibilities faster, put a fast disk (probably RAID) on the manager node!

Adam Deller 6th DiFX workshop, CSIRO ATNF the Datastream Master Node Core 1DataStream 1 DataStream 2 DataStream N Core 2 Core M … … Timerange, destination Baseband data Visibilities Source data Large, segmented ring buffer Visibility buffer processing buffer

Adam Deller 6th DiFX workshop, CSIRO ATNF the Datastream Generally not a problem Tweaking: dataBufferFactor, ensure reasonable size (avoids latency issues) Default (32) generally ok but could usually be bigger w/o problems (increase nSegments also)

Adam Deller 6th DiFX workshop, CSIRO ATNF the Core Master Node Core 1DataStream 1 DataStream 2 DataStream N Core 2 Core M … … Timerange, destination Baseband data Visibilities Source data Large, segmented ring buffer Visibility buffer processing buffer Tweaking: subintNS Output visibility size (nChan / nBaselines)

Adam Deller 6th DiFX workshop, CSIRO ATNF the Core

Adam Deller 6th DiFX workshop, CSIRO ATNF the Core In terms of reducing data transmission, increasing subintNS is the only real knob to turn Unimportant for continuum, single phase centre - it’s only very high spectral resolution and/or multiphase centre where this is relevant In those cases, bigger is better; but be careful about memory (later)

Adam Deller 6th DiFX workshop, CSIRO ATNF the FxManager Master Node Core 1DataStream 1 DataStream 2 DataStream N Core 2 Core M … … Timerange, destination Baseband data Visibilities Source data Large, segmented ring buffer Visibility buffer processing buffer The most common trouble point! Must aggregate data from all Core nodes, can lead to high data rates

Adam Deller 6th DiFX workshop, CSIRO ATNF the FxManager

Adam Deller 6th DiFX workshop, CSIRO ATNF the FxManager To calculate the rate into FxManager, work out the rate for one Core node and scale Tweaking: maximise subintNS! Or (although this is usually not possible) reduce visibility size (via nChan or the number of phase centers)

Adam Deller 6th DiFX workshop, CSIRO ATNF the Datastream Just don’t make the combination of dataBufferFactor and subintNS too big (can also control via “sendSize”)

Adam Deller 6th DiFX workshop, CSIRO ATNF the Core Usually the biggest problem, memory- wise

Adam Deller 6th DiFX workshop, CSIRO ATNF the Core Usually the biggest problem, memory- wise Never used to be a problem, but multi- field center jobs hit hard Bigger subint means more memory (storing datastream baseband) More threads means more memory - at the pre-average spectral resolution Buffering more FFTs costs more (x the number of threads, too!)

Adam Deller 6th DiFX workshop, CSIRO ATNF the Core Tweaking: subintNS nThreads (threads file) numBufferedFFTs And be aware of: nFFTChans (for multiphase centre/high spectral resolution) Number of phase centres

Adam Deller 6th DiFX workshop, CSIRO ATNF the FxManager Tweaking: visBufferLength Multiplies the size of a single visibility (nChan, nBaselines, nPhaseCentres)

Adam Deller 6th DiFX workshop, CSIRO ATNF the FxManager Tweaking: visBufferLength Multiplies the size of a single visibility (nChan, nBaselines, nPhaseCentres) Generally not a problem Note: visBufferLength should not be too short, especially if you have many (esp. heterogeneous) Core nodes, as the subints can come in out of order

Adam Deller 6th DiFX workshop, CSIRO ATNF the Datastream Loading of Datastream is usually pretty light But, Datastream often runs on old hardware (e.g. Mk5 units) with limited CPU capacity A couple of options can cause problematically high loads: Tsys extraction (.v2d: tcalFreq = xx) Interlaced VDIF formats (used with multi- thread VDIF data, e.g. phased EVLA) More efficient implementations coming; for now, buy faster CPU if needed!

Adam Deller 6th DiFX workshop, CSIRO ATNF the Core Many considerations here, including parameters usually fixed by the science Number of phase centres Spectral resolution (nChan/nFFTChan) Plus several on array management strideLength numBufferedFFTs xmacLength And then a few others as well: nThreads fringe rotation order

Adam Deller 6th DiFX workshop, CSIRO ATNF the Core Number of phase centers For each phase centre, phase rotation and separate accumulation from thread to main buffer

Adam Deller 6th DiFX workshop, CSIRO ATNF the Core Number of phase centers For each phase centre, phase rotation and separate accumulation from thread to main buffer That costs CPU (proportional to number of baselines and number of phase centres), but also ensures that results don’t fit in cache (more later)

Adam Deller 6th DiFX workshop, CSIRO ATNF the Core Spectral resolution More channels means a bigger FFT, and that costs CPU Doesn’t typically follow a logN law like it should - bigger gets worse fast beyond ~1024 due to cache performance Really big (>=8192 channels/subband) gets very expensive Worst thing: typically comes in combination with multiple phase centres! (required to avoiding bandwidth smearing)

Adam Deller 6th DiFX workshop, CSIRO ATNF the Core Array management #1: strideLength (auto setting usually best) -180° 180° One FFT of data sin/cos the first “strideLength” samples, and every “strideLength”’th after that

Adam Deller 6th DiFX workshop, CSIRO ATNF the Core Array management #2: numBufferedFFTs (auto=10 usually ok) Mitigates the cache miss problem by x10 Mode 1Mode 2Mode 3 … Mode N Visibility buffer (too big for cache) But one slot fits in cache! Precompute numBufferedFFTs FFT results, one station at a time

Adam Deller 6th DiFX workshop, CSIRO ATNF the Core Array management #3: xmacLength (auto setting of 128 usually fine; further subdivides XMAC step) Mode 1Mode 2Mode 3 … Mode N Visibility buffer (too big for cache) But one slot fits in cache! Precompute numBufferedFFTs FFT results, one station at a time

Adam Deller 6th DiFX workshop, CSIRO ATNF the Core nThreads Usually, set nThreads = n(CPU cores) - 1 Occasionally, can be advantageous to use fewer threads (avoiding swap memory / cache contention)

Adam Deller 6th DiFX workshop, CSIRO ATNF the Core Fringe Rotation Order Default is 1, and this is almost always fine 2nd order only ever needed for very high fringe rates with very long FFTs (space VLBI of masers?) BUT: 0th order could often be used, and almost never is: it can be about 25% faster Fringe rotation phase time 1st FFT 2nd FFT Here, fringe rate is too high for 0th order

Adam Deller 6th DiFX workshop, CSIRO ATNF the Core Fringe Rotation Order Default is 1, and this is almost always fine 2nd order only ever needed for very high fringe rates with very long FFTs (space VLBI of masers?) BUT: 0th order could often be used, and almost never is: it can be about 25% faster Fringe rotation phase time 1st FFT 2nd FFT But at low fringe rate, 0th order approximation can be acceptable

Adam Deller 6th DiFX workshop, CSIRO ATNF the Core Fringe Rotation Order Default is 1, and this is almost always fine 2nd order only ever needed for very high fringe rates with very long FFTs (space VLBI of masers?) BUT: 0th order could often be used, and almost never is: it can be about 25% faster.v2d: fringeRotOrder = [0, 1, 2]

Adam Deller 6th DiFX workshop, CSIRO ATNF the FxManager CPU load at the FxManager is typically light - it only does low-cadence accumulation and scaling of visibilities Very short subintNS can potentially lead to problems (although network issues are more likely)

Adam Deller 6th DiFX workshop, CSIRO ATNF Questions?