Experiments with Pure Parallelism Hank Childs, Dave Pugmire, Sean Ahern, Brad Whitlock, Mark Howison, Prabhat, Gunther Weber, & Wes Bethel April 13, 2010.

Slides:

Advertisements

Similar presentations

Hank Childs Lawrence Berkeley National Laboratory /

Advertisements

EXASCALE VISUALIZATION: GET READY FOR A WHOLE NEW WORLD Hank Childs, Lawrence Berkeley Lab & UC Davis April 10, 2011.

EXASCALE VISUALIZATION: GET READY FOR A WHOLE NEW WORLD Hank Childs, Lawrence Berkeley Lab & UC Davis July 1, 2011.

The Challenges Ahead for Visualizing and Analyzing Massive Data Sets Hank Childs Lawrence Berkeley National Laboratory February 26, B element Rayleigh-Taylor.

Petascale I/O Impacts on Visualization Hank Childs Lawrence Berkeley National Laboratory & UC Davis March 24, B element Rayleigh-Taylor Instability.

The Challenges Ahead for Visualizing and Analyzing Massive Data Sets Hank Childs Lawrence Berkeley National Laboratory & UC Davis February 26, B.

Chapter 5 Computing Components. The (META) BIG IDEA Cool, idea but maybe too big DATA – Must be stored somewhere in a storage device PROCESSING – Data.

Large Vector-Field Visualization, Theory and Practice: Large Data and Parallel Visualization Hank Childs Lawrence Berkeley National Laboratory / University.

UNCLASSIFIED: LA-UR Data Infrastructure for Massive Scientific Visualization and Analysis James Ahrens & Christopher Mitchell Los Alamos National.

Deploying a Petascale-Capable Visualization and Analysis Tool April 15, 2010.

Graph Analysis with High Performance Computing by Bruce Hendrickson and Jonathan W. Berry Sandria National Laboratories Published in the March/April 2008.

1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Sep 5, 2005 Lecture 2.

Lawrence Livermore National Laboratory Visualization and Analysis Activities May 19, 2009 Hank Childs VisIt Architect Performance Measures x.x, x.x, and.

1 Multi - Core fast Communication for SoPC Multi - Core fast Communication for SoPC Technion – Israel Institute of Technology Department of Electrical.

Support for Global Cloud Resolving Model Simulations VACET All-Hands Meeting IEEE Vis 2008.

Brad Whitlock October 14, 2009 Brad Whitlock October 14, 2009 Porting VisIt to BG/P.

E. WES BETHEL (LBNL), CHRIS JOHNSON (UTAH), KEN JOY (UC DAVIS), SEAN AHERN (ORNL), VALERIO PASCUCCI (LLNL), JONATHAN COHEN (LLNL), MARK DUCHAINEAU.

Copyright © 1998 Wanda Kunkle Computer Organization 1 Chapter 2.1 Introduction.

Copyright 2004 David J. Lilja1 Measuring Computer Performance: A Practitioner’s Guide David J. Lilja Electrical and Computer Engineering University of.

Computer Systems CS208. Major Components of a Computer System Processor (CPU) Runs program instructions Main Memory Storage for running programs and current.

Challenges and Solutions for Visual Data Analysis on Current and Emerging HPC Platforms Wes Bethel & Hank Childs, Lawrence Berkeley Lab July 20, 2011.

Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.

The Computer Systems By : Prabir Nandi Computer Instructor KV Lumding.

Infrastructure for Better Quality Internet Access & Web Publishing without Increasing Bandwidth Prof. Chi Chi Hung School of Computing, National University.

Highly Distributed Parallel Computing Neil Skrypuch COSC 3P93 3/21/2007.

Performance Concepts Mark A. Magumba. Introduction Research done on 1058 correspondents in 2006 found that 75% OF them would not return to a website that.

Pursuing Faster I/O in COSMO POMPA Workshop May 3rd 2010.

Hank Childs, Lawrence Berkeley Lab & UC Davis Workshop on Exascale Data Management, Analysis, and Visualization Houston, TX 2/22/11 Visualization & The.

C++ Programming Language Lecture 1 Introduction By Ghada Al-Mashaqbeh The Hashemite University Computer Engineering Department.

So far we have covered … Basic visualization algorithms Parallel polygon rendering Occlusion culling They all indirectly or directly help understanding.

Nov. 14, 2012 Hank Childs, Lawrence Berkeley Jeremy Meredith, Oak Ridge Pat McCormick, Los Alamos Chris Sewell, Los Alamos Ken Moreland, Sandia Panel at.

Building a Parallel File System Simulator E Molina-Estolano, C Maltzahn, etc. UCSC Lab, UC Santa Cruz. Published in Journal of Physics, 2009.

10/19/2015Erkay Savas1 Performance Computer Architecture – CS401 Erkay Savas Sabanci University.

Amy Apon, Pawel Wolinski, Dennis Reed Greg Amerson, Prathima Gorjala University of Arkansas Commercial Applications of High Performance Computing Massive.

1 CS/EE 362 Hardware Fundamentals Lecture 9 (Chapter 2: Hennessy and Patterson) Winter Quarter 1998 Chris Myers.

Parallel Processing Steve Terpe CS 147. Overview What is Parallel Processing What is Parallel Processing Parallel Processing in Nature Parallel Processing.

Lecture 8 February 29, Topics Questions about Exercise 4, due Thursday? Object Based Programming (Chapter 8) –Basic Principles –Methods –Fields.

Fast Crash Recovery in RAMCloud. Motivation The role of DRAM has been increasing – Facebook used 150TB of DRAM For 200TB of disk storage However, there.

Computer Organization & Assembly Language © by DR. M. Amer.

EXASCALE VISUALIZATION: GET READY FOR A WHOLE NEW WORLD Hank Childs, Lawrence Berkeley Lab & UC Davis April 10, 2011.

Diskless Checkpointing on Super-scale Architectures Applied to the Fast Fourier Transform Christian Engelmann, Al Geist Oak Ridge National Laboratory Februrary,

Rassul Ayani 1 Performance of parallel and distributed systems  What is the purpose of measurement?  To evaluate a system (or an architecture)  To compare.

EXASCALE VISUALIZATION: GET READY FOR A WHOLE NEW WORLD Hank Childs, Lawrence Berkeley Lab & UC Davis April 10, 2011.

Data Management for Decision Support Session-4 Prof. Bharat Bhasker.

CSCI-455/552 Introduction to High Performance Computing Lecture 6.

Hank Childs, University of Oregon Large Data Visualization.

DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters Nanyang Technological University Shanjiang Tang, Bu-Sung Lee, Bingsheng.

Parallelizing Spacetime Discontinuous Galerkin Methods Jonathan Booth University of Illinois at Urbana/Champaign In conjunction with: L. Kale, R. Haber,

1  1998 Morgan Kaufmann Publishers How to measure, report, and summarize performance (suorituskyky, tehokkuus)? What factors determine the performance.

Outline Why this subject? What is High Performance Computing?

Computer Architecture Lecture 26 Past and Future Ralph Grishman November 2015 NYU.

CS 351/ IT 351 Modeling and Simulation Technologies HPC Architectures Dr. Jim Holten.

National Center for Supercomputing Applications University of Illinois at Urbana–Champaign Visualization Support for XSEDE and Blue Waters DOE Graphics.

COMP381 by M. Hamdi 1 Clusters: Networks of WS/PC.

September 10 Performance Read 3.1 through 3.4 for Wednesday Only 3 classes before 1 st Exam!

Image Processing A Study in Pixel Averaging Building a Resolution Pyramid With Parallel Computing Denise Runnels and Farnaz Zand.

Concurrency and Performance Based on slides by Henri Casanova.

HOW PETASCALE VISUALIZATION WILL CHANGE THE RULES Hank Childs Lawrence Berkeley Lab & UC Davis 10/12/09.

Background Computer System Architectures Computer System Software.

Building PetaScale Applications and Tools on the TeraGrid Workshop December 11-12, 2007 Scott Lathrop and Sergiu Sanielevici.

VisIt Project Overview

Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming

Introduction Edited by Enas Naffar using the following textbooks: - A concise introduction to Software Engineering - Software Engineering for students-

September 2 Performance Read 3.1 through 3.4 for Tuesday

VisIt Libsim Update DOE Computer Graphics Forum 2012 Brad Whitlock

So far we have covered … Basic visualization algorithms

CSCI1600: Embedded and Real Time Software

Fault Tolerance Distributed Web-based Systems

WHY THE RULES ARE CHANGING FOR LARGE DATA VISUALIZATION AND ANALYSIS

CSCI1600: Embedded and Real Time Software

Presentation transcript:

Experiments with Pure Parallelism Hank Childs, Dave Pugmire, Sean Ahern, Brad Whitlock, Mark Howison, Prabhat, Gunther Weber, & Wes Bethel April 13, 2010

The landscape: how tools process data P0 P1 P3 P2 P8 P7P6 P5 P4 P9 Pieces of data (on disk) ReadProcessRender Processor 0 ReadProcessRender Processor 1 ReadProcessRender Processor 2 Parallel visualization program P0 P3 P2 P5 P4 P7 P6 P9 P8 P1 Parallel Simulation Code This technique is called “pure parallelism”

Pure parallelism Pure parallelism is data-level parallelism, but… –Multi-resolution can be data-level parallelism –Out-of-core can be data-level parallelism Pure parallelism: “brute force” … processing full resolution data using data-level parallelism Pros: –Easy to implement Cons: –Requires large I/O capabilities –Requires large amount of primary memory –  requires big machines Pure parallelism is data-level parallelism, but… –Multi-resolution can be data-level parallelism –Out-of-core can be data-level parallelism Pure parallelism: “brute force” … processing full resolution data using data-level parallelism Pros: –Easy to implement Cons: –Requires large I/O capabilities –Requires large amount of primary memory –  requires big machines

Research Questions Is it possible/feasible to run production- quality visual data analysis s/w on large machines and on large data sets? –Are the tools we use right now ready for tomorrow’s data? What obstacles/bottlenecks do we encounter at massive data? Is it possible/feasible to run production- quality visual data analysis s/w on large machines and on large data sets? –Are the tools we use right now ready for tomorrow’s data? What obstacles/bottlenecks do we encounter at massive data?

Experiment methodology Preprocess step: generate large data set Read it Contour 1024x1024 Synthetic data: –Wanted to look at tomorrow’s data; not available yet –Synthetic data should be reasonable surrogate for real data. Preprocess step: generate large data set Read it Contour 1024x1024 Synthetic data: –Wanted to look at tomorrow’s data; not available yet –Synthetic data should be reasonable surrogate for real data. Visualization of 1 trillion cells, visualized with VisIt on Franklin using 16,000 cores.

Experiment methodology, continued Only used pure parallelism –This experiment was about testing the limits of pure parallelism –Purposely did not use in situ, multi-resolution, out- of-core, data subsetting Pure parallelism is what the production visualization tools use right now (*). Only used pure parallelism –This experiment was about testing the limits of pure parallelism –Purposely did not use in situ, multi-resolution, out- of-core, data subsetting Pure parallelism is what the production visualization tools use right now (*).

Volume rendering Ran into problems with volume rendering. –See Dave’s talk. Problem eventually fixed, but not in time for study –Runs on these big machines are opportunistic and it’s hard to get a second chance –Approximately five seconds per render Contouring exercises much of the infrastructure (read, process, render) Ran into problems with volume rendering. –See Dave’s talk. Problem eventually fixed, but not in time for study –Runs on these big machines are opportunistic and it’s hard to get a second chance –Approximately five seconds per render Contouring exercises much of the infrastructure (read, process, render) Visualization of 2 trillion cells, visualized with VisIt on JaguarPF using 32,000 cores.

Experiment methodology, continued Three basic variations –Vary over supercomputing environment –Vary over data generation –Vary over I/O pattern Three basic variations –Vary over supercomputing environment –Vary over data generation –Vary over I/O pattern

Varying over supercomputer environment Goals: –Ensure results aren’t tied to a single machine. –Understand differences from different architectures. Experiment details –1 trillion cells per 16,000 cores –10*NCores “Brick-of-float” files, gzipped –Upsampled data Goals: –Ensure results aren’t tied to a single machine. –Understand differences from different architectures. Experiment details –1 trillion cells per 16,000 cores –10*NCores “Brick-of-float” files, gzipped –Upsampled data

7-10 network links failed, had to be statically re-routed BG/L has 850MHz clock speed Lustre striping of 2 versus Lustre striping of 4

Varying over data generation pattern Concern: does upsampling produce unrepresentatively smooth surfaces? Alternative: replication Concern: does upsampling produce unrepresentatively smooth surfaces? Alternative: replication Visualization of 1 trillion cells, visualized with VisIt on Franklin using 16,000 cores.

Results from data generation test Test on franklin, using 16,000 cores with unzipped data Contouring time is the same because case where a triangle is generated is rare. Rendering time is different because replicated pattern has more geometry.

Varying over I/O pattern Previous tests: uncoordinated I/O, doing 10 “fread”s per core. Can collective communication help? Previous tests: uncoordinated I/O, doing 10 “fread”s per core. Can collective communication help? Franklin I/O maximum: 12GB/s

Pitfalls at scale Volume rendering (see Dave’s talk) Startup time –Loading plugins overwhelmed file system –Took ~5 minutes –Solution #1: Read plugin information on MPI task 0 and broadcast. (90% speedup) –Solution #2: static linking Still need to demonstrate at scale Volume rendering (see Dave’s talk) Startup time –Loading plugins overwhelmed file system –Took ~5 minutes –Solution #1: Read plugin information on MPI task 0 and broadcast. (90% speedup) –Solution #2: static linking Still need to demonstrate at scale

Pitfalls at scale #2: All to one communication Each MPI task needs to report high level information –Was there an error in execution for that task? –Data extents? Spatial Extents? Previous implementation: –Every MPI task sends a direct message to MPI task 0. New implementation (Miller, LLNL): –Tree communication Each MPI task needs to report high level information –Was there an error in execution for that task? –Data extents? Spatial Extents? Previous implementation: –Every MPI task sends a direct message to MPI task 0. New implementation (Miller, LLNL): –Tree communication

Pitfalls at scale #3: reproducible results Repeated debugging runs at scale are critical to resolving issues like these.

This study continued after the initial effort as a way to validate new machines.

Should more tools have been used? Could have performed this study with VisIt, ParaView, EnSight, etc. Successful test with VisIt validates pure parallelism. Of course, I/O is a big problem … but ParaView, EnSight, etc, are doing the same “fread”s. Could have performed this study with VisIt, ParaView, EnSight, etc. Successful test with VisIt validates pure parallelism. Of course, I/O is a big problem … but ParaView, EnSight, etc, are doing the same “fread”s.

Trends in I/O Pure parallelism is almost always >50% I/O and sometimes 98% I/O Amount of data to visualize is typically O(total mem) Pure parallelism is almost always >50% I/O and sometimes 98% I/O Amount of data to visualize is typically O(total mem) FLOPs MemoryI/O Terascale machine “Petascale machine” Two big factors: ① how much data you have to read ② how fast you can read it  Relative I/O (ratio of total memory and I/O) is key

Anedoctal evidence: relative I/O really is getting slower. Machine nameMain memoryI/O rate ASC purple49.0TB140GB/s5.8min BGL-init32.0TB24GB/s22.2min BGL-cur69.0TB30GB/s38.3min Petascale machine ?? >40min Time to write memory to disk

Why is relative I/O getting slower? “I/O doesn’t pay the bills” –And I/O is becoming a dominant cost in the overall supercomputer procurement. Simulation codes aren’t as exposed. –And will be more exposed with proposed future architectures. “I/O doesn’t pay the bills” –And I/O is becoming a dominant cost in the overall supercomputer procurement. Simulation codes aren’t as exposed. –And will be more exposed with proposed future architectures. We need to de-emphasize I/O in our visualization and analysis techniques.

Conclusions Pure parallelism works, but is only as good as the underlying I/O infrastructure –I/O future looks grim –Positive indicator for in situ processing Full results available in special issue of Computer Graphics & Applications, special issue on Ultrascale Visualization. Pure parallelism works, but is only as good as the underlying I/O infrastructure –I/O future looks grim –Positive indicator for in situ processing Full results available in special issue of Computer Graphics & Applications, special issue on Ultrascale Visualization.

Backup slides

Is the I/O we saw on the hero runs indicative of what we will see on future machines? Two failure points –The number of I/O nodes will be dropping, especially with increasing numbers of cores per node, making the network to the I/O nodes the probable bottleneck Jaguar 32K / 2T cells took 729s. If BW disk is 5X faster that is still 145s. LLNL has a bunch of disk and we couldn’t get below two minutes because of contention Even if BW gets enough disk, disk is very expensive and future machines will likely not get enough. –This is especially true if a FLASH solution takes hold. Two failure points –The number of I/O nodes will be dropping, especially with increasing numbers of cores per node, making the network to the I/O nodes the probable bottleneck Jaguar 32K / 2T cells took 729s. If BW disk is 5X faster that is still 145s. LLNL has a bunch of disk and we couldn’t get below two minutes because of contention Even if BW gets enough disk, disk is very expensive and future machines will likely not get enough. –This is especially true if a FLASH solution takes hold. OSTs Network to I/O nodes CPUs