Mark Mathis Texas A&M University Darren Kerbyson and Adolfy Hoisie

Slides:

Advertisements

Similar presentations

Prepared 7/28/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.

Advertisements

SKELETON BASED PERFORMANCE PREDICTION ON SHARED NETWORKS Sukhdeep Sodhi Microsoft Corp Jaspal Subhlok University of Houston.

A Parallel Computational Model for Heterogeneous Clusters Jose Luis Bosque, Luis Pastor, IEEE TRASACTION ON PARALLEL AND DISTRIBUTED SYSTEM, VOL. 17, NO.

1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Sep 5, 2005 Lecture 2.

Computer System Lifecycle Chapter 1. Introduction Computer System users, administrators, and designers are all interested in performance evaluation. Whether.

Performance Measurement. A Quantitative Basis for Design n Parallel programming is an optimization problem. n Must take into account several factors:

Outline 3  PWA overview Computational challenges in Partial Wave Analysis Comparison of new and old PWA software design - performance issues Maciej Swat.

LATA: A Latency and Throughput- Aware Packet Processing System Author: Jilong Kuang and Laxmi Bhuyan Publisher: DAC 2010 Presenter: Chun-Sheng Hsueh Date:

Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.

Simulation is the process of studying the behavior of a real system by using a model that replicates the behavior of the system under different scenarios.

Computer Science and Engineering Parallelizing Defect Detection and Categorization Using FREERIDE Leonid Glimcher P. 1 ipdps’05 Scaling and Parallelizing.

Distributed Information Systems. Motivation ● To understand the problems that Web services try to solve it is helpful to understand how distributed information.

Computer and Computational Sciences Division Los Alamos National Laboratory On the Feasibility of Incremental Checkpointing for Scientific Computing Jose.

Resource Utilization in Large Scale InfiniBand Jobs Galen M. Shipman Los Alamos National Labs LAUR

Performance Analysis of the Compaq ES40--An Overview Paper evaluates Compaq’s ES40 system, based on the Alpha Only concern is performance: no power.

Experimental Studies of Spatial Distributions of Neutrons Produced by Set-ups with Thick Lead Target Irradiated by Relativistic Protons Vladimír Wagner.

Reservoir Uncertainty Assessment Using Machine Learning Techniques Authors: Jincong He Department of Energy Resources Engineering AbstractIntroduction.

CMP Design Choices Finding Parameters that Impact CMP Performance Sam Koblenski and Peter McClone.

Workshop On Nuclear Data for Advanced Reactor Technologies, ICTP , A. Borella1 Monte Carlo methods.

1 Opportunities and Challenges of Modern Communication Architectures: Case Study with QsNet CAC Workshop Santa Fe, NM, 2004 Sameer Kumar* and Laxmikant.

Sunpyo Hong, Hyesoon Kim

Fermi National Accelerator Laboratory & Thomas Jefferson National Accelerator Facility SciDAC LQCD Software The Department of Energy (DOE) Office of Science.

P AL Performance Modeling of Unstructured Mesh Particle Transport Computations Mark M. Mathis and Darren J. Kerbyson Performance and Architectures Laboratory.

Introduction to Marketing Research

Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.

INSTITUTE OF NUCLEAR SCIENCE AND TECHNOLOGY

Web: Parallel Computing Rabie A. Ramadan , PhD Web:

September 2 Performance Read 3.1 through 3.4 for Tuesday

3 Research Design Formulation

Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Interconnection Networks (Part 2) Dr.

Lecture 16: Data Storage Wednesday, November 6, 2006.

OVERVIEW Impact of Modelling and simulation in Mechatronics system

Neural Network Implementations on Parallel Architectures

Tan Hongbing, Liu Sheng†, Chen Haiyan School of National University of

SOFTWARE DESIGN AND ARCHITECTURE

Tohoku University, Japan

Presented by Munezero Immaculee Joselyne PhD in Software Engineering

A New Coherence Method Using A Multicast Address Network

A Brachytherapy Treatment Planning Software Based on Monte Carlo Simulations and Artificial Neural Network Algorithm Amir Moghadam.

Object oriented system development life cycle

Cache Memory Presentation I

PA an Coordinated Memory Caching for Parallel Jobs

Lecture 11: DMBS Internals

Resource Utilization in Large Scale InfiniBand Jobs

Direct Attached Storage and Introduction to SCSI

CMSC 611: Advanced Computer Architecture

Computer Simulation of Networks

Address-Value Delta (AVD) Prediction

Performance Evaluation of the Parallel Fast Multipole Algorithm Using the Optimal Effectiveness Metric Ioana Banicescu and Mark Bilderback Department of.

Object oriented analysis and design

Virtual Memory Hardware

Chapter 1 Introduction.

HIGH LEVEL SYNTHESIS.

Gary M. Zoppetti Gagan Agrawal

Introduction to Operating Systems

COMP60621 Fundamentals of Parallel and Distributed Systems

SAMANVITHA RAMAYANAM 18TH FEBRUARY 2010 CPE 691

Development of a Large Area Gamma-ray Detector

The Fuel Cycle Analysis Toolbox

Chapter 4 Multiprocessors

Design Yaodong Bi.

Real time signal processing

Parallel Programming in C with MPI and OpenMP

Gary M. Zoppetti Gagan Agrawal Rishi Kumar

CMP Design Choices Finding Parameters that Impact CMP Performance

COMP60611 Fundamentals of Parallel and Distributed Systems

The Gamma Database Machine Project

Design principles for packet parsers

Parallel Feature Identification and Elimination from a CFD Dataset

Presentation transcript:

A Performance Model of non-Deterministic Particle Transport on Large-Scale Systems Mark Mathis Texas A&M University Darren Kerbyson and Adolfy Hoisie Performance and Architectures Laboratory (PAL) Los Alamos National Laboratory Presented by: Kei Davis

Performance & Architectures Lab Performance Analysis Portfolio at Los Alamos: Benchmarking near-to-market advanced system Large-scale Simulation: Parsims Design of advanced systems Application centric modeling Developed models of many applications: Deterministic transport (Sweep3D, Tycho) Hydro code (SAGE) Ocean Simulation (POP) MCNP (described here) Models are being used in many ways: Predict performance prior to availability Comparison of Large-scale systems (e.g. ASCI Q vs. the Earth Simulator) In Procurement of ASCI purple (expected to be a 100T system in 2004/5) During installation of ASCI Q (just completed, 20T Alpha system)

Why Model Performance? Performance analysis is necessary to evaluate the impact of architectural evolution and innovation. Application modeling provides insight into the achievable performance on current systems, and allows exploration of expected performance improvements possible on future systems.

Need to have an expectation Complex machines and software single processors, interactions within nodes, interaction between nodes (communication networks), I/O Large cost for development, deployment and maintenance Need to know in advance what performance will be. Lots of system choices Some measurement possible (small scale) What should we buy? (ASCI Purple) Verification of ASCI Q Performance Update of SW and/or HW Maintenance Installation Procurement Implementation Design

MCNP (Monte-Carlo N-Particle) General-purpose code that can be used for neutron, photon, electron, or coupled transport. Simulates individual particle (histories) and records aspects (tallies) of their average behavior. Sequentially, MCNP simulates the requested number of histories on a given input geometry and reports the requested output tallies. In parallel, MCNP copies the entire input geometry from a master to two or more slaves Each slave simulates a different set of particles. In each iteration, or cycle, the master merges tallies from all slaves during a rendezvous. Complexity of the problem is constrained by available memory at a single node – due to the input geometry being copied to each PE. Hence, parallelism is utilized to solve the problem faster, rather than solve a more complex problem in the same amount of time. (Strong Scaling)

Example Experiment - Criticality A “critical” system is one where exactly one of the neutrons produced in a fission reaction continues a chain reaction. Such a system has a neutron multiplication of one, or keff =1. In a “subcritical” system, keff <1, and the chain reaction will die away. If keff >1, the system is “supercritical”. Such a system will produce large amounts of radiation and persistent radioactive contamination. MCNP can be used to simulate the neutron interactions for a given input geometry and calculate keff . An example input geometry consists of an insulating cylinder with rods of various types arranged in the middle.

Example Input Geometry Vertical cross-section Horizontal cross-section

Parallel Activity in MCNP Scatter Phase Work Gather Phase Master Slave1 SlaveP-1 Stage 1 2 3 4 5 6 4 5 6 7 8 To develop a model, an understanding of the key processing operations and their scaling behavior is required. The parallel activity for one cycle of MCNP is shown above. An analytical model is obtained from this type of analysis.

Analysis of Parallel Activity Stage Source Action Quantity Description 1 Master bcast P*8 particle range to be computed by slaves 2 229240 update current history 3 Slave work Thist*  Nph/(P-1)  Thist times the number of particle histories 4 pt2pt 5512 task common 5 320 tally data 6 204920 task array 1 7 48*Nph/(P-1)  task array 2 8 32 timing data Only main activities shown Stages 1 and 2 correspond to the scatter phase Stage 3 is the work phase Stages 4-8 correspond to the gather phase

Performance Model (Overview) Performance described by analytical expressions. Top level: Elements represent the main processing stages. For example: Parameters in model enable scalability studies, e.g.: P (# PEs), Nph (# histories),

System Model The system model encapsulates key system characteristics including: Communication (e.g. latency and bandwidth) Computational Performance (e.g. processor speed). For example, point-to-point communications can be modeled as a piece-wise linear curve: Tpt2pt(S) = 0 £ n £ 32 T ~ 5 ms 64 £ n £ 1024 T ~ 5 ms + 15 ns / byte n > 1024 T ~ 10ms + 3.4 ns / byte S (= message size in bytes)

Single-Processor Performance The single-processor performance can be modeled or measured. A measured value has several advantages… Avoids necessity to model compiler optimizations (which are complex!) Eliminates need to model memory hierarchy. and disadvantages… Requires preliminary benchmarking experiments (and access to system). Values needed for all systems in a comparison

Experimental Test-bed Compaq Alphaserver ES40: 32 nodes, each with 4 PEs 833MHz, EV68 Alpha processors 64K L1, 8MB L2 8GB memory per node Quadrics QsNet Interconnect Fat-tree topology Low latency (typically 6µs), high bandwidth (~ 300MB/s)

MCNP Model Parameters Type Parameter Values System Lc(S), Bc(S) 5.05µs, 0.0MB/s (S < 64) 5.47µs, 78MB/s (64 <S < 512) 10.3µs, 294MB/s (S > 512) Tpack(S) 0.12ns (S < 32K) 0.16ns (32K < S < 4M) 0.67ns (S > 4M) Application Nph 100, 500, 1000, 5000, 10000, 50000, 100000 Thist 798µs S in bytes

Model Validation The model predicts well. The predicted time is often within 10% of the measured time. Accuracy generally decreases as the number of PEs grows. Some work remains to increase model accuracy

Exploring Performance Once validated, the model can be used to predict performance. E.g. new scenarios on the current architecture. What-if we processed a larger problem? What-if we used more processors? Strong Scaling Weak Scaling

Exploring Performance (2) Can explore performance on possible future architectures: What-if the network was faster? What-if the processors were faster? What-if message packing was faster? Can predict the performance for possible code modifications: What-if the “gather” phase was re-implemented using reductions? Strong Scaling Weak Scaling

Conclusions Developed an analytical performance model for MCNP. Validated the model on a large-scale Alphaserver testbed. predicted time is often within 10% of the measured time. Used the model to explore a number of scenarios. Studied strong and weak scaling modes for small and large inputs. Predicted performance for improved systems and code. Showed that most performance gain will come from increased processor speed. Illustrated the benefits of developing a performance model of an application. Part of an on-going effort to model large-scale systems http:www.c3.lanl.gov/par_arch