Dr. Gengbin Zheng and Ehsan Totoni Parallel Programming Laboratory University of Illinois at Urbana-Champaign April 18, 2011.

Slides:

Advertisements

Similar presentations

Dynamic Thread Assignment on Heterogeneous Multiprocessor Architectures Pree Thiengburanathum Advanced computer architecture Oct 24,

Advertisements

Decision Trees and MPI Collective Algorithm Selection Problem Jelena Pje¡sivac-Grbovi´c,Graham E. Fagg, Thara Angskun, George Bosilca, and Jack J. Dongarra,

Master/Slave Architecture Pattern Source: Pattern-Oriented Software Architecture, Vol. 1, Buschmann, et al.

Cloud Computing Resource provisioning Keke Chen. Outline  For Web applications statistical Learning and automatic control for datacenters  For data.

Spark: Cluster Computing with Working Sets

Abhinav Bhatele, Laxmikant V. Kale University of Illinois at Urbana-Champaign Sameer Kumar IBM T. J. Watson Research Center.

Abhinav Bhatele, Laxmikant V. Kale University of Illinois at Urbana-Champaign Sameer Kumar IBM T. J. Watson Research Center.

SKELETON BASED PERFORMANCE PREDICTION ON SHARED NETWORKS Sukhdeep Sodhi Microsoft Corp Jaspal Subhlok University of Houston.

Communication Pattern Based Node Selection for Shared Networks

A Parallel Computational Model for Heterogeneous Clusters Jose Luis Bosque, Luis Pastor, IEEE TRASACTION ON PARALLEL AND DISTRIBUTED SYSTEM, VOL. 17, NO.

Scaling Distributed Machine Learning with the BASED ON THE PAPER AND PRESENTATION: SCALING DISTRIBUTED MACHINE LEARNING WITH THE PARAMETER SERVER – GOOGLE,

1 Lecture 8 Architecture Independent (MPI) Algorithm Design Parallel Computing Fall 2007.

Adaptive MPI Chao Huang, Orion Lawlor, L. V. Kalé Parallel Programming Lab Department of Computer Science University of Illinois at Urbana-Champaign.

By- Jaideep Moses, Ravi Iyer , Ramesh Illikkal and

Topology Aware Mapping for Performance Optimization of Science Applications Abhinav S Bhatele Parallel Programming Lab, UIUC.

Ehsan Totoni Josep Torrellas Laxmikant V. Kale Charm Workshop April 29th, 2014.

BigSim: A Parallel Simulator for Performance Prediction of Extremely Large Parallel Machines Gengbin Zheng Gunavardhan Kakulapati Laxmikant V. Kale University.

1 Presenter: Ming-Shiun Yang Sah, A., Balakrishnan, M., Panda, P.R. Design, Automation & Test in Europe Conference & Exhibition, DATE ‘09. A Generic.

Charm++ Load Balancing Framework Gengbin Zheng Parallel Programming Laboratory Department of Computer Science University of Illinois at.

Research on cloud computing application in the peer-to-peer based video-on-demand systems Speaker : 吳靖緯 MA0G rd International Workshop.

© Fujitsu Laboratories of Europe 2009 HPC and Chaste: Towards Real-Time Simulation 24 March

Predicting performance of applications and infrastructures Tania Lorido 27th May 2011.

Scaling to New Heights Retrospective IEEE/ACM SC2002 Conference Baltimore, MD.

Work Stealing and Persistence-based Load Balancers for Iterative Overdecomposed Applications Jonathan Lifflander, UIUC Sriram Krishnamoorthy, PNNL* Laxmikant.

Atlanta, Georgia TiNy Threads on BlueGene/P: Exploring Many-Core Parallelisms Beyond The Traditional OS Handong Ye, Robert Pavel, Aaron Landwehr, Guang.

Extreme scale parallel and distributed systems – High performance computing systems Current No. 1 supercomputer Tianhe-2 at petaflops Pushing toward.

Extreme-scale computing systems – High performance computing systems Current No. 1 supercomputer Tianhe-2 at petaflops Pushing toward exa-scale computing.

Adaptive Parallel Sorting Algorithms in STAPL Olga Tkachyshyn, Gabriel Tanase, Nancy M. Amato

 Collectives on Two-tier Direct Networks EuroMPI – 2012 Nikhil Jain, JohnMark Lau, Laxmikant Kale 26 th September, 2012.

Towards a Billion Routing Lookups per Second in Software  Author: Marko Zec, Luigi, Rizzo Miljenko Mikuc  Publisher: SIGCOMM Computer Communication Review,

Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.

IPDPS 2005, slide 1 Automatic Construction and Evaluation of “Performance Skeletons” ( Predicting Performance in an Unpredictable World ) Sukhdeep Sodhi.

Dynamic Load Balancing in Charm++ Abhinav S Bhatele Parallel Programming Lab, UIUC.

CARDIO: Cost-Aware Replication for Data-Intensive workflOws Presented by Chen He.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Basic Parallel Programming Concepts Computational.

RF network in SoC1 SoC Test Architecture with RF/Wireless Connectivity 1. D. Zhao, S. Upadhyaya, M. Margala, “A new SoC test architecture with RF/wireless.

Workshop BigSim Large Parallel Machine Simulation Presented by Eric Bohm PPL Charm Workshop 2004.

1Charm++ Workshop 2010 The BigSim Parallel Simulation System Gengbin Zheng, Ryan Mokos Charm++ Workshop 2010 Parallel Programming Laboratory University.

University of Illinois at Urbana-Champaign Memory Architectures for Protein Folding: MD on million PIM processors Fort Lauderdale, May 03,

Replicating Memory Behavior for Performance Skeletons Aditya Toomula PC-Doctor Inc. Reno, NV Jaspal Subhlok University of Houston Houston, TX By.

Rassul Ayani 1 Performance of parallel and distributed systems  What is the purpose of measurement?  To evaluate a system (or an architecture)  To compare.

1 ©2004 Board of Trustees of the University of Illinois Computer Science Overview Laxmikant (Sanjay) Kale ©

Distributed simulation with MPI in ns-3 Joshua Pelkey and Dr. George Riley Wns3 March 25, 2011.

Using BigSim to Estimate Application Performance Ryan Mokos Parallel Programming Laboratory University of Illinois at Urbana-Champaign October 19, 2010.

Computer Architecture Lecture 26 Past and Future Ralph Grishman November 2015 NYU.

Scalable and Topology-Aware Load Balancers in Charm++ Amit Sharma Parallel Programming Lab, UIUC.

A uGNI-Based Asynchronous Message- driven Runtime System for Cray Supercomputers with Gemini Interconnect Yanhua Sun, Gengbin Zheng, Laximant(Sanjay) Kale.

Toward Runtime Power Management of Exascale Networks by On/Off Control of Links Ehsan Totoni University of Illinois-Urbana Champaign, PPL Charm++ Workshop,

Effects of contention on message latencies in large supercomputers Abhinav S Bhatele and Laxmikant V Kale Parallel Programming Laboratory, UIUC IS TOPOLOGY.

NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC Scaling NAMD to 100 Million Atoms on Petascale.

UPC Status Report - 10/12/04 Adam Leko UPC Project, HCS Lab University of Florida Oct 12, 2004.

Hierarchical Load Balancing for Large Scale Supercomputers Gengbin Zheng Charm++ Workshop 2010 Parallel Programming Lab, UIUC 1Charm++ Workshop 2010.

1 of 14 Lab 2: Formal verification with UPPAAL. 2 of 14 2 The gossiping persons There are n persons. All have one secret to tell, which is not known to.

LACSI 2002, slide 1 Performance Prediction for Simple CPU and Network Sharing Shreenivasa Venkataramaiah Jaspal Subhlok University of Houston LACSI Symposium.

1 of 14 Lab 2: Design-Space Exploration with MPARM.

KIT – University of the State of Baden-Wuerttemberg and National Research Center of the Helmholtz Association SYSTEM ARCHITECTURE GROUP DEPARTMENT OF COMPUTER.

Introduction to Performance Tuning Chia-heng Tu PAS Lab Summer Workshop 2009 June 30,

Spark on Entropy : A Reliable & Efficient Scheduler for Low-latency Parallel Jobs in Heterogeneous Cloud Huankai Chen PhD Student at University of Kent.

NAMD on BlueWaters Presented by: Eric Bohm Team: Eric Bohm, Chao Mei, Osman Sarood, David Kunzman, Yanhua, Sun, Jim Phillips, John Stone, LV Kale.

Software Architecture in Practice

Performance Evaluation of Adaptive MPI

Bank-aware Dynamic Cache Partitioning for Multicore Architectures

Year 2 Updates.

HPML Conference, Lyon, Sept 2018

BigSim: Simulating PetaFLOPS Supercomputers

Gengbin Zheng, Esteban Meneses, Abhinav Bhatele and Laxmikant V. Kale

IXPUG, SC’16 Lightning Talk Kavitha Chandrasekar*, Laxmikant V. Kale

COMP60611 Fundamentals of Parallel and Distributed Systems

Rohan Yadav and Charles Yuan (rohany) (chenhuiy)

Srinivas Neginhal Anantharaman Kalyanaraman CprE 585: Survey Project

Presentation transcript:

Dr. Gengbin Zheng and Ehsan Totoni Parallel Programming Laboratory University of Illinois at Urbana-Champaign April 18, 2011

A function level simulator for parallel applications on peta-scale machine An emulator runs applications at full scale A simulator predicts performance by simulating message passing 2

Improved simulation framework from these aspects Even larger/faster emulation and simulation More accurate prediction of sequential performance Apply BigSim to predict Blue Waters machine 3

Using existing (possibly different) clusters for emulation Applied memory reduction techniques For example, we achieved for NAMD: Emulate 10M-atom benchmark on 256K cores using only 4K cores of Ranger (2GB/core) Emulate 100M-atom benchmark on 1.2M cores using only 48K cores of Jaguar (1.3GB/core) 4

Predicting parallel applications can be challenging Accurate prediction of sequential execution blocks Network performance This paper explores techniques to use statistical models for sequential execution blocks 5 Gengbin Zheng, Gagan Gupta, Eric Bohm, Isaac Dooley, and Laxmikant V. Kale, "Simulating Large Scale Parallel Applications using Statistical Models for Sequential Execution Blocks", in the Proceedings of the 16th International Conference on Parallel and Distributed Systems (ICPADS 2010)

6

CPU scaling Take execution on one machine to predict another not accurate Hardware simulator Cycle accurate timing However it is slow Performance counters Very hard, and platform specific An efficient and accurate prediction method? 7

Identify parameters that characterize the execution time of an SEB: x 1,x 2, …, x n With machine learning, build statistical model of the execution time: T=f(x 1,x 2,…,x n ) Run SEBs in realistic scenario and collect data points X 11,x 21,…,x n1 => T1 X 12,x 22,…,x n2 => T2 … 8

9

Slicing (C 2 ) Record and replay any processor Instrument exact parameters Require binary compatible architecture Miniaturization (C 1 ) Scaled down execution (smaller dataset, smaller number of processors) Not all application can scale down 10

Machine learning techniques Linear Regression Least median squared linear regression SVMreg Use machine learning software like weka As an example: T(N) = N N

12 NAS BT Benchmark BT.D.1024 emulate 64 core Abe, recording 8 target cores replay on Ranger to predict Use Abe to predict Ranger

13 NAMD STMV 1M-atom benchmark 4096 core prediction Using 512 core of BG/P

14 Native Prediction

Will be here soon More than 300K POWER7 Cores PERCS Network Unique network not similar to any other! No theory or legacy for it Applications and runtimes tuned for other networks NSF Acceptance: Sustained petaFLOPs for real applications Tuning applications typically takes months to years after machines become online

Two-level fully-connected Supernodes with 32 nodes 24 GB/s LLocal (LL) link, 5 GB/s LRemote (LR) link, 10 GB/s D link, 192 GB/s to IBM HUB Chip

Detailed packet level network model developed for Blue Waters Validated against MERCURY and IH-Drawer Ping Pong within 1.1% of MERCURY Alltoall within 10% of drawer Optimizations for that scale Different outputs (link statistics, projections…) Some example studies: (E. Totoni et al., submitted to SC11) Topology-aware mapping System noise Collective optimization

Apply different mappings in simulation Evaluate using output tools (Link Utilization…)

Example: Mapping of 3D stencil Default mapping: rank ordered mapping of MPI tasks to cores of nodes Drawer mapping: 3D cuboids on nodes and drawers Supernode mapping: 3D cuboids on supernodes also; 80% communication decrease, 20% overall time

BigSim’s noise injection support Captured noise of a node with the same architecture Replicate on all the target machines nodes with random offsets (not synchronized) Artificial periodic noise patterns for each core of a node With periodicity, amplitude and offset (phase) Replicate on all nodes with random offsets Increases length of computations appropriately

Applications: NAMD: 10 Million atom system on 256K cores MILC: su3_rmd on 4K cores K-Neighbor with Allreduce 1 ms sequential computation, 2.4 ms total iteration time K-Neighbor without Allreduce Inspecting with different noise frequencies and amplitudes

NAMD is tolerant, but jumps around certain frequencies MILC is sensitive even in small jobs (4K cores), almost as much affected as large jobs

Simple models are not enough In high frequencies there is chance that small MPI calls for collectives get affected Delaying execution significantly Combining noise patterns:

Optimized Alltoall for large messages in SN BigSim’s link utilization output gives the idea: Outgoing links of a node are not used effectively at all times during Alltoall Shifted usage of different links

New algorithm: each core send to different node Use all 31 outgoing links Link utilization now looks better

Up to 5 times faster than existing Pairwise- Exchange scheme Much closer to theoretical peak bandwidth of links