Supercomputing 2005 1 Cross-Platform Performance Prediction Using Partial Execution Leo T. Yang Xiaosong Ma* Frank Mueller Department of Computer Science.

Slides:

Advertisements

Similar presentations

SkewReduce YongChul Kwon Magdalena Balazinska, Bill Howe, Jerome Rolia* University of Washington, *HP Labs Skew-Resistant Parallel Processing of Feature-Extracting.

Advertisements

Test Automation Success: Choosing the Right People & Process

SLA-Oriented Resource Provisioning for Cloud Computing

A Dynamic World, what can Grids do for Multi-Core computing? Daniel Goodman, Anne Trefethen and Douglas Creager

Bag-of-Tasks Scheduling under Budget Constraints Ana-Maria Oprescu, Thilo Kielman Presented by Bryan Rosander.

Predictor of Customer Perceived Software Quality By Haroon Malik.

Copyright © 2002 W. A. Tucker1 Chapter 1 Lecture Notes Bill Tucker Austin Community College COSC 1315.

SAN DIEGO SUPERCOMPUTER CENTER Blue Gene for Protein Structure Prediction (Predicting CASP Targets in Record Time) Ross C. Walker.

Meeting Service Level Objectives of Pig Programs Zhuoyao Zhang, Ludmila Cherkasova, Abhishek Verma, Boon Thau Loo University of Pennsylvania Hewlett-Packard.

Efficient Autoscaling in the Cloud using Predictive Models for Workload Forecasting Roy, N., A. Dubey, and A. Gokhale 4th IEEE International Conference.

Presented by Scalable Systems Software Project Al Geist Computer Science Research Group Computer Science and Mathematics Division Research supported by.

Dynamic adaptation of parallel codes Toward self-adaptable components for the Grid Françoise André, Jérémy Buisson & Jean-Louis Pazat IRISA / INSA de Rennes.

Tools for Engineering Analysis of High Performance Parallel Programs David Culler, Frederick Wong, Alan Mainwaring Computer Science Division U.C.Berkeley.

COMP 4—Power Tools for the Mind1 PowerTools What’s in the Box? Turing 1: An Introduction to Programming You will learn elementary computer programming.

GHS: A Performance Prediction and Task Scheduling System for Grid Computing Xian-He Sun Department of Computer Science Illinois Institute of Technology.

Charm++ Load Balancing Framework Gengbin Zheng Parallel Programming Laboratory Department of Computer Science University of Illinois at.

MpiP Evaluation Report Hans Sherburne, Adam Leko UPC Group HCS Research Laboratory University of Florida.

Introduction and Overview Questions answered in this lecture: What is an operating system? How have operating systems evolved? Why study operating systems?

Scaling to New Heights Retrospective IEEE/ACM SC2002 Conference Baltimore, MD.

RM2D Let’s write our FIRST basic SPIN program!. The Labs that follow in this Module are designed to teach the following; Turn an LED on – assigning I/O.

Heterogeneity and Dynamicity of Clouds at Scale: Google Trace Analysis [1] 4/24/2014 Presented by: Rakesh Kumar [1 ]

High Performance I/O and Data Management System Group Seminar Xiaosong Ma Department of Computer Science North Carolina State University September 12,

Experience with Using a Performance Predictor During Development a Distributed Storage System Tale Lauro Beltrão Costa *, João Brunet +, Lile Hattori #,

November 13, 2006 Performance Engineering Research Institute 1 Scientific Discovery through Advanced Computation Performance Engineering.

R. Ryne, NUG mtg: Page 1 High Energy Physics Greenbook Presentation Robert D. Ryne Lawrence Berkeley National Laboratory NERSC User Group Meeting.

A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Tekin Bicer Gagan Agrawal 1.

Performance Model & Tools Summary Hung-Hsun Su UPC Group, HCS lab 2/5/2004.

SSS Test Results Scalability, Durability, Anomalies Todd Kordenbrock Technology Consultant Scalable Computing Division Sandia is a multiprogram.

Accelerating Scientific Exploration Using Workflow Automation Systems Terence Critchlow (LLNL) Ilkay Altintas (SDSC) Scott Klasky(ORNL) Mladen Vouk (NCSU)

IPDPS 2005, slide 1 Automatic Construction and Evaluation of “Performance Skeletons” ( Predicting Performance in an Unpredictable World ) Sukhdeep Sodhi.

Common Set of Tools for Assimilation of Data COSTA Data Assimilation Summer School, Sibiu, 6 th August 2009 COSTA An Introduction Nils van Velzen

MAPLD Reconfigurable Computing Birds-of-a-Feather Programming Tools Jeffrey S. Vetter M. C. Smith, P. C. Roth O. O. Storaasli, S. R. Alam

9 February 2000CHEP2000 Paper 3681 CDF Data Handling: Resource Management and Tests E.Buckley-Geer, S.Lammel, F.Ratnikov, T.Watts Hardware and Resources.

Hiding Periodic I/O Costs in Parallel Applications Xiaosong Ma Department of Computer Science University of Illinois at Urbana-Champaign Spring 2003.

ICOM 6115: Computer Systems Performance Measurement and Evaluation August 11, 2006.

Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.

NIH Resource for Biomolecular Modeling and Bioinformatics Beckman Institute, UIUC NAMD Development Goals L.V. (Sanjay) Kale Professor.

NIH Resource for Biomolecular Modeling and Bioinformatics Beckman Institute, UIUC NAMD Development Goals L.V. (Sanjay) Kale Professor.

1. 2 Table 4.1 Key characteristics of six passenger aircraft: all figures are approximate; some relate to a specific model/configuration of the aircraft.

Investigating Adaptive Compilation using the MIPSpro Compiler Keith D. Cooper Todd Waterman Department of Computer Science Rice University Houston, TX.

Buddy Scheduling in Distributed Scientific Applications on SMT Processors Nikola Vouk December 20, 2004 In Fulfillment of Requirement for Master’s of Science.

Active Sampling for Accelerated Learning of Performance Models Piyush Shivam, Shivnath Babu, Jeff Chase Duke University.

Replicating Memory Behavior for Performance Skeletons Aditya Toomula PC-Doctor Inc. Reno, NV Jaspal Subhlok University of Houston Houston, TX By.

CISC Machine Learning for Solving Systems Problems Presented by: Suman Chander B Dept of Computer & Information Sciences University of Delaware Automatic.

ApproxHadoop Bringing Approximations to MapReduce Frameworks

DOE Network PI Meeting 2005 Runtime Data Management for Data-Intensive Scientific Applications Xiaosong Ma NC State University Joint Faculty: Oak Ridge.

COMP381 by M. Hamdi 1 Clusters: Networks of WS/PC.

Parameter Sweep and Resources Scaling Automation in Scalarm Data Farming Platform J. Liput, M. Paciorek, M. Wrona, M. Orzechowski, R. Slota, and J. Kitowski.

1 Rocket Science using Charm++ at CSAR Orion Sky Lawlor 2003/10/21.

Hierarchical Load Balancing for Large Scale Supercomputers Gengbin Zheng Charm++ Workshop 2010 Parallel Programming Lab, UIUC 1Charm++ Workshop 2010.

Tackling I/O Issues 1 David Race 16 March 2010.

LACSI 2002, slide 1 Performance Prediction for Simple CPU and Network Sharing Shreenivasa Venkataramaiah Jaspal Subhlok University of Houston LACSI Symposium.

Scheduling a 100,000 Core Supercomputer for Maximum Utilization and Capability September 2010 Phil Andrews Patricia Kovatch Victor Hazlewood Troy Baer.

PEER 2003 Meeting 03/08/031 Interdisciplinary Framework Major focus areas Structural Representation Fault Systems Earthquake Source Physics Ground Motions.

Some of the utilities associated with the development of programs. These program development tools allow users to write and construct programs that the.

Presented by Robust Storage Management On Desktop, in Machine Room, and Beyond Xiaosong Ma Computer Science and Mathematics Oak Ridge National Laboratory.

COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University

Manchester Computing Supercomputing, Visualization & eScience Zoe Chaplin 11 September 2003 CAS2K3 Comparison of the Unified Model Version 5.3 on Various.

Spark on Entropy : A Reliable & Efficient Scheduler for Low-latency Parallel Jobs in Heterogeneous Cloud Huankai Chen PhD Student at University of Kent.

INTRODUCTION TO XSEDE. INTRODUCTION  Extreme Science and Engineering Discovery Environment (XSEDE)  “most advanced, powerful, and robust collection.

Resource Characterization Rich Wolski, Dan Nurmi, and John Brevik Computer Science Department University of California, Santa Barbara VGrADS Site Visit.

Resource Specification Prediction Model Richard Huang joint work with Henri Casanova and Andrew Chien.

Dynamo: A Runtime Codesign Environment

Welcome: Intel Multicore Research Conference

Containers in HPC By Raja.

Performance Evaluation of Adaptive MPI

Department of Computer Science University of California, Santa Barbara

Component Frameworks:

Department of Computer Science University of California, Santa Barbara

Parallel Exact Stochastic Simulation in Biochemical Systems

Presentation transcript:

Supercomputing Cross-Platform Performance Prediction Using Partial Execution Leo T. Yang Xiaosong Ma* Frank Mueller Department of Computer Science Center for High Performance Simulations (CHiPS) North Carolina State University (* Joint Faculty with Oak Ridge National Laboratory)

Supercomputing Presentation Roadmap Introduction Model and approach Performance results Conclusion and future work

Supercomputing Cross-Platform Performance Prediction Users face wide selection of machines Need cross-platform performance prediction to  Choose platform to use / purchase  Estimate resource usage  Estimate job wall time Machines and applications both grow larger and more complex  Modeling- and simulation- based approaches harder and more expensive  Performance data not reused in performance prediction

Supercomputing Observation-based Performance Prediction Observe cross-platform behavior  Treating applications and platforms as black boxes  Avoiding case-by-case model building  Covering entire application Computation Communication I/O  Convenient with third-party libraries Performance translation Observation: existence of “reference platform” Goal: Cross-platform Meta-predictor Approach: based on relative performance T = 20 hrs T = ? hrs

Supercomputing Presentation Roadmap Introduction Model and approach Performance results Conclusion and future work

Supercomputing Main Idea: Utilizing Partial Execution Observation: majority of scientific applications are iteration-based  Highly repetitive behavior  phases -> timesteps Execute small partial executions  Low-cost “test drives”  Simple APIs (indicate timesteps: k)  Quit after k timesteps Full-1 Partial-1Partial-2 Relative performance = 0.6 Full-2 (predicted) reference system target system

Supercomputing Application Model Execution of parallel simulations modeled as regular expression I(C*[W])*F  I: one-time initialization phase  C: computation phase  W: optional I/O phase  F: one-time finalization phase  Different phases likely have different cross-platform relative performance Major challenges  Avoid impact of initially unstable performance  Predict correct mixture of C and W phases

Supercomputing Partial Execution Terminate applications prematurely API  init_timestep() Optional, useful with large setup phase  begin_timestep()  end_timestep(maxsteps) “begin” and “end” calls bracket C or CW phase Execution terminated after maxsteps timesteps Easy-to-use interface  2-3 lines of codes inserted into source codes

Supercomputing Base Prediction Model Given reference platform and target platform  Perform 1 or more partial executions  Compute average execution time of timestep on both platforms  Compute relative performance  Compute overall execution time estimate for target platform Prediction performance (predicted-to-actual ratio)

Supercomputing Refined Prediction Model Problem 1: initial performance fluctuations  Variances due to cache warm-up, etc.  May span dozens of timesteps Problem 2: periodic I/O phases  I/O frequency often configurable and determined at run time Unified solution  Monitor per-timestep performance variance at runtime Identify anomalies and repeated patterns  Filter out early, unstable timestep measurements Consider only later results once performance stabilizes Combine early timestep overheads into initialization cost  Computing sliding window averages of per-timestep overheads Use multiples of observed pattern length as window size

Supercomputing Presentation Roadmap Introduction Model and approach Performance results Conclusion and future work

Supercomputing Proof-of-concept experiments Questions:  Is relative performance observed in a very short early period indicative of overall relative performance?  Can we reuse partial execution data in predicting execution with different configurations? Experiment settings  Large-scale codes: 2 ASCI Purple (sphot and sPPM) fusion code (Gyro) rocket simulation (GENx)  Full runs take >5 hours  10 super computers: SDSC, NCSA, ORNL, LLNL, UIUC, NCSU, NERSC  7 architectures (SP3, SP4, Altix, Cray X1, 3 clusters: G5, Xeon, Itanium)

Supercomputing Base Model Accuracy (Sphot)  High accuracy with very short partial execution

Supercomputing Refined Model (sPPM, Ram->Henry2) Issues: Ram: init variance Henry2: 1 in 10 steps I/O normalized Smarter algorithms Initialization filter Sliding window handle anomaly and periodic I/O

Supercomputing Application with Variable Problem Size GENx Rocket Simulation (CSAR, UIUC), Turing  Frost  Limited accuracy w/ variable timesteps

Supercomputing Reusing Partial Execution Data Avg. Error: 12.1% % Avg. Error: 5.6% % Scientists often repeat runs with different configurations  Number of processors  Input size and data content  Computation tasks Results from Gyro fusion simulation on 5 platforms

Supercomputing Presentation Roadmap Introduction Model and approach Performance results Conclusion and future work

Supercomputing Conclusion Empirical performance prediction works!  Real-world production codes  Multiple parallel platforms  Highly accurate predictions  Limitations with Variable problem sizes Input-size/processor scaling Observation-based prediction  Simple  Portable  Low cost (few timesteps) T = 20 hrs T = 2 hrs T = 10 hrs T = 1 hrs

Supercomputing Related Work Parallel program performance prediction  Application-specific analytical models  Compiler/instrumentation tools  Simulation-based predictions Cross-platform performance studies  Mostly examine multiple platforms individually Grid job schedulers  Do not offer cross-platform performance translation

Supercomputing Ongoing and Future Work Evaluate with AMR applications Automated partial execution  Automatic computation phase identification  Binary rewriting to avoid source code modification Extend to non-dedicated systems  For job schedulers