A Component Infrastructure for Performance and Power Modeling of Parallel Scientific Applications Boyana Norris Argonne National Laboratory Van Bui, Lois.

Slides:

Advertisements

Similar presentations

Compilation and Parallelization Techniques with Tool Support to Realize Sequence Alignment Algorithm on FPGA and Multicore Sunita Chandrasekaran1 Oscar.

Advertisements

The OpenUH Compiler: A Community Resource Barbara Chapman University of Houston March, 2007 High Performance Computing and Tools Group

1 Lawrence Livermore National Laboratory By Chunhua (Leo) Liao, Stephen Guzik, Dan Quinlan A node-level programming model framework for exascale computing*

University of Houston So What’s Exascale Again?. University of Houston The Architects Did Their Best… Scale of parallelism Multiple kinds of parallelism.

Analysis of Database Workloads on Modern Processors Advisor: Prof. Shan Wang P.h.D student: Dawei Liu Key Laboratory of Data Engineering and Knowledge.

Robert Bell, Allen D. Malony, Sameer Shende Department of Computer and Information Science Computational Science.

Yefu Wang and Kai Ma. Project Goals and Assumptions Control power consumption of multi-core CPU by CPU frequency scaling Assumptions: Each core can be.

Nick Trebon, Alan Morris, Jaideep Ray, Sameer Shende, Allen Malony {ntrebon, amorris, Department of.

On the Integration and Use of OpenMP Performance Tools in the SPEC OMP2001 Benchmarks Bernd Mohr 1, Allen D. Malony 2, Rudi Eigenmann 3 1 Forschungszentrum.

DANSE Central Services Michael Aivazis Caltech NSF Review May 23, 2008.

CAD and Design Tools for On- Chip Networks Luca Benini, Mark Hummel, Olav Lysne, Li-Shiuan Peh, Li Shang, Mithuna Thottethodi,

Author: D. Brooks, V.Tiwari and M. Martonosi Reviewer: Junxia Ma

Performance Tools for Empirical Autotuning Allen D. Malony, Nick Chaimov, Kevin Huck, Scott Biersdorff, Sameer Shende

Iterative computation is a kernel function to many data mining and data analysis algorithms. Missing in current MapReduce frameworks is collective communication,

Challenges in Performance Evaluation and Improvement of Scientific Codes Boyana Norris Argonne National Laboratory Ivana.

An Automated Component-Based Performance Experiment and Modeling Environment Van Bui, Boyana Norris, Lois Curfman McInnes, and Li Li Argonne National Laboratory,

CCA Forum Fall Meeting October CCA Common Component Architecture Update on TASCS Component Technology Initiatives CCA Fall Meeting October.

1 Babak Behzad, Yan Liu 1,2,4, Eric Shook 1,2, Michael P. Finn 5, David M. Mattli 5 and Shaowen Wang 1,2,3,4 Babak Behzad 1,3, Yan Liu 1,2,4, Eric Shook.

Beyond Automatic Performance Analysis Prof. Dr. Michael Gerndt Technische Univeristät München

© 2008 The MathWorks, Inc. ® ® Parallel Computing with MATLAB ® Silvina Grad-Freilich Manager, Parallel Computing Marketing

CQoS Update Li Li, Boyana Norris, Lois Curfman McInnes Argonne National Laboratory Kevin Huck University of Oregon.

Low Power Techniques in Processor Design

McPAT: An Integrated Power, Area, and Timing Modeling Framework for Multicore and Manycore Architectures Runjie Zhang Dec.3 S. Li et al. in MICRO’09.

Component Infrastructure of CQoS and Its Application in Scientific Computations Li Li 1, Boyana Norris 1, Lois Curfman McInnes 1, Kevin Huck 2, Joseph.

Multi Core Processor Submitted by: Lizolen Pradhan

Low-Power Wireless Sensor Networks

Intro to Architecture – Page 1 of 22CSCI 4717 – Computer Architecture CSCI 4717/5717 Computer Architecture Topic: Introduction Reading: Chapter 1.

1 Using the PETSc Parallel Software library in Developing MPP Software for Calculating Exact Cumulative Reaction Probabilities for Large Systems (M. Minkoff.

Energy saving in multicore architectures Assoc. Prof. Adrian FLOREA, PhD Prof. Lucian VINTAN, PhD – Research.

Composing Adaptive Software Authors Philip K. McKinley, Seyed Masoud Sadjadi, Eric P. Kasten, Betty H.C. Cheng Presented by Ana Rodriguez June 21, 2006.

Programming Models & Runtime Systems Breakout Report MICS PI Meeting, June 27, 2002.

DANSE Central Services Michael Aivazis Caltech NSF Review May 31, 2007.

Technology + Process SDCI HPC Improvement: High-Productivity Performance Engineering (Tools, Methods, Training) for NSF HPC Applications Rick Kufrin *,

A Compiler-Based Tool for Array Analysis in HPC Applications Presenter: Ahmad Qawasmeh Advisor: Dr. Barbara Chapman 2013 PhD Showcase Event.

High Performance Computing Processors Felix Noble Mirayma V. Rodriguez Agnes Velez Electric and Computer Engineer Department August 25, 2004.

A performance evaluation approach openModeller: A Framework for species distribution Modelling.

4.2.1 Programming Models Technology drivers – Node count, scale of parallelism within the node – Heterogeneity – Complex memory hierarchies – Failure rates.

Plans and Opportunities Involving Beam Dynamics Components ComPASS SAP Project and Phase I and II Doe SBIR Boyana Norris (ANL) In collaboration with Stefan.

Components for Beam Dynamics Douglas R. Dechow, Tech-X Lois Curfman McInnes, ANL Boyana Norris, ANL With thanks to the Common Component Architecture (CCA)

Building an Electron Cloud Simulation using Bocca, Synergia2, TxPhysics and Tau Performance Tools Phase I Doe SBIR Stefan Muszala, PI DOE Grant No DE-FG02-08ER85152.

SAP Participants: Douglas Dechow, Tech-X Corporation Lois Curfman McInnes, Boyana Norris, ANL Physics Collaborators: James Amundson, Panagiotis Spentzouris,

Kevin A. Huck Department of Computer and Information Science Performance Research Laboratory University of.

Presented by An Overview of the Common Component Architecture (CCA) The CCA Forum and the Center for Technology for Advanced Scientific Component Software.

1 Optimizing compiler tools and building blocks project Alexander Drozdov, PhD Sergey Novikov, PhD.

MILAN: Technical Overview October 2, 2002 Akos Ledeczi MILAN Workshop Institute for Software Integrated.

1 1 What does Performance Across the Software Stack mean?  High level view: Providing performance for physics simulations meaningful to applications 

Application Heartbeats Henry Hoffmann, Jonathan Eastep, Marco Santambrogio, Jason Miller, Anant Agarwal CSAIL Massachusetts Institute of Technology Cambridge,

PerfExplorer Component for Performance Data Analysis Kevin Huck – University of Oregon Boyana Norris – Argonne National Lab Li Li – Argonne National Lab.

Allen D. Malony Department of Computer and Information Science TAU Performance Research Laboratory University of Oregon Discussion:

1 SciDAC High-End Computer System Performance: Science and Engineering Jack Dongarra Innovative Computing Laboratory University of Tennesseehttp://

ProActive components and legacy code Matthieu MOREL.

Using Cache Models and Empirical Search in Automatic Tuning of Applications Apan Qasem Ken Kennedy John Mellor-Crummey Rice University Houston, TX Apan.

Runtime Software Power Estimation and Minimization Tao Li.

Connections to Other Packages The Cactus Team Albert Einstein Institute

Automatic Parameterisation of Parallel Linear Algebra Routines Domingo Giménez Javier Cuenca José González University of Murcia SPAIN Algèbre Linéaire.

Computer Science and Engineering Power-Performance Considerations of Parallel Computing on Chip Multiprocessors Jian Li and Jose F. Martinez ACM Transactions.

A uGNI-Based Asynchronous Message- driven Runtime System for Cray Supercomputers with Gemini Interconnect Yanhua Sun, Gengbin Zheng, Laximant(Sanjay) Kale.

Shangkar Mayanglambam, Allen D. Malony, Matthew J. Sottile Computer and Information Science Department Performance.

The Performance Evaluation Research Center (PERC) Participating Institutions: Argonne Natl. Lab.Univ. of California, San Diego Lawrence Berkeley Natl.

Integrated Performance Views in Charm++: Projections meets TAU Scott Biersdorff Allen D. Malony Department Computer and Information Science University.

C OMPUTATIONAL R ESEARCH D IVISION 1 Defining Software Requirements for Scientific Computing Phillip Colella Applied Numerical Algorithms Group Lawrence.

Hierarchical Load Balancing for Large Scale Supercomputers Gengbin Zheng Charm++ Workshop 2010 Parallel Programming Lab, UIUC 1Charm++ Workshop 2010.

Best detection scheme achieves 100% hit detection with

Productive Performance Tools for Heterogeneous Parallel Computing

Boyana Norris Argonne National Laboratory Ivana Veljkovic

Performance Technology for Scalable Parallel Systems

TAU integration with Score-P

Parallel Algorithm Design

Allen D. Malony, Sameer Shende

Gengbin Zheng, Esteban Meneses, Abhinav Bhatele and Laxmikant V. Kale

Presentation transcript:

A Component Infrastructure for Performance and Power Modeling of Parallel Scientific Applications Boyana Norris Argonne National Laboratory Van Bui, Lois Curfman McInnes, Li Li Argonne National Laboratory Oscar Hernandez, Barbara Chapman University of Houston Kevin Huck University of Oregon

Outline  Motivation  Performance/Power Models  Component Infrastructure  Experiments  Conclusions and Future Work  Acknowledgements 2CBHPC, Karlsruhe, Germany, October 17, 2008

Component-Based Software Engineering  Functional unit with well-defined interfaces and dependencies  Components interact through ports  Benefits: software reuse, complex software management, code generation, available “services”  Drawback: more restrictive software engineering, need for runtime framework CBHPC, Karlsruhe, Germany, October 17,

Motivation  CBSE increasing in HPC  Power increasing in importance  A need for simpler processes for performance/power measurement and analysis ― Performance tools can be applied at the component abstraction layer ― Opportunities for automation CBHPC, Karlsruhe, Germany, October 17,

Power vs. Energy Rate a system performs work Power = Work / ▲Time Total work over a period of time Energy = Power * ▲ Time CBHPC, Karlsruhe, Germany, October 17,

Power Trends CBHPC, Karlsruhe, Germany, October 17, Cameron, K. W., Ge, R., and Feng, X High-Performance, Power-Aware Distributed Computing for Scientific Applications. Computer 38, 11 (Nov. 2005),

Power Reduction Techniques  Circuit and logic level  Low power interconnect  Low power memories and memory hierarchy  Low power processor architecture adaptations  Dynamic voltage scaling  Resource hibernation  Compiler level power management  Application level power management CBHPC, Karlsruhe, Germany, October 17,

Goals and Approach  Provide a component based system ― Facilitates performance/power measurement and analysis ― Computes high level performance metrics ― Integrates existing tools into a uniform interface ― End Goal: static and dynamic optimizations based on offline/online analyses 8CBHPC, Karlsruhe, Germany, October 17, 2008

System Diagram 9 Interactive Analysis and Model Building Substitution Assertion Database Instrumented Component Application Runs Instrumented Component Application Runs Control System (parameter changes and component substitution) Control System (parameter changes and component substitution) CQoS-Enabled Component Application CQoS-Enabled Component Application Component A Component B Component C Substitution Set Machine Learning Performance/Power Databases (persistent & runtime) Analysis InfrastructureControl Infrastructure CBHPC, Karlsruhe, Germany, October 17, 2008

Performance Model I  FLP Inefficiency – PD: Problem size dependent variant  FLP Inefficiency – PI: Problem size independent variant CBHPC, Karlsruhe, Germany, October 17, Metric Global StallsStall_cycles/total_cycles % FLP StallsFLP_stalls/stall_cycles FLP Inefficiency – PDFLP_OPS * stalls/cycles FLP Inefficiency – PI(FLP_OPS/retired_inst) * stall/cycle

Performance Model II  Core logic Stalls = L1D_register_stalls + branch_misprediction + instruction_miss + stack_engine_stalls + floating_point_stalls + pipeline_inter_register_dependency + processor_frontend_flush  Memory Stalls = L1_hits * L1_latency + L2_hits * L2_latency + L3_hits * L3_latency + local_mem_access * local_mem_latency + remote_mem_access * remote_mem_latency + TLB_miss * TLB_miss_penalty CBHPC, Karlsruhe, Germany, October 17,

Power Model CBHPC, Karlsruhe, Germany, October 17,  Based on on-die components  Leverages performance hardware counters

Die Photo for SiCortex CBHPC, Karlsruhe, Germany, October 17,

Performance Measurement and Analysis System  Components ― TAU: Performance measurement  ― Performance Database Component(s) ― PerfExplorer: Performance and power analysis  CBHPC, Karlsruhe, Germany, October 17, PerfExplorer Component TAU Component Component App Database Components Runtime Optimization Compiler feedback User/tool analysis

PerfExplorer Component  Loads a python analysis script  Performance and power analysis  Data mining, inference rules, comparing different experimental runs CBHPC, Karlsruhe, Germany, October 17,

Study I: Performance-Power Trade-offs CBHPC, Karlsruhe, Germany, October 17,  Experiment – Effect of compiler optimization levels on performance and power  Experimental Details ― Machine: SGI Altix 300 ― MPI Processes: 16 ― Compiler: OpenUH ― Code: GenIDLEST ― Optimization levels: -O0, -O1, -O2, -O3 ― Performance tools: TAU, PerfExplorer, and PAPI

Linux/ccNUMA CBHPC, Karlsruhe, Germany, October 17,

Results CBHPC, Karlsruhe, Germany, October 17,  Aggressive optimizations Higher power  IPC ~ Power dissipation  Aggressive optimizations Lower energy  Operation count ~ energy consumption

Performance/Power Study With PETSc Codes  PETSc: Portable Extensible Toolkit for Scientific Computation ―  Experimental Details ― Machine: SGI Altix 3600 ― Compiler: GCC ― MPI Processes: 32 ― Application: 2-D simulation of cavity flow  Krylov subspace linear solvers: FGMRES, GMRES, BiCGS  Preconditioner: Block Jacobi  Problem Size: 16x16 each processor (weak scaling) ― Performance tools: TAU, PerfExplorer, PAPI CBHPC, Karlsruhe, Germany, October 17,

Inefficiency CBHPC, Karlsruhe, Germany, October 17, ― Bottlenecks in methods used in solution of linear system ― Bottleneck also in preconditioner

Results  FGMRES has good performance initially ― Not very power efficient  BCGS is optimal for performance and power efficiency CBHPC, Karlsruhe, Germany, October 17,

Conclusions  Little or no hardware and software support for detailed power measurement and analysis on modern systems  Need for more integrated toolsets supporting both performance and power measurements, analysis, and optimizations  Combining tools with component based software engineering can benefit efficiency and effectiveness of tuning process CBHPC, Karlsruhe, Germany, October 17,

Future Directions  Integration of components into a framework  Dynamic selection of algorithms and parameters based on offline/online analyses  Compiler based performance power cost modeling  Continue performance and power analysis of PETSc based codes  Extension of performance and power model for more modern architectures CBHPC, Karlsruhe, Germany, October 17,

References  Jarp, S. A methodology for using the itanium-2 performance counters for bottleneck analysis. Tech.rep., HP Labs, August  Bircher, W.L.; John, L.K. Complete System Power Estimation: A Trickle- Down Approach Based on Performance Events. International Symposium on Performance Analysis of Systems & Software, Page(s): ,  Isci, C. and Martonosi, M Runtime Power Monitoring in High-End Processors: Methodology and Empirical Data. In Proceedings of the 36th Annual IEEE/ACM international Symposium on Microarchitecture (December , 2003).  K. Huck, O. Hernandez, V. Bui, S. Chandrasekaran, B. Chapman, A. D. Malony, L.C. McInnes, and B. Norris. Capturing Performance Knowledge for Automated Analysis, Supercomputing, CBHPC, Karlsruhe, Germany, October 17, 2008

Acknowledgments  Professors/Advisors: Boyana Norris, Lois Curfman McInnes, Barbara Chapman, Allen Maloney, Danesh Tafti  Students: Oscar Hernandez, Kevin Huck, Sunita Chandrasekaran, Li Li  SiCortex: Lawrence Stuart and Dan Jackson  MCS Division, Argonne National Laboratory  NSF, DOE, NCSA, NASA CBHPC, Karlsruhe, Germany, October 17,