Autotuning Large Computational Chemistry Codes PERI Principal Investigators: David H. Bailey (lead)Lawrence Berkeley National Laboratory Jack Dongarra.

Slides:



Advertisements
Similar presentations
Machine Learning-based Autotuning with TAU and Active Harmony Nicholas Chaimov University of Oregon Paradyn Week 2013 April 29, 2013.
Advertisements

Automatic Tuning of Scientific Applications Apan Qasem Ken Kennedy Rice University Houston, TX Apan Qasem Ken Kennedy Rice University Houston, TX.
Lecture 6: Multicore Systems
SE263 Video Analytics Course Project Initial Report Presented by M. Aravind Krishnan, SERC, IISc X. Mei and H. Ling, ICCV’09.
*time Optimization Heiko, Diego, Thomas, Kevin, Andreas, Jens.
Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.
 Understanding the Sources of Inefficiency in General-Purpose Chips.
Robert Bell, Allen D. Malony, Sameer Shende Department of Computer and Information Science Computational Science.
Scalability Study of S3D using TAU Sameer Shende
Profiling S3D on Cray XT3 using TAU Sameer Shende
May 25, 2010 Mary Hall May 25, 2010 Advancing the Compiler Community’s Research Agenda with Archiving and Repeatability * This work has been partially.
Nick Trebon, Alan Morris, Jaideep Ray, Sameer Shende, Allen Malony {ntrebon, amorris, Department of.
On the Integration and Use of OpenMP Performance Tools in the SPEC OMP2001 Benchmarks Bernd Mohr 1, Allen D. Malony 2, Rudi Eigenmann 3 1 Forschungszentrum.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
Compiler Optimization-Space Exploration Adrian Pop IDA/PELAB Authors Spyridon Triantafyllis, Manish Vachharajani, Neil Vachharajani, David.
Scalability Study of S3D using TAU Sameer Shende
1 Presenter: Chien-Chih Chen Proceedings of the 2002 workshop on Memory system performance.
1 1 Profiling & Optimization David Geldreich (DREAM)
Performance Tools for Empirical Autotuning Allen D. Malony, Nick Chaimov, Kevin Huck, Scott Biersdorff, Sameer Shende
Tool Integration and Autotuning for SUPER Performance Optimization Allen D. Malony, Nick ChaimovUniversity of Oregon Mary HallUniversity of Utah Jeff HollingsworthUniversity.
Optimizing Loop Performance for Clustered VLIW Architectures by Yi Qian (Texas Instruments) Co-authors: Steve Carr (Michigan Technological University)
Beyond Automatic Performance Analysis Prof. Dr. Michael Gerndt Technische Univeristät München
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
11 If you were plowing a field, which would you rather use? Two oxen, or 1024 chickens? (Attributed to S. Cray) Abdullah Gharaibeh, Lauro Costa, Elizeu.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
Lecture 8. Profiling - for Performance Analysis - Prof. Taeweon Suh Computer Science Education Korea University COM503 Parallel Computer Architecture &
Software Performance Analysis Using CodeAnalyst for Windows Sherry Hurwitz SW Applications Manager SRD Advanced Micro Devices Lei.
Scalable Analysis of Distributed Workflow Traces Daniel K. Gunter and Brian Tierney Distributed Systems Department Lawrence Berkeley National Laboratory.
OBJECT ORIENTED SYSTEM ANALYSIS AND DESIGN. COURSE OUTLINE The world of the Information Systems Analyst Approaches to System Development The Analyst as.
Technology + Process SDCI HPC Improvement: High-Productivity Performance Engineering (Tools, Methods, Training) for NSF HPC Applications Rick Kufrin *,
Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.
System Software for Parallel Computing. Two System Software Components Hard to do the innovation Replacement for Tradition Optimizing Compilers Replacement.
Srihari Makineni & Ravi Iyer Communications Technology Lab
Autonomic scheduling of tasks from data parallel patterns to CPU/GPU core mixes Published in: High Performance Computing and Simulation (HPCS), 2013 International.
Generic Code Optimization Jack Dongarra, Shirley Moore, Keith Seymour, and Haihang You LACSI Symposium Automatic Tuning of Whole Applications Workshop.
Investigating Adaptive Compilation using the MIPSpro Compiler Keith D. Cooper Todd Waterman Department of Computer Science Rice University Houston, TX.
System-level power analysis and estimation September 20, 2006 Chong-Min Kyung.
Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
1 SciDAC High-End Computer System Performance: Science and Engineering Jack Dongarra Innovative Computing Laboratory University of Tennesseehttp://
Using Cache Models and Empirical Search in Automatic Tuning of Applications Apan Qasem Ken Kennedy John Mellor-Crummey Rice University Houston, TX Apan.
CMP/CMT Scaling of SPECjbb2005 on UltraSPARC T1 (Niagara) Dimitris Kaseridis and Lizy K. John The University of Texas at Austin Laboratory for Computer.
A Memory-hierarchy Conscious and Self-tunable Sorting Library To appear in 2004 International Symposium on Code Generation and Optimization (CGO ’ 04)
21 Sep UPC Performance Analysis Tool: Status and Plans Professor Alan D. George, Principal Investigator Mr. Hung-Hsun Su, Sr. Research Assistant.
Hybrid Multi-Core Architecture for Boosting Single-Threaded Performance Presented by: Peyman Nov 2007.
Weekly Report- Reduction Ph.D. Student: Leo Lee date: Oct. 30, 2009.
Shangkar Mayanglambam, Allen D. Malony, Matthew J. Sottile Computer and Information Science Department Performance.
Parallel Performance Measurement of Heterogeneous Parallel Systems with GPUs Allen D. Malony, Scott Biersdorff, Sameer Shende, Heike Jagode†, Stanimire.
Single Node Optimization Computational Astrophysics.
AUTO-GC: Automatic Translation of Data Mining Applications to GPU Clusters Wenjing Ma Gagan Agrawal The Ohio State University.
Performane Analyzer Performance Analysis and Visualization of Large-Scale Uintah Simulations Kai Li, Allen D. Malony, Sameer Shende, Robert Bell Performance.
Sunpyo Hong, Hyesoon Kim
Learning A Better Compiler Predicting Unroll Factors using Supervised Classification And Integrating CPU and L2 Cache Voltage Scaling using Machine Learning.
Michael J. Voss and Rudolf Eigenmann PPoPP, ‘01 (Presented by Kanad Sinha)
Scaling up R computation with high performance computing resources.
Institute of Software,Chinese Academy of Sciences An Insightful and Quantitative Performance Optimization Chain for GPUs Jia Haipeng.
Online Performance Analysis and Visualization of Large-Scale Parallel Applications Kai Li, Allen D. Malony, Sameer Shende, Robert Bell Performance Research.
Kai Li, Allen D. Malony, Sameer Shende, Robert Bell
Productive Performance Tools for Heterogeneous Parallel Computing
CS427 Multicore Architecture and Parallel Computing
Kilohertz Decision Making on Petabytes
TAU integration with Score-P
for more information ... Performance Tuning
STUDY AND IMPLEMENTATION
Optimizing MapReduce for GPUs with Effective Shared Memory Usage
VLIW DSP vs. SuperScalar Implementation of a Baseline H.263 Encoder
IXPUG, SC’16 Lightning Talk Kavitha Chandrasekar*, Laxmikant V. Kale
Rohan Yadav and Charles Yuan (rohany) (chenhuiy)
What Are Performance Counters?
Presentation transcript:

Autotuning Large Computational Chemistry Codes PERI Principal Investigators: David H. Bailey (lead)Lawrence Berkeley National Laboratory Jack Dongarra and Shirley MooreUniversity of Tennessee at Knoxville Other Lead Investigators: Samuel WilliamsLawrence Berkeley National Laboratory Mark Gordon and Theresa Windus Ames Laboratory Joseph KennySandia National Laboratory Allen Malony and Sameer ShendeUniversity of Oregon 1

Ab initio Chemistry Codes and Applications Codes: GAMESS, NWChem, MPQC – Community codes: >100,000 users DOE Combustion Energy Frontier Research Center (CEFRC) – Emily Carter, Princeton University – Target application and kernel Large-scale simulations of the large hydrocarbons and sulfur-containing hydrocarbons that are components of diesel fuel Linear scaling multireference configuration interaction (MRCI) module Applications – Solar energy cell design – Combustion efficiency – Materials science – Nanoscience and nanoelectronics Broader impact: results applicable to other ab initio chemistry codes 2

Motivation for Autotuning Large-scale complex architectures – Performance tuning requires expertise and is time- intensive – Hand-tuned codes difficult to maintain – Discontinuous GPU performance optimization space Real-world test case for PERI autotuning tools – Feedback from applications helps improve tools – PAPI, TAU, HPCtoolkit, CHILL, Orio, ROSE, GCO, Active Harmony – Also using Open|SpeedShop and PerfExpert 3

PERI Autotuning Workflow original code transformation and code generation optimized code variant execution environment search engine representative input performance feedback transformation recipes developer  code triage performance data code outliner outlined code HPCToolkit, TAU, PAPI ROSE compiler CHiLL, LoopTool, POET, Orio ActiveHarmony, GCO optimized code 4

Project Status/Milestones Q1, Q2, Q3 milestones largely achieved – Integration of MRCI code into GAMESS, analysis – Profiling of MPQC integral kernels, autotuning – Setup of PerfDMF database – DAG scheduler not yet implemented Q4 milestones (current work) – Evaluation of integral code autotuning – Parallelization and autotuning of MRCI code – Identification of additional kernels for autotuning 5

Gprof Profile for MPQC Integral Computation GNU gprof flat profile: % cumulative self self total time seconds seconds #calls s/call s/call name ,157, sc::Int2eV3::blockbuildprim_1( ) ,508, sc::Int2eV3::compute_erep( ) ,500, sc::EAVLMMap<>::find( ) ,960, do_sparse_transform2_3new( ) ,392, do_sparse_transform2_1new() ,332, sc::Int2eV3::shiftam_34( ) ,942, do_sparse_transform2_2new( ) ,405, sc::Int2eV3::build_not_using_gcs( ) ,365, sc::Int2eV3::shiftam_12( ) ,2500, sc::Int2eV3::int_have_stored_integral( ) … 6

TAU Analysis of Threaded MPQC Optimized TAU instrumentation using sampling and selective instrumentation Identified blockbuildprim and compute_erep as significant 7

Collected PAPI Data Fflop/Cycle = 0.24 (i.e., CPI = 4.2) L1 cache miss rate = 0.45% L2 cache miss rate = 5.6% TLB miss rate = 0.017% Branch miss prediction rate = 3.7% Cycles stalled = 261 M (21% of total cycles) Question: Is it a memory bound or CPU bound application? – T(n) is between O(n 2 ) and O(n 4 ) 8

A Stand-Alone Kernel void blockbuildprim_1(double* A2, double* B, int amin, int amax, int am34, int size34) { for(am12=amin; am12<=amax; am12++) { for (i12=2; i12<=am12; i12++) { for (k12=0; k12<=am12-i12; k12++) { double *A=&A2[am34+1]; double d = half_ooze; k = 0; for (i34=1; i34<=am34; i34++) { for (k34=0; k34<=am34-i34; k34++) { A[k] += d * B[k]; k++; } d += half_ooze; } A2 += size34; } Lack of ILP 9

Improvement We implemented 7 specializations manually – CHILL required rewrite of code in order to work Variable am34Old CPINew CPI

Further MPQC Integral Computation Autotuning Autotuning parameters set up by code developers (total of 10 parameters, possible combinations) – Swapping order of general contraction loops – Redundant primitives or not – Generated code or not – Compiler optimization of low level routines Wrote GCO scripts to perform exhaustive search 30% performance improvement over default settings Need to try more molecules 11

GAMESS+TigerCI Integration 12

TigerCI Optimization and Parallelization Integrated the TigeCI code into GAMESS and analyzed performance. Significant single core performance optimizations have been made – Replacement of loops over BLAS-1 operations by single BLAS-2 operations Bottlenecks in the serial code have been identified – Cholesky decomposition step and the transformation of the Cholesky matrix from the atomic to the molecular basis – Observation that a loop transformation could accelerate a key part of the code by a factor of three – Perform these transformations using automatic tools (CHILL, Orio) Preliminary work to parallelize the code – BLAS-2 and BLAS-3 operations replaced by multithreaded implementations 13

TAU Analysis of GAMESS+TigerCI Performance data were added to a PerfDMF profile database. Data were collected for experiments on C2H6, C3H8, C4H10, C5H12, C6H14, C8H18 and C9H20 chemistry. Preliminary analysis was conducted, comparing all trials with respect to input complexity. 14

Runtime Breakdown of Significant Events The two most significant routines, __wrap__gfortran_matmul_r8 and EXT_3_4_SEG_LOOPS_VEC_LMO_RES_2 exhibit poor scaling with respect to input complexity If these routines are amenable to parallelization, dividing computation between multiple cores could significantly improve performance 15

Runtime Scaling Note the inflection point at C6H14, beyond this level of input complexity the runtime increases rapidly. 16

Continuing Work Currently focusing on collecting more significant profile data from GAMES+TigerCI PAPI Hardware Counter Data Callpath Profiling Sampling Collecting data in profile database for extensive analysis across multiple trials Comparison of parallelization strategies for Tiger CI 17

Exploring further GPU optimizations of GAMESS modules Current GPU implementations of kernels yield 4-17x speedup compared to GAMESS on CPU Model and predict optimal GPU performance – Hardware counter data from PAPI GPU component – TAU – PerfExpert and MACPO from TACC Additional optimizations for Fermi architecture – Resource usage Registers and memory Optimal use of special functional units (SFUs) Optimal partitioning of shared memory/L1 cache – Increase compute to memory access ratio Unroll and jam – Combinations of optimizations Discontinuous optimization space! 18