Case studies in Optimizing High Performance Computing Software Jan Westerholm High performance computing Department of Information Technologies Faculty.

Slides:



Advertisements
Similar presentations
Hadi Goudarzi and Massoud Pedram
Advertisements

A NOVEL APPROACH TO SOLVING LARGE-SCALE LINEAR SYSTEMS Ken Habgood, Itamar Arel Department of Electrical Engineering & Computer Science GABRIEL CRAMER.
CSCI-455/552 Introduction to High Performance Computing Lecture 11.
Algorithms Today we will look at: what we mean by efficiency in programs why efficiency matters what causes programs to be inefficient? will one algorithm.
Theoretical Program Checking Greg Bronevetsky. Background The field of Program Checking is about 13 years old. Pioneered by Manuel Blum, Hal Wasserman,
Lincoln University Canterbury New Zealand Evaluating the Parallel Performance of a Heterogeneous System Elizabeth Post Hendrik Goosen formerly of Department.
Introduction CS 524 – High-Performance Computing.
Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.
Performance Optimization of Clustal W: Parallel Clustal W, HT Clustal and MULTICLUSTAL Arunesh Mishra CMSC 838 Presentation Authors : Dmitri Mikhailov,
A Platform-based Design Flow for Kahn Process Networks Abhijit Davare Qi Zhu December 10, 2004.
DATA LOCALITY & ITS OPTIMIZATION TECHNIQUES Presented by Preethi Rajaram CSS 548 Introduction to Compilers Professor Carol Zander Fall 2012.
STRATEGIES INVOLVED IN REMOTE COMPUTATION
‘Tis not folly to dream: Using Molecular Dynamics to Solve Problems in Chemistry Christopher Adam Hixson and Ralph A. Wheeler Dept. of Chemistry and Biochemistry,
Performance Evaluation of Hybrid MPI/OpenMP Implementation of a Lattice Boltzmann Application on Multicore Systems Department of Computer Science and Engineering,
MUMPS A Multifrontal Massively Parallel Solver IMPLEMENTATION Distributed multifrontal.
1 Babak Behzad, Yan Liu 1,2,4, Eric Shook 1,2, Michael P. Finn 5, David M. Mattli 5 and Shaowen Wang 1,2,3,4 Babak Behzad 1,3, Yan Liu 1,2,4, Eric Shook.
Modularizing B+-trees: Three-Level B+-trees Work Fine Shigero Sasaki* and Takuya Araki NEC Corporation * currently with 1st Nexpire Inc.
Database Systems: Design, Implementation, and Management Eighth Edition Chapter 10 Database Performance Tuning and Query Optimization.
Chao “Bill” Xie, Victor Bolet, Art Vandenberg Georgia State University, Atlanta, GA 30303, USA February 22/23, 2006 SURA, Washington DC Memory Efficient.
Network Aware Resource Allocation in Distributed Clouds.
Scientific Computing Topics for Final Projects Dr. Guy Tel-Zur Version 2,
HOCT: A Highly Scalable Algorithm for Training Linear CRF on Modern Hardware presented by Tianyuan Chen.
Seaborg Cerise Wuthrich CMPS Seaborg  Manufactured by IBM  Distributed Memory Parallel Supercomputer  Based on IBM’s SP RS/6000 Architecture.
Tempus CD-JEP Meeting, Belgrade, SCG, Apr , Curriculum Development: IT Curriculum for graduate studies at Faculty of Mechanical Engineering,
O AK R IDGE N ATIONAL L ABORATORY U.S. D EPARTMENT OF E NERGY 1 Parallel Solution of the 3-D Laplace Equation Using a Symmetric-Galerkin Boundary Integral.
InCoB August 30, HKUST “Speedup Bioinformatics Applications on Multicore- based Processor using Vectorizing & Multithreading Strategies” King.
High Performance Computing Processors Felix Noble Mirayma V. Rodriguez Agnes Velez Electric and Computer Engineer Department August 25, 2004.
Graph Algorithms. Definitions and Representation An undirected graph G is a pair (V,E), where V is a finite set of points called vertices and E is a finite.
1 WORKSHOP ON COMPUTER SCIENCE EDUCATION Innovation of Computer Science Curriculum in Higher Education TEMPUS project CD-JEP 16160/2001.
A performance evaluation approach openModeller: A Framework for species distribution Modelling.
Mahesh Sukumar Subramanian Srinivasan. Introduction Embedded system products keep arriving in the market. There is a continuous growing demand for more.
Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.
Processes and Threads Processes have two characteristics: – Resource ownership - process includes a virtual address space to hold the process image – Scheduling/execution.
Duality between Reading and Writing with Applications to Sorting Jeff Vitter Department of Computer Science Center for Geometric & Biological Computing.
Scalable Symbolic Model Order Reduction Yiyu Shi*, Lei He* and C. J. Richard Shi + *Electrical Engineering Department, UCLA + Electrical Engineering Department,
Detail-Preserving Fluid Control N. Th ű rey R. Keiser M. Pauly U. R ű de SCA 2006.
Task Graph Scheduling for RTR Paper Review By Gregor Scott.
Replicating Memory Behavior for Performance Skeletons Aditya Toomula PC-Doctor Inc. Reno, NV Jaspal Subhlok University of Houston Houston, TX By.
Computing Simulation in Orders Based Transparent Parallelizing Pavlenko Vitaliy Danilovich, Odessa National Polytechnic University Burdeinyi Viktor Viktorovych,
Improving Disk Throughput in Data-Intensive Servers Enrique V. Carrera and Ricardo Bianchini Department of Computer Science Rutgers University.
Motivation: Sorting is among the fundamental problems of computer science. Sorting of different datasets is present in most applications, ranging from.
CSC Multiprocessor Programming, Spring, 2012 Chapter 11 – Performance and Scalability Dr. Dale E. Parson, week 12.
Radix Sort and Hash-Join for Vector Computers Ripal Nathuji 6.893: Advanced VLSI Computer Architecture 10/12/00.
Application of the MCMC Method for the Calibration of DSMC Parameters James S. Strand and David B. Goldstein The University of Texas at Austin Sponsored.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA
Full and Para Virtualization
08/10/ NRL Hybrid QR Factorization Algorithm for High Performance Computing Architectures Peter Vouras Naval Research Laboratory Radar Division Professor.
Page 1 A Platform for Scalable One-pass Analytics using MapReduce Boduo Li, E. Mazur, Y. Diao, A. McGregor, P. Shenoy SIGMOD 2011 IDS Fall Seminar 2011.
Data Consolidation: A Task Scheduling and Data Migration Technique for Grid Networks Author: P. Kokkinos, K. Christodoulopoulos, A. Kretsis, and E. Varvarigos.
Custom Computing Machines for the Set Covering Problem Paper Written By: Christian Plessl and Marco Platzner Swiss Federal Institute of Technology, 2002.
University of Texas at Arlington Scheduling and Load Balancing on the NASA Information Power Grid Sajal K. Das, Shailendra Kumar, Manish Arora Department.
Anders Nielsen Technical University of Denmark, DTU-Aqua Mark Maunder Inter-American Tropical Tuna Commission An Introduction.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA Shirley Moore CPS5401 Fall 2013 svmoore.pbworks.com November 12, 2012.
Introduction to Performance Tuning Chia-heng Tu PAS Lab Summer Workshop 2009 June 30,
Enabling Grids for E-sciencE LRMN ThIS on the Grid Sorina CAMARASU.
Spark on Entropy : A Reliable & Efficient Scheduler for Low-latency Parallel Jobs in Heterogeneous Cloud Huankai Chen PhD Student at University of Kent.
A computational ecosystem for near real-time satellite data processing
A few words on locality and arrays
Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming
Tohoku University, Japan
Operating Systems (CS 340 D)
Department of Computer Science University of California, Santa Barbara
ANTHAN HALKO, PER-GUNNAR MARTINSSON, YOEL SHAOLNISKY, AND MARK TYGERT
Simultaneous Multithreading in Superscalar Processors
Operating Systems (CS 340 D)
Memory System Performance Chapter 3
Department of Computer Science University of California, Santa Barbara
Algorithm Course Algorithms Lecture 3 Sorting Algorithm-1
CSC Multiprocessor Programming, Spring, 2011
Presentation transcript:

Case studies in Optimizing High Performance Computing Software Jan Westerholm High performance computing Department of Information Technologies Faculty of Technology / Åbo Akademi University

FINHPC / Åbo Akademi Objectives Sub-project in FINHPC Three year duration Objective: to improve code individuals and research groups have written and are running on CSC machines –faster code, with in many cases exactly the same numerical results as before –ability to run bigger problems Work approach: apply well known techniques from computer science Faster programs may imply better quality for results Better throughput for everybody

FINHPC / Åbo Akademi Limitations We will use: –parallelization techniques –code optimization cache utilization (particularly L2-cache) microprocessor pipeline continuity data blocking: grid scan order –introduction of new data structures –replacement of very simple algorithms sorting (quicksort instead of bubble sort) –open source libraries

FINHPC / Åbo Akademi Limitations We will not: –introduce better physics, chemistry, etc. –replace chosen basic numerical technique –replace individual algorithms unless they are clearly modularized (matrix inversion as library routine)

3 case studies Lattice-Boltzmann fluid simulation : 3DQ19 Protein covariance analysis: Covana Fusion reactor simulation: Elmfire

3DQ19: Lattice Boltzmann fluid mechanics Jyväskylä University / Jussi Timonen, Keijo Mattila; ÅA / Anders Gustafsson Physical background: –phase space distribution simulated in time –Boltzmann's equation: drift term and collision term –physical quantities = moments of distribution

3DQ19: Program Profiling Flat profile: % cumulative self self total time seconds seconds calls ms/call ms/call name everything2to1() everything1to2() relaxation_BGK() shmem_msgs_available send_west() send_east() recv_message sock_msg_avail_on_fd per_bound_xslice() init_fluid() local_profile_y() socket_msgs_available calc_mass() net_recv allocation() main

3DQ19: Optimizations Parallelization: well done already! Code optimization –blocking: grid scan order –anti-dependency: make blocks of code independent –deep fluid: mark those grid points which do not have solids as neighbours

3DQ19: Blocking

3DQ19: Results on three parallel systems Athlon 1800IBMSC AMD64 everything1to2(): 18,8 19,48 10,06 everything2to1(): 19,34 18,78 10,52 send_west(): 8,4 0,68 1,96 send_east(): 8,31 1,17 3,14 Total time (s): 55,15 40,28 25,76 Time gained (s): 27,48 14,13 14,76 Speed up (%): 33% 26% 36%

2nd case study: Covana Protein Covariance analysis Institute of Medical Technology, University of Tampere / Mauno Vihinen, Bairong Chen; ÅA / André Norrgård Biological background –physico-chemical groups of amino acids –protein function from structure pair and triple correlations between amino acids web server for covariance analysis

Covana: Protein covariance analysis Protein sequences: calculate correlations between columns of amino acids Typical size sequences (rows) amino acids in a sequence (columns) >Q9XW32_CAEEL/9-307 IDVTKPTFLLTFYSIHGTFALVFNILGIFLIMK-NPKIVKMYKGFMINMQ-ILSLLADAQ TTLLMQPVYILPIIGGYTNGLLWQVFR----LSSHIQMAMF---LLLLY LQ VASIVCAIVTKYHVVSNIGKLSDRSI-LFWIF---VIVYHGCAFVITGFFSVS-CLARQ- -EEENLIK------T-KFPNAISVFTLEN--VAIYDLQVN---KWMMITTILFAFMLTSS IVISFY--FSVRLLKTLPSKRNTISARSFRGHQIAVTSLM-AQAT-VPFLVL---IIP-- IGTIVYLFVHVLP------NAQ-----EISNIMMAV--YSFHASLST---FVMIISTPQY

Covana: Code optimization Effective data structures: dynamic memory allocation Effective generic algorithms: sorting Avoid recalculations

Covana: Run time

Covana: Results –Runtime: Original : s Final Version:2.0 s Improvement :112 times faster –Computer memory usage: Original : 3250 MB Final Version:37 MB Improvement :88 times less. –Disk space usage: Original :277 MB Final version:21 MB Improvement:13 times less.

3rd study case: ELMFIRE Tokamak fusion reactor simulation Jukka Heikkinen, Salomon Janhunen, Timo Kiviniemi / Advanced Energy Systems / HUT; ÅA / Artur Signell Physical background: –particle simulation with averaged gyrokinetic Larmor orbits –turbulence and plasma modes

Elmfire: Tokamak fusion reactor simulation Goal 1: Computer platform independence –replacing proprietary library routines for random number generation with open source routines –replacing proprietary library routines for distributed solution of sparse linear systems with open source library routines Goal 2: Scalability –Elmfire ran on at most 8 processors –new data structures for sparse matrices were invented, which make element updates efficient

Elmfire

Conclusions Software can be improved! –modern microprocessor architecture is taken into account: cache utilization pipeline –use of well-established computer science methods

Conclusions In 1 case out 3, a clear impact on run time was made In 2 cases out of 3, previously intractable results can now be obtained Are these three cases representative of code running on CSC machines? –the next two cases are under study!

What have we learnt? Computer scientists with minimal prior knowledge of e.g. physical sciences can contribute to HPC Are supercomputers needed to the extent they are used today at CSC? Interprocess communication often a bottleneck –Parallel computing with 1000 processors may become routine in the future for certain types of problems Who should do the coding? –Code for production use (intensive cycles of use, maintainability) should be outsourced?

Co-workers: Mats Aspnäs, Ph.D Anders Gustafsson, M.Sc. Artur Signell, M.Sc. André Norrgård THANK YOU!