Lixia Liu, Zhiyuan Li Purdue University, USA PPOPP 2010, January 2009.

Slides:

Advertisements

Similar presentations

Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories Muthu Baskaran 1 Uday Bondhugula.

Advertisements

DSPs Vs General Purpose Microprocessors

School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) Parallelism & Locality Optimization.

Lecture 6: Multicore Systems

Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi.

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.

 Understanding the Sources of Inefficiency in General-Purpose Chips.

Introductory Courses in High Performance Computing at Illinois David Padua.

1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 (and Appendix B) Memory Hierarchy Design Computer Architecture A Quantitative Approach,

Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:

Structure-driven Optimizations for Amorphous Data-parallel Programs 1 Mario Méndez-Lojo 1 Donald Nguyen 1 Dimitrios Prountzos 1 Xin Sui 1 M. Amber Hassaan.

Revisiting a slide from the syllabus: CS 525 will cover Parallel and distributed computing architectures – Shared memory processors – Distributed memory.

1 Improving Hash Join Performance through Prefetching _________________________________________________By SHIMIN CHEN Intel Research Pittsburgh ANASTASSIA.

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

Memory: Virtual MemoryCSCE430/830 Memory Hierarchy: Virtual Memory CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng Zhu.

Big Kernel: High Performance CPU-GPU Communication Pipelining for Big Data style Applications Sajitha Naduvil-Vadukootu CSC 8530 (Parallel Algorithms)

Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.

Exercise where Discretize the problem as usual on square grid of points (including boundaries). Define g and f such that the solution to the differential.

Lecture 39: Review Session #1 Reminders –Final exam, Thursday 3:10pm Sloan 150 –Course evaluation (Blue Course Evaluation) Access through.

Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran.

To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,

COLLABORATIVE EXECUTION ENVIRONMENT FOR HETEROGENEOUS PARALLEL SYSTEMS Aleksandar Ili´c, Leonel Sousa 2010 IEEE International Symposium on Parallel & Distributed.

Predictive Runtime Code Scheduling for Heterogeneous Architectures 1.

Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.

1 Hardware Support for Collective Memory Transfers in Stencil Computations George Michelogiannakis, John Shalf Computer Architecture Laboratory Lawrence.

Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors:

Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.

Multi-core Programming Introduction Topics. Topics General Ideas Moore’s Law Amdahl's Law Processes and Threads Concurrency vs. Parallelism.

1 Advance Computer Architecture CSE 8383 Ranya Alawadhi.

Ramazan Bitirgen, Engin Ipek and Jose F.Martinez MICRO’08 Presented by PAK,EUNJI Coordinated Management of Multiple Interacting Resources in Chip Multiprocessors.

Hardware Trends. Contents Memory Hard Disks Processors Network Accessories Future.

GPU in HPC Scott A. Friedman ATS Research Computing Technologies.

High Performance Computing Processors Felix Noble Mirayma V. Rodriguez Agnes Velez Electric and Computer Engineer Department August 25, 2004.

Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.

(1) Scheduling for Multithreaded Chip Multiprocessors (Multithreaded CMPs)

Our work on virtualization Chen Haogang, Wang Xiaolin {hchen, Institute of Network and Information Systems School of Electrical Engineering.

Enabling Multi-threaded Applications on Hybrid Shared Memory Manycore Architectures Tushar Rawat and Aviral Shrivastava Arizona State University, USA CML.

© 2007 SET Associates Corporation SAR Processing Performance on Cell Processor and Xeon Mark Backues, SET Corporation Uttam Majumder, AFRL/RYAS.

Towards a Billion Routing Lookups per Second in Software  Author: Marko Zec, Luigi, Rizzo Miljenko Mikuc  Publisher: SIGCOMM Computer Communication Review,

Fast Support Vector Machine Training and Classification on Graphics Processors Bryan Catanzaro Narayanan Sundaram Kurt Keutzer Parallel Computing Laboratory,

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Basic Parallel Programming Concepts Computational.

Numerical Libraries Project Microsoft Incubation Group Mary Beth Hribar Microsoft Corporation CSCAPES Workshop June 10, 2008 Copyright Microsoft Corporation,

1 06/09/2011, COSMO GM Xavier Lapillonne Porting the physical parametrizations on GPU using directives X. Lapillonne, O. Fuhrer Eidgenössisches Departement.

Chapter 1 Performance & Technology Trends Read Sections 1.5, 1.6, and 1.8.

Investigating Adaptive Compilation using the MIPSpro Compiler Keith D. Cooper Todd Waterman Department of Computer Science Rice University Houston, TX.

Adaptive Multi-Threading for Dynamic Workloads in Embedded Multiprocessors 林鼎原 Department of Electrical Engineering National Cheng Kung University Tainan,

Authors – Jeahyuk huh, Doug Burger, and Stephen W.Keckler Presenter – Sushma Myneni Exploring the Design Space of Future CMPs.

Yang Yu, Tianyang Lei, Haibo Chen, Binyu Zang Fudan University, China Shanghai Jiao Tong University, China Institute of Parallel and Distributed Systems.

Design Issues of Prefetching Strategies for Heterogeneous Software DSM Author :Ssu-Hsuan Lu, Chien-Lung Chou, Kuang-Jui Wang, Hsiao-Hsi Wang, and Kuan-Ching.

Platform Abstraction Group 3. Question How to deal with different types hardware and software platforms? What detail to expose to the programmer? What.

Hardware Architectures for Power and Energy Adaptation Phillip Stanley-Marbell.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Spring 2010 Lecture 13: Basic Parallel.

Weekly Report- Reduction Ph.D. Student: Leo Lee date: Oct. 30, 2009.

Shouqing Hao Institute of Computing Technology, Chinese Academy of Sciences Processes Scheduling on Heterogeneous Multi-core Architecture.

3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.

Presented by Jeremy S. Meredith Sadaf R. Alam Jeffrey S. Vetter Future Technologies Group Computer Science and Mathematics Division Research supported.

Institute of Software,Chinese Academy of Sciences An Insightful and Quantitative Performance Optimization Chain for GPUs Jia Haipeng.

Analyzing Memory Access Intensity in Parallel Programs on Multicore Lixia Liu, Zhiyuan Li, Ahmed Sameh Department of Computer Science, Purdue University,

An Approach for Enhancing Inter- processor Data Locality on Chip Multiprocessors Guilin Chen and Mahmut Kandemir The Pennsylvania State University, USA.

SketchVisor: Robust Network Measurement for Software Packet Processing

A Dynamic Scheduling Framework for Emerging Heterogeneous Systems

CS427 Multicore Architecture and Parallel Computing

5.2 Eleven Advanced Optimizations of Cache Performance

Bank-aware Dynamic Cache Partitioning for Multicore Architectures

Linchuan Chen, Xin Huo and Gagan Agrawal

Presented by: Isaac Martin

Efficient software checkpointing framework for speculative techniques

EE 4xx: Computer Architecture and Performance Programming

Rohan Yadav and Charles Yuan (rohany) (chenhuiy)

Presentation transcript:

Lixia Liu, Zhiyuan Li Purdue University, USA PPOPP 2010, January 2009

 Multicore architecture  Multiple cores per chip  Modest on-chip caches  Memory bandwidth issue ▪Increasing gap between CPU speed and off-chip memory bandwidth ▪Increasing bandwidth consumption by aggressive hardware prefetching  Software  Many optimizations increase memory bandwidth requirement ▪Parallelization, Software prefetching, ILP  Some optimizations reduce memory bandwidth requirement ▪Array contraction, index compression  Loop transformations to improve data locality ▪Loop tiling, loop fusion and others ▪Restricted by data/control dependences 2

 Loop tiling is used to increase data locality  Example program: PDE iterative solver 3 The base implementation do t = 1,itmax update(a, n, f); ! Compute residual and convergence test error = residual(a, n) if (error.le. tol) then exit endif end do

 Tiling is skewed to satisfy data dependences  After tiling, parallelism only exists within a tile due to data dependences between tiles 4

The tiled version with speculated execution do t = 1, itmax/M + 1 ! Save the old result into buffer as checkpoint oldbuf(1:n, 1:n) = a(1:n, 1:n) ! Execute a chunk of M iterations after tiling update_tile(a, n, f, M) ! Compute residual and perform convergence test error = residual(a, n) if (error.le. tol) then call recovery(oldbuf, a, n, f) exit end if end do 5 Questions 1.How to select chunk size? 2.Is recovery overhead necessary?

 Mitigate the memory bandwidth problem  Apply data locality optimizations to challenging cases  Relax restrictions imposed by data/control dependences 6

 Basic idea: allow to use of old neighboring values in the computation, still converging  Originally proposed to reduce communication cost and synchronization overhead  Convergence rate of asynchronous algorithms 1  May slowdown convergence rate  Our contribution is to use the asynchronous model to improve parallelism and locality simultaneously  Relax dependencies  Monotone exit condition 7 [1] Frommer, A. and Szyld, D. B Asynchronous two-stage iterative methods. In Numer. Math. 69, 2, Dec 1994.

The tiled version without recovery do t = 1, itmax/M + 1 ! Execute a chunk of M iterations after tiling update_tile(a, n, f, M) ! Compute residual and convergence test error = residual(a, n) if (error.le. tol) then exit end if end do 8

 Achieve parallelism across the grid  Not just within a tile  Apply loop tiling to improve data locality  Requiring a partition of time steps in chunks  Eliminate recovery overhead 9

 Chunk size: # iterations executed speculatively in the tiled code  Ideal if we can predict the exact iterations to converge  However, it is unknown until convergence happens  Too large a chunk, we pay overshooting overhead  Too small, poor data reuse and poor data locality 10

 Poor solutions  Use a constant chunk size (randomly pick)  Estimate based on the theoretical convergence rate  A better solution: Adaptive chunk size  Use the latest convergence progress to predict how many more iterations are required to converge r i :residual error of i-th round of tiled code 11

 Platforms for experiments:  Intel Q6600, AMD8350 Barcelona, Intel E5530 Nehalem  Evaluated numerical methods: Jacobi, GS, SOR  Performance results  Synchronous model vs. asynchronous model with the best chuck size  Original code vs. loop tiling  Impact of the chunk size  Adaptive chunk selection vs. the ideal chunk size 12

 Peak bandwidth of our platforms 13 MachineModelL1L2L3 BW (GB/s) SBW (GB/s) A AMD8350 4x4 cores 64KB private 512KB private 4x2MB shared B Q6600 1x4 cores 32KB private 2x4MB shared N/A C E5530 2x4 cores 256KB private 1MB private 2x8MBs shared

14  Machine A Machinekernelparalleltiledtiled-norecasync-baseasync-tiled A 16 cores Jacobi Performance varies with chunk size! async-tiled version is the best!

15  Machine B Machinekernelparalleltiledtiled-norecasync-baseasync-tiled B 4 cores Jacobi Poor performance without tiling (async-base and parallel)!

16  Machine C Machinekernelparalleltiledtiled-norecasync-baseasync-tiled C 8 cores Jacobi

Machinekernelparalleltiledtiled-norecasync-baseasync-tiled AGS BGS CGS ASOR BSOR CSOR Asynchronous tiled version performs better than synchronous tiled version (even without recovery cost) Asynchronous baseline suffers more on machine B due to less memory bandwidth available

 adaptive-1: lower bound of chunk size is 1  adaptive-8: lower bound of chunk size is 8 18

 Showed how to benefit from the asynchronous model for relaxing data and control dependences  improve parallelism and data locality (via loop tiling) at the same time  An adaptive method to determine the chunk size  because the iteration count is usually unknown in practice  Good performance enhancement when tested on three well-known numerical kernels on three different multicore systems. 19

 Thank you! 20