This module created with support form NSF CDER Early Adopter Program Module developed Fall 2014 by Apan Qasem Parallel Performance: Analysis and Evaluation.

Slides:



Advertisements
Similar presentations
CSE 160 – Lecture 9 Speed-up, Amdahl’s Law, Gustafson’s Law, efficiency, basic performance metrics.
Advertisements

Communication-Avoiding Algorithms Jim Demmel EECS & Math Departments UC Berkeley.
OpenMP Optimization National Supercomputing Service Swiss National Supercomputing Center.
1 Optimizing compilers Managing Cache Bercovici Sivan.
CSE431 Chapter 7A.1Irwin, PSU, 2008 CSE 431 Computer Architecture Fall 2008 Chapter 7A: Intro to Multiprocessor Systems Mary Jane Irwin (
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
1 Parallel Scientific Computing: Algorithms and Tools Lecture #2 APMA 2821A, Spring 2008 Instructors: George Em Karniadakis Leopold Grinberg.
CA 714CA Midterm Review. C5 Cache Optimization Reduce miss penalty –Hardware and software Reduce miss rate –Hardware and software Reduce hit time –Hardware.
Lecture 6: Multicore Systems
CMSC 611: Advanced Computer Architecture Cache Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from.
1 Chapter 1 Why Parallel Computing? An Introduction to Parallel Programming Peter Pacheco.
Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.
CMPE 421 Parallel Computer Architecture MEMORY SYSTEM.
Ch1. Fundamentals of Computer Design 3. Principles (5) ECE562/468 Advanced Computer Architecture Prof. Honggang Wang ECE Department University of Massachusetts.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Sep 5, 2005 Lecture 2.
S.1 Review: The Memory Hierarchy Increasing distance from the processor in access time L1$ L2$ Main Memory Secondary Memory Processor (Relative) size of.
Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.
Memory Hierarchy.1 Review: Major Components of a Computer Processor Control Datapath Memory Devices Input Output.
331 Lec20.1Fall :332:331 Computer Architecture and Assembly Language Fall 2003 Week 13 Basics of Cache [Adapted from Dave Patterson’s UCB CS152.
ENEE350 Ankur Srivastava University of Maryland, College Park Based on Slides from Mary Jane Irwin ( )
CIS °The Five Classic Components of a Computer °Today’s Topics: Memory Hierarchy Cache Basics Cache Exercise (Many of this topic’s slides were.
Computer ArchitectureFall 2007 © November 7th, 2007 Majd F. Sakr CS-447– Computer Architecture.
Systems I Locality and Caching
1 Instant replay  The semester was split into roughly four parts. —The 1st quarter covered instruction set architectures—the connection between software.
Computer System Architectures Computer System Software
CPE432 Chapter 5A.1Dr. W. Abu-Sufah, UJ Chapter 5A: Exploiting the Memory Hierarchy, Part 1 Adapted from Slides by Prof. Mary Jane Irwin, Penn State University.
This module created with support form NSF under grant # DUE Module developed Spring 2013 by Apan Qasem Task Orchestration : Scheduling and Mapping.
Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.
This module created with support form NSF under grant # DUE Module developed Fall 2014 by Apan Qasem Parallel Computing Fundamentals Course TBD.
Parallel Processing - introduction  Traditionally, the computer has been viewed as a sequential machine. This view of the computer has never been entirely.
Performance Measurement. A Quantitative Basis for Design n Parallel programming is an optimization problem. n Must take into account several factors:
EEE-445 Review: Major Components of a Computer Processor Control Datapath Memory Devices Input Output Cache Main Memory Secondary Memory (Disk)
Super computers Parallel Processing By Lecturer: Aisha Dawood.
PARALLEL APPLICATIONS EE 524/CS 561 Kishore Dhaveji 01/09/2000.
King Fahd University of Petroleum and Minerals King Fahd University of Petroleum and Minerals Computer Engineering Department Computer Engineering Department.
CSC 7600 Lecture 28 : Final Exam Review Spring 2010 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS FINAL EXAM REVIEW Daniel Kogler, Chirag Dekate.
EEL5708/Bölöni Lec 4.1 Fall 2004 September 10, 2004 Lotzi Bölöni EEL 5708 High Performance Computer Architecture Review: Memory Hierarchy.
Computer Architecture
Dean Tullsen UCSD.  The parallelism crisis has the feel of a relatively new problem ◦ Results from a huge technology shift ◦ Has suddenly become pervasive.
Copyright © 2011 Curt Hill MIMD Multiple Instructions Multiple Data.
The Goal: illusion of large, fast, cheap memory Fact: Large memories are slow, fast memories are small How do we create a memory that is large, cheap and.
Processor Level Parallelism. Improving the Pipeline Pipelined processor – Ideal speedup = num stages – Branches / conflicts mean limited returns after.
Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.
Introduction: Memory Management 2 Ideally programmers want memory that is large fast non volatile Memory hierarchy small amount of fast, expensive memory.
Caches Hiding Memory Access Times. PC Instruction Memory 4 MUXMUX Registers Sign Ext MUXMUX Sh L 2 Data Memory MUXMUX CONTROLCONTROL ALU CTL INSTRUCTION.
1 Lecture 2: Performance, MIPS ISA Today’s topics:  Performance equations  MIPS instructions Reminder: canvas and class webpage:
1 Concurrent and Distributed Programming Lecture 2 Parallel architectures Performance of parallel programs References: Based on: Mark Silberstein, ,
Computer Organization CS224 Fall 2012 Lesson 52. Introduction  Goal: connecting multiple computers to get higher performance l Multiprocessors l Scalability,
Lecture 5: Memory Performance. Types of Memory Registers L1 cache L2 cache L3 cache Main Memory Local Secondary Storage (local disks) Remote Secondary.
Advanced Topics: Prefetching ECE 454 Computer Systems Programming Topics: UG Machine Architecture Memory Hierarchy of Multi-Core Architecture Software.
3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.
Scaling Conway’s Game of Life. Why do parallelism? Speedup – solve a problem faster. Accuracy – solve a problem better. Scaling – solve a bigger problem.
09/13/2012CS4230 CS4230 Parallel Programming Lecture 8: Dense Linear Algebra and Locality Optimizations Mary Hall September 13,
CSE373: Data Structures & Algorithms Lecture 26: Memory Hierarchy and Locality 1 Kevin Quinn, Fall 2015.
Processor Level Parallelism 2. How We Got Here Developments in PC CPUs.
Memory Management memory hierarchy programs exhibit locality of reference - non-uniform reference patterns temporal locality - a program that references.
Chapter 4: Threads Modified by Dr. Neerja Mhaskar for CS 3SH3.
CMSC 611: Advanced Computer Architecture
COSC3330 Computer Architecture
Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming
Computer Organization
Yu-Lun Kuo Computer Sciences and Information Engineering
How do we evaluate computer architectures?
The Goal: illusion of large, fast, cheap memory
Parallel Processing - introduction
Concurrent and Distributed Programming
Morgan Kaufmann Publishers
EE 193: Parallel Computing
Morgan Kaufmann Publishers Memory Hierarchy: Introduction
Chapter 4: Threads & Concurrency
Presentation transcript:

This module created with support form NSF CDER Early Adopter Program Module developed Fall 2014 by Apan Qasem Parallel Performance: Analysis and Evaluation Lecture TBD Course TBD Term TBD

Review Performance evaluation of parallel programs

Speedup Sequential Speedup S seq = Exec orig /Exec new Parallel Speedup S par = Exec seq /Exec par S par = Exec 1 /Exec N Linear Speedup S par = N Super Linear Speedup S par > N

Amdahl’s Law for Parallel Programs Speedup is bounded by the amount of parallelism available in the program If the fraction of code that runs in parallel is p then maximum speedup that can be obtained with N processors ExTime new = (ExTime seq * p * 1/N) + (ExTime seq * (1 – p)) ExTime par = ExTime seq * ((1 – p) + p/N) Speedup = ExTime seq /ExTime par = 1/((1-p) + p/N) = N / (N (1 –p) + p)

max theoretical speedup max speedup in relation to number of processors Scalability

Program continues to provide speedups as we add more processing cores Does Amdahl’s Law hold for large values of N for a particular program The ability of a parallel program's performance to scale is a result of a number of interrelated factors The algorithm may have inherent limits to scalability

Strong and Weak Scaling Strong Scaling Adding more cores allows us to solve the problem faster e.g., fold the same protein faster Weak Scaling Adding more cores allows us to solve larger problem e.g., fold a bigger protein

The Road to High Performance Celebrating 20 years teraflop gigaflop petaflop

The Road to High Performance Celebrating 20 years multicores arrive

Lost Performance Celebrating 20 years

Need More Than Performance GPUs arrive No power data prior to 2003 Celebrating 20 years

Communication Costs Algorithms have two costs 1.Arithmetic (FLOPS) 2.Communication: moving data between levels of a memory hierarchy (sequential case) processors over a network (parallel case). CPU Cache DRAM CPU DRAM CPU DRAM CPU DRAM CPU DRAM Slide source: Jim Demmel, UC Berkeley

Avoiding Communication Running time of an algorithm is sum of 3 terms: # flops * time_per_flop # words moved / bandwidth # messages * latency Slide source: Jim Demmel, UC Berkeley communication Goal : organize code to avoid communication Between all memory hierarchy levels L1 L2 DRAM network Not just hiding communication (overlap with arith) (speedup  2x ) Arbitrary speedups possible Annual improvements Time_per_flopBandwidthLatency Network26%15% DRAM23%5% 59%

Power Consumption in HPC Applications 14 Data from NCOMMAS weather modeling applications on AMD Barcelona

Techniques For Improving Parallel Performance Data locality Thread Affinity Energy

Memory Hierarchy : Single Processor Second Level Cache (L2) Control Datapath Secondary Memory (Disk) On-Chip Components RegFile Main Memory (DRAM) Data Cache Instr Cache ITLB DTLB Speed (cycles): ½ 1’s 10’s 100’s 10,000’s Size (bytes): 100’s 10K’s M’s G’s T’s Cost per byte: highest lowest Nothing gained without locality

Types of Locality Temporal Locality (locality in time) If a memory location is referenced then it is likely that it will be referenced again soon  Keep most recently accessed data items closer to the processor Spatial Locality (locality in space) If a memory location is referenced, the locations with nearby addresses are likely to be referenced soon  Move blocks consisting of contiguous words closer to the processor demo

Shared-caches on Multicores Blue Gene/L Tilera64 Intel Core 2 Duo

Data Parallelism D/p D = data D

Data Parallelism D/p D = data D typically, same task on different parts of the data spawn synchronize

D/k D D D/k ≤ Cache Capacity D/k Shared-cache and Data Parallelization intra-core locality

Tiled Data Access individual thread “beam” sweep blocking of i and j parallellization over ii and jj “unit” sweep parallelization over i, j, k no blocking “plane” sweep parallelization over k no blocking i j k

reuse over time, multiple sweeps over working set Reduced granularity Improved intra-core locality thread granularity smaller working set per thread i j k Data Locality and Thread Granularity

Exploiting Locality With Tiling // parallel region thread_construct()... // repeated access for j = 1, M... a[i][j] b[i][j]... for j = 1, M, T // parallel region thread_construct()... // repeated access for jj = j, j + T a[i][jj] b[i][jj]...

Exploiting Locality With Tiling // parallel region for i = 1, N... // repeated access for j = 1, M... a[i][j] b[i][j]... for j = 1, M, T // parallel region for i = 1, N... // repeated access for jj = j, j + T a[i][jj] b[i][jj]... demo

Locality with Distribution // parallel region thread_construct()... for j = 1, N for i = 1, M... a(i,j) b(i,j)... // parallel region thread_construct()... for j = 1, N for i = 1, M... a(i,j)... // parallel region thread_construct()... for j = 1, N for i = 1, M... b(i,j)... reduces threads granularity improves intro-core locality

Locality with Fusion // parallel region thread_construct()... for j = 1, N for i = 1, M... a(i,j) b(i,j)... // parallel region thread_construct()... for j = 1, N for i = 1, M... a(i,j)... // parallel region thread_construct()... for j = 1, N for i = 1, M... b(i,j)...

Combined Tiling and Fusion for i = 1, M, T // parallel region thread_construct() for ii = i, i + T - 1 = a(ii,j) = b(ii,j) // parallel region thread_construct()... for j = 1, N for i = 1, M... a(i,j)... // parallel region thread_construct()... for j = 1, N for i = 1, M... b(i,j)...

Pipelined Parallelism Pipelined parallelism can be used to parallelize applications that exhibit producer-consumer behavior Gained importance because of the low synchronization cost between cores on CMPs Being used to parallelize programs that were previously considered sequential Arises in many different contexts Optimization problems Image processing Compression PDE solvers

Pipelined Parallelism C P Shared Data Set P C Synchronization window Any streaming application : Netflix

Ideal Synchronization Window CP Shared Data Set P C inter-core data locality

Synchronization Window Bounds Bad Not as bad Better?

Thread Affinity Binding a thread to a particular core Soft affinity Affinity suggested by programmer/software; may or may not be honored by OS Hard affinity affinity suggested by system software/runtime system; honored by OS

Thread Affinity and Performance Temporal Locality A thread running on the same core throughout it’s lifetime will be able to exploit the cache Resource usage Shared caches TLBs Prefetch units …

Thread Affinity and Resource Usage Key idea If thread i and j have favorable resource usage then bind them to the same “cohort” If thread i and j have unfavorable resource usage then bind them to different “cohorts” A cohort is a group of cores that share resources demo

Load Balancing This one dominates!

Thread Affinity Tools GNU + OpenMP Environment variable GOMP_CPU_AFFINITY Pthreads pthread_setaffinity_np() Linux API sched_setaffinity() Command line tools taskset

Power Consumption Improved power consumption does not always coincide with improved performance In fact, for many applications it is the opposite P = CV 2 f Need to account for power, explicitly

Optimizations for Power Techniques are similar but objectives are different Fuse code to get a better mix of instructions Distribute code to separate and FP-intensive tasks Can use affinity to reduce overall system power consumption Bind hot-cold tasks to same cohort Distribute hot-hot tasks across multiple cohorts Techniques with hardware support DVFS : slow down a subset of cores