Increasing and Detecting Memory Address Congruence Sam Larsen Emmett Witchel Saman Amarasinghe Laboratory for Computer Science Massachusetts Institute.

Slides:



Advertisements
Similar presentations
Asanovic/Devadas Spring VLIW/EPIC: Statically Scheduled ILP Krste Asanovic Laboratory for Computer Science Massachusetts Institute of Technology.
Advertisements

Dynamic Memory Allocation in C.  What is Memory What is Memory  Memory Allocation in C Memory Allocation in C  Difference b\w static memory allocation.
Streaming SIMD Extension (SSE)
The University of Adelaide, School of Computer Science
CS136, Advanced Architecture Limits to ILP Simultaneous Multithreading.
Chapter 3 Instruction Set Architecture Advanced Computer Architecture COE 501.
INSTRUCTION SET ARCHITECTURES
Machine/Assembler Language Putting It All Together Noah Mendelsohn Tufts University Web:
CS252 Graduate Computer Architecture Spring 2014 Lecture 9: VLIW Architectures Krste Asanovic
Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining.
Computer Architecture CSCE 350
Advanced microprocessor optimization Kampala August, 2007 Agner Fog
Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.
NYU DARPA DIS kick-off September 24, Comparing IA-64 and HPL-PD NYU.
EECS 470 Cache Systems Lecture 13 Coverage: Chapter 5.
Run-time Environment and Program Organization
ECE669 L23: Parallel Compilation April 29, 2004 ECE 669 Parallel Computer Architecture Lecture 23 Parallel Compilation.
Kathy Grimes. Signals Electrical Mechanical Acoustic Most real-world signals are Analog – they vary continuously over time Many Limitations with Analog.
NCCS Brown Bag Series. Vectorization Efficient SIMD parallelism on NCCS systems Craig Pelissier* and Kareem Sorathia
Chapter 1 Algorithm Analysis
Optimizing Loop Performance for Clustered VLIW Architectures by Yi Qian (Texas Instruments) Co-authors: Steve Carr (Michigan Technological University)
Yongjoo Kim*, Jongeun Lee**, Jinyong Lee*, Toan Mai**, Ingoo Heo* and Yunheung Paek* *Seoul National University **UNIST (Ulsan National Institute of Science.
09/27/2011CS4961 CS4961 Parallel Programming Lecture 10: Introduction to SIMD Mary Hall September 27, 2011.
University of Michigan Electrical Engineering and Computer Science 1 Extending Multicore Architectures to Exploit Hybrid Parallelism in Single-Thread Applications.
13/02/2009CA&O Lecture 04 by Engr. Umbreen Sabir Computer Architecture & Organization Instructions: Language of Computer Engr. Umbreen Sabir Computer Engineering.
Hybrid-Scheduling: A Compile-Time Approach for Energy–Efficient Superscalar Processors Madhavi Valluri and Lizy John Laboratory for Computer Architecture.
The Arrival of the 64bit CPUs - Itanium1 นายชนินท์วงษ์ใหญ่รหัส นายสุนัยสุขเอนกรหัส
DASX : Hardware Accelerator for Software Data Structures Snehasish Kumar, Naveen Vedula, Arrvindh Shriraman (Simon Fraser University), Vijayalakshmi Srinivasan.
6.S078 - Computer Architecture: A Constructive Approach Introduction to SMIPS Li-Shiuan Peh Computer Science & Artificial Intelligence Lab. Massachusetts.
CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 1 Compiler Techniques for ILP  So far we have explored dynamic hardware techniques for.
November 18, 2015Memory and Pointers in Assembly1 What does this C code do? int foo(char *s) { int L = 0; while (*s++) { ++L; } return L; }
Structure Layout Optimizations in the Open64 Compiler: Design, Implementation and Measurements Gautam Chakrabarti and Fred Chow PathScale, LLC.
ICOM 4035 – Data Structures Dr. Manuel Rodríguez Martínez Electrical and Computer Engineering Department Lecture 3 – August 28, 2001.
December 2, 2015Single-Instruction Multiple Data (SIMD)1 Performance Optimization, cont. How do we fix performance problems?
Crosscutting Issues: The Rôle of Compilers Architects must be aware of current compiler technology Compiler Architecture.
Next Generation ISA Itanium / IA-64. Operating Environments IA-32 Protected Mode/Real Mode/Virtual Mode - if supported by the OS IA-64 Instruction Set.
Introdution to SSE or How to put your algorithms on steroids! Christian Kerl
Optimization of C Code The C for Speed
Exploiting Vector Parallelism in Software Pipelined Loops Sam Larsen Rodric Rabbah Saman Amarasinghe Computer Science and Artificial Intelligence Laboratory.
Chapter 2 — Instructions: Language of the Computer — 1 Conditional Operations Branch to a labeled instruction if a condition is true – Otherwise, continue.
11/13/2012CS4230 CS4230 Parallel Programming Lecture 19: SIMD and Multimedia Extensions Mary Hall November 13, 2012.
DR. SIMING LIU SPRING 2016 COMPUTER SCIENCE AND ENGINEERING UNIVERSITY OF NEVADA, RENO Session 7, 8 Instruction Set Architecture.
Exploiting Value Locality in Physical Register Files Saisanthosh Balakrishnan Guri Sohi University of Wisconsin-Madison 36 th Annual International Symposium.
EECS 583 – Class 22 Research Topic 4: Automatic SIMDization - Superword Level Parallelism University of Michigan December 10, 2012.
Recursion Unrolling for Divide and Conquer Programs Radu Rugina and Martin Rinard Laboratory for Computer Science Massachusetts Institute of Technology.
ENEE150 – 0102 ANDREW GOFFIN Dynamic Memory. Dynamic vs Static Allocation Dynamic  On the heap  Amount of memory chosen at runtime  Can change allocated.
Constructive Computer Architecture Realistic Memories and Caches Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology.
09/10/2010CS4961 CS4961 Parallel Programming Lecture 6: SIMD Parallelism in SSE-3 Mary Hall September 10,
SIMD Programming CS 240A, Winter Flynn* Taxonomy, 1966 In 2013, SIMD and MIMD most common parallelism in architectures – usually both in same.
Computer Architecture Principles Dr. Mike Frank
SIMD Multimedia Extensions
Chapter7 Structure & C++
Exploiting Parallelism
Getting Started with Automatic Compiler Vectorization
Basics Of X86 Architecture
CC 423: Advanced Computer Architecture Limits to ILP
Vector Processing => Multimedia
DynaMOS: Dynamic Schedule Migration for Heterogeneous Cores
Clear1 and Clear2 clear1(int array[], int size) { int i; for (i = 0; i < size; i += 1) array[i] = 0; } clear2(int *array, int size) {
CS4961 Parallel Programming Lecture 11: SIMD, cont
The University of Adelaide, School of Computer Science
Instructors: Majd Sakr and Khaled Harras
Samuel Larsen and Saman Amarasinghe, MIT CSAIL
Additional ILP topic #5: VLIW Also: ISA topics Prof. Eric Rotenberg
Instruction Set Principles
Samuel Larsen Saman Amarasinghe Laboratory for Computer Science
Loop-Level Parallelism
Lecture 5: Pipeline Wrap-up, Static ILP
Lecture 11: Machine-Dependent Optimization
Stream-based Memory Specialization for General Purpose Processors
Presentation transcript:

Increasing and Detecting Memory Address Congruence Sam Larsen Emmett Witchel Saman Amarasinghe Laboratory for Computer Science Massachusetts Institute of Technology

The Congruence Property int a[M], b[n]; for (i=0; i<n; i++) { a[b[i]*8] = 0; } 048… i=0

The Congruence Property int a[M], b[n]; for (i=0; i<n; i++) { a[b[i]*8] = 0; } 048… i=0 i=1

The Congruence Property int a[M], b[n]; for (i=0; i<n; i++) { a[b[i]*8] = 0; } 048… i=0 i=1 i=2

The Congruence Property int a[M], b[n]; for (i=0; i<n; i++) { a[b[i]*8] = 0; } 048…

The Congruence Property int a[M], b[n]; for (i=0; i<n; i++) { a[b[i]*8] = 0; } Congruent with offset of 0

The Congruence Property int a[M]; for (i=0; i<n; i++) { a[16*i+2] = 0; } Congruent with offset of 8

The Congruence Property int a[M]; for (i=0; i<n; i++) { a[15*i+3] = 0; } Not Congruent (32-byte line)

Outline Uses of congruence information Congruence detection algorithm Congruence-increasing transformations Results Related work

SIMD Compilation [PLDI ’00] Multimedia extensions offer wide mem ops –Motorola’s AltiVec –Intel’s MMX/SSE Automatic SIMD parallelization –Multiple mem ops  single wide mem op 128-bit lds/strs must be 128-bit aligned –SSE: 6-9 cycle penalty for unaligned accesses –AltiVec: All wide mem ops have to be aligned

Energy Savings [Micro ’01] Skip tag checks in a set-associative cache Add special loads/stores to ISA –First mem op memoizes the cache way –Second mem op uses this to skip the check Compiler analysis determines when data occupy the same line –Need congruence information

Banked Memory Architectures Offset specifies the memory bank –Place data close to computation –Access banks in parallel regfile memory 0 regfile memory 4 regfile memory 8 regfile memory 12

Congruence Recognition Iterative dataflow analysis –Low-level IR Lattice elements of the form an+b –For pointers, memory locations accessed If a = cache line size then b = offset –32n+8  accesses offset 8 in a 32-byte line

Dataflow Lattice 8 byte cache line 2n+02n+1 4n+04n+24n+14n+3 8n+08n+48n+28n+68n+18n+58n+38n+7 n+0 

Dataflow Lattice 2n+02n+1 4n+04n+24n+14n+3 8n+08n+48n+28n+68n+18n+58n+38n+7  8n+0 4n+2 2n+0 n+0

Transfer Functions a = gcd(a 1, a 2, |b 1 -b 2 |) b = b 1 % a Meet

Transfer Functions a = gcd(a 1, a 2, |b 1 -b 2 |) b = b 1 % a a = gcd(a 1, a 2 ) b = (b 1 +b 2 ) % a a = gcd(a 1, a 2 ) b = (b 1 – b 2 ) % a a = gcd(a 1 a 2, a 1 b 2, a 2 b 1, C) b = (b 1 b 2 ) % a Meet Add Subtract Multiply

The Bad News Most mem ops are not congruent –32 byte cache line

Congruence Conventions (Padding) Allocate arrays/structs on a line boundary –Congruent accesses to arrays for a given index –Congruent accesses to struct fields Requires that we: –Allocate stack frames on cache line boundary –Modify malloc to return aligned data

Unrolling Unrolling creates congruent references int a[100]; for (i=0; i<n; i+=8) { a[i+0] = 0; a[i+1] = 0; … a[i+7] = 0; } a[0] a[8] a[16]…

Unrolling Unrolling creates congruent references int a[100]; for (i=0; i<n; i+=8) { a[i+0] = 0; a[i+1] = 0; … a[i+7] = 0; } a[1] a[9] a[17]…

Congruence with Parameters void init(int* a) { for (i=0; i<n; i+=8) { a[i+0] = 0; a[i+1] = 0; … a[i+7] = 0; } void main() { int a[100]; init(&a[2]); init(&a[3]); }

Congruence with Parameters void init(int* a) { for (i=0; i<n; i+=8) { a[i+0] = 0; a[i+1] = 0; … a[i+7] = 0; } void main() { int a[100]; init(&a[2]); init(&a[3]); }

Pre-loop Add a pre-loop to enforce congruence for (i=0; i<n; i++) { if ((int)&a[i] % 32 == 0) break; a[i] = 0; } for (; i<n; i+=8) { a[i+0] = 0; a[i+1] = 0; … a[i+7] = 0; }

Pre-loop Add a pre-loop to enforce congruence Mem ops congruent in the unrolled body Pre-loop has few iterations –Most dynamic mem ops are congruent

Finding the Break Condition Can we choose arbitrarily? void init(int *x) { int i; for (i=0; i<100; i+=2) { if ((int)&x[i] % 32 == 0) break; x[i] = 0; }... } int main() { int x[200]; init(&x[1]); } i&x[i]% NO!

Finding the Break Condition void copy(int *x, int *y) { int i; for (i=0; i<100; i++) { if ((int)&x[i] % 32 == 0 && (int)&y[i] % 32 == 0) break; x[i] = y[i]; }... } int main() { int x[200], y[200]; copy(&x[0], &y[0]); copy(&x[0], &y[1]); } i&x[i]%32&y[i]% ……… 800 i&x[i]%32&y[i]% ……… 804 first call second call

Finding the Break Condition void copy(int *x, int *y) { int i; for (i=0; i<100; i++) { if ((int)&x[i] % 32 == 0 && (int)&y[i] % 32 == 0) break; x[i] = y[i]; }... } int main() { int x[200], y[200]; copy(&x[0], &y[0]); copy(&x[0], &y[1]); } i&x[i]%32&y[i]% ……… 800 i&x[i]%32&y[i]% ……… 804 first call second call

Finding the Break Condition void copy(int *x, int *y) { int i; for (i=0; i<100; i++) { if ((int)&x[i] % 32 == 0 && (int)&y[i] % 32 == 4) break; x[i] = y[i]; }... } int main() { int x[200], y[200]; copy(&x[0], &y[0]); copy(&x[0], &y[1]); } i&x[i]%32&y[i]% ……… 800 i&x[i]%32&y[i]% ……… 804 first call second call

Finding the Break Condition void copy(int *x, int *y) { int i; for (i=0; i<100; i++) { if ((int)&x[i] % 32 == 0) break; x[i] = y[i]; }... } int main() { int x[200], y[200]; copy(&x[0], &y[0]); copy(&x[0], &y[1]); } i&x[i]%32&y[i]% ……… 800 i&x[i]%32&y[i]% ……… 804 first call second call

Finding the Break Condition Use profiling to observe runtime addresses Find best break condition for the profile Exhaustive search: –Consider all possible break conditions –Compute iterations in unrolled loop –Multiply by # of mem ops with known offset –Break condition with highest value is the best Results vary little with profile data set –Insignificant on all but one benchmark

Congruence Results (SPECfp95) Original Congruent

Congruence Results (SPECfp95) Original Congruent Detected

Congruence Results (MediaBench) Original Congruent Detected

Execution Time Overhead unrolling+ pre-loop applu-6.27%-5.28% apsi 0.93% 1.13% fpppp 0.00% hydro2d 0.99% 0.39% mgrid 0.72% su2cor-0.32% 0.11% swim-0.96%-0.17% tomcatv-0.18% 0.65% turb3d-0.80% 1.72% wave5 3.75% 4.58%

DCache Energy Savings [Micro ’01]

Related Work Fisher and Ellis – Bulldog Compiler –Memory bank disambiguation –Loop unrolling Barua et al. – Raw Compiler –Modulo unrolling Davidson et al. – Mem Access Coalescing –Loop Unrolling –Alignment checks at runtime

Conclusions Increased number of congruent refs by 5x Analysis detected 95% Results are good –MediaBench – 65% congruent, 60% detected –SpecFP95 – 84% congruent, 82% detected Many uses of congruence information –Wide accesses in multimedia extensions –Energy savings by tag check elimination –Bank disambiguation in clustered architectures

Increasing and Detecting Memory Address Congruence Sam Larsen Emmett Witchel Saman Amarasinghe Laboratory for Computer Science Massachusetts Institute of Technology

r0 = i+7 r1 = r0*4 r2 = r1+a *r2 = 0 i = i+8 i < n i = 0 Example int a[100]; for (i=0; i<n; i+=8) { a[i+0] = 0; a[i+1] = 0; … a[i+7] = 0; }

Example i: 32n+0 r0: 32n n+7 = 32n+7 r1: 32n+7 * 32n+4 = 32n+28 r2: 32n n+0 = 32n+28 i: 32n n+8 = 32n+8 r0 = i+7 r1 = r0*4 r2 = r1+a *r2 = 0 i = i+8 i < n i = 0

i: 32n+0 r0: 32n n+7 = 32n+7 r1: 32n+7 * 32n+4 = 32n+28 r2: 32n n+0 = 32n+28 i: 32n n+8 = 32n+8 Example i: 32n+0 r0: 8n n+7 = 8n+7 r1: 8n+7 * 32n+4 = 32n+28 r2: 32n n+0 = 32n+28 i: 8n n+8 = 8n+0 i: 32n+0  32n+8 = 8n+0 *r2: offset is 28 r0 = i+7 r1 = r0*4 r2 = r1+a *r2 = 0 i = i+8 i < n i = 0

Multimedia Compilation PowerMAC G4 with AltiVec Commercial vectorizing compiler –Alignment pragmas datatypeVector length Speedup (unaligned) Speedup (aligned) Improve- ment float % int % short % char %