Samuel Larsen Saman Amarasinghe Laboratory for Computer Science

Slides:

Advertisements

Similar presentations

MP3 Optimization Exploiting Processor Architecture and Using Better Algorithms Mancia Anguita Universidad de Granada J. Manuel Martinez – Lechado Vitelcom.

Advertisements

1 CS 201 Compiler Construction Software Pipelining: Circular Scheduling.

Compiler techniques for exposing ILP

Computer Organization and Architecture (AT70.01) Comp. Sc. and Inf. Mgmt. Asian Institute of Technology Instructor: Dr. Sumanta Guha Slide Sources: Based.

Programmability Issues

CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture VLIW Steve Ko Computer Sciences and Engineering University at Buffalo.

Breaking SIMD Shackles with an Exposed Flexible Microarchitecture and the Access Execute PDG Venkatraman Govindaraju, Tony Nowatzki, Karthikeyan Sankaralingam.

1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures Computer Architecture A.

POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:

CS252 Graduate Computer Architecture Spring 2014 Lecture 9: VLIW Architectures Krste Asanovic

EECC551 - Shaaban #1 Fall 2003 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.

1 COMP 740: Computer Architecture and Implementation Montek Singh Tue, Feb 24, 2009 Topic: Instruction-Level Parallelism IV (Software Approaches/Compiler.

Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.

Rung-Bin Lin Chapter 4: Exploiting Instruction-Level Parallelism with Software Approaches4-1 Chapter 4 Exploiting Instruction-Level Parallelism with Software.

Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.

Parallell Processing Systems1 Chapter 4 Vector Processors.

08/31/2001Copyright CECS & The Spark Project SPARK High Level Synthesis System Sumit GuptaTimothy KamMichael KishinevskyShai Rotem Nick SavoiuNikil DuttRajesh.

 Understanding the Sources of Inefficiency in General-Purpose Chips.

Increasing and Detecting Memory Address Congruence Sam Larsen Emmett Witchel Saman Amarasinghe Laboratory for Computer Science Massachusetts Institute.

Instruction Level Parallelism (ILP) Colin Stevens.

EECC551 - Shaaban #1 Winter 2002 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.

EECC551 - Shaaban #1 Fall 2005 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.

EECC551 - Shaaban #1 Spring 2004 lec# Definition of basic instruction blocks Increasing Instruction-Level Parallelism & Size of Basic Blocks.

The Vector-Thread Architecture Ronny Krashinsky, Chris Batten, Krste Asanović Computer Architecture Group MIT Laboratory for Computer Science

Generic Software Pipelining at the Assembly Level Markus Pister

Optimizing Loop Performance for Clustered VLIW Architectures by Yi Qian (Texas Instruments) Co-authors: Steve Carr (Michigan Technological University)

Basics and Architectures

09/27/2011CS4961 CS4961 Parallel Programming Lecture 10: Introduction to SIMD Mary Hall September 27, 2011.

University of Michigan Electrical Engineering and Computer Science 1 Extending Multicore Architectures to Exploit Hybrid Parallelism in Single-Thread Applications.

1 Advance Computer Architecture CSE 8383 Ranya Alawadhi.

09/14/2010CS4961 CS4961 Parallel Programming Lecture 7: Introduction to SIMD Mary Hall September 14, 2010.

CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 1 Compiler Techniques for ILP  So far we have explored dynamic hardware techniques for.

CS/EE 5810 CS/EE 6810 F00: 1 Multimedia. CS/EE 5810 CS/EE 6810 F00: 2 New Architecture Direction “… media processing will become the dominant force in.

Exploiting Vector Parallelism in Software Pipelined Loops Sam Larsen Rodric Rabbah Saman Amarasinghe Computer Science and Artificial Intelligence Laboratory.

11/13/2012CS4230 CS4230 Parallel Programming Lecture 19: SIMD and Multimedia Extensions Mary Hall November 13, 2012.

A parallel High Level Trigger benchmark (using multithreading and/or SSE)‏ Håvard Bjerke.

EECS 583 – Class 22 Research Topic 4: Automatic SIMDization - Superword Level Parallelism University of Michigan December 10, 2012.

Learning A Better Compiler Predicting Unroll Factors using Supervised Classification And Integrating CPU and L2 Cache Voltage Scaling using Machine Learning.

A Survey of the Current State of the Art in SIMD: Or, How much wood could a woodchuck chuck if a woodchuck could chuck n pieces of wood in parallel? Wojtek.

09/10/2010CS4961 CS4961 Parallel Programming Lecture 6: SIMD Parallelism in SSE-3 Mary Hall September 10,

SIMD Programming CS 240A, Winter Flynn* Taxonomy, 1966 In 2013, SIMD and MIMD most common parallelism in architectures – usually both in same.

Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.

Code Optimization Overview and Examples

Immediate Addressing Mode

Computer Architecture Principles Dr. Mike Frank

Conception of parallel algorithms

SIMD Multimedia Extensions

/ Computer Architecture and Design

CS4961 Parallel Programming Lecture 8: SIMD, cont

Independence Instruction Set Architectures

Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures Topic 13 SIMD Multimedia Extensions Prof. Zhang Gang School.

Vector Processing => Multimedia

COMP4211 : Advance Computer Architecture

Optimizing Transformations Hal Perkins Autumn 2011

Compiler Back End Panel

STUDY AND IMPLEMENTATION

CS4961 Parallel Programming Lecture 11: SIMD, cont

Compiler Back End Panel

Optimizing Transformations Hal Perkins Winter 2008

Samuel Larsen and Saman Amarasinghe, MIT CSAIL

TI C6701 VLIW MIMD.

The Vector-Thread Architecture

Dynamic Hardware Prediction

How to improve (decrease) CPI

CMSC 611: Advanced Computer Architecture

Loop-Level Parallelism

The University of Adelaide, School of Computer Science

Lecture 5: Pipeline Wrap-up, Static ILP

Introduction to Optimization

Presentation transcript:

Exploiting Superword Level Parallelism with Multimedia Instruction Sets Samuel Larsen Saman Amarasinghe Laboratory for Computer Science Massachusetts Institute of Technology {slarsen,saman}@lcs.mit.edu www.cag.lcs.mit.edu/slp

Overview Problem statement New paradigm for parallelism  SLP SLP extraction algorithm Results SLP vs. ILP and vector parallelism Conclusions Future work

Multimedia Extensions Additions to all major ISAs SIMD operations

Using Multimedia Extensions Library calls and inline assembly Difficult to program Not portable

Using Multimedia Extensions Library calls and inline assembly Difficult to program Not portable Different extensions to the same ISA MMX and SSE SSE vs. 3DNow!

Using Multimedia Extensions Library calls and inline assembly Difficult to program Not portable Different extensions to the same ISA MMX and SSE SSE vs. 3DNow! Need automatic compilation

Vector Compilation Pros: Successful for vector computers Large body of research

Vector Compilation Pros: Cons: Successful for vector computers Large body of research Cons: Involved transformations Targets loop nests

Superword Level Parallelism (SLP) Small amount of parallelism Typically 2 to 8-way Exists within basic blocks Uncovered with a simple analysis

Superword Level Parallelism (SLP) Small amount of parallelism Typically 2 to 8-way Exists within basic blocks Uncovered with a simple analysis Independent isomorphic operations New paradigm

1. Independent ALU Ops R = R + XR * 1.08327 G = G + XG * 1.89234 B = B + XB * 1.29835 R R XR 1.08327 G = G + XG * 1.89234 B B XB 1.29835

2. Adjacent Memory References R = R + X[i+0] G = G + X[i+1] B = B + X[i+2] R R G = G + X[i:i+2] B B

3. Vectorizable Loops for (i=0; i<100; i+=1) A[i+0] = A[i+0] + B[i+0]

3. Vectorizable Loops for (i=0; i<100; i+=4) A[i+0] = A[i+0] + B[i+0] A[i+1] = A[i+1] + B[i+1] A[i+2] = A[i+2] + B[i+2] A[i+3] = A[i+3] + B[i+3] for (i=0; i<100; i+=4) A[i:i+3] = B[i:i+3] + C[i:i+3]

4. Partially Vectorizable Loops for (i=0; i<16; i+=1) L = A[i+0] – B[i+0] D = D + abs(L)

4. Partially Vectorizable Loops for (i=0; i<16; i+=2) L = A[i+0] – B[i+0] D = D + abs(L) L = A[i+1] – B[i+1] D = D + abs(L) for (i=0; i<16; i+=2) L0 L1 = A[i:i+1] – B[i:i+1] D = D + abs(L0) D = D + abs(L1)

Exploiting SLP with SIMD Execution Benefit: Multiple ALU ops  One SIMD op Multiple ld/st ops  One wide mem op

Exploiting SLP with SIMD Execution Benefit: Multiple ALU ops  One SIMD op Multiple ld/st ops  One wide mem op Cost: Packing and unpacking Reshuffling within a register

Packing/Unpacking Costs C = A + 2 D = B + 3 C A 2 D B 3 = +

Packing/Unpacking Costs Packing source operands A A B B A = f() B = g() C = A + 2 D = B + 3 C A 2 D B 3 = +

Packing/Unpacking Costs Packing source operands Unpacking destination operands A A B B A = f() B = g() C = A + 2 D = B + 3 C A 2 D B 3 = + E = C / 5 F = D * 7 C C D D

Optimizing Program Performance To achieve the best speedup: Maximize parallelization Minimize packing/unpacking

Optimizing Program Performance To achieve the best speedup: Maximize parallelization Minimize packing/unpacking Many packing possibilities Worst case: n ops  n! configurations Different cost/benefit for each choice

Observation 1: Packing Costs can be Amortized Use packed result operands A = B + C D = E + F G = A - H I = D - J

Observation 1: Packing Costs can be Amortized Use packed result operands Share packed source operands A = B + C D = E + F A = B + C D = E + F G = A - H I = D - J G = B + H I = E + J

Observation 2: Adjacent Memory is Key Large potential performance gains Eliminate ld/str instructions Reduce memory bandwidth

Observation 2: Adjacent Memory is Key Large potential performance gains Eliminate ld/str instructions Reduce memory bandwidth Few packing possibilities Only one ordering exploits pre-packing

SLP Extraction Algorithm Identify adjacent memory references A = X[i+0] C = E * 3 B = X[i+1] H = C – A D = F * 5 J = D - B

SLP Extraction Algorithm Identify adjacent memory references A B = X[i:i+1] A = X[i+0] C = E * 3 B = X[i+1] H = C – A D = F * 5 J = D - B

SLP Extraction Algorithm Follow def-use chains A B = X[i:i+1] A = X[i+0] C = E * 3 B = X[i+1] H = C – A D = F * 5 J = D - B

SLP Extraction Algorithm Follow def-use chains A B = X[i:i+1] A = X[i+0] C = E * 3 B = X[i+1] H = C – A D = F * 5 J = D - B H J C D A B = -

SLP Extraction Algorithm Follow use-def chains A B = X[i:i+1] A = X[i+0] C = E * 3 B = X[i+1] H = C – A D = F * 5 J = D - B H J C D A B = -

SLP Extraction Algorithm Follow use-def chains A B = X[i:i+1] A = X[i+0] C = E * 3 B = X[i+1] H = C – A D = F * 5 J = D - B C D E F 3 5 = * H J C D A B = -

SLP Extraction Algorithm Follow use-def chains A B = X[i:i+1] A = X[i+0] C = E * 3 B = X[i+1] H = C – A D = F * 5 J = D - B C D E F 3 5 = * H J C D A B = -

SLP Compiler Results SLP compiler implemented in SUIF Tested on two benchmark suites SPEC95fp Multimedia kernels Performance measured three ways: SLP availability Compared to vector parallelism Speedup on AltiVec

SLP Availability

SLP vs. Vector Parallelism

Speedup on AltiVec 6.7

SLP vs. Vector Parallelism Extracted with a simple analysis SLP is fine grain  basic blocks

SLP vs. Vector Parallelism Extracted with a simple analysis SLP is fine grain  basic blocks Superset of vector parallelism Unrolling transforms VP to SLP Handles partially vectorizable loops

SLP vs. Vector Parallelism } Basic block

SLP vs. Vector Parallelism Iterations

SLP vs. ILP Subset of instruction level parallelism

SLP vs. ILP Subset of instruction level parallelism SIMD hardware is simpler Lack of heavily ported register files

SLP vs. ILP Subset of instruction level parallelism SIMD hardware is simpler Lack of heavily ported register files SIMD instructions are more compact Reduces instruction fetch bandwidth

SLP and ILP SLP & ILP can be exploited together Many architectures can already do this

SLP and ILP SLP & ILP can be exploited together SLP & ILP may compete Many architectures can already do this SLP & ILP may compete Occurs when parallelism is scarce

SLP and ILP SLP & ILP can be exploited together SLP & ILP may compete Many architectures can already do this SLP & ILP may compete Occurs when parallelism is scarce Unroll the loop more times When ILP is due to loop level parallelism

Conclusions Multimedia architectures abundant Need automatic compilation

Conclusions Multimedia architectures abundant Need automatic compilation SLP is the right paradigm 20% non-vectorizable in SPEC95fp

Conclusions Multimedia architectures abundant Need automatic compilation SLP is the right paradigm 20% non-vectorizable in SPEC95fp SLP extraction successful Simple, local analysis Provides speedups from 1.24 – 6.70

Conclusions Multimedia architectures abundant Need automatic compilation SLP is the right paradigm 20% non-vectorizable in SPEC95fp SLP extraction successful Simple, local analysis Provides speedups from 1.24 – 6.70 Found SLP in general-purpose codes

Future Work SLP analysis beyond basic blocks Packing maintained across blocks Loop invariant packing Fill unused slots with speculative ops

Future Work SLP analysis beyond basic blocks SLP architectures Packing maintained across blocks Loop invariant packing Fill unused slots with speculative ops SLP architectures Emphasis on SIMD Better packing/unpacking

Exploiting Superword Level Parallelism with Multimedia Instruction Sets Samuel Larsen Saman Amarasinghe Laboratory for Computer Science Massachusetts Institute of Technology {slarsen,saman}@lcs.mit.edu www.cag.lcs.mit.edu/slp