UPC MICRO35 Istanbul Nov. 2002 Effective Instruction Scheduling Techniques for an Interleaved Cache Clustered VLIW Processor Enric Gibert 1 Jesús Sánchez.

Slides:

Advertisements

Similar presentations

Dynamic Power Redistribution in Failure-Prone CMPs Paula Petrica, Jonathan A. Winter * and David H. Albonesi Cornell University *Google, Inc.

Advertisements

Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-order Processors Onur Mutlu, The University of Texas at Austin Jared Start,

High-Performance Computing Seminar © Toni Cortes A Case for Heterogeneous Disk Arrays Toni Cortes.

Performance Analysis of NUCA Policies for CMPs Using Parsec v2.0 Benchmark Suite Javier Lira ψ Carlos Molina ф Antonio González λ λ Intel Barcelona Research.

Virtual Cluster Scheduling Through the Scheduling Graph Josep M. Codina Jesús Sánchez Antonio González Intel Barcelona Research Center, Intel Labs - UPC.

ICS’02 UPC An Interleaved Cache Clustered VLIW Processor E. Gibert, J. Sánchez * and A. González * Dept. d’Arquitectura de Computadors Universitat Politècnica.

Link-Time Path-Sensitive Memory Redundancy Elimination Manel Fernández and Roger Espasa Computer Architecture Department Universitat.

U P C CGO’03 San Francisco March 2003 Local Scheduling Techniques for Memory Coherence in a Clustered VLIW Processor with a Distributed Data Cache Enric.

Lecture 4 Introduction to Digital Signal Processors (DSPs) Dr. Konstantinos Tatas.

UPC Compiler Support for Trace-Level Speculative Multithreaded Architectures Antonio González λ,ф Carlos Molina ψ Jordi Tubella ф INTERACT-9, San Francisco.

CS 7810 Lecture 4 Overview of Steering Algorithms, based on Dynamic Code Partitioning for Clustered Architectures R. Canal, J-M. Parcerisa, A. Gonzalez.

It’s all about latency Henk Neefs Dept. of Electronics and Information Systems (ELIS) University of Gent.

U P C MICRO36 San Diego December 2003 Flexible Compiler-Managed L0 Buffers for Clustered VLIW Processors Enric Gibert 1 Jesús Sánchez 2 Antonio González.

POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:

UPC Microarchitectural Techniques to Exploit Repetitive Computations and Values Carlos Molina Clemente LECTURA DE TESIS, (Barcelona,14 de Diciembre de.

Manager-Client Pairing: A Framework for Implementing Coherence Hierarchies Jesse G. Beu Michael C. Rosier Thomas M. Conte Tinker Research Georgia Institute.

HK-NUCA: Boosting Data Searches in Dynamic NUCA for CMPs Javier Lira ψ Carlos Molina ф Antonio González ψ,λ λ Intel Barcelona Research Center Intel Labs.

UPC Power and Complexity Aware Microarchitectures Jaume Abella 1 Ramon Canal 1

UPC Reducing Misspeculation Penalty in Trace-Level Speculative Multithreaded Architectures Carlos Molina ψ, ф Jordi Tubella ф Antonio González λ,ф ISHPC-VI,

UPC Dynamic Removal of Redundant Computations Carlos Molina, Antonio González and Jordi Tubella Universitat Politècnica de Catalunya - Barcelona

1-XII-98Micro-311 Widening Resources: A Cost-effective Technique for Aggressive ILP Architectures David López, Josep Llosa Mateo Valero and Eduard Ayguadé.

UPC Trace-Level Reuse A. González, J. Tubella and C. Molina Dpt. d´Arquitectura de Computadors Universitat Politècnica de Catalunya 1999 International.

Variable-Based Multi-Module Data Caches for Clustered VLIW Processors Enric Gibert 1,2, Jaume Abella 1,2, Jesús Sánchez 1, Xavier Vera 1, Antonio González.

UPC Reducing Power Consumption of the Issue Logic Daniele Folegnani and Antonio González Universitat Politècnica de Catalunya.

Chapter 17 Parallel Processing.

UPC Value Compression to Reduce Power in Data Caches Carles Aliagas, Carlos Molina and Montse García Universitat Rovira i Virgili – Tarragona, Spain {caliagas,

Csci4203/ece43631 Review Quiz. 1)It is less expensive 2)It is usually faster 3)Its average CPI is smaller 4)It allows a faster clock rate 5)It has a simpler.

The Auction: Optimizing Banks Usage in Non-Uniform Cache Architectures Javier Lira ψ Carlos Molina ψ,ф Antonio González ψ,λ λ Intel Barcelona Research.

1 Clustered Data Cache Designs for VLIW Processors PhD Candidate: Enric Gibert Advisors: Antonio González, Jesús Sánchez.

UPC Trace-Level Speculative Multithreaded Architecture Carlos Molina Universitat Rovira i Virgili – Tarragona, Spain Antonio González.

Designing Memory Systems for Tiled Architectures Anshuman Gupta September 18,

An Energy-Efficient Reconfigurable Multiprocessor IC for DSP Applications Multiple programmable VLIW processors arranged in a ring topology –Balances its.

Software Data Prefetching Mohammad Al-Shurman & Amit Seth Instructor: Dr. Aleksandar Milenkovic Advanced Computer Architecture CPE631.

THE PHILIPS NEXPERIA DIGITAL VIDEO PLATFORM. The Digital Video Revolution  Transition from Analog to Digital Video  Navigate, store, retrieve and share.

UNIVERSITAT POLITÈCNICA DE CATALUNYA Departament d’Arquitectura de Computadors Exploiting Pseudo-schedules to Guide Data Dependence Graph Partitioning.

Prefetching Challenges in Distributed Memories for CMPs Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC – BarcelonaTech.

Last Bank: Dealing with Address Reuse in Non-Uniform Cache Architecture for CMPs Javier Lira ψ Carlos Molina ф Antonio González λ λ Intel Barcelona Research.

Analysis of NUCA Policies for CMPs Using Parsec Benchmark Suite Javier Lira ψ Carlos Molina ф Antonio González λ λ Intel Barcelona Research Center Intel.

Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache And Pefetch Buffers Norman P. Jouppi Presenter:Shrinivas Narayani.

1 Introduction ELG 6158 Digital Systems Architecture Miodrag Bolic.

Lecture 17 Final Review Prof. Mike Schulte Computer Architecture ECE 201.

Exploiting Vector Parallelism in Software Pipelined Loops Sam Larsen Rodric Rabbah Saman Amarasinghe Computer Science and Artificial Intelligence Laboratory.

Lecture 5: Memory Performance. Types of Memory Registers L1 cache L2 cache L3 cache Main Memory Local Secondary Storage (local disks) Remote Secondary.

Final Review Prof. Mike Schulte Advanced Computer Architecture ECE 401.

컴퓨터교육과 이상욱 Published in: COMPUTER ARCHITECTURE LETTERS (VOL. 10, NO. 1) Issue Date: JANUARY-JUNE 2011 Publisher: IEEE Authors: Omer Khan (Massachusetts.

An Evaluation of Memory Consistency Models for Shared- Memory Systems with ILP processors Vijay S. Pai, Parthsarthy Ranganathan, Sarita Adve and Tracy.

LRU-PEA: A Smart Replacement Policy for NUCA caches on Chip Multiprocessors Javier Lira ψ Carlos Molina ψ,ф Antonio González ψ,λ λ Intel Barcelona Research.

ICC Module 3 Lesson 2 – Memory Hierarchies 1 / 25 © 2015 Ph. Janson Information, Computing & Communication Memory Hierarchies – Clip 8 – Example School.

15-740/ Computer Architecture Lecture 12: Issues in OoO Execution Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/7/2011.

A Unified Modulo Scheduling and Register Allocation Technique for Clustered Processors Josep M. Codina, Jesús Sánchez and Antonio González Dept. of Computer.

Memory Hierarchy and Cache Design (3). Reducing Cache Miss Penalty 1. Giving priority to read misses over writes 2. Sub-block placement for reduced miss.

COMP 740: Computer Architecture and Implementation

The University of Adelaide, School of Computer Science

Memory Hierarchies.

Hyunchul Park, Kevin Fan, Manjunath Kudlur,Scott Mahlke

Interconnect with Cache Coherency Manager

Chapter 6 Memory System Design

M. Usha Professor/CSE Sona College of Technology

Computer System Design Lecture 9

Efficient Interconnects for Clustered Microarchitectures

Code Transformation for TLB Power Reduction

Main Memory Background

Jakub Yaghob Martin Kruliš

Lois Orosa, Rodolfo Azevedo and Onur Mutlu

Presentation transcript:

UPC MICRO35 Istanbul Nov Effective Instruction Scheduling Techniques for an Interleaved Cache Clustered VLIW Processor Enric Gibert 1 Jesús Sánchez 1,2 Antonio González 1,2 1 Dept. dArquitectura de Computadors Universitat Politècnica de Catalunya (UPC) Barcelona 2 Intel Barcelona Research Center Intel Labs Barcelona

UPC MICRO35 Istanbul Nov Motivation Capacity vs. Communication-bound Clustered microarchitectures –Simpler + faster –Power consumption –Communications not homogeneous Clustering embedded/DSP domain

UPC MICRO35 Istanbul Nov Clustered Microarchitectures CLUSTER 1 Reg. File FUs CLUSTER 2 Reg. File FUs CLUSTER 3 Reg. File FUs CLUSTER 4 Reg. File FUs Register-to-register communication buses L1 cache L2 cache Memory buses GOAL: distribute the memory hierarchy!!!

UPC MICRO35 Istanbul Nov Contributions Distribution of data cache: –Interleaved cache clustered VLIW processor Hardware enhancement: –Attraction Buffers Effective instruction scheduling techniques –Modulo scheduling –Loop unrolling + smart assignment of latencies + padding

UPC MICRO35 Istanbul Nov Talk Outline MultiVLIW Interleaved-cache clustered VLIW processor Instruction scheduling algorithms and techniques Hardware enhancement: Attraction Buffers Simulation framework Results Conclusions

UPC MICRO35 Istanbul Nov MultiVLIW CLUSTER 1 Register File Func. Units Register-to-register communication buses cache module CLUSTER 2 Register File Func. Units cache module CLUSTER 3 Register File Func. Units cache module CLUSTER 4 Register File Func. Units cache module L2 cache cache block TAG+STATE+DATA Cache-Coherence Protocol!!!

UPC MICRO35 Istanbul Nov Talk Outline MultiVLIW Interleaved-cache clustered VLIW processor Instruction scheduling algorithms and techniques Hardware enhancement: Attraction Buffers Simulation framework Results Conclusions

UPC MICRO35 Istanbul Nov Interleaved Cache CLUSTER 1 Register File Func. Units Register-to-register communication buses cache module CLUSTER 2 Register File Func. Units cache module CLUSTER 3 Register File Func. Units cache module CLUSTER 4 Register File Func. Units cache module L2 cache TAGW0W1W2W4W5W6W7W3 TAGW0W4TAGW1W5TAGW2W6TAGW3W7 subblock 1 local hit remote hitlocal missremote miss cache block

UPC MICRO35 Istanbul Nov Talk Outline MultiVLIW Interleaved-cache clustered VLIW processor Instruction scheduling algorithms and techniques Hardware enhancement: Attraction Buffers Simulation framework Results Conclusions

UPC MICRO35 Istanbul Nov successful not successful BASE Scheduling Algorithm II=II+1 Best profit in output edges START Sort nodes Next node Select possible clusters How Many? Least loaded Schedule it How Many? >0 >1 1 0 successful not successful

UPC MICRO35 Istanbul Nov Scheduling Algorithm For word-interleaved cache clustered processors Scheduling steps: 1.Loop unrolling 2.Assignment of latencies to memory instructions – latencies stall time + compute time 3.Order instructions (DDG nodes) 4.Cluster assignment and scheduling

UPC MICRO35 Istanbul Nov STEP 1: Loop Unrolling CLUSTER 1 cache module a[0]a[4] CLUSTER 2 cache module a[1]a[5] CLUSTER 3 cache module a[2]a[6] CLUSTER 4 cache module a[3]a[7] for (i=0; i<MAX; i++) { ld r3, a[i] r4 = OP(r3) st r4, b[i] } ld r31, a[i]ld r32, a[i+1]ld r33, a[i+2]ld r34, a[i+3] 25% local accesses 100% local accesses for (i=0; i<MAX; i+=4) { ld r31, a[i] (stride 16 bytes) ld r32, a[i+1] (stride 16 bytes) ld r33, a[i+2] (stride 16 bytes) ld r34, a[i+3] (stride 16 bytes)... } ld r3, a[i] 25% local accesses Selective unrolling : No unrolling UnrollxN OUF unrolling Strides multiple of NxI Optimum Unrolling Factor (OUF)

UPC MICRO35 Istanbul Nov STEP 2: Latency Assignment n1 load n2 load n3 add n4 store n5 sub REC1 distance=1 n6 load n7 div n8 add REC2 memory dependences register-flow deps. distance=1 STEP 2 II stall B STEP 1 LoadLatency change II stall B n1 To LM To RH To LH n2 To LM To RH To LH LH=1 cycle RH=5 cycles LM=10 cycles RM=15 cycles L=1 L=8 L=1 L=15 MII=33 MII=22 L=15 L=10 L=15 MII=28 MII=22 L=15 L=5 L=15 MII=23 MII=22 L=5 L=1 MII=9 MII=10

UPC MICRO35 Istanbul Nov Step 3: Order instructions Step 4: Cluster assignment and scheduling STEPS 3 and 4

UPC MICRO35 Istanbul Nov Scheduling Restrictions CLUSTER 1 a[0]a[4] Cache module CLUSTER 3CLUSTER 2 CLUSTER 4 a[3]a[7] Cache module NEXT MEMORY LEVEL memory buses cycle i---store to a[0] cycle i cycle i cycle i+3load from a[0]--- NON-DETERMINISTIC BUS LATENCY!!!

UPC MICRO35 Istanbul Nov Step 3: Order instructions Step 4: Cluster assignment and scheduling –Non-memory instructions same as BASE Minimize register communications + maximize workload –Memory instructions: Memory instructions in same chain same cluster IPBC (Interleaved Preferred Build Chains) –Average preferred cluster of the chain –Padding meaningful preferred cluster information »Stack frames »Dynamically allocated data IBC (Interleaved Build Chains) –Minimize register communications of 1 st instr. of chain STEPS 3 and 4 NxI boundary

UPC MICRO35 Istanbul Nov Memory Dependent Chains n1 load n2 load n3 add n4 store n5 sub distance=1 n6 load n7 div n8 add memory dependences register-flow deps. distance=1 Preferred = 1 Preferred = 2 LH=1 cycle RH=5 cycles LM=10 cycles RM=15 cycles L=1 L=8 L=1 L=5 L=1 n1n2n4n6 IPBCcluster 1cluster 2 IBCsame as n4minimize register communications order={n5, n4, n3, n2, n1, n8, n7, n6}

UPC MICRO35 Istanbul Nov Talk Outline MultiVLIW Interleaved-cache clustered VLIW processor Instruction scheduling algorithms and techniques Hardware enhancement: Attraction Buffers Simulation framework Results Conclusions

UPC MICRO35 Istanbul Nov Attraction Buffers Cost-effective mechanism local accesses CLUSTER 1 cache module a[0]a[4] CLUSTER 2 cache module a[1]a[5] CLUSTER 3 cache module a[2]a[6] CLUSTER 4 cache module a[3]a[7] ABuffer ld r3, a[3] ld r3, a[7]... stride 16 bytes a[3]a[7] Local accesses = 0% Local accesses = 50%

UPC MICRO35 Istanbul Nov Talk Outline MultiVLIW Interleaved-cache clustered VLIW processor Instruction scheduling algorithms and techniques Hardware enhancement: Attraction Buffers Simulation framework Results Conclusions

UPC MICRO35 Istanbul Nov Evaluation Framework IMPACT C compiler Mediabench benchmark suite ProfileExecution epicdec test_imagetitanic epicenc test_imagetitanic g721dec clintonS_16_44 g721enc clintonS_16_44 gsmdec clintonS_16_44 gsmenc clintonS_16_44 jpegdec testimgmonalisa ProfileExecution jpegenc testimgmonalisa mpeg2dec mei16v2tek6 pegwitdec pegwittechrep pegwitenc pgptesttechrep pgpdec pgptexttechrep pgpenc pgptesttechrep rasta ex5_c1

UPC MICRO35 Istanbul Nov Evaluation Framework Unified cacheMultiVLIWInterleaved cache # clusters 4 Functional units 1 FP / cluster + 1 integer / cluster + 1 memory / cluster Register buses 4 buses running at ½ the core freq. Cache configuration 8KB, 2-way set-associative, 32 byte blocks L2 always hits Cache latencies Hit=5 Miss=14 Hit=1 Miss=10 Local Hit=1 Remote Hit=5 Local Miss=10 Remote Miss=15 Algorithm BASEIBCIPBC + IBC Interleaving factor --4 bytes

UPC MICRO35 Istanbul Nov Talk Outline MultiVLIW Interleaved-cache clustered VLIW processor Instruction scheduling algorithms and techniques Hardware enhancement: Attraction Buffers Simulation framework Results Conclusions

UPC MICRO35 Istanbul Nov Local Accesses OUF=Optimum UF P=Padding NC=No Chains

UPC MICRO35 Istanbul Nov Why Remote Accesses? Double precision accesses (mpeg2dec) Unclear preferred cluster information Indirect accesses (e.g. a[b[i]] ) (jpegdec, jpegenc, pegwitdec, pegwitenc) Different alignment (epicenc, jpegdec, jpegenc) Strides not multiple of NxI (selective unrolling, …) Memory dependent chains (epicdec, pgpdec, pgpenc, rasta) for (k=0; k<MAX; k++){ for (i=k; i<MAX; i++) load a[i] }

UPC MICRO35 Istanbul Nov Stall Time

UPC MICRO35 Istanbul Nov Cycle Count Results

UPC MICRO35 Istanbul Nov Talk Outline MultiVLIW Interleaved-cache clustered VLIW processor Instruction scheduling algorithms and techniques Hardware enhancement: Attraction Buffers Simulation framework Results Conclusions

UPC MICRO35 Istanbul Nov Conclusions Interleaved cache clustered VLIW processor Effective instruction scheduling techniques –Smart assignment of latencies –Loop unrolling + padding (27% local hits) Source of remote accesses and stall time Attraction Buffers ( stall time up to 34%) Cycle count results: –MultiVLIW (7% slowdown but simpler hardware) –Unified cache (11% speedup)

UPC MICRO35 Istanbul Nov Questions?

UPC MICRO35 Istanbul Nov Question: Latency Assignment MII(REC1)=20MII(DDG)=10 Node II stall B(ratio)B(substract) n n n35154 n45154 n5100MAX10

UPC MICRO35 Istanbul Nov Question: Padding void foo(int *array, int *accum) { *accum = 0; for (i=0; i<MAX; i++) *accum += array[i]; } void main() { int *a, value; a = malloc(MAX*sizeof(int)); foo(a, &value); } CLUSTER 1 a[0] a[4]... CLUSTER 2 accum a[1] a[5]... CLUSTER 3 a[2] a[6]... CLUSTER 4 a[3] a[7]...

UPC MICRO35 Istanbul Nov Question: Coherence Memory Dependent Chains –Modified data Present in only one Attraction Buffer –Data present in multiple Attraction Buffers Replicated in read-only manner –Local scheduling technique At end of loop flush Attraction Buffers contents CLUSTER 1 a[2] ABuffer CLUSTER 2 a[2] ABuffer CLUSTER 3 ABuffer CLUSTER 4 a[2] ABuffer