ICS’02 UPC An Interleaved Cache Clustered VLIW Processor E. Gibert, J. Sánchez * and A. González * Dept. d’Arquitectura de Computadors Universitat Politècnica.

Slides:

Advertisements

Similar presentations

UPC MICRO35 Istanbul Nov Effective Instruction Scheduling Techniques for an Interleaved Cache Clustered VLIW Processor Enric Gibert 1 Jesús Sánchez.

Advertisements

Virtual Cluster Scheduling Through the Scheduling Graph Josep M. Codina Jesús Sánchez Antonio González Intel Barcelona Research Center, Intel Labs - UPC.

MEMOCode 2007 Design Contest – MIT Submission N. Dave, K. Fleming, M. King, M. Pellauer, M. Vijayaraghavan.

Link-Time Path-Sensitive Memory Redundancy Elimination Manel Fernández and Roger Espasa Computer Architecture Department Universitat.

U P C CGO’03 San Francisco March 2003 Local Scheduling Techniques for Memory Coherence in a Clustered VLIW Processor with a Distributed Data Cache Enric.

Compiler Support for Superscalar Processors. Loop Unrolling Assumption: Standard five stage pipeline Empty cycles between instructions before the result.

Lecture 4: CPU Performance

UPC Compiler Support for Trace-Level Speculative Multithreaded Architectures Antonio González λ,ф Carlos Molina ψ Jordi Tubella ф INTERACT-9, San Francisco.

CS 7810 Lecture 4 Overview of Steering Algorithms, based on Dynamic Code Partitioning for Clustered Architectures R. Canal, J-M. Parcerisa, A. Gonzalez.

CPE 731 Advanced Computer Architecture Instruction Level Parallelism Part I Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.

HW 2 is out! Due 9/25!. CS 6290 Static Exploitation of ILP.

1 Lecture 5: Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2)

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.

Instruction Level Parallelism María Jesús Garzarán University of Illinois at Urbana-Champaign.

U P C MICRO36 San Diego December 2003 Flexible Compiler-Managed L0 Buffers for Clustered VLIW Processors Enric Gibert 1 Jesús Sánchez 2 Antonio González.

POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:

University of Michigan Electrical Engineering and Computer Science 1 A Distributed Control Path Architecture for VLIW Processors Hongtao Zhong, Kevin Fan,

1 Lecture: Static ILP Topics: compiler scheduling, loop unrolling, software pipelining (Sections C.5, 3.2)

Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.

Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining.

1 Advanced Computer Architecture Limits to ILP Lecture 3.

1 Lecture 10: Static ILP Basics Topics: loop unrolling, static branch prediction, VLIW (Sections 4.1 – 4.4)

UPC Microarchitectural Techniques to Exploit Repetitive Computations and Values Carlos Molina Clemente LECTURA DE TESIS, (Barcelona,14 de Diciembre de.

Compiler Challenges for High Performance Architectures

University of Michigan Electrical Engineering and Computer Science 1 Reducing Control Power in CGRAs with Token Flow Hyunchul Park, Yongjun Park, and Scott.

HK-NUCA: Boosting Data Searches in Dynamic NUCA for CMPs Javier Lira ψ Carlos Molina ф Antonio González ψ,λ λ Intel Barcelona Research Center Intel Labs.

UPC Power and Complexity Aware Microarchitectures Jaume Abella 1 Ramon Canal 1

UPC Reducing Misspeculation Penalty in Trace-Level Speculative Multithreaded Architectures Carlos Molina ψ, ф Jordi Tubella ф Antonio González λ,ф ISHPC-VI,

Addressing Optimization for Loop Execution Targeting DSP with Auto-Increment/Decrement Architecture Wei-Kai Cheng Youn-Long Lin* Computer & Communications.

UPC Dynamic Removal of Redundant Computations Carlos Molina, Antonio González and Jordi Tubella Universitat Politècnica de Catalunya - Barcelona

CS 267 Spring 2008 Horst Simon UC Berkeley May 15, 2008 Code Generation Framework for Process Network Models onto Parallel Platforms Man-Kit Leung, Isaac.

1-XII-98Micro-311 Widening Resources: A Cost-effective Technique for Aggressive ILP Architectures David López, Josep Llosa Mateo Valero and Eduard Ayguadé.

UPC Trace-Level Reuse A. González, J. Tubella and C. Molina Dpt. d´Arquitectura de Computadors Universitat Politècnica de Catalunya 1999 International.

1 Lecture 5: Pipeline Wrap-up, Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2) Assignment 1 due at the start of class on Thursday.

Variable-Based Multi-Module Data Caches for Clustered VLIW Processors Enric Gibert 1,2, Jaume Abella 1,2, Jesús Sánchez 1, Xavier Vera 1, Antonio González.

UPC Reducing Power Consumption of the Issue Logic Daniele Folegnani and Antonio González Universitat Politècnica de Catalunya.

Chapter 17 Parallel Processing.

Cluster Prefetch: Tolerating On-Chip Wire Delays in Clustered Microarchitectures Rajeev Balasubramonian School of Computing, University of Utah July 1.

Multiscalar processors

Spatial Computation Mihai Budiu CMU CS CALCM Seminar, Oct 21, 2003.

1 Clustered Data Cache Designs for VLIW Processors PhD Candidate: Enric Gibert Advisors: Antonio González, Jesús Sánchez.

UPC Trace-Level Speculative Multithreaded Architecture Carlos Molina Universitat Rovira i Virgili – Tarragona, Spain Antonio González.

Optimizing Loop Performance for Clustered VLIW Architectures by Yi Qian (Texas Instruments) Co-authors: Steve Carr (Michigan Technological University)

UNIVERSITAT POLITÈCNICA DE CATALUNYA Departament d’Arquitectura de Computadors Exploiting Pseudo-schedules to Guide Data Dependence Graph Partitioning.

Super computers Parallel Processing By Lecturer: Aisha Dawood.

Last Bank: Dealing with Address Reuse in Non-Uniform Cache Architecture for CMPs Javier Lira ψ Carlos Molina ф Antonio González λ λ Intel Barcelona Research.

© Wen-mei Hwu and S. J. Patel, 2005 ECE 511, University of Illinois Lecture 4: Microarchitecture: Overview and General Trends.

1 Introduction ELG 6158 Digital Systems Architecture Miodrag Bolic.

Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File Stephen Hines, Gary Tyson, and David Whalley Computer Science Dept. Florida.

Exploiting Vector Parallelism in Software Pipelined Loops Sam Larsen Rodric Rabbah Saman Amarasinghe Computer Science and Artificial Intelligence Laboratory.

University of Michigan Electrical Engineering and Computer Science Automatic Synthesis of Customized Local Memories for Multicluster Application Accelerators.

ICC Module 3 Lesson 1 – Computer Architecture 1 / 11 © 2015 Ph. Janson Information, Computing & Communication Module 3 : Systems.

CPS 258 Announcements –Lecture calendar with slides –Pointers to related material.

ICC Module 3 Lesson 2 – Memory Hierarchies 1 / 25 © 2015 Ph. Janson Information, Computing & Communication Memory Hierarchies – Clip 8 – Example School.

15-740/ Computer Architecture Lecture 12: Issues in OoO Execution Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/7/2011.

A Unified Modulo Scheduling and Register Allocation Technique for Clustered Processors Josep M. Codina, Jesús Sánchez and Antonio González Dept. of Computer.

William Stallings Computer Organization and Architecture 7th Edition

The University of Adelaide, School of Computer Science

Mattan Erez The University of Texas at Austin

CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue

Samuel Larsen and Saman Amarasinghe, MIT CSAIL

Virtual Memory Overcoming main memory size limitation

Mattan Erez The University of Texas at Austin

Hardware Organization

Code Transformation for TLB Power Reduction

Lecture 5: Pipeline Wrap-up, Static ILP

Presentation transcript:

ICS’02 UPC An Interleaved Cache Clustered VLIW Processor E. Gibert, J. Sánchez * and A. González * Dept. d’Arquitectura de Computadors Universitat Politècnica de Catalunya (UPC) * Also at Intel Barcelona Research Center June 2002

ICS’02 UPC Motivation  Capacity-bound vs. Communication-bound  Solution: clustered microarchitectures Partition some hardware resources Simpler + faster Power consumption Communications not homogeneous  Goal: clustering the memory hierarchy in statically scheduled processors Motivation

ICS’02 UPC Talk Outline  State-of-the-art: multiVLIW  Interleaved Cache Clustered VLIW  Scheduling Algorithms  Enhancement: Attraction Buffers  Experimental Framework  Results  Conclusions

ICS’02 UPC State-of-the-art: MultiVLIW  Sánchez and González [MICRO’00] Reg. File F.U. L1 data cache Cluster 1 Reg. File F.U. L1 data cache Cluster 2 Reg. File F.U. L1 data cache Cluster n Coherency network... Register-to-register buses Next memory level

ICS’02 UPC Talk Outline  State-of-the-art: multiVLIW  Interleaved Cache Clustered VLIW  Scheduling Algorithms  Enhancement: Attraction Buffers  Experimental Framework  Results  Conclusions

ICS’02 UPC Basic Interleaved Cache Clustered VLIW Processor Reg. File FUs TAGW0W4 cache module Reg. File FUs TAGW1W5 cache module Reg. File FUs TAGW2W6 cache module Reg. File FUs TAGW3W7 cache module TAGW0W1W2W4W5W6W7W3 Subblock 1 memory buses NEXT MEMORY LEVEL cache block Register-to-register buses CLUSTER 1 CLUSTER 2CLUSTER 3CLUSTER 4

ICS’02 UPC Talk Outline  State-of-the-art: multiVLIW  Interleaved Cache Clustered VLIW  Scheduling Algorithms  Enhancement: Attraction Buffers  Experimental Framework  Results  Conclusions

ICS’02 UPC Modulo Scheduling  Extract ILP from loops  overlap execution of iterations A A B B C C A A B B C C A’ B’ C’ A’’ B’’ C’’ II SC Kernel LOOP L

ICS’02 UPC Base Scheduling Algorithm  Used for Unified Cache II=II+1 Best profit in output edges START Sort nodes Next node Select possible clusters How Many? Least loaded Schedule it How Many? >0 >1 1 0

ICS’02 UPC Interleaved Cache Scheduling Algorithm  Unroll loop to maximize instructions with a stride multiple of NxI  access ONE cache module  Assign latencies to memory instructions  Assign memory instructions to clusters: –IPBC (Interleaved Pre-Build Chains)  minimize stall time –IBC (Interleaved Build Chains)  minimize compute time

ICS’02 UPC Memory Dependent Instructions store load add load add store load store memory dependant chain 1 memory dependant chain 2 IPBC  preferred info is used vs. IBC  minimize register comms. Preferred=1 Preferred=2

ICS’02 UPC Talk Outline  State-of-the-art: multiVLIW  Interleaved Cache Clustered VLIW  Scheduling Algorithms  Enhancement: Attraction Buffers  Experimental Framework  Results  Conclusions

ICS’02 UPC Local Data Local Data ABuffer local logic datahit data hit ADDRESS TAGW2W6 = TAGW ADDRESS datahit ATTRACTION BUFFER word select CACHE MODULE Enhacement: Attraction Buffers

ICS’02 UPC for (i=0; i<MAX; i++) { ld r3, a[i] r4 = OP(r3) st r4, b[i] } for (i=0; i<MAX; i+=4) { ld r31, a[i] (stride 16) ld r32, a[i+1] ld r33, a[i+2] ld r34, a[i+3] r41 = OP(r31) r42 = OP(r32) r43 = OP(r33) r44 = OP(r34) st r41, b[i] st r42, b[i+1] st r43, b[i+2] st r44, b[i+3] } 16 byte strides (NxI multiple) N = 4 clusters, I= 4 bytes Unroll x4 An Example a[3]a[7]a[0]a[4] CLUSTER 4 ABuffer Local module ld r31, a[0] CLUSTER 3CLUSTER 2CLUSTER 1 a[0] a[1] a[2] a[3]...

ICS’02 UPC Enhacement: Attraction Buffers  Why remote accesses? Why Attraction Buffers? –Double precision accesses  low benefit –Indirect accesses: a[b[i]]  low benefit –“Unclear” preferred cluster  big benefit for (i=0; i<MAX; i++) for (k=i; k<i+MAX; k+=4) ld a[k], ld a[k+1], ld a[k+2], ld a[k+3] –Memory dependent chains  big benefit –IBC: preferred cluster info is not used  big benefit

ICS’02 UPC Talk Outline  State-of-the-art: multiVLIW  Interleaved Cache Clustered VLIW  Scheduling Algorithms  Enhancement: Attraction Buffers  Experimental Framework  Results  Conclusions

ICS’02 UPC Experimental Framework  IMPACT C compiler  Modulo scheduling on hyperblock loops –BASE for a Unified Cache –IPBC and IBC for an Interleaved Cache –IPBC and IBC for the MultiVLIW –The same unrolling factor has been used for all architecture configurations!  Mediabench benchmark suite

ICS’02 UPC Experimental Framework Number of clusters4 Functional units1 FP / cluster + 1 int / cluster + 1 mem / cluster Cache configuration8KB, 32-byte lines, 2-way set associative, 1 cycle latency Reg-to-reg communication buses 4 buses that run at ½ the core frequency Memory buses4 buses that run at ½ (or ¼) the core frequency Next memory level4 ports, 5 cycle latency, always hit Interleaving factor (Interleaved Cache) 4 bytes Latencies1-10 (Unified Cache + MultiVLIW) 1-(5/6) (Interleaved Cache)

ICS’02 UPC Results (I)  IPBC vs IBC  similar cycle count results  MultiVLIW vs Interleaved  similar results BUT… … lower complexity!

ICS’02 UPC Results (II)  Memory dependent chains –Interleaved cache  workload unbalance +  remote accesses –MultiVLIW  workload unbalance –Working on techniques to overcome scheduling restrictions

ICS’02 UPC Results (III)  Local hits are increased by 15%  Stall time reduced by 30%

ICS’02 UPC Conclusions  Scheduling Algorithms –Good latency assignment process (stall time accounts for 9% of execution time) –Coherence kept through memory dependent chains (5% cycle count degradation)  Attraction Buffers –Effective to increase local hits (15% average) + reduce stall time (30% average) –Reduce remote hits to previously accessed subblocks (70% average)  Cycle count results –similar to Unified Cache and MultiVLIW

ICS’02 UPC Questions