U P C CGO’03 San Francisco March 2003 Local Scheduling Techniques for Memory Coherence in a Clustered VLIW Processor with a Distributed Data Cache Enric.

Slides:



Advertisements
Similar presentations
Institut für Industriewirtschaftliche Forschung Stefan Kooths Eric Ringhut CEF 2001, New Haven Complex dynamics and adaptive fuzzy rule-based expectations.
Advertisements

Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-order Processors Onur Mutlu, The University of Texas at Austin Jared Start,
UPC MICRO35 Istanbul Nov Effective Instruction Scheduling Techniques for an Interleaved Cache Clustered VLIW Processor Enric Gibert 1 Jesús Sánchez.
High-Performance Computing Seminar © Toni Cortes A Case for Heterogeneous Disk Arrays Toni Cortes.
Performance Analysis of NUCA Policies for CMPs Using Parsec v2.0 Benchmark Suite Javier Lira ψ Carlos Molina ф Antonio González λ λ Intel Barcelona Research.
Virtual Cluster Scheduling Through the Scheduling Graph Josep M. Codina Jesús Sánchez Antonio González Intel Barcelona Research Center, Intel Labs - UPC.
Bypass and Insertion Algorithms for Exclusive Last-level Caches
1 ICCD 2010 Amsterdam, the Netherlands Rami Sheikh North Carolina State University Mazen Kharbutli Jordan Univ. of Science and Technology Improving Cache.
Vector Processing as a Soft-core CPU Accelerator Jason Yu, Guy Lemieux, Chris Eagleston {jasony, lemieux, University of British Columbia.
ICS’02 UPC An Interleaved Cache Clustered VLIW Processor E. Gibert, J. Sánchez * and A. González * Dept. d’Arquitectura de Computadors Universitat Politècnica.
Link-Time Path-Sensitive Memory Redundancy Elimination Manel Fernández and Roger Espasa Computer Architecture Department Universitat.
UPC Compiler Support for Trace-Level Speculative Multithreaded Architectures Antonio González λ,ф Carlos Molina ψ Jordi Tubella ф INTERACT-9, San Francisco.
CS 7810 Lecture 4 Overview of Steering Algorithms, based on Dynamic Code Partitioning for Clustered Architectures R. Canal, J-M. Parcerisa, A. Gonzalez.
Managing Wire Delay in Large CMP Caches Bradford M. Beckmann David A. Wood Multifacet Project University of Wisconsin-Madison MICRO /8/04.
Computer Science, University of Oklahoma Reconfigurable Versus Fixed Versus Hybrid Architectures John K. Antonio Oklahoma Supercomputing Symposium 2008.
Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi.
U P C MICRO36 San Diego December 2003 Flexible Compiler-Managed L0 Buffers for Clustered VLIW Processors Enric Gibert 1 Jesús Sánchez 2 Antonio González.
POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:
University of Michigan Electrical Engineering and Computer Science 1 A Distributed Control Path Architecture for VLIW Processors Hongtao Zhong, Kevin Fan,
UPC Microarchitectural Techniques to Exploit Repetitive Computations and Values Carlos Molina Clemente LECTURA DE TESIS, (Barcelona,14 de Diciembre de.
Better than the Two: Exceeding Private and Shared Caches via Two-Dimensional Page Coloring Lei Jin and Sangyeun Cho Dept. of Computer Science University.
DEFI Workshop, Pisa, November, Rewarded Markov Modeling Techniques Juan A. Carraso Departament d’Enginyeria Electrònica Universitat Politècnica.
Using one level of Cache:
HK-NUCA: Boosting Data Searches in Dynamic NUCA for CMPs Javier Lira ψ Carlos Molina ф Antonio González ψ,λ λ Intel Barcelona Research Center Intel Labs.
UPC Power and Complexity Aware Microarchitectures Jaume Abella 1 Ramon Canal 1
UPC Reducing Misspeculation Penalty in Trace-Level Speculative Multithreaded Architectures Carlos Molina ψ, ф Jordi Tubella ф Antonio González λ,ф ISHPC-VI,
Designing Software for Ease of Extension and Contraction
UPC Dynamic Removal of Redundant Computations Carlos Molina, Antonio González and Jordi Tubella Universitat Politècnica de Catalunya - Barcelona
1-XII-98Micro-311 Widening Resources: A Cost-effective Technique for Aggressive ILP Architectures David López, Josep Llosa Mateo Valero and Eduard Ayguadé.
UPC Trace-Level Reuse A. González, J. Tubella and C. Molina Dpt. d´Arquitectura de Computadors Universitat Politècnica de Catalunya 1999 International.
Variable-Based Multi-Module Data Caches for Clustered VLIW Processors Enric Gibert 1,2, Jaume Abella 1,2, Jesús Sánchez 1, Xavier Vera 1, Antonio González.
UPC Reducing Power Consumption of the Issue Logic Daniele Folegnani and Antonio González Universitat Politècnica de Catalunya.
UPC Value Compression to Reduce Power in Data Caches Carles Aliagas, Carlos Molina and Montse García Universitat Rovira i Virgili – Tarragona, Spain {caliagas,
1 of 14 1/15 Design Optimization of Multi-Cluster Embedded Systems for Real-Time Applications Paul Pop, Petru Eles, Zebo Peng, Viaceslav Izosimov Embedded.
The Auction: Optimizing Banks Usage in Non-Uniform Cache Architectures Javier Lira ψ Carlos Molina ψ,ф Antonio González ψ,λ λ Intel Barcelona Research.
1 Clustered Data Cache Designs for VLIW Processors PhD Candidate: Enric Gibert Advisors: Antonio González, Jesús Sánchez.
UPC Trace-Level Speculative Multithreaded Architecture Carlos Molina Universitat Rovira i Virgili – Tarragona, Spain Antonio González.
Dynamic Runtime Testing for Cycle-Accurate Simulators Saša Tomić, Adrián Cristal, Osman Unsal, Mateo Valero Barcelona Supercomputing Center (BSC) Universitat.
An Energy-Efficient Reconfigurable Multiprocessor IC for DSP Applications Multiple programmable VLIW processors arranged in a ring topology –Balances its.
Physical Planning for the Architectural Exploration of Large-Scale Chip Multiprocessors Javier de San Pedro, Nikita Nikitin, Jordi Cortadella and Jordi.
McPAT: An Integrated Power, Area, and Timing Modeling Framework for Multicore and Manycore Architectures Runjie Zhang Dec.3 S. Li et al. in MICRO’09.
A Reconfigurable Processor Architecture and Software Development Environment for Embedded Systems Andrea Cappelli F. Campi, R.Guerrieri, A.Lodi, M.Toma,
Instruction-Level Parallelism for Low-Power Embedded Processors January 23, 2001 Presented By Anup Gangwar.
A RISC ARCHITECTURE EXTENDED BY AN EFFICIENT TIGHTLY COUPLED RECONFIGURABLE UNIT Nikolaos Vassiliadis N. Kavvadias, G. Theodoridis, S. Nikolaidis Section.
UNIVERSITAT POLITÈCNICA DE CATALUNYA Departament d’Arquitectura de Computadors Exploiting Pseudo-schedules to Guide Data Dependence Graph Partitioning.
Presenter: Jyun-Yan Li A hybrid approach to the test of cache memory controllers embedded in SoCs’ W. J. Perez, J. Velasco Universidad del Valle Grupo.
RICE UNIVERSITY ‘Stream’-based wireless computing Sridhar Rajagopal Research group meeting December 17, 2002 The figures used in the slides are borrowed.
Prefetching Challenges in Distributed Memories for CMPs Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC – BarcelonaTech.
Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.
Last Bank: Dealing with Address Reuse in Non-Uniform Cache Architecture for CMPs Javier Lira ψ Carlos Molina ф Antonio González λ λ Intel Barcelona Research.
Managing Distributed, Shared L2 Caches through OS-Level Page Allocation Jason Bosko March 5 th, 2008 Based on “Managing Distributed, Shared L2 Caches through.
1 Integrating Adaptive On-Chip Storage Structures for Reduced Dynamic Power Steve Dropsho, Alper Buyuktosunoglu, Rajeev Balasubramonian, David H. Albonesi,
CML SSDM: Smart Stack Data Management for Software Managed Multicores Jing Lu Ke Bai, and Aviral Shrivastava Compiler Microarchitecture Lab Arizona State.
Analysis of NUCA Policies for CMPs Using Parsec Benchmark Suite Javier Lira ψ Carlos Molina ф Antonio González λ λ Intel Barcelona Research Center Intel.
Power Awareness through Selective Dynamically Optimized Traces Roni Rosner, Yoav Almog, Micha Moffie, Naftali Schwartz and Avi Mendelson – Intel Labs,
Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File Stephen Hines, Gary Tyson, and David Whalley Computer Science Dept. Florida.
Principles of Operating Systems Lecture 18: Review and Future Prof. Joseph Pasquale University of California, San Diego March 13, 2013 © 2013 by Joseph.
Lx: A Technology Platform for Customizable VLIW Embedded Processing.
Using Uncacheable Memory to Improve Unity Linux Performance
LRU-PEA: A Smart Replacement Policy for NUCA caches on Chip Multiprocessors Javier Lira ψ Carlos Molina ψ,ф Antonio González ψ,λ λ Intel Barcelona Research.
1 of 14 Lab 2: Design-Space Exploration with MPARM.
15-740/ Computer Architecture Lecture 12: Issues in OoO Execution Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/7/2011.
A Unified Modulo Scheduling and Register Allocation Technique for Clustered Processors Josep M. Codina, Jesús Sánchez and Antonio González Dept. of Computer.
COMP7500 Advanced Operating Systems I/O-Aware Load Balancing Techniques Dr. Xiao Qin Auburn University
Low-power Digital Signal Processing for Mobile Phone chipsets
Auburn University COMP7500 Advanced Operating Systems I/O-Aware Load Balancing Techniques (2) Dr. Xiao Qin Auburn University.
Hyperthreading Technology
Set-Associative Cache
Interconnect with Cache Coherency Manager
Computer Evolution and Performance
Presentation transcript:

U P C CGO’03 San Francisco March 2003 Local Scheduling Techniques for Memory Coherence in a Clustered VLIW Processor with a Distributed Data Cache Enric Gibert 1 Jesús Sánchez 2 Antonio González 1,2 1 Dept. d’Arquitectura de Computadors Universitat Politècnica de Catalunya (UPC) Barcelona 2 Intel Barcelona Research Center Intel Labs Barcelona

U P C CGO’03 San Francisco March 2003 Motivation  Capacity vs. Communication-bound  Clustered microarchitectures –Simpler + faster –Power consumption –Communications not homogeneous  Clustering  embedded/DSP domain

U P C CGO’03 San Francisco March 2003 Clustered Microarchitectures CLUSTER 1 Reg. File FUs CLUSTER 2 Reg. File FUs CLUSTER 3 Reg. File FUs CLUSTER 4 Reg. File FUs Register-to-register communication buses L1 cache L2 cache Memory buses CLUSTER 1 Reg. File FUs CLUSTER 2 Reg. File FUs CLUSTER 3 Reg. File FUs CLUSTER 4 Reg. File FUs Register-to-register communication buses L1 cache module L1 cache module L2 cache L1 cache module L1 cache module L1 cache module L1 cache module L1 cache module L1 cache module CLUSTER 1 Reg. File FUs CLUSTER 2 Reg. File FUs CLUSTER 3 Reg. File FUs CLUSTER 4 Reg. File FUs Register-to-register communication buses L1 cache module L1 cache module L2 cache L1 cache module L1 cache module L1 cache module L1 cache module L1 cache module L1 cache module Memory buses

U P C CGO’03 San Francisco March 2003 Contributions  Distribution of data cache –Architecture design + data mapping Word-interleaved scheme [ICS’02] –Appropriate scheduling techniques [MICRO’02] –Memory coherence  Scheduling techniques for mem. coherence –Local software-based techniques –Applied to word-interleaved cache Complex conf. (with Attraction Buffers – refer to paper) Simple conf. (without Attraction Buffers) –Applicable to any other cache configuration

U P C CGO’03 San Francisco March 2003 Talk Outline  Architecture and Scheduling Algorithms  Memory Coherence Problem  Solutions –Memory Dependent Chains (MDC) –DDG Transformations (DDGT)  Evaluation  Conclusions

U P C CGO’03 San Francisco March 2003 Word-Interleaved Distribution CLUSTER 1 Register File Func. Units Register-to-register communication buses cache module CLUSTER 2 Register File Func. Units cache module CLUSTER 3 Register File Func. Units cache module CLUSTER 4 Register File Func. Units cache module L2 cache TAGW0W1W2W4W5W6W7W3 TAGW0W4TAGW1W5TAGW2W6TAGW3W7 subblock 1 cache block local hitremote hit local missremote miss

U P C CGO’03 San Francisco March 2003 Scheduling Techniques CLUSTER 1 cache module a[0]a[4] CLUSTER 2 cache module a[1]a[5] CLUSTER 3 cache module a[2]a[6] CLUSTER 4 cache module a[3]a[7] for (i=0; i<MAX; i++) { ld r3, a[i] r4 = OP(r3) st r4, b[i] } ld r31, a[i]ld r32, a[i+1]ld r33, a[i+2]ld r34, a[i+3] for (i=0; i<MAX; i+=4) { ld r31, a[i] (stride 16 bytes) ld r32, a[i+1] (stride 16 bytes) ld r33, a[i+2] (stride 16 bytes) ld r34, a[i+3] (stride 16 bytes)... } ld r3, a[i] Modulo scheduling Loop unrolling Assignment of latencies Padding + Profiling

U P C CGO’03 San Francisco March 2003 Cluster Assignment  Non-memory instructions Minimize register communications Maximize workload balance  Memory instructions  2 heuristics: –PrefClus Heuristic Preferred Cluster = most accessed cluster Profiling + Padding –MinComs Heuristic Minimize register communications Maximize workload balance Post-pass phase to increase local accesses

U P C CGO’03 San Francisco March 2003 Talk Outline  Architecture and Scheduling Algorithms  Memory Coherence Problem  Solutions –Memory Dependent Chains (MDC) –DDG Transformations (DDGT)  Evaluation  Conclusions

U P C CGO’03 San Francisco March 2003 Memory Coherence Problem CLUSTER 1 a[0]a[4] Cache module CLUSTER 3CLUSTER 2 CLUSTER 4 a[3]a[7] Cache module NEXT MEMORY LEVEL memory buses cycle i---store to a[0] cycle i cycle i cycle i cycle i+4load from a[0]--- Store to a[0] Update a[0] Read a[0] Remote accesses Misses Replacements Others NON-DETERMINISTIC BUS LATENCY!!! Store to a[0]

U P C CGO’03 San Francisco March 2003 Talk Outline  Architecture and Scheduling Algorithms  Memory Coherence Problem  Solutions –Memory Dependent Chains (MDC) –DDG Transformations (DDGT)  Evaluation  Conclusions

U P C CGO’03 San Francisco March 2003 Solutions Outline  Local scheduling solutions  applied at a loop granularity –Memory Dependent Chains (MDC) –Data Dependence Graph Transformations (DDGT) Store replication Load-store synchronization  Software-based solutions  Applicable to other configurations –Replicated distributed cache –MultiVLIW [MICRO00] …

U P C CGO’03 San Francisco March 2003 Memory Dependent Chains  Sets of aliased instructions: –Memory Dependent Chains (MDC)  Instructions in same set: –Assigned to same cluster  Restrictions on cluster assignment –PrefClus: average preferred cluster –MinComs: minimize comms. when scheduling first node n1 load n2 load n3 add n4 store n6 load n7 div n8 add RF MA MF = memory-flow MA = memory-anti RF = register-flow MF

U P C CGO’03 San Francisco March 2003 Memory Dependent Chains CLUSTER 1 a[0]a[4] Cache module CLUSTER 3CLUSTER 2 CLUSTER 4 a[3]a[7] Cache module NEXT MEMORY LEVEL memory buses cycle i---store to a[0] cycle i cycle i cycle i cycle i+4load from a[0]--- store to a[0] load from a[0]

U P C CGO’03 San Francisco March 2003 DDGT: Store Replication  Overcome MEM_FLOW (MF) and MEM_OUT (MO) store A store A load B load B MF store A store A store A’ store A’ store A’’ store A’’ store A’’’ store A’’’ load B load B MF store replication store A store A store B store B MO store A store A store A’ store A’ store A’’ store A’’ store A’’’ store A’’’ MO store replication store B store B store B’ store B’ store B’’ store B’’ store B’’’ store B’’’ local instance remote instances

U P C CGO’03 San Francisco March 2003 DDGT: Store Replication CLUSTER 1 a[0]a[4] Cache module CLUSTER 3 CLUSTER 2 CLUSTER 4 a[3]a[7] Cache module NEXT MEMORY LEVEL memory buses cycle i---store to a[0] cycle i+1store to a[0]- - cycle i cycle i+3-store to a[0]-- cycle i+4load from a[0]--- local instance remote instances Increase number of register communications!!!

U P C CGO’03 San Francisco March 2003 DDGT: ld-st Synchronization  Overcome MEM_ANTI (MA) dependences load A load A store B store B MA add RF load-store sync. load A load A store B store B SYNC add RF  Special cases: –Store is already REG_FLOW dependent on the load –Impossible recurrences load A load A store C store C RF store B store B MA MO load A load A store C store C RF store B store B MO fake cons fake cons RF SYNC load-store sync. MA

U P C CGO’03 San Francisco March 2003 CCCC BA MRT II res =2 C1C2C3C4 MDC Solution: Case Study  Impact on compute time –May increase the II res load A load A store C store C load B load B C BA MRT II res =2 C1C2C3C4 MA MF B C A MRT II res =3 C1C2C3C4  Impact on stall time –May increase remote accesses Extra stall cycles = 3 cycles / iteration always accesses data in cluster 1 always accesses data in cluster 2 Latency LH = 1 cycle Latency RH = 5 cycles add RF cycle 1 cycle 3

U P C CGO’03 San Francisco March 2003 DDGT Solution: Case Study  Impact on compute time –More instructions (II res ) Store replication Fake consumers (few) Register communications MRT II res =2 C1C2C3C4 X XXX store B store B load A load A MA MF C4 MRT II res =3 C1C2C3 BXBB B AXXX set of memory instructions X  Impact on stall time –Small New dependences may decrease slack of some memory instructions

U P C CGO’03 San Francisco March 2003 Talk Outline  Architecture and Scheduling Algorithms  Memory Coherence Problem  Solutions –Memory Dependent Chains (MDC) –DDG Transformations (DDGT)  Evaluation  Conclusions

U P C CGO’03 San Francisco March 2003 Evaluation Framework  IMPACT C compiler Compile + optimize + memory disambiguation  Mediabench benchmark suite ProfileExecution epicdec test_imagetitanic g721dec clintonS_16_44 g721enc clintonS_16_44 gsmdec clintonS_16_44 gsmenc clintonS_16_44 jpegdec testimgmonalisa jpegenc testimgmonalisa ProfileExecution mpeg2dec mei16v2tek6 pegwitdec pegwittechrep pegwitenc pgptesttechrep pgpdec pgptexttechrep pgpenc pgptesttechrep rasta ex5_c1

U P C CGO’03 San Francisco March 2003 Evaluation Framework Word-Interleaved Cache Clustered VLIW Processor # clusters 4 Functional units 1 FP / cluster + 1 integer / cluster + 1 memory / cluster Register buses 4 buses running at ½ the core freq. Memory buses 4 buses running at ½ the core freq. Cache configuration 8KB, 2-way set-associative, 32 byte blocks L2 always hits Cache latencies Local Hit=1 Remote Hit=5 Local Miss=10 Remote Miss=15 Algorithm PrefClus and MinComs Interleaving factor 2 or 4 bytes depending on benchmark BASELINE Same architecture but complete freedom when assigning instructions to clusters

U P C CGO’03 San Francisco March 2003 Local vs. Remote Accesses

U P C CGO’03 San Francisco March 2003 Execution Time

U P C CGO’03 San Francisco March 2003 Other Configurations  Configuration 1 24Memory buses42Register buses Latency# BusesLatency# Buses More pressure on register buses MDC outperforms DDGT in all cases  MDC requires less register communications 42Memory buses24Register buses Latency# BusesLatency# Buses More pressure on memory buses DDGT outperforms best MDC in several cases: epicdec 17%, pgpdec 20%, pgpenc 9%, rasta 7%…  Configuration 2

U P C CGO’03 San Francisco March 2003 Talk Outline  Architecture and Scheduling Algorithms  Memory Coherence Problem  Solutions –Memory Dependent Chains (MDC) –DDG Transformations (DDGT)  Evaluation  Conclusions

U P C CGO’03 San Francisco March 2003 Conclusions  Memory coherence problem –Two software-based solutions: MDC and DDGT –Applied to a word-interleaved cache clustered VLIW processor  MDC vs DDGT –Results depending on architecture configuration MDC outperforms DDGT in most cases DDGT better by up to 20% in specific configuration –Sets of memory dependent insts. are small –DDGT  freedom in cluster assignment Increase local accesses by 15%  reduce stall time

U P C CGO’03 San Francisco March 2003 Questions?