1-XII-98Micro-311 Widening Resources: A Cost-effective Technique for Aggressive ILP Architectures David López, Josep Llosa Mateo Valero and Eduard Ayguadé.

Slides:



Advertisements
Similar presentations
UPC MICRO35 Istanbul Nov Effective Instruction Scheduling Techniques for an Interleaved Cache Clustered VLIW Processor Enric Gibert 1 Jesús Sánchez.
Advertisements

© 2004 Wayne Wolf Topics Task-level partitioning. Hardware/software partitioning.  Bus-based systems.
ICS’02 UPC An Interleaved Cache Clustered VLIW Processor E. Gibert, J. Sánchez * and A. González * Dept. d’Arquitectura de Computadors Universitat Politècnica.
U P C CGO’03 San Francisco March 2003 Local Scheduling Techniques for Memory Coherence in a Clustered VLIW Processor with a Distributed Data Cache Enric.
Compiler Support for Superscalar Processors. Loop Unrolling Assumption: Standard five stage pipeline Empty cycles between instructions before the result.
Cases 2007 Florida State University Chris Zimmer, Steve Hines, Prasad Kulkarni Gary Tyson, David Whalley Facilitating Compiler Optimizations Through the.
1 Optimizing compilers Managing Cache Bercovici Sivan.
VLIW Very Large Instruction Word. Introduction Very Long Instruction Word is a concept for processing technology that dates back to the early 1980s. The.
CPE 731 Advanced Computer Architecture Instruction Level Parallelism Part I Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.
HW 2 is out! Due 9/25!. CS 6290 Static Exploitation of ILP.
CS 6461: Computer Architecture Basic Compiler Techniques for Exposing ILP Instructor: Morris Lancaster Corresponding to Hennessey and Patterson Fifth Edition.
Instruction Level Parallelism 2. Superscalar and VLIW processors.
Computer Organization and Architecture (AT70.01) Comp. Sc. and Inf. Mgmt. Asian Institute of Technology Instructor: Dr. Sumanta Guha Slide Sources: Based.
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
Instruction Level Parallelism María Jesús Garzarán University of Illinois at Urbana-Champaign.
U P C MICRO36 San Diego December 2003 Flexible Compiler-Managed L0 Buffers for Clustered VLIW Processors Enric Gibert 1 Jesús Sánchez 2 Antonio González.
1 Adapted from UCB CS252 S01, Revised by Zhao Zhang in IASTATE CPRE 585, 2004 Lecture 14: Hardware Approaches for Cache Optimizations Cache performance.
University of Michigan Electrical Engineering and Computer Science 1 A Distributed Control Path Architecture for VLIW Processors Hongtao Zhong, Kevin Fan,
TIE Extensions for Cryptographic Acceleration Charles-Henri Gros Alan Keefer Ankur Singla.
Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.
Rung-Bin Lin Chapter 4: Exploiting Instruction-Level Parallelism with Software Approaches4-1 Chapter 4 Exploiting Instruction-Level Parallelism with Software.
1 Advanced Computer Architecture Limits to ILP Lecture 3.
Lecture 3: Chapter 2 Instruction Level Parallelism Dr. Eng. Amr T. Abdel-Hamid CSEN 601 Spring 2011 Computer Architecture Text book slides: Computer Architec.
Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.
1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 3 (and Appendix C) Instruction-Level Parallelism and Its Exploitation Computer Architecture.
1 Tuesday, September 19, 2006 The practical scientist is trying to solve tomorrow's problem on yesterday's computer. Computer scientists often have it.
UPC Trace-Level Reuse A. González, J. Tubella and C. Molina Dpt. d´Arquitectura de Computadors Universitat Politècnica de Catalunya 1999 International.
Variable-Based Multi-Module Data Caches for Clustered VLIW Processors Enric Gibert 1,2, Jaume Abella 1,2, Jesús Sánchez 1, Xavier Vera 1, Antonio González.
Analysis and Performance Results of a Molecular Modeling Application on Merrimac Erez, et al. Stanford University 2004 Presented By: Daniel Killebrew.
Cluster Prefetch: Tolerating On-Chip Wire Delays in Clustered Microarchitectures Rajeev Balasubramonian School of Computing, University of Utah July 1.
© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 – July 2, Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.
Intel Architecture. Changes in architecture Software architecture: –Front end (Feature changes such as adding more graphics, changing the background colors,
Optimizing Loop Performance for Clustered VLIW Architectures by Yi Qian (Texas Instruments) Co-authors: Steve Carr (Michigan Technological University)
 Higher associativity means more complex hardware  But a highly-associative cache will also exhibit a lower miss rate —Each set has more blocks, so there’s.
1 Caches Concepts Review  What is a block address? —Why not bring just what is needed by the processor?  What is a set associative cache?  Write-through?
UNIVERSITAT POLITÈCNICA DE CATALUNYA Departament d’Arquitectura de Computadors Exploiting Pseudo-schedules to Guide Data Dependence Graph Partitioning.
1 Advance Computer Architecture CSE 8383 Ranya Alawadhi.
Is Out-Of-Order Out Of Date ? IA-64’s parallel architecture will improve processor performance William S. Worley Jr., HP Labs Jerry Huck, IA-64 Architecture.
CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.
CAPS project-team Compilation et Architectures pour Processeurs Superscalaires et Spécialisés.
A Graph Based Algorithm for Data Path Optimization in Custom Processors J. Trajkovic, M. Reshadi, B. Gorjiara, D. Gajski Center for Embedded Computer Systems.
Spring 2003CSE P5481 VLIW Processors VLIW (“very long instruction word”) processors instructions are scheduled by the compiler a fixed number of operations.
RISC architecture and instruction Level Parallelism (ILP) based on “Computer Architecture: a Quantitative Approach” by Hennessy and Patterson, Morgan Kaufmann.
CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 1 Compiler Techniques for ILP  So far we have explored dynamic hardware techniques for.
CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.
Limits of Instruction-Level Parallelism Presentation by: Robert Duckles CSE 520 Paper being presented: Limits of Instruction-Level Parallelism David W.
VLIW Digital Signal Processor Michael Chang. Alison Chen. Candace Hobson. Bill Hodges.
System-level power analysis and estimation September 20, 2006 Chong-Min Kyung.
Exploiting Scratchpad-aware Scheduling on VLIW Architectures for High-Performance Real-Time Systems Yu Liu and Wei Zhang Department of Electrical and Computer.
UltraSPARC III Hari P. Ananthanarayanan Anand S. Rajan.
Exploiting Vector Parallelism in Software Pipelined Loops Sam Larsen Rodric Rabbah Saman Amarasinghe Computer Science and Artificial Intelligence Laboratory.
15-740/ Computer Architecture Lecture 12: Issues in OoO Execution Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/7/2011.
A Unified Modulo Scheduling and Register Allocation Technique for Clustered Processors Josep M. Codina, Jesús Sánchez and Antonio González Dept. of Computer.
Automatic Thread Extraction with Decoupled Software Pipelining
ESE532: System-on-a-Chip Architecture
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue
Christopher Han-Yu Chou Supervisor: Dr. Guy Lemieux
5.2 Eleven Advanced Optimizations of Cache Performance
/ Computer Architecture and Design
COMP4211 : Advance Computer Architecture
Hyperthreading Technology
Register Pressure Guided Unroll-and-Jam
Adapted from the slides of Prof
CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue
Mattan Erez The University of Texas at Austin
CMSC 611: Advanced Computer Architecture
Lecture 5: Pipeline Wrap-up, Static ILP
CAPS project-team Compilation et Architectures pour Processeurs Superscalaires et Spécialisés.
Research: Past, Present and Future
Presentation transcript:

1-XII-98Micro-311 Widening Resources: A Cost-effective Technique for Aggressive ILP Architectures David López, Josep Llosa Mateo Valero and Eduard Ayguadé Departament d’Arquitectura de Computadors, Universitat Politècnica de Catalunya

1-XII-98 Micro-31 2 Goals Modify resources to exploit ILP in –VLIW architectures –numerical code –innermost loops A study of performance and cost Technological projection

1-XII-98 Micro-31 3 Outline Replication and Widening Performance –maximum ILP achievable –effects of spill code Design considerations Performance under a technological limit Conclusions

1-XII-98 Micro-31 4 Basic architecture 1 bus between the register file (RF) and the first-level cache 1 general purpose floating point functional unit (FPU) FPU Register File Bus

1-XII-98 Micro-31 5 Basic architecture 2 operations can be issued per cycle: –1 memory –1 FPU FPU Register File Bus memory FPUothers VLIW

1-XII-98 Micro-31 6 Replication 2 buses + 2 FPU 4 operations can be issued per cycle: –2 memory (independent) –2 FPU (independent) FPU Register File Bus memory FPU others VLIW

1-XII-98 Micro-31 7 An alternative: widening The bus, the FPU and the RF are widened 4 operations can be issued per cycle: –2 memory (in consecutive memory addresses) –2 FPU (the same operation) less versatile Bus FPU Register File memory FPU others VLIW

1-XII-98 Micro-31 8 Software pipelined loops Loops performance is limited by –recurrences –resources Software pipelining overlaps the execution of several consecutive iterations With a perfect scheduling, at least one resource is occupied at 100% (unless the loop is recurrence-bound)

1-XII-98 Micro-31 9 How widening works? 3 memory operations: loads A and B, and store C A y B have stride 1 with themselves in the next iteration 3 floating point operations + has a recurrence with itself in the next iteration let’s assume a latency of 2 cycles BA * * + C D 1 11

1-XII-98 Micro Execution of several iterations Load A Load B * * + Store C iteration bus + 1 FPU : 3 cycles / iteration 2 buses + 2 FPU: 1.5 cycles / iteration but 2 cycles are required due to the recurrence, so 2 cycles / iteration

1-XII-98 Micro “Compactable” operations (width 2) Load A Load B * * + Store C iteration 1234 Reason (López et al. ICS97) No dependency and stride 1 Dependency No dependency, but no stride 1 2 cycles / iteration No dependency

1-XII-98 Micro Limits on ILP Baseline configuration (1w1) :1 bus and 2 FPU Configurations XwY : –X: degree of replication: X buses, 2*X FPU –Y: degree of widening (width of the resources) Characteristics of the architecture: –store is served in 1 cycle, division (19 cycles) and SQRT (27 cycles) are not pipelined –the rest are fully pipelined with a latency of 4 cycles

1-XII-98 Micro Performance: Replication Workbench: 1180 loops that account for 78% of the execution time of the Perfect Club

1-XII-98 Micro Performance: R vs Widening

1-XII-98 Micro Performance: R vs W vs Combined

1-XII-98 Micro Scheduling and register assignment Loops have been software pipelined using HRMS (Llosa et al. MICRO-28), a register pressure sensitive heuristic. Register allocation has been performed using wands-only strategy and the end-fit with adjacency ordering (Rau et al. PLDI-92). When a loop requires more registers than the available, spill code is added.

1-XII-98 Micro Register pressure Reducing the cycles required per iteration can increase the register requirements. Widening is also applied to the register file –more storage capacity (and less register pressure) –not cheating! If there are no compactable operations, we do not benefit from this additional capacity

1-XII-98 Micro Effects of adding spill code Baseline : 1w1 with a 256 RF

1-XII-98 Micro Area cost Cost of the FPU: widening and replication have the same cost The area of the RF grows as the square of the number of ports

1-XII-98 Micro Register file access time Based on the CACTI model (Wilton & Jouppi J. of Solid-State Circuits 96) for cache memory. Normalised to configuration 1w1 with a 32-RF and a technology of =0.05. Widening the RF is cheaper than adding ports Increasing the number of registers is cheaper than adding ports

1-XII-98 Micro Effect of the RF size Configuration 1w1

1-XII-98 Micro Effect of the studied techniques

1-XII-98 Micro Cost of widening and replication Area: –replication: quadratic increment –widening: linear increment Cycle time: –the increment of cycle time applying replication is greater than applying widening –the RF can be partitioned into several copies, reducing the access time but increasing the area

1-XII-98 Micro Performance/cost trade-off Configurations XwY(Z:n) where: –X is the replication degree –Y is the widening degree –Z is the RF size (32, 64, 128 or 256) –n is the number of blocks in which the RF has been partitioned

1-XII-98 Micro Configurations that can be implemented We use the SIA predictions FPU + RF area cost must be smaller than 20% of the total chip area available Semiconductor Industry Association  m) Size (mm 2 ) 2 per chip (x10 6 ) , , , ,530

1-XII-98 Micro Implementable configurations ( =0.25)

1-XII-98 Micro Implementable configurations ( =0.18)

1-XII-98 Micro Implementable configurations ( =0.13)

1-XII-98 Micro Implementable configurations ( =0.10)

1-XII-98 Micro Implementable configurations ( =0.07)

1-XII-98 Micro FPU latency We compare configurations adapting the latency of the FPU to the processor cycle time A configuration with a relative cycle time Tc belongs to the z-cycles model where z=  4/Tc 

1-XII-98 Micro Effect of the RF size The same configuration, changing the RF size A big RF needs less spill code, but has a big penalty in access time

1-XII-98 Micro Effect of the studied techniques Only Replication Only Widening

1-XII-98 Micro XwY where X*Y=8 The same RF size and peak performance Combining small degrees of replication and widening results in the best performance

1-XII-98 Micro Top five configurations (i) The five configurations that achieve the best performance for =0.18 are showed. Blue ones: the ones with best performance/cost In all the technology generations, the best ones use widening = 0.18

1-XII-98 Micro Top five configurations (ii) = 0.13 = 0.10

1-XII-98 Micro Conclusions Study of two techniques to extract ILP: replication and widening Study of aggressive configurations in optimal conditions: –replication achieves best performance –widening costs less Study of the cost of both techniques

1-XII-98 Micro Conclusions Applying small degrees of replication and widening results in best performance under a technology limit –widening has more storage capacity less spill code –replication has more area requirements some configurations become not implementable –RF access time is shorter using replication than using widening

1-XII-98 Micro RF area cost Read data line Write data line Write select line Read select line

1-XII-98 Micro Unrolling the loop BA * * + C D 1 11 A1A1 A0A0 *0*0 *0* C0C0 D B1B1 B0B0 *1*1 *1* C1C1 D 1

1-XII-98 Micro Compacting A1A1 A0A0 *0*0 *0* C0C0 D B1B1 B0B0 *1*1 *1* C1C1 D 1 A 0,1 * 0, C0C0 D B 0, C1C1 1

1-XII-98 Micro Execution of a compacted loop A 0,1 * 0, C0C0 D B 0, C1C1 1 FPU Register File Bus

1-XII-98 Micro Limits A loop is bounded by recurrences and resources. Assume the basic architecture (1 bus and 1 FPU) with latency of 2 cycles A1A1 A0A0 *0*0 *0* C0C0 D B1B1 B0B0 *1*1 *1* C1C1 D 1

1-XII-98 Micro Limits: resources and recurrences A1A1 A0A0 *0*0 *0* C0C0 D B1B1 B0B0 *1*1 *1* C1C1 D 1

1-XII-98 Micro Reducing the resources limits A1A1 A0A0 *0*0 *0* C0C0 D B1B1 B0B0 *1*1 *1* C1C1 D 1

1-XII-98 Micro Effect of replication and widening 1w1: 3 cycles/it 2w1: 2 cycles/it 1w2: 2 cycles/it

1-XII-98 Micro Taxonomy of loops Recurrences Compactab. Non compa. Don’t care

1-XII-98 Micro Top five configurations = 0.25

1-XII-98 Micro