CHIMAERA: A High-Performance Architecture with a Tightly-Coupled Reconfigurable Functional Unit Kynan Fraser.

Slides:

Advertisements

Similar presentations

ILP: IntroductionCSCE430/830 Instruction-level parallelism: Introduction CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng.

Advertisements

University of Michigan Electrical Engineering and Computer Science 1 Application-Specific Processing on a General Purpose Core via Transparent Instruction.

CPE 731 Advanced Computer Architecture Instruction Level Parallelism Part I Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.

1 Pipelining Part 2 CS Data Hazards Data hazards occur when the pipeline changes the order of read/write accesses to operands that differs from.

Instruction-Level Parallelism compiler techniques and branch prediction prepared and Instructed by Shmuel Wimer Eng. Faculty, Bar-Ilan University March.

Computer Organization and Architecture (AT70.01) Comp. Sc. and Inf. Mgmt. Asian Institute of Technology Instructor: Dr. Sumanta Guha Slide Sources: Based.

POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:

EECC551 - Shaaban #1 Fall 2003 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.

Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.

Pipeline Computer Organization II 1 Hazards Situations that prevent starting the next instruction in the next cycle Structural hazards – A required resource.

Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.

EECC551 - Shaaban #1 Fall 2005 lec# Static Compiler Optimization Techniques We examined the following static ISA/compiler techniques aimed.

Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.

1 Cleared for Open Publication July 30, S-2144 P148/MAPLD 2004 Rea MAPLD 148:"Is Scaling the Correct Approach for Radiation Hardened Conversions.

Sim-alpha: A Validated, Execution-Driven Alpha Simulator Rajagopalan Desikan, Doug Burger, Stephen Keckler, Todd Austin.

Instruction-Level Parallelism (ILP)

Lecture 15: Reconfigurable Coprocessors October 31, 2013 ECE 636 Reconfigurable Computing Lecture 15 Reconfigurable Coprocessors.

Accurately Approximating Superscalar Processor Performance from Traces Kiyeon Lee, Shayne Evans, and Sangyeun Cho Dept. of Computer Science University.

From Sequences of Dependent Instructions to Functions An Approach for Improving Performance without ILP or Speculation Ben Rudzyn.

1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 3 (and Appendix C) Instruction-Level Parallelism and Its Exploitation Computer Architecture.

Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

Extensible Processors. 2 ASIP Gain performance by:  Specialized hardware for the whole application (ASIC). −  Almost no flexibility. −High cost.  Use.

1 Lecture 17: Basic Pipelining Today’s topics:  5-stage pipeline  Hazards and instruction scheduling Mid-term exam stats:  Highest: 90, Mean: 58.

Wish Branches A Review of “Wish Branches: Enabling Adaptive and Aggressive Predicated Execution” Russell Dodd - October 24, 2006.

Computer Architecture 2011 – out-of-order execution (lec 7) 1 Computer Architecture Out-of-order execution By Dan Tsafrir, 11/4/2011 Presentation based.

Defining Wakeup Width for Efficient Dynamic Scheduling A. Aggarwal, O. Ergin – Binghamton University M. Franklin – University of Maryland Presented by:

Architecture and Compilation for Reconfigurable Processors Jason Cong, Yiping Fan, Guoling Han, Zhiru Zhang Computer Science Department UCLA Nov 22, 2004.

Reducing the Complexity of the Register File in Dynamic Superscalar Processors Rajeev Balasubramonian, Sandhya Dwarkadas, and David H. Albonesi In Proceedings.

Computer Architecture 2010 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

1/25 HIPEAC 2008 TurboROB TurboROB A Low Cost Checkpoint/Restore Accelerator Patrick Akl and Andreas Moshovos AENAO Research Group Department of Electrical.

ASH: A Substrate for Scalable Architectures Mihai Budiu Seth Copen Goldstein CALCM Seminar, March 19, 2002.

Research on cloud computing application in the peer-to-peer based video-on-demand systems Speaker : 吳靖緯 MA0G rd International Workshop.

1 The Performance Potential for Single Application Heterogeneous Systems Henry Wong* and Tor M. Aamodt § *University of Toronto § University of British.

Korea Univ B-Fetch: Branch Prediction Directed Prefetching for In-Order Processors 컴퓨터 · 전파통신공학과 최병준 1 Computer Engineering and Systems Group.

A Reconfigurable Processor Architecture and Software Development Environment for Embedded Systems Andrea Cappelli F. Campi, R.Guerrieri, A.Lodi, M.Toma,

Coarse and Fine Grain Programmable Overlay Architectures for FPGAs

A RISC ARCHITECTURE EXTENDED BY AN EFFICIENT TIGHTLY COUPLED RECONFIGURABLE UNIT Nikolaos Vassiliadis N. Kavvadias, G. Theodoridis, S. Nikolaidis Section.

Amalgam: a Reconfigurable Processor for Future Fabrication Processes Nicholas P. Carter University of Illinois at Urbana-Champaign.

1 Exploring Custom Instruction Synthesis for Application-Specific Instruction Set Processors with Multiple Design Objectives Lin, Hai Fei, Yunsi ACM/IEEE.

Predicated Static Single Assignment (PSSA) Presented by AbdulAziz Al-Shammari

1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.

Reconfigurable Computing Using Content Addressable Memory (CAM) for Improved Performance and Resource Usage Group Members: Anderson Raid Marie Beltrao.

A Single-Pass Cache Simulation Methodology for Two-level Unified Caches + Also affiliated with NSF Center for High-Performance Reconfigurable Computing.

1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Architectures Iowa State University (Ames) Reconfigurable Architectures Forces that drive.

1 Out-Of-Order Execution (part I) Alexander Titov 14 March 2015.

A Hybrid Design Space Exploration Approach for a Coarse-Grained Reconfigurable Accelerator Farhad Mehdipour, Hamid Noori, Hiroaki Honda, Koji Inoue, Kazuaki.

Reconfigurable Computing - Pipelined Systems John Morris Chung-Ang University The University of Auckland ‘Iolanthe’ at 13 knots on Cockburn Sound, Western.

1/25 June 28 th, 2006 BranchTap: Improving Performance With Very Few Checkpoints Through Adaptive Speculation Control BranchTap Improving Performance With.

1 Efficient System-on-Chip Energy Management with a Segmented Counting Bloom Filter Mrinmoy Ghosh- Georgia Tech Emre Özer- ARM Ltd Stuart Biles- ARM Ltd.

1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Architectures Iowa State University (Ames) CPRE 583 Reconfigurable Computing Lecture.

Sunpyo Hong, Hyesoon Kim

Lecture 9. MIPS Processor Design – Pipelined Processor Design #1 Prof. Taeweon Suh Computer Science Education Korea University 2010 R&E Computer System.

07/11/2005 Register File Design and Memory Design Presentation E CSE : Introduction to Computer Architecture Slides by Gojko Babić.

CS717 1 Hardware Fault Tolerance Through Simultaneous Multithreading (part 2) Jonathan Winter.

Out-of-order execution Lihu Rappoport 11/ MAMAS – Computer Architecture Out-Of-Order Execution Dr. Lihu Rappoport.

Application-Specific Customization of Soft Processor Microarchitecture Peter Yiannacouras J. Gregory Steffan Jonathan Rose University of Toronto Electrical.

Pipelining: Implementation CPSC 252 Computer Organization Ellen Walker, Hiram College.

1/25 HIPEAC 2008 TurboROB TurboROB A Low Cost Checkpoint/Restore Accelerator Patrick Akl 1 and Andreas Moshovos AENAO Research Group Department of Electrical.

PipeliningPipelining Computer Architecture (Fall 2006)

Multiscalar Processors

Application-Specific Customization of Soft Processor Microarchitecture

Address-Value Delta (AVD) Prediction

Guest Lecturer TA: Shreyas Chand

Instruction Execution Cycle

Dynamic Hardware Prediction

Patrick Akl and Andreas Moshovos AENAO Research Group

rePLay: A Hardware Framework for Dynamic Optimization

Paper discussed in class: S. Hauck, T. Fry, M. Hosler, J

Presentation transcript:

CHIMAERA: A High-Performance Architecture with a Tightly-Coupled Reconfigurable Functional Unit Kynan Fraser

DISCLAIMER This presentation is based on a paper written by Zhi Alex Yi, Andreas Moshovos, Scott Hauck and Prithiviraj Banerjee. The paper is as named in the title. All proposals, implementation, testing, results and figures and tables have been done by the aforementioned peoples. These slides however have been produced by me for educational purposes.

Outline ● Background ● Introduction ● Chimaera architecture ● Compiler support ● Related work (not covered) ● Evaluation - methodology - modelling RFUOP latency - RFUOP analysis - working set of RFUOP's - performance measurements ● Summary

Background Customized vs Flexibility Benefits vs Risks Reconfigurable solution?? Multimedia platforms

Introduction ● Coupled RFU (Reconfigurable Functional Unit) ● Implements application specific operations ● 9 inputs to 1 output ● Fairly simple compiler CHIMAERA Reconfigurable hardware and compiler

Introduction – Potential Advantages ● Reduce execution time of dependent instructions - tmp=R2−R3; R5=tmp+R1 ● Reduce dynamic branch count - if (a>88) a=b+3 ● Exploit subword parallelism - a = a + 3; b = c << 2 (a,b,c halfwords) ● Reduce resource contention

Chimaera Architecture ● Reconfigurable Array - programmable logic blocks ● Shadow Register File ● Execution Control Unit ● Configuration Control and Caching Unit

Chimaera Architecture ● RFUOP unique ID ● In-order commits ● Single Issue RFUOP scheduler ● Worst case 23 transistor levels (for one logic block)

Compiler Support ● Automatically maps groups of instructions to RFUOP's ● Analyses DFG's ● Schedules across branches ● Identifies sub-word parallelism (disabled in this case due to endangered correctness) ● Look later at how many can instructions actually map to RFUOP's

Related Work ● We are looking at it ● Read section 4 for more information

Configuration

Evaluation - Methodology ● Execution driven timing ● Built over simplescalar ● ISA extension of MIPS ● RFUOP's appear as NOOP's under MIPS ISA ● Previous slide configuration used

Evaluation – Modelling RFUOP Latency ● First row modelled on original instruction sequence critical path ● Second row modelled on transistor levels and delays

Evaluation – RFUOP Analysis ● Total number of RFUOP's per benchmark ● Frequency of instruction types mapped to RFUOP's

Evaluation – RFUOP Analysis ● Look at how many instructions replaced by RFUOP - dest = src1 op src2 op src3 - 3/4 input/1 output most common ● Look at critical path of instructions replaced

Evaluation – Working set of RFUOP's ● Larger working set = more stalls to configure ● Maintaining 4 MRU almost no misses ● 16 rows sufficient

Evaluation – Performance Measurements

● Performance under original instruction timing latencies (4 issue) ● Latency of 2C or better still give speed of 11% or greater, 3C not worthwhile ● 3C not worthwhile (only speedup under one benchmark) ● N model improves performance overall - due to decreased branches and reduced resource contention

Evaluation – Performance Measurements ● Performance under transistor timing (4 issue) ● Improvements of 21% even under most conservative transistor timing ● Performance in optimistic models close to 1-cycle model (upper bound)

Evaluation – Performance Measurements

● Performance with 8 issue ● Only improvements with C, 1, 2 and N timing ● Relative improvements (to 4 issue) small ● Reason: Because limited to one RFUOP issue per cycle

Evaluation – Performance Measurements

● Strong relationship between performance improvement and branches replaced by RFUOP's ● Benchmarks with lowest branch reduction have lowest speedup ● Even under pessimistic assumptions Chimaera still provides improvements

Summary ● Seen the CHIMAERA architecture ● The C compiler that generates RFUOP's ● Maps sequences of instructions into single instruction (9input/1output) ● Can eliminate flow control instructions ● Exploits sub-word parallelism (not here) ● 22% on average of all instructions to RFUOP's ● Variety of computations mapped ● Studied under variety of configurations and timing models

Summary ● 4 way: average 21% speedup for transistor timing (pessimistic) ● 8 way: average 11% speedup for transistor timing (pessimistic) ● 4 way: average 28% speedup for transistor timing (reasonable) ● 8 way: average 25% speedup for transistor timing (reasonable)