Architecture and Design Automation for Application-Specific Processors Philip Brisk Assistant Professor Dept. of Computer Science and Engineering University.

Slides:

Advertisements

Similar presentations

Computer Science and Engineering Laboratory, Transport-triggered processors Jani Boutellier Computer Science and Engineering Laboratory This.

Advertisements

Data Dependencies Describes the normal situation that the data that instructions use depend upon the data created by other instructions, or data is stored.

Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.

Architectural Improvement for Field Programmable Counter Array: Enabling Efficient Synthesis of Fast Compressor Trees on FPGA Alessandro Cevrero 1,2 Panagiotis.

Code Transformations to Improve Memory Parallelism Vijay S. Pai and Sarita Adve MICRO-32, 1999.

Lecture 15: Reconfigurable Coprocessors October 31, 2013 ECE 636 Reconfigurable Computing Lecture 15 Reconfigurable Coprocessors.

Memory Organization and Data Layout for Instruction Set Extensions with Architecturally Visible Storage Panagiotis Athanasopoulos EPFL Philip Brisk UCR.

Application Specific Instruction Generation for Configurable Processor Architectures VLSI CAD Lab Computer Science Department, UCLA Led by Jason Cong Yiping.

Extensible Processors. 2 ASIP Gain performance by:  Specialized hardware for the whole application (ASIC). −  Almost no flexibility. −High cost.  Use.

1 HW/SW Partitioning Embedded Systems Design. 2 Hardware/Software Codesign “Exploration of the system design space formed by combinations of hardware.

Behavioral Design Outline –Design Specification –Behavioral Design –Behavioral Specification –Hardware Description Languages –Behavioral Simulation –Behavioral.

State Machines Timing Computer Bus Computer Performance Instruction Set Architectures RISC / CISC Machines.

Chapter 2: Impact of Machine Architectures What is the Relationship Between Programs, Programming Languages, and Computers.

PSU CS 106 Computing Fundamentals II Introduction HM 1/3/2009.

UCB November 8, 2001 Krishna V Palem Proceler Inc. Customization Using Variable Instruction Sets Krishna V Palem CTO Proceler Inc.

Network-on-Chip: Communication Synthesis Department of Computer Science Texas A&M University.

Ajay K. Verma, Philip Brisk and Paolo Ienne Processor Architecture Laboratory (LAP) & Centre for Advanced Digital Systems (CSDA) Ecole Polytechnique Fédérale.

Basic Microcomputer Design. Inside the CPU Registers – storage locations Control Unit (CU) – coordinates the sequencing of steps involved in executing.

1 Chapter 1 Parallel Machines and Computations (Fundamentals of Parallel Processing) Dr. Ranette Halverson.

A Reconfigurable Processor Architecture and Software Development Environment for Embedded Systems Andrea Cappelli F. Campi, R.Guerrieri, A.Lodi, M.Toma,

Making FPGAs a Cost-Effective Computing Architecture Tom VanCourt Yongfeng Gu Martin Herbordt Boston University BOSTON UNIVERSITY.

COMPUTER SCIENCE &ENGINEERING Compiled code acceleration on FPGAs W. Najjar, B.Buyukkurt, Z.Guo, J. Villareal, J. Cortes, A. Mitra Computer Science & Engineering.

Paper Review: XiSystem - A Reconfigurable Processor and System

TILEmpower-Gx36 - Architecture overview & performance benchmarks – Presented by Younghyun Jo 2013/12/18.

Automated Design of Custom Architecture Tulika Mitra

Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.

SJSU SPRING 2011 PARALLEL COMPUTING Parallel Computing CS 147: Computer Architecture Instructor: Professor Sin-Min Lee Spring 2011 By: Alice Cotti.

A Flexible DSP Block to Enhance FGPA Arithmetic Performance

1 Exploring Custom Instruction Synthesis for Application-Specific Instruction Set Processors with Multiple Design Objectives Lin, Hai Fei, Yunsi ACM/IEEE.

Configurable, reconfigurable, and run-time reconfigurable computing.

Thread-Level Speculation Karan Singh CS

1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.

Hardware/Software Co-design Design of Hardware/Software Systems A Class Presentation for VLSI Course by : Akbar Sharifi Based on the work presented in.

Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.

1 Instruction Set Architecture (ISA) Alexander Titov 10/20/2012.

Introduction to FPGAs Dr. Philip Brisk Department of Computer Science and Engineering University of California, Riverside CS 223.

Summary Background –Why do we need parallel processing? Moore’s law. Applications. Introduction in algorithms and applications –Methodology to develop.

Evaluating and Improving an OpenMP-based Circuit Design Tool Tim Beatty, Dr. Ken Kent, Dr. Eric Aubanel Faculty of Computer Science University of New Brunswick.

By Edward A. Lee, J.Reineke, I.Liu, H.D.Patel, S.Kim

Hardware Benchmark Results for An Ultra-High Performance Architecture for Embedded Defense Signal and Image Processing Applications September 29, 2004.

Data Management for Decision Support Session-4 Prof. Bharat Bhasker.

Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.

12/13/ _01 1 Computer Organization EEC-213 Computer Organization Electrical and Computer Engineering.

High Performance Embedded Computing © 2007 Elsevier Chapter 7, part 3: Hardware/Software Co-Design High Performance Embedded Computing Wayne Wolf.

Dept. of Computer Science - CS6461 Computer Architecture CS6461 – Computer Architecture Fall 2015 Lecture 1 – Introduction Adopted from Professor Stephen.

A High-Level Synthesis Flow for Custom Instruction Set Extensions for Application-Specific Processors Asia and South Pacific Design Automation Conference.

ICC Module 3 Lesson 1 – Computer Architecture 1 / 6 © 2015 Ph. Janson Information, Computing & Communication Computer Architecture Clip 3 – Instruction.

1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Architectures Iowa State University (Ames) CPRE 583 Reconfigurable Computing Lecture.

ICC Module 3 Lesson 1 – Computer Architecture 1 / 11 © 2015 Ph. Janson Information, Computing & Communication Module 3 : Systems.

Area-Efficient Instruction Set Synthesis for Reconfigurable System on Chip Designs Philip BriskAdam KaplanMajid Sarrafzadeh Embedded and Reconfigurable.

Speculative DMA for Architecturally Visible Storage in Instruction Set Extensions Theo KluterEPFL Philip BriskEPFL Paolo IenneEPFL Edoardo CharbonEPFL.

Compiler Research How I spent my last 22 summer vacations Philip Sweany.

Elec/Comp 526 Spring 2015 High Performance Computer Architecture Instructor Peter Varman DH 2022 (Duncan Hall) rice.edux3990 Office Hours Tue/Thu.

Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.

MPSoC Design using Application-Specific Architecturally Visible Communication Theo Kluter Philip Brisk Edoardo Charbon Paolo Ienne.

Evaluating Register File Size

The Problem Finding a needle in haystack An expert (CPU)

Chapter 1: Introduction

CDA 3101 Spring 2016 Introduction to Computer Organization

Methodology of a Compiler that Compresses Code using Echo Instructions

Jason Cong, Guoling Han, Zhiru Zhang VLSI CAD Lab

Exploiting Forwarding to Improve Data Bandwidth of Instruction-Set Extensions Ramkumar Jayaseelan, Haibin Liu, Tulika Mitra School of Computing, National.

Computer Structure S.Abinash 11/29/ _02.

Computer Organization

15-740/ Computer Architecture Lecture 5: Precise Exceptions

Henk Corporaal TUEindhoven 2011

Department of Electrical Engineering Joint work with Jiong Luo

Course Outline for Computer Architecture

Research: Past, Present and Future

Presentation transcript:

Architecture and Design Automation for Application-Specific Processors Philip Brisk Assistant Professor Dept. of Computer Science and Engineering University of California, Riverside IEEE 9 th International Conference on ASIC (ASICON) Xiamen, ChinaOctober 26, 2011

Acknowledgment The vast majority of slides in this presentation are taken from the Ph.D. Thesis of my friend and collaborator, Dr. Theo Kluter (Ph.D., EPFL, 2010)

Five Stage RISC Pipeline I$RFD$RF Fetch DecodeExecuteMemoryWrite-back

Application-Specific Custom Unit (ASCU) for Instruction Set Extensions (ISEs) I$RFD$RF Fetch DecodeExecuteMemoryWrite-back ASCU

Automatic ISE Identification I$RFD$RF Fetch DecodeExecuteMemoryWrite-back ASCU Compiler HW Synthesis Applications Assembly code with ISEs

Overview Architecture Compilation and Synthesis Conclusion

Overview Architecture – Custom ISE Logic – I/O Bandwidth – Local memories and coherence Compilation and Synthesis Conclusion

Example: Luminance Conversion in JPEG Compression 19 cycles in software 17-bit values Fixed-point

Custom Hardware Implementation One single-ported memory 4 – 5 cycles (3 loads, 1 arithmetic, 1 store) Speedup: 3.8x – 4.8x R, G, B, and Y Memories 1 cycle for everything Speedup: 19x

Custom ISE Logic RF has 2 read ports RF has 1 write port Architectural Limitations Load data from memory into RF RF I/O bandwidth Performance 7 cycles (3 loads, 2 ASCU, 1 store) Speedup: 3.1x

Overview Architecture – Custom ISE Logic – I/O Bandwidth – Local memories and coherence Compilation and Synthesis Conclusion

I/O Bandwidth Constraint AES Algorithm Single round 4 stages Best ISE 22 inputs 22 outputs [Verma, Brisk, and Ienne, CASES 2007 & TCAD 2010] RF I/O constraints Noticeable slowdown

Pipeline Forwarding [Jayaseelan et al., DAC 2006] 1 output I/O Bandwidth limitations Input bandwidth depends on number of pipeline stages Does not increase output bandwidth Complicates instruction scheduling

Register File Clustering 4 inputs 1 output [Karuri et al., ICCAD 2007] I/O Bandwidth limitations Input bandwidth depends on number clusters Does not increase output bandwidth Compiler must eliminate inter-cluster copies More clusters => more copies NP-Hard

Shadow Registers [Cong et al., FPGA 2005] 1 output I/O Bandwidth No limitation on input bandwidth Does not increase output bandwidth Increases ISA bitwidth

Overview Architecture – Custom ISE Logic – I/O Bandwidth – Local memories and coherence Compilation and Synthesis Conclusion

Architecturally Visible Storage DMA transfers data between memory and AVS Coherence problem between AVS and D$ [Biswas et al., DATE 2006, TCAD 2007]

Example: IDCT (from JPEG)

The Coherence Problem

Overview Architecture – Custom ISE Logic – I/O Bandwidth – Local memories and coherence Coherent and Speculative DMA Virtual Ways Way Stealing Compilation and Synthesis Conclusion

Coherent DMA

Speculative DMA Coherent DMA loads and evicts the array from AVS during each iteration Speculative DMA waits until the array is overwritten in AVS memory by other data, or if the data is read/written by the D$.

Virtual Ways

AVS vs. Traditional Cache

AVS and Cache Ways are Similar if AVS Memory has 1-input, 1-output

Way Stealing

No AVS memories (reduced area) No coherence protocol

Coherent AVS Summary Speculative DMA – Requires a coherence protocol Lots of bus traffic Good solution for coherent multiprocessor systems – No limit on AVS memory organization – Uses standard cache IPs Virtual Ways – Requires non-traditional cache controller – No limit on AVS memory organization Way Stealing – Requires non-traditional cache – Number of ways limits number of AVS memories – All AVS memories have 1-input, 1-output – Keeps AVS memories within the cache

Overview Architecture Compilation and Synthesis – ISE Identification Algorithms Conclusion

SW and HW Costs

Convex and Non-Convex Cuts

Integrating AVS Memories

Single Cycle ISE Identification Problem Legality Constraints: – Convex cut – Contains no forbidden nodes – Number of inputs/outputs match architectural constraints (e.g., 2 RF inputs, 1 RF output) Objective: – Find the legal cut that maximizes speedup

Algorithms for ISE Identification Optimal (Exponential worst-case runtime) – Branch-and-bound search – Integer Linear Program Formulation Iterative Improvement – Evolutionary algorithms – Simulated annealing Polynomial-time Heuristics

Branch-and-Bound Search Example [Atasu et al., DAC 2003]

Branch-and-Bound Search Example [Atasu et al., DAC 2003]

ISE Identification Algorithms [Kastner et al., ICCAD 2001] [Brisk et al., CASES 2002] [Sun et al., ICCAD 2002] [Lee et al., ICCAD 2002] [Atasu et al., DAC 2003] [Goodwin and Petkov, DATE 2003] [Peymandoust et al., ASAP 2003] [Clark et al., MICRO 2003] [Sun et al., ICCAD 2003] [Lee et al., ISLPED 2003] [Cong et al., FPGA 2004] [Biswas et al., DAC 2004] [Yu and Mitra, DAC 2004] [Borin et al., ESTIMedia 2004] [Kastens et al., LCTES 2004] [Yu and Mitra, CASES 2004] [Pozzi and Ienne, CASES 2005] [Biswas et al., DAC 2005] [Atasu et al., CODES-ISSS 2005] [Sun et al., VLSI Design 2005] [Biswas et al., DATE 2006] [Galuzzi et al., CODES-ISSS 2006] [Sun et al., VLSI Design 2006] [Wong et al., HiPEAC 2007] [Verma et al., CASES 2007] [Pothineni et al., CDES 2007] Conferences [Atasu et al., IJPP 2003] [Clark et al., IJPP 2003] [Sun et al., TCAD 2004] [Clark et al., TCOMP 2005] [Pozzi et al., TCAD 2006] [Biswas et al., TVLSI 2006] [Sun et al., TCAD 2006] [Sun et al., TVLSI 2006] [Biswas et al., TCAD 2007] [Chen et al., TCAD 2007] [Sun et al., TCAD 2007] [Lee et al., TODAES 2007] [Bonzini and Pozzi, TVLSI 2008] [Zhao et al., IEICE Trans. Fund. 2008] [Atasu et al., TCAD 2008] [Murray et al., TECS 2009] [Verma et al., TCAD 2010] [Galuzzi and Bertels, TRETS 2011] Journals [Pothineni et al., VLSI Design 2007] [Bonzini and Pozzi, DATE 2007] [Atasu et al., DATE 2007] [Noori et al., DATE 2007] [Galuzzi et al., SAMOS 2007] [Galuzzi et al., ARC 2007] [Bonzini and Pozzi, ASAP 2007] [Wolinski and Kuchcinski, ASAP 2007] [Galuzzi et al., ARC 2007] [Yu and Mitra, FPL 2007] [Bennet et al., LCTES 2007] [Verma et al., ASPDAC 2008] [Wolinski and Kuchcinski, DATE 2008] [Galuzzi and Bertels, ARC 2008] [Atasu et al., ASAP 2008] [Galuzzi and Bertels, ReConFig 2008] [Pothineni et al., VLSI Design 2008] [Galuzzi et al., DATE 2009] [Martin et al., ASAP 2009] [Martin et al., SAMOS 2009] [Kamal et al., ASAP 2010] [Pothineni et al., VLSI Design 2010] [Ahn et al., ASPDAC 2011] [Xiao and Casseau, GLS-VLSI 2011] [Xiao and Casseau, ASAP 2011] [Ahn et al., CODES-ISSS 2011]

Overview Architecture Compilation and Synthesis Conclusion – Summary and Future Research Directions

Conclusion ASIP Architecture – Supply data bandwidth to ASCU – Ensuring coherence when using local memories ISE Identification – Problem formulation is well-understood – Extensions needed to support memory operations – Many effective algorithms exist

Future ASIP Research Directions Parallel and Multi-core ASIPs – Balance ISE speedup across many threads – ISE identification for parallel models of computation Concurrent state machines Synchronous Data Flow / Kahn Process Networks MapReduce Identify ACSU for Current AND Future Applications – Some ISEs are not known at design time – Must insert generality or programmability into the ACSU Application-specific GPUs – Identify vectorized and threaded ISEs – ACSU by hundreds of near-identical threads concurrently