MIAOW: An Open Source RTL Implementation of a GPGPU

Slides:



Advertisements
Similar presentations
FPGA (Field Programmable Gate Array)
Advertisements

ECOE 560 Design Methodologies and Tools for Software/Hardware Systems Spring 2004 Serdar Taşıran.
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Exploring Memory Consistency for Massively Threaded Throughput- Oriented Processors Blake Hechtman Daniel J. Sorin 0.
Lecture 6: Multicore Systems
Supporting x86-64 Address Translation for 100s of GPU Lanes Jason Power, Mark D. Hill, David A. Wood UW-Madison Computer Sciences 2/19/2014.
Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.
Department of Computer Science iGPU: Exception Support and Speculative Execution on GPUs Jaikrishnan Menon, Marc de Kruijf Karthikeyan Sankaralingam Vertical.
Chimera: Collaborative Preemption for Multitasking on a Shared GPU
System Simulation Of 1000-cores Heterogeneous SoCs Shivani Raghav Embedded System Laboratory (ESL) Ecole Polytechnique Federale de Lausanne (EPFL)
GPU System Architecture Alan Gray EPCC The University of Edinburgh.
Using Hardware Vulnerability Factors to Enhance AVF Analysis Vilas Sridharan RAS Architecture and Strategy AMD, Inc. International Symposium on Computer.
GPGPU Introduction Alan Gray EPCC The University of Edinburgh.
Thin Servers with Smart Pipes: Designing SoC Accelerators for Memcached Bohua Kou Jing gao.
Defect Tolerance for Yield Enhancement of FPGA Interconnect Using Fine-grain and Coarse-grain Redundancy Anthony J. Yu August 15, 2005.
Defect Tolerance for Yield Enhancement of FPGA Interconnect Using Fine-grain and Coarse-grain Redundancy Anthony J. Yu August 15, 2005.
What Great Research ?s Can RAMP Help Answer? What Are RAMP’s Grand Challenges ?
2015/6/21\course\cpeg F\Topic-1.ppt1 CPEG 421/621 - Fall 2010 Topics I Fundamentals.
Modern trends in computer architecture and semiconductor scaling are leading towards the design of chips with more and more processor cores. Highly concurrent.
Students:Gilad Goldman Lior Kamran Supervisor:Mony Orbach Part A Presentation Network Sniffer.
Climate Machine Update David Donofrio RAMP Retreat 8/20/2008.
Experiences Implementing Tinuso in gem5 Maxwell Walter, Pascal Schleuniger, Andreas Erik Hindborg, Carl Christian Kjærgaard, Nicklas Bo Jensen, Sven Karlsson.
XMT-GPU A PRAM Architecture for Graphics Computation Tom DuBois, Bryant Lee, Yi Wang, Marc Olano and Uzi Vishkin.
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
© 2011 Xilinx, Inc. All Rights Reserved Intro to System Generator This material exempt per Department of Commerce license exception TSU.
8/16/2015\course\cpeg323-08F\Topics1b.ppt1 A Review of Processor Design Flow.
Out-of-Order OpenRISC 2 semesters project Semester A: Implementation of OpenRISC on XUPV5 board Final A Presentation By: Vova Menis-Lurie Sonia Gershkovich.
GPU-Qin: A Methodology For Evaluating Error Resilience of GPGPU Applications Bo Fang , Karthik Pattabiraman, Matei Ripeanu, The University of British.
Petros OikonomakosBashir M. Al-Hashimi Mark Zwolinski Versatile High-Level Synthesis of Self-Checking Datapaths Using an On-line Testability Metric Electronics.
Out-of-Order OpenRISC 2 semesters project Semester A: Implementation of OpenRISC on XUPV5 board Midterm Presentation By: Vova Menis-Lurie Sonia Gershkovich.
BY: ALI AJORIAN ISFAHAN UNIVERSITY OF TECHNOLOGY 2012 GPU Architecture 1.
Unifying Primary Cache, Scratch, and Register File Memories in a Throughput Processor Mark Gebhart 1,2 Stephen W. Keckler 1,2 Brucek Khailany 2 Ronny Krashinsky.
Revisiting Kirchhoff Migration on GPUs Rice Oil & Gas HPC Workshop
By Arun Bhandari Course: HPC Date: 01/28/12. GPU (Graphics Processing Unit) High performance many core processors Only used to accelerate certain parts.
1 Rapid Estimation of Power Consumption for Hybrid FPGAs Chun Hok Ho 1, Philip Leong 2, Wayne Luk 1, Steve Wilton 3 1 Department of Computing, Imperial.
Design Verification An Overview. Powerful HDL Verification Solutions for the Industry’s Highest Density Devices  What is driving the FPGA Verification.
ASIP Architecture for Future Wireless Systems: Flexibility and Customization Joseph Cavallaro and Predrag Radosavljevic Rice University Center for Multimedia.
J. Christiansen, CERN - EP/MIC
1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.
1 Latest Generations of Multi Core Processors
An Architecture and Prototype Implementation for TCP/IP Hardware Support Mirko Benz Dresden University of Technology, Germany TERENA 2001.
FPGA-based Fast, Cycle-Accurate Full System Simulators Derek Chiou, Huzefa Sanjeliwala, Dam Sunwoo, John Xu and Nikhil Patil University of Texas at Austin.
Adaptive GPU Cache Bypassing Yingying Tian *, Sooraj Puthoor†, Joseph L. Greathouse†, Bradford M. Beckmann†, Daniel A. Jiménez * Texas A&M University *,
Single-Chip Heterogeneous Computing: Does the Future Include Custom Logic, FPGAs, and GPGPUs? Wasim Shaikh Date: 10/29/2015.
 GPU Power Model Nandhini Sudarsanan Nathan Vanderby Neeraj Mishra Usha Vinodh
An Improved “Soft” eFPGA Design and Implementation Strategy
1 November 11, 2015 A Massively Parallel, Hybrid Dataflow/von Neumann Architecture Yoav Etsion November 11, 2015.
Chapter 11 System-Level Verification Issues. The Importance of Verification Verifying at the system level is the last opportunity to find errors before.
Sunpyo Hong, Hyesoon Kim
GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.
Real-Time System-On-A-Chip Emulation.  Introduction  Describing SOC Designs  System-Level Design Flow  SOC Implemantation Paths-Emulation and.
My Coordinates Office EM G.27 contact time:
Synergy.cs.vt.edu Online Performance Projection for Clusters with Heterogeneous GPUs Lokendra S. Panwar, Ashwin M. Aji, Wu-chun Feng (Virginia Tech, USA)
GPGPU Performance and Power Estimation Using Machine Learning Gene Wu – UT Austin Joseph Greathouse – AMD Research Alexander Lyashevsky – AMD Research.
Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.
DAC50, Designer Track, 156-VB543 Parallel Design Methodology for Video Codec LSI with High-level Synthesis and FPGA-based Platform Kazuya YOKOHARI, Koyo.
Co-Designing Accelerators and SoC Interfaces using gem5-Aladdin
Programmable Hardware: Hardware or Software?
Ph.D. in Computer Science
Leiming Yu, Fanny Nina-Paravecino, David Kaeli, Qianqian Fang
Implementation of Efficient Check-pointing and Restart on CPU - GPU
ECE 553: TESTING AND TESTABLE DESIGN OF DIGITAL SYSTES
A Review of Processor Design Flow
INTRODUCTION TO COMPLEX PROGRAMMABLE LOGIC
Matlab as a Development Environment for FPGA Design
Combining Simulators and FPGAs “An Out-of-Body Experience”
Computer Architecture: A Science of Tradeoffs
Measuring the Gap between FPGAs and ASICs
CSE 502: Computer Architecture
Presentation transcript:

MIAOW: An Open Source RTL Implementation of a GPGPU Vinay Gangadhar, Raghu Balasubramaniam, Mario Drumond, Ziliang Guo, Jai Menon, Cherin Joseph, Robin Prakash, Sharath Prasad, Pradip Vallathol, Karu Sankaralingam www.miaowgpu.org Vertical Research Group University of Wisconsin - Madison

MIAOW Open Source GPGPU MIAOW - Many-core Integrated Accelerator Of Wisconsin AMD Southern Islands ISA-based GPGPU Transformative for Academic GPU research Contribution to Industry MIAOW as a Research tool – RTL codebase, Verification and Simulation toolchain Support for workloads First publicly available open source GPGPU - MIAOW Embodies its principles that it can be transformative for GPU research in Academia with RTL implementation We also envision that this research tool could be of help to industry in applying new research ideas and exploring the uarch impact

Outline Open Source GPGPU Micro-Architecture Realism Research Flexibility Conclusion

MIAOW has 32 Compute Units (CUs) MIAOW Overview MIAOW has 32 Compute Units (CUs)

Hardware Organization Compute Unit In-order + Vector core Single Issue 40 16-wide vector ALUs LSU – Memory operations Wavefronts -- Hybrid of an in-order + vector core all fed by single instruction supply with MT support

ISA Summary 95 instructions – AMD Southern Islands ISA No Graphics support support The instructions selected are based on application profiling , operations in datapath and practically implementable by a small design team Program control – predication and branch instructions 32 bit data support

MIAOW Design Approach (a) Full ASIC Design (b) Mapped to FPGA (c) Hybrid Design Spectrum of Implementation strategizies and tradeoffs MIAOW follows Shaded portion indicates Register file and SRAM storage FPGA has resource constraints Low Flexibility, High Cost, High Realism Medium Flexibility, Low Cost, Long Design Time, Medium Realism High Flexibility, Low Cost, Short Design Time, Flexible Realism

Outline Open Source GPGPU Micro-Architecture Realism Research Flexibility Conclusion

MIAOW Realism MIAOW Kaveri No graphics and texture support in MIAOW -- MIAOW should be a realistic implementation of a GPU resembling principles and implementation of industry GPU -- Area and Power estimation done to assure that our design decisions reflect that Kaveri

Realism – Software Compatibility Runs unmodified OpenCL programs All OpenCL benchmarks Many Rodinia benchmarks Easily extendable to add any missing instruction from ISA

Realism – FPGA Synthesis Xilinx Virtex 7 based Maps 1 CU Explores feasibility of Design Benchmark prototyping – Ongoing work - ASIC Chip not built yet, but we do have a synthesized version of FPGA Has embedded microblaze soft core processor connected to Compute unit via AXI interface 300K LUTs and 1K Block RAMs

Outline Open Source GPGPU Micro-Architecture Realism Research Flexibility Conclusion

MIAOW enabled findings MIAOW enabled findings MIAOW enabled findings Research Flexibility Direction Research Idea MIAOW enabled findings New Directions Circuit-Failure Prediction (Aged SDMR) Implemented entirely in µarch Works elegantly in GPUs Small area, power overheads Timing Speculation (TS) Quantifies error-rate on GPU Direction Research Idea MIAOW enabled findings Validation of Simulator studies Transient Fault Injection RTL Level Fault Injection More Gray area than CPUs Silent data corruption seen Direction Research Idea MIAOW enabled findings Traditional µarch Thread-block compaction (TBC) Implemented TBC in RTL Significant design complexity Increase in Critical Path length Ultra-threaded Dispatcher modified Micro-architecture impacted Compute Units modified Micro-architectural Gates + Delay elements impacted Physical design perspective to traditional uarch It should be flexible enough to accommodate research studies of various types Compare the implementation to simulator studies Compute Units + Storage modified Delay elements impacted

Research Flexibility Direction Research Idea MIAOW enabled findings Traditional µarch Thread-block compaction (TBC) Implemented TBC in RTL Significant design complexity Increase in Critical Path length New Directions Circuit-Failure Prediction (Aged SDMR) Implemented entirely in µarch Works elegantly in GPUs Small area, power overheads Timing Speculation (TS) Quantifies error-rate on GPU Physical design perspective to traditional uarch It should be flexible enough to accommodate research studies of various types Compare the implementation to simulator studies Validation of Simulator studies Transient Fault Injection RTL Level Fault Injection Silent data corruption seen

Conclusion www.miaowgpu.org MIAOW provides transformative capability for GPU research More community support  First Open Source Silicon GPU Chip Can it help kick-start an Open Source hardware movement? Are Open Source hardware chips feasible? -- It should be flexible enough to accommodate research studies of various types www.miaowgpu.org

Back Up Slides -- It should be flexible enough to accommodate research studies of various types

Area Estimates Total Area: 15 mm2 SRAM based RF: 9mm2

Power Estimates Total Power: 1.1 W

Performance Estimates Compared to NVIDIA Fermi 1-SM GPU CPI close on 3 benchmarks CPI DMin DMax BinS BSort MatT PSum Red SLA Scalar 1 3 Vector 6 5.4 2.1 3.1 5.5 Memory 100 14.1 3.8 4.6 6.0 6.8 Overall 1 100 5.1 1.2 1.7 3.6 4.4 3.0 NVIDIA 1 - 20.5 1.9 2.1 8 4.7 7.5

Verification Methodology Emulator – Multi2sim Heterogeneous Simulator