MEMOCode 2007 Design Contest – MIT Submission N. Dave, K. Fleming, M. King, M. Pellauer, M. Vijayaraghavan.

Slides:



Advertisements
Similar presentations
NS Training Hardware. NS9750 System Overview.
Advertisements

The Bus Architecture of Embedded System ESE 566 Report 1 LeTian Gu.
Computer Architecture
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 More on Performance Considerations.
Taking CUDA to Ludicrous Speed Getting Righteous Performance from your GPU 1.
Presenter : Cheng-Ta Wu Kenichiro Anjo, Member, IEEE, Atsushi Okamura, and Masato Motomura IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 39,NO. 5, MAY 2004.
1 Overview Assignment 4: hints Memory management Assignment 3: solution.
Computer Science and Engineering Laboratory, Transport-triggered processors Jani Boutellier Computer Science and Engineering Laboratory This.
A scheme to overcome data hazards
RISC / CISC Architecture By: Ramtin Raji Kermani Ramtin Raji Kermani Rayan Arasteh Rayan Arasteh An Introduction to Professor: Mr. Khayami Mr. Khayami.
Instruction Level Parallelism 2. Superscalar and VLIW processors.
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:
Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.
Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining.
Pipelining 5. Two Approaches for Multiple Issue Superscalar –Issue a variable number of instructions per clock –Instructions are scheduled either statically.
Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.
Parallell Processing Systems1 Chapter 4 Vector Processors.
Computer Architecture and Data Manipulation Chapter 3.
Main Mem.. CSE 471 Autumn 011 Main Memory The last level in the cache – main memory hierarchy is the main memory made of DRAM chips DRAM parameters (memory.
November 3, 2009 L18-1http://csg.csail.mit.edu/korea Matrix Multiply: Writing and Refining FSMs Nirav Dave Computer Science & Artificial Intelligence Lab.
MEMOCODE 2007 HW/SW Co-design Contest Documentation of the submission by Eric Simpson Pengyuan Yu Sumit Ahuja Sandeep Shukla Patrick Schaumont Electrical.
Copyright © 2008 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Computer Science: An Overview Tenth Edition by J. Glenn Brookshear Chapter.
Processor Technology and Architecture
Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.
Behavioral Design Outline –Design Specification –Behavioral Design –Behavioral Specification –Hardware Description Languages –Behavioral Simulation –Behavioral.
1 EECS Components and Design Techniques for Digital Systems Lec 21 – RTL Design Optimization 11/16/2004 David Culler Electrical Engineering and Computer.
Chapter 15 IA 64 Architecture Review Predication Predication Registers Speculation Control Data Software Pipelining Prolog, Kernel, & Epilog phases Automatic.
Chapter 6 Memory and Programmable Logic Devices
1 Instant replay  The semester was split into roughly four parts. —The 1st quarter covered instruction set architectures—the connection between software.
Processor Structure & Operations of an Accumulator Machine
Introduction to Interconnection Networks. Introduction to Interconnection network Digital systems(DS) are pervasive in modern society. Digital computers.
RICE UNIVERSITY Implementing the Viterbi algorithm on programmable processors Sridhar Rajagopal Elec 696
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
Lecture #32 Page 1 ECE 4110–5110 Digital System Design Lecture #32 Agenda 1.Improvements to the von Neumann Stored Program Computer Announcements 1.N/A.
IT253: Computer Organization Lecture 4: Instruction Set Architecture Tonga Institute of Higher Education.
Automated Design of Custom Architecture Tulika Mitra
Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.
Super computers Parallel Processing By Lecturer: Aisha Dawood.
ISSS 2001, Montréal1 ISSS’01 S.Derrien, S.Rajopadhye, S.Sur-Kolay* IRISA France *ISI calcutta Combined Instruction and Loop Level Parallelism for Regular.
Chapter 2 Data Manipulation © 2007 Pearson Addison-Wesley. All rights reserved.
Chapter 2 Data Manipulation. © 2005 Pearson Addison-Wesley. All rights reserved 2-2 Chapter 2: Data Manipulation 2.1 Computer Architecture 2.2 Machine.
System-level power analysis and estimation September 20, 2006 Chong-Min Kyung.
DIGITAL SIGNAL PROCESSORS. Von Neumann Architecture Computers to be programmed by codes residing in memory. Single Memory to store data and program.
Computer Organization CDA 3103 Dr. Hassan Foroosh Dept. of Computer Science UCF © Copyright Hassan Foroosh 2002.
L12 : Lower Power High Level Synthesis(3) 성균관대학교 조 준 동 교수
Copyright © 2008 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Data Manipulation Brookshear, J.G. (2012) Computer Science: an Overview.
Chapter 2 Data Manipulation © 2007 Pearson Addison-Wesley. All rights reserved.
Processor Structure and Function Chapter8:. CPU Structure  CPU must:  Fetch instructions –Read instruction from memory  Interpret instructions –Instruction.
Chapter 2 Data Manipulation © 2007 Pearson Addison-Wesley. All rights reserved.
Chapter 2: Data Manipulation
Presented by Jeremy S. Meredith Sadaf R. Alam Jeffrey S. Vetter Future Technologies Group Computer Science and Mathematics Division Research supported.
Niagara: A 32-Way Multithreaded Sparc Processor Kongetira, Aingaran, Olukotun Presentation by: Mohamed Abuobaida Mohamed For COE502 : Parallel Processing.
6/6/20081 High-Throughput Pipelined Mergesort Kermin Fleming Myron King Man Cheuk Ng Asif Khan Muralidaran Vijayaraghavan.
EECE571R -- Harnessing Massively Parallel Processors ece
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue
Pipelining and Retiming 1
Overview Introduction General Register Organization Stack Organization
/ Computer Architecture and Design
Henk Corporaal TUEindhoven 2009
Chapter 2: Data Manipulation
Henk Corporaal TUEindhoven 2011
CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue
Midterm 2 review Chapter
Chapter 2: Data Manipulation
COMP541 Datapaths I Montek Singh Mar 18, 2010.
CSC3050 – Computer Architecture
CPU Structure CPU must:
Chapter 2: Data Manipulation
Presentation transcript:

MEMOCode 2007 Design Contest – MIT Submission N. Dave, K. Fleming, M. King, M. Pellauer, M. Vijayaraghavan

Resources Five insufficiently busy grad students Three weeks –Nine man weeks used Bluespec expertise –Easy parameterization/Fast Concurrency The promise of food

Basic Facts Matrix Multiply is embarrassingly parallel –More multipliers and adders should help Matrices are too large to be stored in FGPA memory Time was short, design needed to be partitioned to make use of all designers –Latency insensitive methodology

Outline The Problem Partitioning the Computation Architectural Overview Implementation Results Things We Wish we could do

The Standard N 3 Algorithm for(int i=0; i < N; i++) for(int j=0; j < N; j++) for(int k=0; k < N; k++) c[i][j] += a[i][k] * b[k][j];

and blocking is well understood… for(int ib = 0; ib < N; ib+=K) for(int io = 0; io < K; io++) for(int jb = 0; jb < N/K; jb+=K) for(int jo = 0; jo < K; jo++) for(int k = 0; k < K; k++) c[ib+io][jb+jo] +=a[ib+io][jb+k] * b[ib+k][jb+jo]; split reduces memory traffic for(int ib = 0; ib < N; ib+=K) for(int jb = 0; jb < N/K; jb+=K) for(int io = 0; io < K; io++) for(int jo = 0; jo < K; jo++) for(int k = 0; k < K; k++) c[ib+io][jb+jo] += (a[ib+io][jb+k] * b[ib+k][jb+jo]); swap Kernel

Outline The Problem Partitioning the Computation Architectural Overview Implementation Results Things We Wish we could do

Hardware Facts If we accelerate the computation, DRAM access becomes the bottleneck CPU has slow access to DRAM –HW can directly access DRAM via PLB (Processor Local Bus)

Hardware Facts CPU to HW memory bandwidth bound at 150MB/sec –Software overhead in data orchestration, probably only 50% of this bandwidth can be used Memory Bus supports 800MB/sec –Direct interface can provide up to a 5x improvement over software transfer Special hardware may not be complicated because memory access patterns are simple

High Level Architecuture Func Unit Func Unit Func Unit CPU PLB DRAM Interconnection Logic

Architecture Func Unit Func Unit Func Unit Controller Feeder CPU PLB Switch PLB Master DRAM

Software Example (C = A x B) Func Unit Func Unit Func Unit Controller Feeder CPU PLB Switch PLB Master DRAM AB Ld A 0Ld B 0St C 0MAC 0 C In reality – the execution of several blocks will be overlapped

Outline The Problem Partitioning the Computation Architectural Overview Implementation Results Things We Wish we could do

Functional Unit - Design Instructions: –Load operand (memory) –Store operand (memory) –Zero (C = 0) –Multiply-Add-Accumulate (C += A*B) Two FSMs (Read/Write and Compute) –Allows overlapping of Instructions

Functional Unit – Algorithm Take algo & unroll P loop iterations Adder Tree of P –Crit. path grows logarithmically Can pipeline –Complicated because of parameterization for(int i = 0; i < K; i++) for(int j = 0; j < K; j++) for(int k = 0; k < K; k++) c[i][j] += a[i][k] * b[k][j];

Functional Unit – Algorithm Different algorithm –reorder multiplies –writes c[i][j] multple times Unroll by P –same # of adders and multipliers –shorter critical path Pipelining is easy –2 stages for(int i = 0; i < K; i++) for(int j = 0; j < K; j++) for(int k = 0; k < K; k++) c[j][k] += a[i][k] * b[j][i];

FU Microarchitecture

Memory Bus Master (PLB) 32-bit bus interface 16-word burst transfers –Amortize bus setup costs DRAM may refresh during transfer –Added burst buffer for rapid recovery

Memory Bus Master (PLB) Half of critical path through bus arbiter –Beyond our control Substantial retiming needed –Register pushing –State decoupling Need fine-grained control over scheduling

Outline The Problem Partitioning the Computation Architectural Overview Implementation Results Things We Wish we could do

Design Parameters Architecture: Number of functional units Functional Unit: degree of parallelism, matrix size Memory Bus (PLB) Master: matrix memory layout, matrix size Switch: Number of functional units Algorithm Generator: Block size

Final Results 100MHz 1 Functional Unit –64 2 subblocks – 8 Complex Multiplies Lines of code – 10K total –Unit Testing Framework – 1.5K –C Code – 2K –BSV – 5.5K –Multiple FU implementations 1K –Additional Unused Hardware 1K More than 3 GOps/Sec

Performance Size Time ( µs) x

Things we would have done with more time We believe we could have obtained 10 billion ops per second 32-PLB -> 64-bit PLB –Double memory bandwidth fairly simple improvement Multiple Clock Domains –implemented, but had trouble synthesizing in EDK Play with # of FUs / registers per FU –HW parameterized for this Explore alternative machine organization Algorithmic Exploration

Fin