RAMP BLUE: Double-Floating Point Coprocessor Mitch Harwell David Tylman.

Slides:



Advertisements
Similar presentations
Computer Organization, Bus Structure
Advertisements

Parallel Programming and Algorithms : A Primer Kishore Kothapalli IIIT-H Workshop on Multi-core Technologies International Institute.
Topics covered: Memory subsystem CSE243: Introduction to Computer Architecture and Hardware/Software Interface.
Computer Abstractions and Technology
POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:
Khaled A. Al-Utaibi  Computers are Every Where  What is Computer Engineering?  Design Levels  Computer Engineering Fields  What.
CS-334: Computer Architecture
DH2T 34 Computer Architecture 1 LO2 Lesson Two CPU and Buses.
CIS 570 Advanced Computer Systems University of Massachusetts Dartmouth Instructor: Dr. Michael Geiger Fall 2008 Lecture 1: Fundamentals of Computer Design.
Chapter1 Fundamental of Computer Design Dr. Bernard Chen Ph.D. University of Central Arkansas.
10.2 Characteristics of Computer Memory RAM provides random access Most RAM is volatile.
Chapter 1 CSF 2009 Computer Performance. Defining Performance Which airplane has the best performance? Chapter 1 — Computer Abstractions and Technology.
Seven Minute Madness: Special-Purpose Parallel Architectures Dr. Jason D. Bakos.
NPACI: National Partnership for Advanced Computational Infrastructure Supercomputing ‘98 Mannheim CRAY T90 vs. Tera MTA: The Old Champ Faces a New Challenger.
Performance D. A. Patterson and J. L. Hennessey, Computer Organization & Design: The Hardware Software Interface, Morgan Kauffman, second edition 1998.
Appendix A Pipelining: Basic and Intermediate Concepts
1 Chapter 4. 2 Measure, Report, and Summarize Make intelligent choices See through the marketing hype Key to understanding underlying organizational motivation.
Inside The CPU. Buses There are 3 Types of Buses There are 3 Types of Buses Address bus Address bus –between CPU and Main Memory –Carries address of where.
1 Instant replay  The semester was split into roughly four parts. —The 1st quarter covered instruction set architectures—the connection between software.
Chapter1 Fundamental of Computer Design Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2010.
Processor Structure & Operations of an Accumulator Machine
Basics and Architectures
Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.
 What is an operating system? What is an operating system?  Where does the OS fit in? Where does the OS fit in?  Services provided by an OS Services.
LOGO BUS SYSTEM Members: Bui Thi Diep Nguyen Thi Ngoc Mai Vu Thi Thuy Class: 1c06.
D75P 34R HNC Computer Architecture 1 Week 9 The Processor, Busses and Peripherals © C Nyssen/Aberdeen College 2003 All images © C Nyssen /Aberdeen College.
 Higher associativity means more complex hardware  But a highly-associative cache will also exhibit a lower miss rate —Each set has more blocks, so there’s.
1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,
Multi-core Programming Introduction Topics. Topics General Ideas Moore’s Law Amdahl's Law Processes and Threads Concurrency vs. Parallelism.
Lecture 1: Performance EEN 312: Processors: Hardware, Software, and Interfacing Department of Electrical and Computer Engineering Spring 2013, Dr. Rozier.
What have mr aldred’s dirty clothes got to do with the cpu
C OMPUTER O RGANIZATION AND D ESIGN The Hardware/Software Interface 5 th Edition Chapter 1 Computer Abstractions and Technology Sections 1.5 – 1.11.
10/27: Lecture Topics Survey results Current Architectural Trends Operating Systems Intro –What is an OS? –Issues in operating systems.
Main Memory CS448.
Outline  Over view  Design  Performance  Advantages and disadvantages  Examples  Conclusion  Bibliography.
“Politehnica” University of Timisoara Course No. 2: Static and Dynamic Configurable Systems (paper by Sanchez, Sipper, Haenni, Beuchat, Stauffer, Uribe)
How Computers Work in Simple english Dr. John P. Abraham Professor UTPA.
Evaluating and Improving an OpenMP-based Circuit Design Tool Tim Beatty, Dr. Ken Kent, Dr. Eric Aubanel Faculty of Computer Science University of New Brunswick.
Reconfigurable Computing - Pipelined Systems John Morris Chung-Ang University The University of Auckland ‘Iolanthe’ at 13 knots on Cockburn Sound, Western.
Processor Structure and Function Chapter8:. CPU Structure  CPU must:  Fetch instructions –Read instruction from memory  Interpret instructions –Instruction.
Playstation2 Architecture Architecture Hardware Design.
1 Lecture 2: Performance, MIPS ISA Today’s topics:  Performance equations  MIPS instructions Reminder: canvas and class webpage:
Superscalar - summary Superscalar machines have multiple functional units (FUs) eg 2 x integer ALU, 1 x FPU, 1 x branch, 1 x load/store Requires complex.
September 10 Performance Read 3.1 through 3.4 for Wednesday Only 3 classes before 1 st Exam!
Performance – Last Lecture Bottom line performance measure is time Performance A = 1/Execution Time A Comparing Performance N = Performance A / Performance.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
Different Microprocessors Tamanna Haque Nipa Lecturer Dept. of Computer Science Stamford University Bangladesh.
Fundamentals of Programming Languages-II
BCS361: Computer Architecture I/O Devices. 2 Input/Output CPU Cache Bus MemoryDiskNetworkUSBDVD …
L12 – Performance 1 Comp 411 Computer Performance He said, to speed things up we need to squeeze the clock Study
EGRE 426 Computer Organization and Design Chapter 4.
COMPSYS 304 Computer Architecture Speculation & Branching Morning visitors - Paradise Bay, Bay of Islands.
Performance Computer Organization II 1 Computer Science Dept Va Tech January 2009 © McQuain & Ribbens Defining Performance Which airplane has.
CISC. What is it?  CISC - Complex Instruction Set Computer  CISC is a design philosophy that:  1) uses microcode instruction sets  2) uses larger.
3/12/07CS Visit Days1 A Sea Change in Processor Design Uniprocessor SpecInt Performance: From Hennessy and Patterson, Computer Architecture: A Quantitative.
Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.
CPIT Program Execution. Today, general-purpose computers use a set of instructions called a program to process data. A computer executes the.
Physical Memory and Physical Addressing ( Chapter 10 ) by Polina Zapreyeva.
Lynn Choi School of Electrical Engineering
Defining Performance Which airplane has the best performance?
Chapter1 Fundamental of Computer Design
Morgan Kaufmann Publishers
Architecture & Organization 1
Architecture & Organization 1
EE 445S Real-Time Digital Signal Processing Lab Spring 2014
BIC 10503: COMPUTER ARCHITECTURE
CS 704 Advanced Computer Architecture
Presentation transcript:

RAMP BLUE: Double-Floating Point Coprocessor Mitch Harwell David Tylman

What is Ramp Research Accelerator for Multiple Processors With multiple FPGAs on multiple BEE2 boards in single chassis, RAMP is building a massive, parallel multi-processor system.

Why Ramp? We have hit a “Power wall” where Power has become increasingly troublesome, as has the dissipation of heat through the air. Power has become expensive, while transistors are essentially free. We have reached an “ILP wall” where the law of diminishing returns requires more HW to squeeze out the last ILP from the design. Along with power we have hit a “Memory wall” where the Memory latencies have become restrictive. (200 clock cycles to DRAM memory, 4 clocks for multiply) Power Wall + ILP Wall + Memory Wall = Brick Wall Because traditional Uni-processors will cease to exhibit the performance gains of the last three decades, it is necessary to investigate other means of speeding up computation, but the computer architecture community lacks the basic infrastructure tools required to carry out this research. RAMP will accelerate research across all the fields that touch multiple processors: operating systems, compilers, debuggers, programming languages, scientific libraries, and so on.

Design Decisions The interface was chosen for the purpose of minimizing the time spent transferring data over the FSL bus.  No acknowledgements or synchronization structures were used.  We transferred the control necessary to control the FPU over the FSL_Control lines instead of sending a 5 th data word.  This works under the assumption that the interface will always expect 4 word-inputs and two word-outputs. The hardware unit was designed to be as simple as possible.  None of the units are pipelined, and only one functional unit (add/sub, mult, div, sqrt, comp, fx->fl, fl->fx) will be running at a time.  New values are not processed until the old values have completed calculating.

Software Shenanigans gcc translates floating-point math operations into function calls. The operands broken into 4 32-bit words and sent one at a time over the FSL bus For each data word, we also transmit a control bit to specify which operation to perform. We stall the processor until the answer appears on the FSL bus.

Hardware High-jinks

The Current Design Microblaze idle read crunch write FSL

What has been accomplished The software talks to the hardware as is expected. The hardware captures the operands, performs the correct operations, and returns correct results as expected. The software returns the hardware results as expected.

Benchmarks We ran a FFT benchmark twice.  Once on our DFPU hardware (6 minutes 17 seconds)  Once with software routines (56 minutes 31 seconds)

What remains Fully-compliant IEEE 754 math units Multiple processors sharing one DFPU Pipelined design