Download presentation
Presentation is loading. Please wait.
1
RAMP BLUE: Double-Floating Point Coprocessor Mitch Harwell David Tylman
2
What is Ramp Research Accelerator for Multiple Processors With multiple FPGAs on multiple BEE2 boards in single chassis, RAMP is building a massive, parallel multi-processor system.
3
Why Ramp? We have hit a “Power wall” where Power has become increasingly troublesome, as has the dissipation of heat through the air. Power has become expensive, while transistors are essentially free. We have reached an “ILP wall” where the law of diminishing returns requires more HW to squeeze out the last ILP from the design. Along with power we have hit a “Memory wall” where the Memory latencies have become restrictive. (200 clock cycles to DRAM memory, 4 clocks for multiply) Power Wall + ILP Wall + Memory Wall = Brick Wall Because traditional Uni-processors will cease to exhibit the performance gains of the last three decades, it is necessary to investigate other means of speeding up computation, but the computer architecture community lacks the basic infrastructure tools required to carry out this research. RAMP will accelerate research across all the fields that touch multiple processors: operating systems, compilers, debuggers, programming languages, scientific libraries, and so on.
4
Design Decisions The interface was chosen for the purpose of minimizing the time spent transferring data over the FSL bus. No acknowledgements or synchronization structures were used. We transferred the control necessary to control the FPU over the FSL_Control lines instead of sending a 5 th data word. This works under the assumption that the interface will always expect 4 word-inputs and two word-outputs. The hardware unit was designed to be as simple as possible. None of the units are pipelined, and only one functional unit (add/sub, mult, div, sqrt, comp, fx->fl, fl->fx) will be running at a time. New values are not processed until the old values have completed calculating.
5
Software Shenanigans gcc translates floating-point math operations into function calls. The operands broken into 4 32-bit words and sent one at a time over the FSL bus For each data word, we also transmit a control bit to specify which operation to perform. We stall the processor until the answer appears on the FSL bus.
6
Hardware High-jinks
7
The Current Design Microblaze idle read crunch write FSL
8
What has been accomplished The software talks to the hardware as is expected. The hardware captures the operands, performs the correct operations, and returns correct results as expected. The software returns the hardware results as expected.
9
Benchmarks We ran a FFT benchmark twice. Once on our DFPU hardware (6 minutes 17 seconds) Once with software routines (56 minutes 31 seconds)
11
What remains Fully-compliant IEEE 754 math units Multiple processors sharing one DFPU Pipelined design
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.