(A) Approved for public release; distribution is unlimited. Experience and results porting HPEC Benchmarks to MONARCH Lloyd Lewins & Kenneth Prager Raytheon.

Slides:



Advertisements
Similar presentations
Main MemoryCS510 Computer ArchitecturesLecture Lecture 15 Main Memory.
Advertisements

PIPELINE AND VECTOR PROCESSING
Intel Pentium 4 ENCM Jonathan Bienert Tyson Marchuk.
Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.
Multi-Level Caches Vittorio Zaccaria. Preview What you have seen: Data organization, Associativity, Cache size Policies -- how to manage the data once.
Lecture 6: Multicore Systems
A Scalable and Reconfigurable Search Memory Substrate for High Throughput Packet Processing Sangyeun Cho and Rami Melhem Dept. of Computer Science University.
Implementation of 2-D FFT on the Cell Broadband Engine Architecture William Lundgren Gedae), Kerry Barnes (Gedae), James Steed (Gedae)
Fall EE 333 Lillevik 333f06-l20 University of Portland School of Engineering Computer Organization Lecture 20 Pipelining: “bucket brigade” MIPS.
THE MIPS R10000 SUPERSCALAR MICROPROCESSOR Kenneth C. Yeager IEEE Micro in April 1996 Presented by Nitin Gupta.
Parallell Processing Systems1 Chapter 4 Vector Processors.
Computer Organization and Architecture
(A) Approved for public release; distribution is unlimited. High Performance Processing with MONARCH – A Case Study in CT Reconstruction High Performance.
Main Mem.. CSE 471 Autumn 011 Main Memory The last level in the cache – main memory hierarchy is the main memory made of DRAM chips DRAM parameters (memory.
COMP3221 lec31-mem-bus-I.1 Saeid Nooshabadi COMP 3221 Microprocessors and Embedded Systems Lectures 31: Memory and Bus Organisation - I
Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.
VIRAM-1 Architecture Update and Status Christoforos E. Kozyrakis IRAM Retreat January 2000.
Programmable logic and FPGA
Chapter 12 CPU Structure and Function. Example Register Organizations.
Memory access scheduling Authers: Scott RixnerScott Rixner,William J. Dally,Ujval J. Kapasi, Peter Mattson, John D. OwensWilliam J. DallyUjval J. KapasiPeter.
EECS 470 Cache Systems Lecture 13 Coverage: Chapter 5.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Nov. 18, 2002 Topic: Main Memory (DRAM) Organization – contd.
Embedded DRAM for a Reconfigurable Array S.Perissakis, Y.Joo 1, J.Ahn 1, A.DeHon, J.Wawrzynek University of California, Berkeley 1 LG Semicon Co., Ltd.
GallagherP188/MAPLD20041 Accelerating DSP Algorithms Using FPGAs Sean Gallagher DSP Specialist Xilinx Inc.
GPGPU platforms GP - General Purpose computation using GPU
Sub-Nyquist Sampling DSP & SCD Modules Presented by: Omer Kiselov, Daniel Primor Supervised by: Ina Rivkin, Moshe Mishali Winter 2010High Speed Digital.
Introduction to Interconnection Networks. Introduction to Interconnection network Digital systems(DS) are pervasive in modern society. Digital computers.
February 12, 1998 Aman Sareen DPGA-Coupled Microprocessors Commodity IC’s for the Early 21st Century by Aman Sareen School of Electrical Engineering and.
ECE 526 – Network Processing Systems Design Network Processor Architecture and Scalability Chapter 13,14: D. E. Comer.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
ISCA2000 Norman Margolus MIT/BU SPACERAM: An embedded DRAM architecture for large-scale spatial lattice computations.
Programming Examples that Expose Efficiency Issues for the Cell Broadband Engine Architecture William Lundgren Gedae), Rick Pancoast.
DSP Lecture Series DSP Memory Architecture Dr. E.W. Hu Nov. 28, 2000.
The MIPS R10000 Superscalar Microprocessor Kenneth C. Yeager Nishanth Haranahalli February 11, 2004.
Quadratic Programming Solver for Image Deblurring Engine Rahul Rithe, Michael Price Massachusetts Institute of Technology.
1 Fly – A Modifiable Hardware Compiler C. H. Ho 1, P.H.W. Leong 1, K.H. Tsoi 1, R. Ludewig 2, P. Zipf 2, A.G. Oritz 2 and M. Glesner 2 1 Department of.
HPEC SMHS 9/24/2008 MIT Lincoln Laboratory Large Multicore FFTs: Approaches to Optimization Sharon Sacco and James Geraci 24 September 2008 This.
Lecture 16: Reconfigurable Computing Applications November 3, 2004 ECE 697F Reconfigurable Computing Lecture 16 Reconfigurable Computing Applications.
The original MIPS I CPU ISA has been extended forward three times The practical result is that a processor implementing MIPS IV is also able to run MIPS.
Algorithm and Programming Considerations for Embedded Reconfigurable Computers Russell Duren, Associate Professor Engineering And Computer Science Baylor.
Computer Architecture Lecture 32 Fasih ur Rehman.
Copyright  2005 SRC Computers, Inc. ALL RIGHTS RESERVED Overview.
Hardware Benchmark Results for An Ultra-High Performance Architecture for Embedded Defense Signal and Image Processing Applications September 29, 2004.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
Digital Signal Processors (DSPs). DSP Advanced signal processor circuits MAC (Multiply and Accumulate) unit (s) - provides fast multiplication of two.
ShiDianNao: Shifting Vision Processing Closer to the Sensor
Programming with CUDA WS 08/09 Lecture 10 Tue, 25 Nov, 2008.
Overview High Performance Packet Processing Challenges
Copyright © 2004, Dillon Engineering Inc. All Rights Reserved. An Efficient Architecture for Ultra Long FFTs in FPGAs and ASICs  Architecture optimized.
The Alpha – Data Stream Matt Ziegler.
Fundamentals of Programming Languages-II
1 Adapted from UC Berkeley CS252 S01 Lecture 18: Reducing Cache Hit Time and Main Memory Design Virtucal Cache, pipelined cache, cache summary, main memory.
Mapping of Regular Nested Loop Programs to Coarse-grained Reconfigurable Arrays – Constraints and Methodology Presented by: Luis Ortiz Department of Computer.
A New Class of High Performance FFTs Dr. J. Greg Nash Centar ( High Performance Embedded Computing (HPEC) Workshop.
Spring EE 437 Lillevik 437s06-l22 University of Portland School of Engineering Advanced Computer Architecture Lecture 22 Distributed computer Interconnection.
Contemporary DRAM memories and optimization of their usage Nebojša Milenković and Vladimir Stanković, Faculty of Electronic Engineering, Niš.
Page 1 Computer Architecture and Organization 55:035 Final Exam Review Spring 2011.
Institute of Software,Chinese Academy of Sciences An Insightful and Quantitative Performance Optimization Chain for GPUs Jia Haipeng.
HPEC 2003 Linear Algebra Processor using FPGA Jeremy Johnson, Prawat Nagvajara, Chika Nwankpa Drexel University.
World’s First Polymorphic Computer – MONARCH
Reducing Hit Time Small and simple caches Way prediction Trace caches
Presented by: Tim Olson, Architect
Instructor: Dr. Phillip Jones
Reconfigurable Computing MONARCH/MCHIP
Pipelining and Vector Processing
Modified from notes by Saeid Nooshabadi
ADSP 21065L.
Presentation transcript:

(A) Approved for public release; distribution is unlimited. Experience and results porting HPEC Benchmarks to MONARCH Lloyd Lewins & Kenneth Prager Raytheon Company 2000 E. El Segundo Blvd, El Segundo, CA High Performance Embedded Computing (HPEC) Workshop 23−25 September 2008

Page 29/23/08 Overview of HPEC Benchmarks Provides a means to quantitatively evaluate high performance embedded computing (HPEC) systems Addresses important operations across a broad range of DoD signal and image processing applications – Finite Impulse Response (FIR) Filter – QR Factorization – Singular Value Decomposition – Pattern Matching – Corner turn etc Documentation, Uniprocessor C-code, MATLAB, Sizes

Page 39/23/08 Overview of MONARCH  6 RISC Processors  12 MBytes on-chip DRAM  2 DDR2 External Memory Interfaces (8 GB/s BW)  Flash Port (32 MB)  2 Serial RapidIO Ports (1.25 GB/s each)  16 IFL ports (2.6 GB/s each)  On-chip Ring 40 GB/s  Reconfigurable Array: FPCA (64 GFLOPS)  6 RISC Processors  12 MBytes on-chip DRAM  2 DDR2 External Memory Interfaces (8 GB/s BW)  Flash Port (32 MB)  2 Serial RapidIO Ports (1.25 GB/s each)  16 IFL ports (2.6 GB/s each)  On-chip Ring 40 GB/s  Reconfigurable Array: FPCA (64 GFLOPS)

Page 49/23/08 Benchmark Selection Transpose (corner-turn) – 50x5000 and 750x5000 – Transpose to/from External DRAM Constant False Alarm Detection (CFAR) – 16x64x24, 48x3500x128, 48x1909x64 and 16x9900x16 – Few ops – bandwidth limited. – Larger datasets in External DRAM – smaller in EDRAM QR Factorization – 500x100, 180x60, and 150x150 – Givens Rotation (more complex) – Many 2x2 matrix multiplies (but simple) Note: results for FIR and FFT previously reported

Page 59/23/08 MONARCH Mapping Issues Bandwidth Limitations – External DRAM (DDR2) 4.7 Gbyte/s peak per port (64 333MHz DDR + overhead) Only one port populated on test board – Implementation Issues EDRAM bank conflict bug – no simultaneous read/write PBuf to Node-Bus arbitration – unload one word every 3 clocks (cuts 10.6 Gbyte/s PIRX bandwidth down to 3.6 Gbyte/s). DDR2 latency versus MMBT pipeline depth – limits reads to 3.8 Gbyte/s. Partitioning Algorithm Selection – “Fast” Givens versus “regular” Givens Reciprocal/Square Root – Synthesize using Newton-Raphson

Page 69/23/08 Corner Turn Benchmark Hierarchical Block Transpose – FPCA handles 32x8 inner block (uses 16 MEM elements) – EDRAM contains 32x2528 blocks – ANBI streams into 32x8 blocks – MMBT transfers 32x2528 blocks to/from DDR2 Alignment Issues – MMBT/DDR2 interactions require transferring 32 words for peak performance – Total transpose was 768x5056 (3.5% larger) Performance Issues – Single FPCA Transpose engine limits bandwidth to 1.3 Gbyte/s – Elimination of bank conflict bug and two DDR2 ports would allow three transpose engines (3.6 Gbyte/s) – limited by PBuf/Node-Bus arbitration

Page 79/23/08 Corner Turn Implementation FPCA ANBI MMBT FPCA – Field Programmable Computer Array; ANBI – Array Node Bus Interface; MMBT – Memory Block Transfer; DDR_A – Double Data Rate DRAM interface A; EDRAM – Embedded DRAM

Page 89/23/08 Corner Turn Results Measured performance and predicted performance if second DDR2 bank available: Predicted performance in the absence of the bank conflict bug Note: this is end-to-end bandwidth – achieved memory bandwidth of 2X Bandwidth is in Bytes per Second DDR2 – Double Data Rate DRAM interface BCB – Band Conflict Bug (EDRAM)

Page 99/23/08 Constant False Alarm Rate Benchmark Multiple CFAR engines implemented in FPCA – Limited by number of EDRAMs to feed them Smaller datasets stored in EDRAM – Six CFAR engines – 14 GFLOPS Larger datasets stored in DDR2 – Three CFAR engines (because of bank conflict bug) – Further limited by DDR2 bandwidth – 6.2 GFLOPS (12.4 GFLOPS with two DDR ports)

Page 109/23/08 CFAR Implementation FPCA ANBI MMBT FPCA – Field Programmable Computer Array; ANBI – Array Node Bus Interface; MMBT – Memory Block Transfer; DDR_A – Double Data Rate DRAM interface A; EDRAM – Embedded DRAM

Page 119/23/08 Constant False Alarm Rate Results Measured performance and predicted performance if second DDR2 bank available: Predicted performance in the absence of the bank conflict bug: DDR2 – Double Data Rate DRAM interface BCB – Band Conflict Bug (EDRAM)

Page 129/23/08 QR Factorization Benchmark Single QR Engine implemented in FPCA – Uses high percentage of resources – Multiple streams to/from memory Performance limited by bandwidth to EDRAM Classic “Fast Givens” requires even more streams to/from EDRAM – Issue is not FLOPS, but Bandwidth Calculating Givens rotation requires square-root and reciprocal. – Implemented in FPCA using Newton-Raphson.

Page 139/23/08 QR Factorization Implementation FPCA Low rate More ops 2x2 Multiply ANBI FPCA – Field Programmable Computer Array; ANBI – Array Node Bus Interface; EDRAM – Embedded DRAM

Page 149/23/08 QR Factorization Results Measured performance: Predicted performance in the absence of the bank conflict bug: BCB – Band Conflict Bug (EDRAM)

Page 159/23/08 Reciprocal/Square Root FPCA doesn’t support division or square-root directly Number of approaches considered, including CORDIC Newton-Raphson works surprisingly well, even for floating point numbers – Use a few small lookup tables – Integer arithmetic to extract exponent and mantissa – Floating point arithmetic to iterate estimate – Fully pipelined

Page 169/23/08 Reciprocal Calculation (Math) Newton Raphson: to solve 1/y, given an estimate of 1/y (x i ), a better estimate of 1/y (x i+1 ) is given by: Split the number into exponent (plus sign), and mantissa. Use LUT to calculate reciprocal of exponent, and a second LUT to estimate the reciprocal of the mantissa. Use Newton Raphson twice to refine the reciprocal of the mantissa (getting more than 23 bits) and finally multiply the resulting mantissa and exponent. 1/X LUT (512) exponent+sign 1/X LUT (256) mantissa (MS bits) 9 8 Newton Raphson Newton Raphson mantissa *

Page 179/23/08 Reciprocal Calculation (Implementation) out & >15 > in mantissa exponent_s23 mantissaFull 3 y mantissa_s15x0x1_YXix1_YXiP2 x1 x2_YXix2_YXiP2 x2 recip exponentLUT D DD DD D D mantissaF MALU0 MALU1 D ALU SSC MEM MUL D DDE Delay

Page 189/23/08 Comparison to other Architectures Benchmark (Units)PPC-G4XeonRAW-16 RAW-64 MONARCH 1- DDR +BCB MONARCH 2- DDR -BCB Cornerturn (GBytes/S) Benchmark (Units)PPC-G4XeonRAW-16 RAW-64 MONARCH 1- DDR +BCB MONARCH 2- DDR -BCB CFAR (GFLOP/S) QR (GFLOP/S) RAW-64 performance projected

Page 199/23/08 Conclusion Several interesting HPEC benchmarks successfully implemented on MONARCH MONARCH performance very competitive with other published HPEC results Benchmarks all bandwidth limited – Partitioning focuses on optimizing data movement – Buffer data in EDRAM to ensure sequential DDR accesses – Select algorithm which is “bandwidth friendly” – “It’s the data movement stupid!” Reciprocal/square root readily synthesized from existing FPCA resources – Demonstrates flexibility of FPCA Working around errata of current chip added challenge! This work was supported by the NRO