© 2005 Mercury Computer Systems, Inc. Yael Steinsaltz, Scott Geaghan, Myra Jean Prelle, Brian Bouzas,

Slides:



Advertisements
Similar presentations
© 2007 Eaton Corporation. All rights reserved. LabVIEW State Machine Architectures Presented By Scott Sirrine Eaton Corporation.
Advertisements

University of South Australia Distributed Reconfiguration Avishek Chakraborty, David Kearney, Mark Jasiunas.
1a. Outline how the main memory of a computer can be partitioned b. What are the benefits of partitioning the main memory? It allows more than 1 program.
Systems and Technology Group © 2006 IBM Corporation Cell Programming Tutorial - JHD24 May 2006 Cell Programming Tutorial Jeff Derby, Senior Technical Staff.
An OpenCL Framework for Heterogeneous Multicores with Local Memory PACT 2010 Jaejin Lee, Jungwon Kim, Sangmin Seo, Seungkyun Kim, Jungho Park, Honggyu.
Yaron Doweck Yael Einziger Supervisor: Mike Sumszyk Spring 2011 Semester Project.
Implementation of 2-D FFT on the Cell Broadband Engine Architecture William Lundgren Gedae), Kerry Barnes (Gedae), James Steed (Gedae)
DFT Filter Banks Steven Liddell Prof. Justin Jonas.
Based on Silberschatz, Galvin and Gagne  2009 Threads Definition and motivation Multithreading Models Threading Issues Examples.
Chapter 13 Embedded Systems
Figure 1.1 Interaction between applications and the operating system.
A Performance and Energy Comparison of FPGAs, GPUs, and Multicores for Sliding-Window Applications From J. Fowers, G. Brown, P. Cooke, and G. Stitt, University.
GallagherP188/MAPLD20041 Accelerating DSP Algorithms Using FPGAs Sean Gallagher DSP Specialist Xilinx Inc.
GPGPU platforms GP - General Purpose computation using GPU
Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.
Programming the Cell Multiprocessor Işıl ÖZ. Outline Cell processor – Objectives – Design and architecture Programming the cell – Programming models CellSs.
Operating Systems Should Manage Accelerators Sankaralingam Panneerselvam Michael M. Swift Computer Sciences Department University of Wisconsin, Madison,
Computer System Architectures Computer System Software
Evaluation of Multi-core Architectures for Image Processing Algorithms Masters Thesis Presentation by Trupti Patil July 22, 2009.
7th Workshop on Fusion Data Processing Validation and Analysis Integration of GPU Technologies in EPICs for Real Time Data Preprocessing Applications J.
Model-Based Design and SDR Fabio Ancona Sundance Italia SRL CEO – Sales Director.
Orchestration by Approximation Mapping Stream Programs onto Multicore Architectures S. M. Farhad (University of Sydney) Joint work with Yousun Ko Bernd.
Gedae Portability: From Simulation to DSPs to the Cell Broadband Engine James Steed, William Lundgren, Kerry Barnes Gedae, Inc
National Center for Supercomputing Applications University of Illinois at Urbana-Champaign Cell processor implementation of a MILC lattice QCD application.
1 of 23 Fouts MAPLD 2005/C117 Synthesis of False Target Radar Images Using a Reconfigurable Computer Dr. Douglas J. Fouts LT Kendrick R. Macklin Daniel.
Para-Snort : A Multi-thread Snort on Multi-Core IA Platform Tsinghua University PDCS 2009 November 3, 2009 Xinming Chen, Yiyao Wu, Lianghong Xu, Yibo Xue.
Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.
1 Computer Architecture Research Overview Rajeev Balasubramonian School of Computing, University of Utah
Developments in networked embedded system technologies and programmable logic are making it possible to develop new, highly flexible data acquisition system.
Advanced Computer Architecture, CSE 520 Generating FPGA-Accelerated DFT Libraries Chi-Li Yu Nov. 13, 2007.
Programming Examples that Expose Efficiency Issues for the Cell Broadband Engine Architecture William Lundgren Gedae), Rick Pancoast.
High Performance Computing Processors Felix Noble Mirayma V. Rodriguez Agnes Velez Electric and Computer Engineer Department August 25, 2004.
Multi-core.  What is parallel programming ?  Classification of parallel architectures  Dimension of instruction  Dimension of data  Memory models.
SW and HW platforms for development of SDR systems SW: Model-Based Design and SDR HW: Concept of Modular Design and Solutions Fabio Ancona Sundance Italia.
StreamX10: A Stream Programming Framework on X10 Haitao Wei School of Computer Science at Huazhong University of Sci&Tech.
© 2004 Mercury Computer Systems, Inc. FPGAs & Software Components Graham Bardouleau & Jim Kulp Mercury Computer Systems, Inc. High Performance Embedded.
Kevin Eady Ben Plunkett Prateeksha Satyamoorthy.
HPEC SMHS 9/24/2008 MIT Lincoln Laboratory Large Multicore FFTs: Approaches to Optimization Sharon Sacco and James Geraci 24 September 2008 This.
Scalable Multi-core Sonar Beamforming with Computational Process Networks Motivation Sonar beamforming requires significant computation and input/output.
Sep 08, 2009 SPEEDUP – Optimization and Porting of Path Integral MC Code to New Computing Architectures V. Slavnić, A. Balaž, D. Stojiljković, A. Belić,
Algorithm and Programming Considerations for Embedded Reconfigurable Computers Russell Duren, Associate Professor Engineering And Computer Science Baylor.
Parallelization of likelihood functions for data analysis Alfio Lazzaro CERN openlab Forum on Concurrent Programming Models and Frameworks.
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 1 Bundled Execution.
Hardware Benchmark Results for An Ultra-High Performance Architecture for Embedded Defense Signal and Image Processing Applications September 29, 2004.
A Programmable Single Chip Digital Signal Processing Engine MAPLD 2005 Paul Chiang, MathStar Inc. Pius Ng, Apache Design Solutions.
Copyright © 2004, Dillon Engineering Inc. All Rights Reserved. An Efficient Architecture for Ultra Long FFTs in FPGAs and ASICs  Architecture optimized.
CS 351/ IT 351 Modeling and Simulation Technologies HPC Architectures Dr. Jim Holten.
XLV INTERNATIONAL WINTER MEETING ON NUCLEAR PHYSICS Tiago Pérez II Physikalisches Institut For the PANDA collaboration FPGA Compute node for the PANDA.
Presented by Jeremy S. Meredith Sadaf R. Alam Jeffrey S. Vetter Future Technologies Group Computer Science and Mathematics Division Research supported.
Silberschatz, Galvin and Gagne ©2013 Operating System Concepts – 9 th Edition Chapter 4: Threads.
FFTC: Fastest Fourier Transform on the IBM Cell Broadband Engine David A. Bader, Virat Agarwal.
Page 1 2P13 Week 1. Page 2 Page 3 Page 4 Page 5.
Analyzing Memory Access Intensity in Parallel Programs on Multicore Lixia Liu, Zhiyuan Li, Ahmed Sameh Department of Computer Science, Purdue University,
HPEC-1 SMHS 7/7/2016 MIT Lincoln Laboratory Focus 3: Cell Sharon Sacco / MIT Lincoln Laboratory HPEC Workshop 19 September 2007 This work is sponsored.
Sridhar Rajagopal Bryan A. Jones and Joseph R. Cavallaro
Introduction to threads
NFV Compute Acceleration APIs and Evaluation
Presenter: Darshika G. Perera Assistant Professor
Introduction to Programmable Logic
Parallel Programming By J. H. Wang May 2, 2017.
Embedded Systems Design
FPGAs & Software Components
Performance Tuning Team Chia-heng Tu June 30, 2009
Morgan Kaufmann Publishers
2PAD’s Beamforming Software
Chapter 4: Threads.
The performance requirements for DSP applications continue to grow and the traditional solutions do not adequately address this new challenge Paradigm.
Optimizing stencil code for FPGA
Wireless Embedded Systems
ADSP 21065L.
Presentation transcript:

© 2005 Mercury Computer Systems, Inc. Yael Steinsaltz, Scott Geaghan, Myra Jean Prelle, Brian Bouzas, Michael Pepe, Leveraging Multicomputer Frameworks for Use in Multi-Core Processors High Performance Embedded Computing Workshop September 21, 2006

© 2005 Mercury Computer Systems, Inc. 2 Outline Introduction Channelizer Problem Preliminary Results Summary

© 2005 Mercury Computer Systems, Inc. 3 Multi-Core Processors Multi-Core processors vary in architecture from 2-4 identical cores ( Intel Xeon, Freescale 8641 ), to a single Manager, several Workers on a die ( IBM Cell Broadband Engine ™ (BE) processor ). Focusing on the IBM Cell BE processor, and using the standard presented in re.org, we implemented an API ‘Multi-Core Framework’ (MCF). re.org MCF is applicable across architectures as long as one process acts as a Manager; more established APIs would work as well.

© 2005 Mercury Computer Systems, Inc. 4 Multi-Core Framework MCF is based on Mercury's prior implementation of a product named “Parallel Acceleration System” or PAS. Distributed data flows in a Manager-Worker fashion enabling concurrent I/O and parallel processing. Function Offload model, where user programs both Manager and Workers. MCF simplifies development. LS memory is used efficiently (< 5% for MCF kernel). Runs tasks on SPE without Linux ® overhead (thread create is bypassed).

© 2005 Mercury Computer Systems, Inc. 5 Data Movement Multi-buffered, strip mining of N-dimensional data sets between a large main memory (XDR) and small worker memories. Provides for overlap and duplication when distributing data as well as different partitioning. Data re-organization enables easy transfer of data between local stores.

© 2005 Mercury Computer Systems, Inc. 6 Outline Introduction Channelizer Problem Preliminary Results Summary

© 2005 Mercury Computer Systems, Inc. 7 Objective and Motivation Objective : Develop a Cell BE based real- time signal acquisition system composed of frequency channelizers and signal detectors in a single ~6U slot. Motivation : Benchmark computational density between PPCs, FPGAs & Cell-BE for a typical streaming application

© 2005 Mercury Computer Systems, Inc. 8 The Channelizer Problem FM 3 TR Signal (Hopping, Multi-Waveform, Multiband) Channelization using 16K real FFT with 75% overlap of the input (Computation signal independent). Simple threshold for detection of the active channels (Computation is data dependent).

© 2005 Mercury Computer Systems, Inc. 9 Channelizer Problem The signal acquisition system separates a wide radio frequency band into a set of narrow frequency bands. Implementation Specifications  4:1 Overlap Buffer: 16K sample buffer -> 8K complex FFT.  Blackman Window (Embedded Multipliers).  Log-magnitude  Threshold: adjustable register and comparator to determine detections

© 2005 Mercury Computer Systems, Inc. 10 Data Flow and Work Distribution manager thread of manager Teams perform data parallel math Manager thread of execution High speed Alarm worker Channelizer workers Input data Channelizer output worker HSA output Unused processing elements worker

© 2005 Mercury Computer Systems, Inc. 11 Data Flow – Re-org Channels Channelizer team Local Store XDR HSA team LS

© 2005 Mercury Computer Systems, Inc. 12 Data Flow – Re-org Channels Channelizer team Local Store XDR HSA team LS

© 2005 Mercury Computer Systems, Inc. 13 Data Flow – Re-org Channels Channelizer team Local Store XDR HSA team LS

© 2005 Mercury Computer Systems, Inc. 14 Data Flow – Re-org Channels Channelizer team Local Store XDR HSA team LS

© 2005 Mercury Computer Systems, Inc. 15 Data Flow – Re-org Channels Channelizer team Local Store XDR HSA team LS

© 2005 Mercury Computer Systems, Inc. 16 Data Flow – Re-org Channels Channelizer team Local Store XDR HSA team LS

© 2005 Mercury Computer Systems, Inc. 17 Data Flow – Re-org Channels Channelizer team Local Store XDR HSA team LS

© 2005 Mercury Computer Systems, Inc. 18 Data Flow – Re-org Channels Channelizer team Local Store XDR HSA team LS

© 2005 Mercury Computer Systems, Inc. 19 Data Flow – Re-org Channels Channelizer team Local Store XDR HSA team LS

© 2005 Mercury Computer Systems, Inc. 20 Data Flow – Re-org Channels Channelizer team Local Store XDR HSA team LS

© 2005 Mercury Computer Systems, Inc. 21 Outline Introduction Channelizer Problem Preliminary Results Summary

© 2005 Mercury Computer Systems, Inc. 22 Development Time and Hardware Use PPC – 22 PPC needed for the channelizer, and 7 PPC for the HSA; about 2 man-months for development. FPGA – one half of a VirtexIIPro P70 FPGA (quarter board), about 8 man-months, all the math had to be developed using some Xilinx cores. Cell BE – single processor (half board), about 4 man-weeks (using the same math and SAL calls as the PPC code).

© 2005 Mercury Computer Systems, Inc. 23 Data Rates Tested PPC implementation accepted data at 70, 80 and 105 MHz (and is easily scalable). FPGA implementation met data rates at 70 and 80 MHz (MS/sec). Cell BE implementation met data rates at 70, 80 and 105 MHz (MS/sec).  Windowing wasn’t implemented in Cell BE because of insufficient local store for the weights. To add this an extra 2-3 weeks of design modification to the data organization and channels would be needed (Times were measured with a multiply by constant to be true to performance).  Math only started to impact data rates when using less than 4 SPEs for the FFT, adding more SPEs didn’t result in added speed.

© 2005 Mercury Computer Systems, Inc. 24 Outline Introduction Channelizer Problem Preliminary Results Summary

© 2005 Mercury Computer Systems, Inc. 25 Summary Morphing a library with similar API to new architecture makes porting applications efficient. Hardware footprint (6U slots) is comparable to FPGA use. The small size of the SPE local store is a significant contributor in determining whether an application will port easily or require additional work. Mercury is fully cognizant of the architecture and works to reduce code size while benefiting from the large I/O bandwidth and fast processing capability of the Cell BE.