© 2007 SET Associates Corporation SAR Processing Performance on Cell Processor and Xeon Mark Backues, SET Corporation Uttam Majumder, AFRL/RYAS.

Slides:



Advertisements
Similar presentations
Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,
Advertisements

DSPs Vs General Purpose Microprocessors
Lecture 4 Introduction to Digital Signal Processors (DSPs) Dr. Konstantinos Tatas.
Systems and Technology Group © 2006 IBM Corporation Cell Programming Tutorial - JHD24 May 2006 Cell Programming Tutorial Jeff Derby, Senior Technical Staff.
Intel Pentium 4 ENCM Jonathan Bienert Tyson Marchuk.
4. Shared Memory Parallel Architectures 4.4. Multicore Architectures
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
Lecture 6: Multicore Systems
Implementation of 2-D FFT on the Cell Broadband Engine Architecture William Lundgren Gedae), Kerry Barnes (Gedae), James Steed (Gedae)
Click to add text © IBM Corporation Optimization Issues in SSE/AVX-compatible functions on PowerPC Ian McIntosh November 5, 2014.
Intel Core2 GHz Q6700 L2 Cache 8 Mbytes (4MB per pair) L1 Cache: (128 KB Instruction +128KB Data at the core level???) L3 Cache: None? CPU.
Fall EE 333 Lillevik 333f06-l20 University of Portland School of Engineering Computer Organization Lecture 20 Pipelining: “bucket brigade” MIPS.
Intel’s MMX Dr. Richard Enbody CSE 820. Michigan State University Computer Science and Engineering Why MMX? Make the Common Case Fast Multimedia and Communication.
Computer Architecture and Data Manipulation Chapter 3.
 Understanding the Sources of Inefficiency in General-Purpose Chips.
Single-Chip Multiprocessor Nirmal Andrews. Case for single chip multiprocessors Advances in the field of integrated chip processing. - Gate density (More.
Processor Technology and Architecture
Instruction Level Parallelism (ILP) Colin Stevens.
EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.
CS 7810 Lecture 24 The Cell Processor H. Peter Hofstee Proceedings of HPCA-11 February 2005.
Intel Pentium 4 Processor Presented by Presented by Steve Kelley Steve Kelley Zhijian Lu Zhijian Lu.
Kathy Grimes. Signals Electrical Mechanical Acoustic Most real-world signals are Analog – they vary continuously over time Many Limitations with Analog.
Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran.
1 Instant replay  The semester was split into roughly four parts. —The 1st quarter covered instruction set architectures—the connection between software.
Joram Benham April 2,  Introduction  Motivation  Multicore Processors  Overview, CELL  Advantages of CMPs  Throughput, Latency  Challenges.
Intel Architecture. Changes in architecture Software architecture: –Front end (Feature changes such as adding more graphics, changing the background colors,
Evaluation of Multi-core Architectures for Image Processing Algorithms Masters Thesis Presentation by Trupti Patil July 22, 2009.
Samuel Williams, John Shalf, Leonid Oliker, Shoaib Kamil, Parry Husbands, Katherine Yelick Lawrence Berkeley National Laboratory ACM International Conference.
Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.
Lecture#14. Last Lecture Summary Memory Address, size What memory stores OS, Application programs, Data, Instructions Types of Memory Non Volatile and.
3 1 3 C H A P T E R Hardware: Input, Processing, and Output Devices.
National Center for Supercomputing Applications University of Illinois at Urbana-Champaign Cell processor implementation of a MILC lattice QCD application.
Company LOGO High Performance Processors Miguel J. González Blanco Miguel A. Padilla Puig Felix Rivera Rivas.
Pipeline And Vector Processing. Parallel Processing The purpose of parallel processing is to speed up the computer processing capability and increase.
Programming Examples that Expose Efficiency Issues for the Cell Broadband Engine Architecture William Lundgren Gedae), Rick Pancoast.
High Performance Computing Processors Felix Noble Mirayma V. Rodriguez Agnes Velez Electric and Computer Engineer Department August 25, 2004.
History of Microprocessor MPIntroductionData BusAddress Bus
Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"
RISC Architecture RISC vs CISC Sherwin Chan.
Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.
A Closer Look At GPUs By Kayvon Fatahalian and Mike Houston Presented by Richard Stocker.
Hyper Threading Technology. Introduction Hyper-threading is a technology developed by Intel Corporation for it’s Xeon processors with a 533 MHz system.
Chapter 2 Data Manipulation © 2007 Pearson Addison-Wesley. All rights reserved.
Optimization of Collective Communication in Intra- Cell MPI Optimization of Collective Communication in Intra- Cell MPI Ashok Srinivasan Florida State.
High Performance Computing Group Feasibility Study of MPI Implementation on the Heterogeneous Multi-Core Cell BE TM Architecture Feasibility Study of MPI.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
LYU0703 Parallel Distributed Programming on PS3 1 Huang Hiu Fung Wong Chung Hoi Supervised by Prof. Michael R. Lyu Department of Computer.
Playstation2 Architecture Architecture Hardware Design.
Chapter 2 Data Manipulation © 2007 Pearson Addison-Wesley. All rights reserved.
Optimizing Ray Tracing on the Cell Microprocessor David Oguns.
Presented by Jeremy S. Meredith Sadaf R. Alam Jeffrey S. Vetter Future Technologies Group Computer Science and Mathematics Division Research supported.
Aarul Jain CSE520, Advanced Computer Architecture Fall 2007.
Distortion Correction ECE 6276 Project Review Team 5: Basit Memon Foti Kacani Jason Haedt Jin Joo Lee Peter Karasev.
My Coordinates Office EM G.27 contact time:
On-chip Parallelism Alvin R. Lebeck CPS 220/ECE 252.
Institute of Software,Chinese Academy of Sciences An Insightful and Quantitative Performance Optimization Chain for GPUs Jia Haipeng.
IBM Cell Processor Ryan Carlson, Yannick Lanner-Cusin, & Cyrus Stoller CS87: Parallel and Distributed Computing.
Processor Level Parallelism 1
1 ECE 734 Final Project Presentation Fall 2000 By Manoj Geo Varghese MMX Technology: An Optimization Outlook.
1 Lecture 5a: CPU architecture 101 boris.
Appendix C Graphics and Computing GPUs
Backprojection Project Update January 2002
Pipelining and Retiming 1
Multi-core processors
Christopher Han-Yu Chou Supervisor: Dr. Guy Lemieux
STUDY AND IMPLEMENTATION
Computer Evolution and Performance
Multicore and GPU Programming
The University of Adelaide, School of Computer Science
Multicore and GPU Programming
Presentation transcript:

© 2007 SET Associates Corporation SAR Processing Performance on Cell Processor and Xeon Mark Backues, SET Corporation Uttam Majumder, AFRL/RYAS

© 2008 SET Corporation Summary  SAR imaging algorithm optimized for both Cell processor and Quad- Core Xeon – Cell implementation partially modeled after Richard Linderman work  Performance between two processors similar  Cell would generally perform better on a lower complexity problem – Illustrated by bilinear interpolation implementation  Relative performance can be understood from architectural differences

© 2008 SET Corporation Back-Projection on Cell and Xeon  Simple, general purpose SAR imaging implementation – Order n 3 for nxn pixel image tiles – Per pulse: 4x oversampled range compression FFT – Per pulse, per pixel: Single precision range calculation Linear range interpolation Nearest neighbor table lookup for 4pi/c·f 0 ·R phase term  Optimized on both processor types – SIMD intrinsics for 4x parallelism per processing unit – Multiple threads – Loops unrolled to eliminate instruction related stalls

© 2008 SET Corporation Back-Projection Performance  Performance on one Intel Quad-Core Xeon 20% faster than on one IBM Cell processor – 3.2GHz clock rate – Range compression not included in timing analysis Fast compared to projection process Often performed by hardware front-end  Cell implementation more difficult – Explicit DMAs required – Use of select, shift, and shuffle intrinsics requried for efficient data movement  Four 3.2GHz Xeon cores equivalent to eight 1.6GHz cores – Global memory access not a problem

© 2008 SET Corporation Back-Projection Range Calculation  Range calculation accounts for 43% of execution time on Cell processor, and 33% on Xeon  Square root used in range calculation, for maximum generality  The square root can be replaced by a much faster approximation – ||r|| - ||r-s|| ≈ /||r|| when ||s||<<||r|| – Other approximations are possible – The allowable error is application dependent  The performance on Cell processor is then closer to the performance on quad- core Xeon Error for Inner-Product Range Approximation

© 2008 SET Corporation Single Verses Double Precision  Double precision most important for range calculation  PS3 Cell double precision instructions very slow – 13 cycle latencies with unavoidable 6 cycle stalls – Throughput ~6x worse where used  Double precision comparison would be much less favorable for PS3 Cell

© 2008 SET Corporation Why Performance is Not Predicted by Peak GFLOPS Figure (cont.)  Instruction pipeline differences – Cell processing element has two pipelines, but only one is for arithmetic instructions – Xeon has multiple ports and execution units, and can issue two (2-cycle- throughput) instructions per cycle, with data movement often not requiring additional cycles Cell Processing Unit Xeon Core

© 2008 SET Corporation Why Performance is Not Predicted by Peak GFLOPS Figure (cont.)  Operation count poorly reflects computational difficulty – Transcendental functions are orders of magnitude slower than most other arithmetic operations – Efficiency of table lookup depends on instruction set characteristics not reflected by peak performance figure – Shuffling of data into registers for efficient SIMD operation can be the slowest part of the process, and is not predicted by operation count 1D 2345 rotqbyi$45,$113,8 0D 3456 shli$115,$81,4 1D lqd$94,224($sp) 0D 4567 shli$113,$55,4 1D stqd$34,5984($sp) 0D 5678 shli$112,$53,4 1D lqx$31,$126,$127 0D 67 ceqi$80,$65,0 1D lqd$95,240($sp) 0D 78 andi$65,$27,1 1D 7890 rotqbyi$15,$105,8 0D 8901 shli$105,$45,4 1D 8901 rotqbyi$60,$28,8 0D 90 ceqi$65,$65,0 1D 9012 rotqbyi$52,$19,8 Example Cell processor disassembly and timing analysis

© 2008 SET Corporation Bilinear Interpolation on Cell and Xeon  Bilinear affine transformation of 256x256 pixel 8-bit images – Vector intrinsics used for both implementations – Instruction related stalls eliminated on Cell – DMA time still negligible on Cell – no double buffering required – Data movement and type conversions required significant optimization on Xeon – More difficult programming would be required for Cell to handle images too big to fit in 256KB memory local to each processing unit – Order n 2 for nxn pixel images

© 2008 SET Corporation Xeon Memory Bottleneck vs Cell  Four 3.2GHz Xeon cores have 1.5x the performance of eight 1.6GHz cores – Front-side bus is 1600MHz vs 1066Mhz – Main memory access is limiting factor, not computation or cache use  One 3.2GHz IBM Cell processor has 2.4x the performance of one 3.2GHz Intel Quad-Core Xeon – Data movement much more difficult to program, but much more efficient L2 Cache Core Memory Controller Main Memory

© 2008 SET Corporation Current and Future Work  SSE Optimized polar-format – Image warped to fixed coordinates – Includes wavefront curvature and other corrections – Currently about 7x faster than back-projection implementation, but with limitations  Intel Nehalem – On-board memory controller – ‘QuickPath’ memory interconnect – Up to 8 cores per die  Intel Larrabee – 24 x86 cores – 4-way multithreading per core – SSE – 32KB L1 cache, 512KB L2 Cache