Aarul Jain CSE520, Advanced Computer Architecture Fall 2007.

Slides:



Advertisements
Similar presentations
Computer Architecture
Advertisements

DSPs Vs General Purpose Microprocessors
Lecture 4 Introduction to Digital Signal Processors (DSPs) Dr. Konstantinos Tatas.
Systems and Technology Group © 2006 IBM Corporation Cell Programming Tutorial - JHD24 May 2006 Cell Programming Tutorial Jeff Derby, Senior Technical Staff.
A Seamless Communication Solution for Hybrid Cell Clusters Natalie Girard Bill Gardner, John Carter, Gary Grewal University of Guelph, Canada.
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
CML CML Managing Stack Data on Limited Local Memory Multi-core Processors Saleel Kudchadker Compiler Micro-architecture Lab School of Computing, Informatics.
Implementation of 2-D FFT on the Cell Broadband Engine Architecture William Lundgren Gedae), Kerry Barnes (Gedae), James Steed (Gedae)
Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.
Cell Broadband Engine. INF5062, Carsten Griwodz & Pål Halvorsen University of Oslo Cell Broadband Engine Structure SPE PPE MIC EIB.
Using Cell Processors for Intrusion Detection through Regular Expression Matching with Speculation Author: C˘at˘alin Radu, C˘at˘alin Leordeanu, Valentin.
Introduction CS 524 – High-Performance Computing.
Development of a Ray Casting Application for the Cell Broadband Engine Architecture Shuo Wang University of Minnesota Twin Cities Matthew Broten Institute.
Associative Cache Mapping A main memory block can load into any line of cache Memory address is interpreted as tag and word (or sub-address in line) Tag.
PhD/Master course, Uppsala  Understanding the interaction between your program and computer  Structuring the code  Optimizing the code  Debugging.
Synergistic Processing In Cell’s Multicore Architecture Michael Gschwind, et al. Presented by: Jia Zou CS258 3/5/08.
CS 7810 Lecture 24 The Cell Processor H. Peter Hofstee Proceedings of HPCA-11 February 2005.
Cell Broadband Processor Daniel Bagley Meng Tan. Agenda  General Intro  History of development  Technical overview of architecture  Detailed technical.
J. A. Kahle, M. N. Day, H. P. Hofstee, C. R. Johns, T. R. Maeurer, and D. Shippy IBM Systems and Technology Group IBM Journal of Research and Development.
Programming the Cell Multiprocessor Işıl ÖZ. Outline Cell processor – Objectives – Design and architecture Programming the cell – Programming models CellSs.
Real time DSP Professors: Eng. Julian Bruno Eng. Mariano Llamedo Soria.
Cell Architecture. Introduction The Cell concept was originally thought up by Sony Computer Entertainment inc. of Japan, for the PlayStation 3 The architecture.
Introduction to the Cell multiprocessor J. A. Kahle, M. N. Day, H. P. Hofstee, C. R. Johns, T. R. Maeurer, D. Shippy (IBM Systems and Technology Group)
Cell Systems and Technology Group. Introduction to the Cell Broadband Engine Architecture  A new class of multicore processors being brought to the consumer.
Evaluation of Multi-core Architectures for Image Processing Algorithms Masters Thesis Presentation by Trupti Patil July 22, 2009.
Cell Broadband Engine Architecture Bardia Mahjour ENCM 515 March 2007 Bardia Mahjour ENCM 515 March 2007.
Samuel Williams, John Shalf, Leonid Oliker, Shoaib Kamil, Parry Husbands, Katherine Yelick Lawrence Berkeley National Laboratory ACM International Conference.
Agenda Performance highlights of Cell Target applications
Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.
Basics and Architectures
National Center for Supercomputing Applications University of Illinois at Urbana-Champaign Cell processor implementation of a MILC lattice QCD application.
High Performance Computing on the Cell Broadband Engine
Programming Examples that Expose Efficiency Issues for the Cell Broadband Engine Architecture William Lundgren Gedae), Rick Pancoast.
Mahesh Sukumar Subramanian Srinivasan. Introduction Embedded system products keep arriving in the market. There is a continuous growing demand for more.
Nicolas Tjioe CSE 520 Wednesday 11/12/2008 Hyper-Threading in NetBurst Microarchitecture David Koufaty Deborah T. Marr Intel Published by the IEEE Computer.
© 2007 SET Associates Corporation SAR Processing Performance on Cell Processor and Xeon Mark Backues, SET Corporation Uttam Majumder, AFRL/RYAS.
Seunghwa Kang David A. Bader Optimizing Discrete Wavelet Transform on the Cell Broadband Engine.
1 The IBM Cell Processor – Architecture and On-Chip Communication Interconnect.
Kevin Eady Ben Plunkett Prateeksha Satyamoorthy.
HPEC SMHS 9/24/2008 MIT Lincoln Laboratory Large Multicore FFTs: Approaches to Optimization Sharon Sacco and James Geraci 24 September 2008 This.
IBM Research © 2008 Feeding the Multicore Beast: It’s All About the Data! Michael Perrone IBM Master Inventor Mgr, Cell Solutions Dept.
Cell Processor Programming: An introduction Pascal Comte Brock University, Fall 2007.
Computer Architecture Lecture 32 Fasih ur Rehman.
Sam Sandbote CSE 8383 Advanced Computer Architecture The IBM Cell Architecture Sam Sandbote CSE 8383 Advanced Computer Architecture April 18, 2006.
High Performance Computing Group Feasibility Study of MPI Implementation on the Heterogeneous Multi-Core Cell BE TM Architecture Feasibility Study of MPI.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
Pipelining and Parallelism Mark Staveley
LYU0703 Parallel Distributed Programming on PS3 1 Huang Hiu Fung Wong Chung Hoi Supervised by Prof. Michael R. Lyu Department of Computer.
DSP Architectures Additional Slides Professor S. Srinivasan Electrical Engineering Department I.I.T.-Madras, Chennai –
Optimizing Ray Tracing on the Cell Microprocessor David Oguns.
Aurora/PetaQCD/QPACE Metting Regensburg University, April 14-15, 2010.
Presented by Jeremy S. Meredith Sadaf R. Alam Jeffrey S. Vetter Future Technologies Group Computer Science and Mathematics Division Research supported.
The Standford Hydra CMP  Lance Hammond  Benedict A. Hubbert  Michael Siu  Manohar K. Prabhu  Michael Chen  Kunle Olukotun Presented by Jason Davis.
ANR Meeting / PetaQCD LAL / Paris-Sud University, May 10-11, 2010.
M211 – Central Processing Unit
FFTC: Fastest Fourier Transform on the IBM Cell Broadband Engine David A. Bader, Virat Agarwal.
1a.1 Parallel Computing and Parallel Computers ITCS 4/5145 Cluster Computing, UNC-Charlotte, B. Wilkinson, 2006.
C.E. Goutis V.I.Kelefouras University of Patras Department of Electrical and Computer Engineering VLSI lab Date: 20/11/2015 Compilers for Embedded Systems.
IBM Cell Processor Ryan Carlson, Yannick Lanner-Cusin, & Cyrus Stoller CS87: Parallel and Distributed Computing.
1/21 Cell Processor Systems Seminar Diana Palsetia (11/21/2006)
1 load [2], [9] Transfer contents of memory location 9 to memory location 2. Illegal instruction.
● Cell Broadband Engine Architecture Processor ● Ryan Layer ● Ben Kreuter ● Michelle McDaniel ● Carrie Ruppar.
Memory COMPUTER ARCHITECTURE
Cell Architecture.
Cache Memory Presentation I
Mapping the FFT Algorithm to the IBM Cell Processor
Memory System Performance Chapter 3
Large data arrays processing on Cell Broadband Engine
Presentation transcript:

Aarul Jain CSE520, Advanced Computer Architecture Fall 2007

 Three versions of Fast Fourier Transform to be implemented on Cell BE simulator and their performance analyzed as the order of FFT is increased.  Fast Fourier Transform on PPE/single SPU.  Data/Task parallel on multiple SPUs. (single buffer v/s double buffer performance comparison.)  Pipelined implementation on multiple SPUs.  Performance :  FFT kernel  DMA data transfer

PPE  64bit Power architecture with VMX.  In-order, 2-way SMT.  32KB L1, 512KB L2 Cache. SPE  256 KB local store.  In-order, No speculation.  128 registers for all data types. EIB  Four 16B data rings.  Over 100 outstanding requests.

 FFT compute intensity O(nlogn)  Implementation on PPU ◦ Cache based memory architecture – No software controlled memory.  Implementation on SPU ◦ Software controlled memory. ◦ Limited Local store memory decides the maximum size of the fft that can be implemented. (Data Structure Size = 16bytes * FFT size => 8K point FFT)

PPE EXECUTION TIMESSPE EXECUTION TIMES MEASURED ON PPEMEASURED ON SPEMEASURED ON PPE N(points)NLOGNCYCLES CYCLES(for fft only)CYCLES(for DMA) TOTAL CYCLE TIME CYCLES(THREAD CREATION)DIFFERENCE N v/s cycles

 Number of cycles on PPU and SPU scale with order NlogN.  Compute time on single SPU is greater than PPU due to cache misses in PPU. No cache for SPU -> direct local store access.  Very efficient DMA.  Thread creation on SPE very expensive. Thus SPUs need to be dedicated to a particular task for a period of time long enough to recoup the time it took to get it set up.  DIFFERENCE (col 8) TOO LARGE?? Exact reason unknown. Possible reasons: ◦ Cycles for exiting the thread. (Upon exit are entries of Local Store invalidated?) ◦ Profile tool problem. (IBM says that simulator is used for profiling SPEs and not PPEs. Does this mean intrinsics provided for measuring cycles on PPE (__mftb) are not accurate?)

 Multiple FFTs running on each SPU and each SPU works on different data.  Limitation of local store memory. ◦ Single buffer approach => 8K points ◦ Double buffer approach => 4K points  Single buffer v/s double buffer.  Performance as number of active SPUs are increased.

SINGLE BUFFERDOUBLE BUFFER SPUsN CYCLES (THREAD CREATION) AVG. CYCLES(for fft only) AVG. CYCLES(for DMA) CYCLES (THREAD CREATION) AVG. CYCLES(for fft only) AVG. CYCLES(for DMA) (appx.)

 More compute power with multi-processors ◦ For FFT -> almost 8 times if thread creation is not counted.  Using double buffering may not always give speed advantage. (Amdahl’s law)  Careful analysis of algorithm should be done to find out if its compute-intensive or memory-intensive with respect to Cell Architecture. ◦ Matrix multiplication is memory-intensive but FFT will be memory-intensive only for very large orders where all FFT samples cannot fit into Cell Local Store.

 Reference No. of cycles for single 4K point FFT = No. of floating point operations = 4*1024*log(4*1024) = Frequency of system = 3.2Ghz No. of SPUs = 8 GFLOPS = (49152/24688) * 8 * 3.2G = 50.96Gflops/sec IBM RESULTS MY RESULTS

 CELL architecture and its programming environment is completely new. Unknown problems come up.  Runtime error -> “bus error”. Normally because of unaligned access. In my case I was making accesses more than 16K.  Profiling is tricky with simulator supporting multiple modes. Use of assembly intrinsics is required to measure actual cycles. Running in “CYCLE” mode is very slow. ◦ Takes 2 days to run a 8K point fft.  Simulator crashing when mode is changed multiple times.  Debug support very complex.

 Use the forum alphaworks: excellent forum with quick response time.  To profile accurately run simulation in cycle mode.  Commands for profiling ◦ __mftb() -> FOR PPE ◦ spu_writech(), spu_readch() -> FOR SPE

 Pipelined implementation of FFT.  Standalone mode.  Higher order FFTs.  Compiler performance.

  Cell Broadband Engine Architecture Reference Manual, Ver 1.02, October 11,  IBM Cell Broadband Engine Software Development Kit,  Kahle J. A. et. al., Introduction to the Cell multiprocessor, IBM Journal of Research and Development, September  Perrone M., Introduction to the Cell Processor (lecture),  Krewell K., Cell Moves Into the Limelight, Microprocessor Report, February  Krewell K., Chips, Software, and Systems, Microprocessor Report, January 

loop( mfc_get(&cb1+x*sizeof(cb1)/(FFT_SIZE/1024), argp+x*sizeof(cb1)/(FFT_SIZE/1024), sizeof(cb1)/(FFT_SIZE/1024), x, 0, 0); mfc_write_tag_mask (1<<(y+10)); mfc_read_tag_status_all(); mfc_get(&cb2+y*sizeof(cb1)/(FFT_SIZE/1024), argp+y*sizeof(cb1)/(FFT_SIZE/1024), sizeof(cb2)/(FFT_SIZE/1024), y+10, 0, 0); mfc_write_tag_mask (1<<x); mfc_read_tag_status_all(); fft_float (FFT_SIZE,cb1.RealIn,cb1.ImagIn,cb1.RealOut,cb1.ImagOut); mfc_write_tag_mask (1<<(y+10)); mfc_read_tag_status_all(); mfc_put(&cb1+x*sizeof(cb1)/(FFT_SIZE/1024), argp+x*sizeof(cb1)/(FFT_SIZE/1024), sizeof(cb1)/(FFT_SIZE/1024), x, 0, 0); fft_float (FFT_SIZE,cb2.RealIn,cb2.ImagIn,cb2.RealOut,cb2.ImagOut); mfc_write_tag_mask (1<<x); mfc_read_tag_status_all(); mfc_put(&cb2+y*sizeof(cb1)/(FFT_SIZE/1024), argp+y*sizeof(cb1)/(FFT_SIZE/1024), sizeof(cb2)/(FFT_SIZE/1024), y+10, 0, 0); ) mfc_write_tag_mask (1<<(y+10)); mfc_read_tag_status_all();

mfc_get(&cb1), argp, sizeof(cb1) x, 0, 0); => WONT WORK FOR cb1>16KB SHOULD BE RECODED AS for (x=0;x<FFT_SIZE/1024;x++) { mfc_get(&cb1+x*sizeof(cb1)/(FFT_SIZE/1024), argp+x*sizeof(cb1)/(FFT_SIZE/1024), sizeof(cb1)/(FFT_SIZE/1024), x, 0, 0); }