Lecturer: Simon Winberg Lecture 17 RC Architectures Case Studies Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) Microprocessor-based: Cell Broadband.

Slides:



Advertisements
Similar presentations
Parallel Processing with PlayStation3 Lawrence Kalisz.
Advertisements

Introduction to Programmable Logic John Coughlan RAL Technology Department Electronics Division.
Lecture 6: Multicore Systems
Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.
Cell Broadband Engine. INF5062, Carsten Griwodz & Pål Halvorsen University of Oslo Cell Broadband Engine Structure SPE PPE MIC EIB.
A reconfigurable system featuring dynamically extensible embedded microprocessor, FPGA, and customizable I/O Borgatti, M. Lertora, F. Foret, B. Cali, L.
Ido Tov & Matan Raveh Parallel Processing ( ) January 2014 Electrical and Computer Engineering DPT. Ben-Gurion University.
Types of Parallel Computers
Embedded Systems: Introduction. Course overview: Syllabus: text, references, grading, etc. Schedule: will be updated regularly; lectures, assignments.
Lecture 26: Reconfigurable Computing May 11, 2004 ECE 669 Parallel Computer Architecture Reconfigurable Computing.
Configurable System-on-Chip: Xilinx EDK
CS 151 Digital Systems Design Lecture 38 Programmable Logic.
Heterogeneous Computing Dr. Jason D. Bakos. Heterogeneous Computing 2 “Traditional” Parallel/Multi-Processing Large-scale parallel platforms: –Individual.
Lecturer: Simon Winberg Review of EEE4084F  Lecture content covered  Readings, seminars, chapters EEE4084F.
Field Programmable Gate Array (FPGA) Layout An FPGA consists of a large array of Configurable Logic Blocks (CLBs) - typically 1,000 to 8,000 CLBs per chip.
1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,
Emotion Engine A look at the microprocessor at the center of the PlayStation2 gaming console Charles Aldrich.
J. A. Kahle, M. N. Day, H. P. Hofstee, C. R. Johns, T. R. Maeurer, and D. Shippy IBM Systems and Technology Group IBM Journal of Research and Development.
Programming the Cell Multiprocessor Işıl ÖZ. Outline Cell processor – Objectives – Design and architecture Programming the cell – Programming models CellSs.
Lecture 16 RC Architecture Types & FPGA Interns Lecturer: Simon Winberg Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
Cell Architecture. Introduction The Cell concept was originally thought up by Sony Computer Entertainment inc. of Japan, for the PlayStation 3 The architecture.
Lecturer: Simon Winberg Lecture 21 Reflections, key steps* and A short comprehensive assignment * Relates to Martinez, Bond and Vai Ch 4. Attribution-ShareAlike.
Evaluation of Multi-core Architectures for Image Processing Algorithms Masters Thesis Presentation by Trupti Patil July 22, 2009.
February 12, 1998 Aman Sareen DPGA-Coupled Microprocessors Commodity IC’s for the Early 21st Century by Aman Sareen School of Electrical Engineering and.
Lecture 2: Field Programmable Gate Arrays September 13, 2004 ECE 697F Reconfigurable Computing Lecture 2 Field Programmable Gate Arrays.
Cell Broadband Engine Architecture Bardia Mahjour ENCM 515 March 2007 Bardia Mahjour ENCM 515 March 2007.
1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,
Paper Review: XiSystem - A Reconfigurable Processor and System
Automated Design of Custom Architecture Tulika Mitra
Designing the WRAMP Dean Armstrong The University of Waikato.
SJSU SPRING 2011 PARALLEL COMPUTING Parallel Computing CS 147: Computer Architecture Instructor: Professor Sin-Min Lee Spring 2011 By: Alice Cotti.
J. Christiansen, CERN - EP/MIC
FPGA (Field Programmable Gate Array): CLBs, Slices, and LUTs Each configurable logic block (CLB) in Spartan-6 FPGAs consists of two slices, arranged side-by-side.
Reminder Lab 0 Xilinx ISE tutorial Research Send me an if interested Looking for those interested in RC with skills in compilers/languages/synthesis,
Lithographic Aerial Image Simulation with FPGA based Hardware Acceleration Jason Cong and Yi Zou UCLA Computer Science Department.
Sept. 2005EE37E Adv. Digital Electronics Lesson 1 CPLDs and FPGAs: Technology and Design Features.
Kevin Eady Ben Plunkett Prateeksha Satyamoorthy.
Lecture 16: Reconfigurable Computing Applications November 3, 2004 ECE 697F Reconfigurable Computing Lecture 16 Reconfigurable Computing Applications.
“Politehnica” University of Timisoara Course No. 2: Static and Dynamic Configurable Systems (paper by Sanchez, Sipper, Haenni, Beuchat, Stauffer, Uribe)
1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Systems Iowa State University (Ames) CPRE 583 Reconfigurable Computing Lecture 8: Wed.
Lecture 12: Reconfigurable Systems II October 20, 2004 ECE 697F Reconfigurable Computing Lecture 12 Reconfigurable Systems II: Exploring Programmable Systems.
Evaluating and Improving an OpenMP-based Circuit Design Tool Tim Beatty, Dr. Ken Kent, Dr. Eric Aubanel Faculty of Computer Science University of New Brunswick.
Sam Sandbote CSE 8383 Advanced Computer Architecture The IBM Cell Architecture Sam Sandbote CSE 8383 Advanced Computer Architecture April 18, 2006.
Processor Architecture
LYU0703 Parallel Distributed Programming on PS3 1 Huang Hiu Fung Wong Chung Hoi Supervised by Prof. Michael R. Lyu Department of Computer.
The Octoplier: A New Software Device Affecting Hardware Group 4 Austin Beam Brittany Dearien Brittany Dearien Warren Irwin Amanda Medlin Amanda Medlin.
Survey of multicore architectures Marko Bertogna Scuola Superiore S.Anna, ReTiS Lab, Pisa, Italy.
Reconfigurable architectures ESE 566. Outline Static and Dynamic Configurable Systems –Static SPYDER, RENCO –Dynamic FIREFLY, BIOWATCH PipeRench: Reconfigurable.
Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Principles of Parallel Programming First Edition by Calvin Lin Lawrence Snyder.
Aarul Jain CSE520, Advanced Computer Architecture Fall 2007.
FFTC: Fastest Fourier Transform on the IBM Cell Broadband Engine David A. Bader, Virat Agarwal.
Introduction to Field Programmable Gate Arrays (FPGAs) EDL Spring 2016 Johns Hopkins University Electrical and Computer Engineering March 2, 2016.
Lecture 4: Lecturer: Simon Winberg Temporal and Spatial Computing, Shared Memory, Granularity Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
1/21 Cell Processor Systems Seminar Diana Palsetia (11/21/2006)
Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.
Lecture 5: Lecturer: Simon Winberg Review of paper: Temporal Partitioning Algorithm for a Coarse-grained Reconfigurable Computing Architecture by Chongyong.
Reconfigurable Computing1 Reconfigurable Computing Part II.
Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.
EEE4084F Digital Systems Lecture 24 RC Platform Case Studies 1/2
Lecture 18 FPGA Interns & Performance Comparison
EEE4084F Digital Systems NOT IN 2017 EXAM Lecture 25
Cell Architecture.
Architecture & Organization 1
EEE4084F Digital Systems Lecture 24: RC Platform Case Studies 1/2
EEE4084F Digital Systems NOT IN 2018 EXAM Lecture 24X
Architecture & Organization 1
Instructor: Dr. Phillip Jones
Lecture 18 X: HDL & VHDL Quick Recap
Instructor: Dr. Phillip Jones
Presentation transcript:

Lecturer: Simon Winberg Lecture 17 RC Architectures Case Studies Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) Microprocessor-based: Cell Broadband Engine Architecture FPGA-based: PAM, VCC, SPLASH …

 Case study of RC computers  IBM Blade & Cell Processor  Programmable Active Memories (PAM)  Virtual Computer Corporation (VCC)  Super Computer Research Center Splash System  Small RC Systems

IBM Blade & The Cell Processor CASE STUDY: Cell (or Meta-) processors Changeable in smaller parts – the ‘Strategic Processing Units’ (SPUs) and their interconnects IBM Blade rack

 Developed by STI alliance, a collaboration of Sony, Sony Computer Entertainment, Toshiba, and IBM.  Why Cell?  Actually “Cell” is a shortening for “Cell Broadband Engine Architecture”  Technically abbreviated as CBEA in full, alternatively “Cell BE”.  The design and first implementation of the Cell:  Performed at STI Design Center in Austin, Texas  Carried out over a 4-year period from March 2001  Budget approx. 400 million USD Information based mainly on Image of the Cell processor

 2005 Feb [1,2]  IBM’s technical disclosures of cell processors quickly led to new platforms & toolsets [2]  Oct 05: Mercury Cell Blade  Nov 05: Open Source SDK & Simulator  Feb 06: IBM Cell Blade  Resources / further reading   (see copy of condensed article: Lect17 - The Cell architecture.pdf)The Cell architecture.pdf [2] [1] IBM press release 7-Feb-2005:

 9 cores  1 x Power Processor  8 x Synergistic Processor Element (SPE)  10 threads (2x PPE threads + 8x SPE threads)  Transistors: 241x10 6  Size: 235 mm 2  Clock: 3.2 GHz  Cell ver. 1: 64-bit arch Power Processor Element L2 Cache (512 Kb) Rambus XRAM ™ Interface Memory Controller SPE IO Controller Rambus FlexIO™ SPE Layout of Cell processor adapted from Element interconnect bus Test&Debug

 Cells: heterogeneous multi-core system architecture  Power cell element for control tasks  Synergistic Processing Elements for data- intensive processing  Each SPE  Synergistic Processor Unit (SPU)  Synergistic Memory Flow Control (MFC)  Data movement and synchronization  Interface to high-performance Element Interconnect Bus (EIB)

Synergistic Processor Unit (SPU) Synergistic Memory Flow Control (MFC) EIB L2 Cache MIC XRAM ™ FLEX™ IO SPU MFC SPE SPU MFC SPE SPU MFC SPE SPU MFC SPE SPU MFC SPE SPU MFC SPE SPU MFC SPE SPU MFC SPE PPU

 Application Binary Interface (ABI) Specifications  Defines: data types, register usage, calling conventions, and object formats to ensure compatibility of code generators and portability of code.  Examples  IBM SPE (Strategic Processor Elements) ABI  Linux Cell ABI

 SPE C/C++ Language Extensions  Defines: standardized data types, compiler directives, and language extensions used to make use of SIMD capabilities in the core

Cell Processor Programming Models Reconfigurable Computing

 Cell Processor change SPEs according to application  Models  Application-specific accelerators  Function offloading  Computation acceleration  Heterogeneous multi-threading

Application Specific Accelerators Example PPE SPE 1 SPE 2 SPE 3SPE 4SPE 5SPE 6SPE 7SPE 8 EIB Hardware Software 3D Visualization Application 3D Graphics Acceleration Texture mapping Data decomp ression Data comparison and classification 3D Scene Generation FLEX™ IO DATA Stores

Function offloading models… PPE SPE Multi-staged pipeline Parallel stage of processing sequence PPE SPE Remember: All the SPEs can access the shared memory directly via the EIB (element interconnect bus) Example: LZH_compress(‘data.dat’) Example: Matrix X,Y Y = quicksort(X) m = Max(X) X = X + 1

Computation Acceleration PPE SPE1 Similar to model for functional offloading, except each SPE can be busy with other forms of related computation, but tasks not necessarily directly dependent (i.e. the main task isn’t always blocked, waiting for the others to complete) SPE3SPE2 Task #1 Task #2 Task #3 Set of specific computation tasks scheduled optimally, each possibly needing multiple SPEs and PPE resources SPE4 Processing resource usage SPE1 configured for tasks of type #1 SPE2 configured for tasks of type #2 SPE3 and SPE4 configured for tasks of type #3

Heterogeneous multi-threading PPE SPE1SPE2SPE3SPE4SPE5SPE6SPE7SPE8 Spawn new threads as needed Thread #3 Thread #3 Thread #5 All SPEs configured to handle general types of tasks required by the application Combination of PPE threads and SPE threads Thread #1 Thread #4 Processing resource usage disabled processing resources PPE configured for thread types #1 and #2 SPE1 configured for threads of type #6 SPE2 configured for threads of type #3 SPE3 and SPE4 for threads of type #5 No threads of type #6 currently exist Certain SPEs configured to speed certain threads, but able to handle other threads also waiting (this thread is blocked)

 Three-step approach for application operation  Step 1 : Staging  Telling the SPEs what they are to do  Applying computation parameters PPE L2 Cache Main Memory SPE todo assigning tasks

 Step 1 : Staging  Each SPE can use a different block of memory  Step 2 : Processing  Each SPE does its assigned task PPE L2 Cache Main Memory SPE Each SPE uses its allocated part of memory

 Step 1 : Staging  Step 2 : Processing  Step 3 : Combination PPE L2 Cache Main Memory SPE Power PC combines results that were left by the SPEs in memory, using its L2 cache to speed it up

 Each blade contains  Two cell processors  IO controller devices  XDRAM memory  IBM Blade center interface

RC Systems A look at platforms architectures

 Programmable Active Memories (PAM)  Produced by Digital Equipment Corp (DEC)  Used Xilinx XC3000 FPGAs  Independent banks of fast static RAM Host CPU FPGA SRAM DRAM Digital Equipment Corp. PAM system (1980s) Image adapted from Hauck and Dehon (2008) Ch3

 Virtual Computer Corporation (VCC)  First commercially commercial RC platform*  Checkerboard layout of  Xilinx XC4010 devices and  I-Cube programmable interconnection devices  SRAM modules on the edges * Hauck and Dehon (2008) … … …………………… … VCC Virtual Computer FPGASRAM FPGA SRAM I-CubeFPGA I-CubeFPGA I-Cube FPGA SRAM FPGA I-Cube FPGASRAM

Summary of the Splash system Developed initially to solve the problem of mapping the human genome and other similar problems. Design follows a reconfigurable linear logic array. The SPLASH aimed to give a Sun computer better than supercomputer performance for a certain types of problems. At the time, the performance of SPLASH was shown to outperform a Cray 2 by a factor of 325. FPGAs were used to build SPLASH, a cross between a specialized hardware board but more flexible like a supercomputer. The SPLASH system consists of software and hardware which plugs into two slots of a Sun workstation. ** * Hauck and Dehon (2008) SRC Splash version 2 Dedicated controller … … … … SRAM FPGA SRAM FPGA SRAM FPGA SRAM FPGA SRAM FPGA SRAM FPGA SRAM Crossbar Dev. by Super Computer Research (SCR) Center ~1990 Well utilized (compared to previous systems). Comprised linear array of FPGAs each with own SRAM * **Adapted from: Waugh, T.C., "Field programmable gate array key to reconfigurable array outperforming supercomputers," Custom Integrated Circuits Conference, 1991., Proceedings of the IEEE 1991, vol., no., pp.6.6/1,6.6/4, May 1991 doi: /CICC Illustration of the SPLASH design (adapted from *)

 Brown University’s PRISM  Single FPGA co-processor in each computer in a cluster  Main CPUs offloading parallelized functions to FPGA  Algotronix  Configurable Array Logic (CAL) – FPGA featuring very simple logic cells (compared to other FPGAs)  Later become XC6200 (when CAL bought by Xilinx) * Hauck and Dehon (2008)

 Cray Research  XD1: 12 processing nodes  6x ADM Opteron processors  6x Reconfigurable nodes built from Xilinx Vertex 4  Each XD1 in own chassis, can connect up to 12 chassis in a cabined (i.e. 144 processing nodes)  SRC  Traditional processor + reconfig. processing unit  Based on Xilinx Virtex FPGAs  Silicon Graphics  RASP (reconfigurable application-specific processor)  Blade-type approach of smaller boards plugging into larger ones Ref: Hauck and Dehon Ch3 (2008)

 Reading  Reconfigurable Computing: A Survey of Systems and Software (ACM Survey) * * Compton & Hauck (2002).“Reconfigurable Computing: A Survey of Systems and Software” In ACM Computing Surveys, Vol. 34, No. 2, June 2002, pp. 171–210. (not specifically examined, but can help you develop insights that help you demonstrate a deeper understanding to problems) -- End of the Cell Processor case study --

 Reading  Hauck, Scott (1998). “The Roles of FPGAs in Reprogrammable Systems” In Proceedings of the IEEE. 86(4) pp  Next lecture:  Amdahl’s Law  Discussion of YODA phase 1

Image sources: IBM Blade rack (slide 3), IBM blade, Checkered flag – Wikipedia open commons NASCAR image – flickr CC2 share alike Disclaimers and copyright/licensing details I have tried to follow the correct practices concerning copyright and licensing of material, particularly image sources that have been used in this presentation. I have put much effort into trying to make this material open access so that it can be of benefit to others in their teaching and learning practice. Any mistakes or omissions with regards to these issues I will correct when notified. To the best of my understanding the material in these slides can be shared according to the Creative Commons “Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)” license, and that is why I selected that license to apply to this presentation (it’s not because I particulate want my slides referenced but more to acknowledge the sources and generosity of others who have provided free material such as the images I have used).