MIT Lincoln Laboratory 000523-jca-1 KAM 5/27/2016 Panel Session: Multicore Meltdown? James C. Anderson MIT Lincoln Laboratory HPEC07 Wednesday, 19 September.

Slides:



Advertisements
Similar presentations
MULTICORE PROCESSOR TECHNOLOGY.  Introduction  history  Why multi-core ?  What do you mean by multicore?  Multi core architecture  Comparison of.
Advertisements

The First Microprocessor By: Mark Tocchet and João Tupinambá.
Evolution of Chip Design ECE 111 Spring A Brief History 1958: First integrated circuit – Flip-flop using two transistors – Built by Jack Kilby at.
Some Thoughts on Technology and Strategies for Petaflops.
Introduction What is Parallel Algorithms? Why Parallel Algorithms? Evolution and Convergence of Parallel Algorithms Fundamental Design Issues.
State Machines Timing Computer Bus Computer Performance Instruction Set Architectures RISC / CISC Machines.
ECE 526 – Network Processing Systems Design
Introduction to FPGA and DSPs Joe College, Chris Doyle, Ann Marie Rynning.
Using FPGAs with Embedded Processors for Complete Hardware and Software Systems Jonah Weber May 2, 2006.
HPEC_GPU_DECODE-1 ADC 8/6/2015 MIT Lincoln Laboratory GPU Accelerated Decoding of High Performance Error Correcting Codes Andrew D. Copeland, Nicholas.
GallagherP188/MAPLD20041 Accelerating DSP Algorithms Using FPGAs Sean Gallagher DSP Specialist Xilinx Inc.
Presenter MaxAcademy Lecture Series – V1.0, September 2011 Introduction and Motivation.
Networking Virtualization Using FPGAs Russell Tessier, Deepak Unnikrishnan, Dong Yin, and Lixin Gao Reconfigurable Computing Group Department of Electrical.
INTRODUCTION TO MICROPROCESSORS
Information and Communication Technology Fundamentals Credits Hours: 2+1 Instructor: Ayesha Bint Saleem.
1 VLSI and Computer Architecture Trends ECE 25 Fall 2012.
MIT Lincoln Laboratory jca-1 KAM 9/12/2015 Panel Session: Amending Moore’s Law for Embedded Applications James C. Anderson MIT Lincoln Laboratory.
© 2005 Mercury Computer Systems, Inc. Yael Steinsaltz, Scott Geaghan, Myra Jean Prelle, Brian Bouzas,
Lecture#14. Last Lecture Summary Memory Address, size What memory stores OS, Application programs, Data, Instructions Types of Memory Non Volatile and.
Guide to Operating Systems, 4th ed.
Computer Architecture Challenges Shriniwas Gadage.
Advanced Processing Group ISO9001 Registered Octopus: A Multi-core implementation Export of this products is subject to U.S. export controls. Licenses.
1 3-General Purpose Processors: Altera Nios II 2 Altera Nios II processor A 32-bit soft core processor from Altera Comes in three cores: Fast, Standard,
Introduction CSE 410, Spring 2008 Computer Systems
Company LOGO High Performance Processors Miguel J. González Blanco Miguel A. Padilla Puig Felix Rivera Rivas.
1 Computer Architecture Research Overview Rajeev Balasubramonian School of Computing, University of Utah
Research on Reconfigurable Computing Using Impulse C Carmen Li Shen Mentor: Dr. Russell Duren February 1, 2008.
C OMPUTER O RGANIZATION AND D ESIGN The Hardware/Software Interface 5 th Edition Chapter 1 Computer Abstractions and Technology Sections 1.5 – 1.11.
Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.
Annapolis Micro Systems, Inc. 190 Admiral Cochrane Dr., Ste 130, Annapolis, MD Web: HQ Phone: (410) HQ Fax: (410)
J. Christiansen, CERN - EP/MIC
Radar Pulse Compression Using the NVIDIA CUDA SDK
Reminder Lab 0 Xilinx ISE tutorial Research Send me an if interested Looking for those interested in RC with skills in compilers/languages/synthesis,
Introduction to Reconfigurable Computing Greg Stitt ECE Department University of Florida.
1 The First Computer [Adapted from Copyright 1996 UCB]
Computers © 2005 Prentice-Hall, Inc.Slide 1. Computers Chapter 4 Inside the Computer © 2005 Prentice-Hall, Inc.Slide 2.
HPEC2002_Session1 1 DRM 11/11/2015 MIT Lincoln Laboratory Session 1: Novel Hardware Architectures David R. Martinez 24 September 2002 This work is sponsored.
MIT Lincoln Laboratory jca-1 KAM 11/13/2015 Panel Session: Paving the Way for Multicore Open Systems Architectures James C. Anderson MIT Lincoln.
Hardware Benchmark Results for An Ultra-High Performance Architecture for Embedded Defense Signal and Image Processing Applications September 29, 2004.
MULTICORE PROCESSOR TECHNOLOGY.  Introduction  history  Why multi-core ?  What do you mean by multicore?  Multi core architecture  Comparison of.
Copyright © 2004, Dillon Engineering Inc. All Rights Reserved. An Efficient Architecture for Ultra Long FFTs in FPGAs and ASICs  Architecture optimized.
TM Freescale™ and the Freescale logo are trademarks of Freescale Semiconductor, Inc. All other product or service names are the property of their respective.
Survey of multicore architectures Marko Bertogna Scuola Superiore S.Anna, ReTiS Lab, Pisa, Italy.
HPEC HN 9/23/2009 MIT Lincoln Laboratory RAPID: A Rapid Prototyping Methodology for Embedded Systems Huy Nguyen, Thomas Anderson, Stuart Chen, Ford.
Multi-core, Mega-nonsense. Will multicore cure cancer? Given that multicore is a reality –…and we have quickly jumped from one core to 2 to 4 to 8 –It.
MIT Lincoln Laboratory 1 of 4 MAA 3/8/2016 Development of a Real-Time Parallel UHF SAR Image Processor Matthew Alexander, Michael Vai, Thomas Emberley,
Advanced Rendering Technology The AR250 A New Architecture for Ray Traced Rendering.
FPGA Technology Overview Carl Lebsack * Some slides are from the “Programmable Logic” lecture slides by Dr. Morris Chang.
THE COMPUTER MOTHERBOARD AND ITS COMPONENTS Compiled By: Jishnu Pradeep.
Hardware Architecture
HPEC-1 SMHS 7/7/2016 MIT Lincoln Laboratory Focus 3: Cell Sharon Sacco / MIT Lincoln Laboratory HPEC Workshop 19 September 2007 This work is sponsored.
Introduction CSE 410, Spring 2005 Computer Systems
SPRING 2012 Assembly Language. Definition 2 A microprocessor is a silicon chip which forms the core of a microcomputer the concept of what goes into a.
Unit 3 Processors and Memory Section B Chapter 1, Slide 2Starting Out with Visual Basic 3 rd EditionIntroduction to ComputersUnit 3B – Processors and.
Lynn Choi School of Electrical Engineering
ECE354 Embedded Systems Introduction C Andras Moritz.
Lynn Choi School of Electrical Engineering
Presented by: Tim Olson, Architect
ECE 154A Introduction to Computer Architecture
Instructor: Dr. Phillip Jones
Components of Computer
Introduction to Reconfigurable Computing
Computers © 2005 Prentice-Hall, Inc. Slide 1.
Parallel Processing in ROSA II
Technology and Historical Perspective: A peek of the microprocessor Evolution 11/14/2018 cpeg323\Topic1a.ppt.
The performance requirements for DSP applications continue to grow and the traditional solutions do not adequately address this new challenge Paradigm.
A High Performance SoC: PkunityTM
Chapter 1 Introduction.
Computer Evolution and Performance
Embedded Processors.
Presentation transcript:

MIT Lincoln Laboratory jca-1 KAM 5/27/2016 Panel Session: Multicore Meltdown? James C. Anderson MIT Lincoln Laboratory HPEC07 Wednesday, 19 September 2007 This work was sponsored by the Department of the Air Force under Air Force Contract #FA C Opinions, interpretations, conclusions and recommendations are those of the author, and are not necessarily endorsed by the United States Government. Reference to any specific commercial product, trade name, trademark or manufacturer does not constitute or imply endorsement.

MIT Lincoln Laboratory jca-2 KAM 5/27/2016 Objective & Schedule Objective: assess the impact of multicore processors on past and present DoD HPEC systems, and project their importance for the future Schedule – : Overview – : Guest speaker Mr. Kalpesh Sheth – : Introduction of the panelists – : Previously-submitted questions for the panel – : Open forum – : Conclusions & the way ahead

MIT Lincoln Laboratory jca-3 KAM 5/27/2016 On-chip Performance for Representative COTS Single- & Multi-core Processors (2Q98) Billion 32-bit floating point operations per sec-watt for computing 1K-point complex FFT ( GFLOPS/W) Year RISC + 4 DSPs (0.5µm): TI SMJ320C80 DSP (0.5µm): Analog Devices ADSP-21060L SHARC ‘C80 SHARC “Past”

MIT Lincoln Laboratory jca-4 KAM 5/27/2016 On-chip Performance for Representative COTS Single- & Multi-core Processors (3Q07) Billion 32-bit floating point operations per sec-watt for computing 1K-point complex FFT ( GFLOPS/W) Year X in 3 yrs improvement rate 4 RISC cores (MIPS64) in SoC (130nm): Broadcom BCM RISC + 4 DSPs (0.5µm): TI SMJ320C80 DSP (0.5µm): Analog Devices ADSP-21060L SHARC RISC + vector processor (130nm): Freescale MPC7447A PowerPC w/ AltiVec ‘C80 BCM1480 SHARC PowerPC “Present” RISC + 8 SIMD processor elements (90nm): IBM Cell Broadband Engine Cell Dual 64-bit Power Architecture CPU (65nm): P.A. Semi PA6T-1682M (note: 14X on- chip memory for given throughput vs. Cell)

MIT Lincoln Laboratory jca-5 KAM 5/27/2016 On-chip Performance for Possible Future COTS Single- & Multi-core Processors Billion 32-bit floating point operations per sec-watt for computing 1K-point complex FFT ( GFLOPS/W) Year X in 3 yrs improvement rate 80 µP cores (65nm): Intel Polaris non- COTS tech demo RISC + 8 SIMD processor elements (90nm): IBM Cell Broadband Engine Possible future COTS multi-core processor (45nm) 50 GFLOPS/W in 2016 (32nm) Polaris Cell 1 RISC + 4 DSPs (0.5µm): TI SMJ320C80 DSP (0.5µm): Analog Devices ADSP-21060L SHARC RISC + vector processor (130nm): Freescale MPC7447A PowerPC w/ AltiVec ‘C80 SHARC PowerPC “Future” Data appear consistent with an underlying improvement rate model

MIT Lincoln Laboratory jca-6 KAM 5/27/2016 Single-core Microprocessor Family Improvement Rate Model Derived from ITRS06 projections ( ) – geometry reduction every 3 yrs –Normalized for constant die size & power Re-design in new geometry Original µP die L New µP design 2a). Next-generation uni-processor –2X throughput (2X similar speed, bytes/OPS unchanged) –Compatible HW/SW 1). Present- generation uni-processor

MIT Lincoln Laboratory jca-7 KAM 5/27/2016 Single- & Multi-core Microprocessor Family Improvement Rate Model Derived from ITRS06 projections ( ) – geometry reduction every 3 yrs –Normalized for constant die size & power Re-implement in new geometry Re-design in new geometry Original µP die L L/2 2b). Next-generation dual- core processor –3X throughput (2X 1.5X speed, but 1/3 fewer bytes/OPS) –Incompatible HW/SW (2X pins, not a uni- processor) New µP design Original µP design Original µP design 2a). Next-generation uni-processor –2X throughput (2X similar speed, bytes/OPS unchanged) –Compatible HW/SW 1). Present- generation uni-processor Note: 3X in 3 yrs improvement rate also projected for COTS ASICs (e.g., Cell) & SRAM-based FPGAs “Multicore meltdown” occurs when core- based design methods can no longer provide substantial improvements

MIT Lincoln Laboratory jca-8 KAM 5/27/2016 Notional Model Projection Based On IBM’s Cell Broadband Engine 90nm geometry baseline ca –1 mm pitch 1320 ( ) BGA (ball grid array) –234M transistors on 221 mm 2 die L = 17.7 mm W = = 12.5 mm –1333 pads (43x31) possible using 400µm pitch –100W 3.2 GHz (170 GFLOPS sustained for 1.7 GFLOPS/W) –2592 Kbytes on-chip memory (66 KFLOPS/byte) 22nm geometry ca (16X transistors & 5.1X constant power & die size) –0.25 mm pitch 21,120 ( ) FBGA (fine pitch BGA, with 0.15 mm feasible ca. 2010) –21,328 pads (172x124) possible using 100µm pitch (typical ca. 2015) –139 GFLOPS/W & 334 KFLOPS/byte Core-based design methods appear feasible for the coming decade, but how useful are the resulting devices given that memory & I/O shortcomings are compounded over time?

MIT Lincoln Laboratory jca-9 KAM 5/27/2016 Timeline for Highest Performance COTS Multiprocessor Card Technologies (3Q06) Input & Output (million samples per sec) Year ,000 Historical Moore’s Law slope: 4X in 3 yrs SRAM-based FPGAs: 3X in 3 yrs Uni-processor µP, DSP & RISC (w/ vector processor): 2X in 3 yrs Can multicore µPs be used efficiently for anything other than “embarrassingly parallel” front-end applications? Multicore µPs & COTS ASICs: 3X in 3 yrs Card-level I&O cmplx sample rate sustained for 32-bit flt pt 1K-pt cmplx FFT (1024 MSPS for FFT computed in 1μs with 51.2 GFLOPS) using 6U form factor convection-cooled cards <55W 6U card (55W) “Back-end” processing (~10 FLOPS/byte) “Front-end” processing (~100 FLOPS/byte) Could multicore µPs with high-speed (e.g., optical) on-chip interconnect be made to act more like uni-processors?

MIT Lincoln Laboratory jca-10 KAM 5/27/2016 Timeline for Highest Performance COTS Multiprocessor Card Technologies (4Q06) TFLOPS (trillion 32-bit floating-point operations per sec for computing 1K-point complex FFT) Year Multicore µP, COTS ASIC & SRAM-based FPGA cards improving 3X in 3 yrs Triple PowerPC MPC7448 Dual Virtex-4 LX200 By 2016, performance gap is 8 yrs & growing Normalized performance values for 6U (55W) card include I/O to support FFT & high-speed memory for general-purpose processing (10 FLOPS/byte) Uni-processor µP, DSP & RISC (w/ on-chip vector processor) cards improving 2X in 3 yrs 6X by 2016 Can the uni-processor improvement rate be increased by adding external cache? Are multicore performance & time-to-market advantages lost due to programming difficulties?

MIT Lincoln Laboratory jca-11 KAM 5/27/2016 Improvements in COTS Embedded Multiprocessor Card Technologies (4Q06) 70 W/Liter (55W & 6U form factor) 11/06 11/15 11/06 Multicore cards improving 3X in 3 yrs Uni-processor cards improving 2X in 3 yrs Multicore µP, COTS ASIC & FPGA cards ca Projected 6X performance gap ca due to different improvement rates Uni-processor µP, DSP & RISC (w/ on-chip vector processor) cards ca /15 Projections for 2016 based on 2006 data Do we need new tools (e.g., floorplanning) to make use of multicore technology for FPGAs? What good are high-throughput multicore chips without high- speed (e.g., optical) off-chip I/O? How can future multicore µPs be supported in an “open systems architecture?”

MIT Lincoln Laboratory jca-12 KAM 5/27/2016 Objective & Schedule Objective: assess the impact of multicore processors on past and present DoD HPEC systems, and project their importance for the future Schedule – : Overview – : Guest speaker Mr. Kalpesh Sheth – : Introduction of the panelists – : Previously-submitted questions for the panel – : Open forum – : Conclusions & the way ahead

MIT Lincoln Laboratory jca-13 KAM 5/27/2016 Panel Session: “Multicore Meltdown?” Panel members & audience may hold diverse, evolving opinions Moderator: Dr. James C. Anderson, MIT Lincoln Laboratory Dr. James Held, Intel Corp. Mr. Markus Levy, The Multicore Association & The Embedded Microprocessor Benchmark Consortium Mr. Greg Rocco, Mercury Computer Systems Mr. Kalpesh Sheth, Advanced Processing Group, DRS Technologies Dr. Thomas VanCourt, Altera Corp.

MIT Lincoln Laboratory jca-14 KAM 5/27/2016 Objective & Schedule Objective: assess the impact of multicore processors on past and present DoD HPEC systems, and project their importance for the future Schedule – : Overview – : Guest speaker Mr. Kalpesh Sheth – : Introduction of the panelists – : Previously-submitted questions for the panel – : Open forum – : Conclusions & the way ahead

MIT Lincoln Laboratory jca-15 KAM 5/27/2016 Embedded processor hardware still improving exponentially –Throughput, in FLOPS (floating-point operations per second) –Memory, in bytes –Standard form factors with constant SWAP (size, weight and power) still required for upgradeability in DoD HPEC systems to support an “open systems architecture” At system level, ability to effectively utilize hardware improves slowly, leading to ever-worsening “performance gap” –Multicore processors, COTS ASICs (e.g., Cell) & FPGAs improving 3X every 3 yrs, but some performance advantages may be lost due to programming difficulties (relatively small memory) & I/O bottlenecks –Traditional uni-processors easier to program, but improving at slower rate (2X every 3 yrs) Several methods may help “narrow the gap” –Improved processor-to-processor & processor-to-memory communication technology (e.g., optical on-chip & off-chip) may allow multi-processors to behave more like uni-processors –Control logic & 3D interconnect to attach more high-speed memory for a given throughput may improve multicore processor utility (and/or allow uni-processors to improve at a faster rate) Conclusions & The Way Ahead “Multicore meltdown” avoidable for the coming decade, but taking full advantage of performance benefits will be challenging

MIT Lincoln Laboratory jca-16 KAM 5/27/2016 Backup Slides

MIT Lincoln Laboratory jca-17 KAM 5/27/2016 Intel Benchmark Performance Depending on task, 2 CPU cores perform X as fast as one

MIT Lincoln Laboratory jca-18 KAM 5/27/2016 6U Cards Feasible (but not built) using 90nm COTS Devices (4Q06) µP: Freescale MPC7448 PowerPC 1.4 GHz –9.4 GFLOPS sustained for 32-bit flt pt 1K cmplx FFT (83.6% of peak) 2.3W budget for external 1.5 Gbytes/sec simultaneous I&O 2W budget for 1 Gbyte external DDR2 SDRAM (~10 FLOPS/byte) 2.5W budget for misc. logic (host bridge, clock, boot flash) 1.6W budget for on-board DC-to-DC converter (91% efficient) –Triple CN (compute node) card Easier component placement vs. quad CN 28.2 GFLOPS sustained for 55W 0.51 GFLOPS/W & 37 GFLOPS/L FPGA: Xilinx Virtex-4 LX200 –41W (est.) for a legacy card re-implemented using 90nm devices Card includes DC-to-DC converter 2 FPGAs w/ 225 MHz core & 375 MHz I/O speeds 528 Mbytes DDR-SRAM 8 Gbytes/sec I&O 51.2 GFLOPS sustained for 32-bit flt pt 1K cmplx FFT –8W budget for additional 4 Gbytes DDR2 SDRAM (~10 FLOPS/byte) –Dual CN card 51.2 GFLOPS sustained for 50W (1W added for increased converter dissipation) 1.0 GFLOPS/W & 67 GFLOPS/L