MIT Lincoln Laboratory 000523-jca-1 KAM 11/13/2015 Panel Session: Paving the Way for Multicore Open Systems Architectures James C. Anderson MIT Lincoln.

Slides:

Advertisements

Similar presentations

An International Technology Roadmap for Semiconductors

Advertisements

MULTICORE PROCESSOR TECHNOLOGY.  Introduction  history  Why multi-core ?  What do you mean by multicore?  Multi core architecture  Comparison of.

Supercomputing Challenges at the National Center for Atmospheric Research Dr. Richard Loft Computational Science Section Scientific Computing Division.

Evolution of Chip Design ECE 111 Spring A Brief History 1958: First integrated circuit – Flip-flop using two transistors – Built by Jack Kilby at.

SYNAR Systems Networking and Architecture Group CMPT 886: Special Topics in Operating Systems and Computer Architecture Dr. Alexandra Fedorova School of.

VLSI Trends. A Brief History  1958: First integrated circuit  Flip-flop using two transistors  From Texas Instruments  2011  Intel 10 Core Xeon Westmere-EX.

Room: E-3-31 Phone: Dr Masri Ayob TK 2123 COMPUTER ORGANISATION & ARCHITECTURE Lecture 4: Computer Performance.

Introduction What is Parallel Algorithms? Why Parallel Algorithms? Evolution and Convergence of Parallel Algorithms Fundamental Design Issues.

ELEN468 Lecture 11 ELEN468 Advanced Logic Design Lecture 1Introduction.

EE141 © Digital Integrated Circuits 2nd Introduction 1 The First Computer.

HPEC_GPU_DECODE-1 ADC 8/6/2015 MIT Lincoln Laboratory GPU Accelerated Decoding of High Performance Error Correcting Codes Andrew D. Copeland, Nicholas.

GallagherP188/MAPLD20041 Accelerating DSP Algorithms Using FPGAs Sean Gallagher DSP Specialist Xilinx Inc.

FPGA Based Fuzzy Logic Controller for Semi- Active Suspensions Aws Abu-Khudhair.

Presenter MaxAcademy Lecture Series – V1.0, September 2011 Introduction and Motivation.

Juanjo Noguera Xilinx Research Labs Dublin, Ireland Ahmed Al-Wattar Irwin O. Irwin O. Kennedy Alcatel-Lucent Dublin, Ireland.

Computer performance.

Networking Virtualization Using FPGAs Russell Tessier, Deepak Unnikrishnan, Dong Yin, and Lixin Gao Reconfigurable Computing Group Department of Electrical.

Ross Brennan On the Introduction of Reconfigurable Hardware into Computer Architecture Education Ross Brennan

1 VLSI and Computer Architecture Trends ECE 25 Fall 2012.

MIT Lincoln Laboratory jca-1 KAM 9/12/2015 Panel Session: Amending Moore’s Law for Embedded Applications James C. Anderson MIT Lincoln Laboratory.

Highest Performance Programmable DSP Solution September 17, 2015.

Computational Technologies for Digital Pulse Compression

Last Time Performance Analysis It’s all relative

Multi-core Programming Introduction Topics. Topics General Ideas Moore’s Law Amdahl's Law Processes and Threads Concurrency vs. Parallelism.

SHA-3 Candidate Evaluation 1. FPGA Benchmarking - Phase Round-2 SHA-3 Candidates implemented by 33 graduate students following the same design.

High Performance Computing Processors Felix Noble Mirayma V. Rodriguez Agnes Velez Electric and Computer Engineer Department August 25, 2004.

Reminder Lab 0 Xilinx ISE tutorial Research Send me an if interested Looking for those interested in RC with skills in compilers/languages/synthesis,

Introduction to Reconfigurable Computing Greg Stitt ECE Department University of Florida.

MIT Lincoln Laboratory jca-1 KAM 5/27/2016 Panel Session: Multicore Meltdown? James C. Anderson MIT Lincoln Laboratory HPEC07 Wednesday, 19 September.

HPEC2002_Session1 1 DRM 11/11/2015 MIT Lincoln Laboratory Session 1: Novel Hardware Architectures David R. Martinez 24 September 2002 This work is sponsored.

Hyper Threading Technology. Introduction Hyper-threading is a technology developed by Intel Corporation for it’s Xeon processors with a 533 MHz system.

Chapter 1 Introduction. Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 1, Slide 2 Learning Objectives  Why process signals.

2D/3D Integration Challenges: Dynamic Reconfiguration and Design for Reuse.

Present – Past -- Future

Hardware Benchmark Results for An Ultra-High Performance Architecture for Embedded Defense Signal and Image Processing Applications September 29, 2004.

 Historical view:  1940’s-Vacuum tubes  1947-Transistors invented by willliam shockely & team  1959-Integrated chips invented by Texas Instrument.

Analysis of Cache Tuner Architectural Layouts for Multicore Embedded Systems + Also Affiliated with NSF Center for High- Performance Reconfigurable Computing.

Single-Chip Heterogeneous Computing: Does the Future Include Custom Logic, FPGAs, and GPGPUs? Wasim Shaikh Date: 10/29/2015.

MULTICORE PROCESSOR TECHNOLOGY.  Introduction  history  Why multi-core ?  What do you mean by multicore?  Multi core architecture  Comparison of.

Transistor Counts 1,000, ,000 10,000 1, i386 i486 Pentium ® Pentium ® Pro K 1 Billion Transistors.

EE141 © Digital Integrated Circuits 2nd Introduction 1 Principle of CMOS VLSI Design Introduction Adapted from Digital Integrated, Copyright 2003 Prentice.

Chapter 5: Computer Systems Design and Organization Dr Mohamed Menacer Taibah University

Trends in IC technology and design J. Christiansen CERN - EP/MIC

HPEC HN 9/23/2009 MIT Lincoln Laboratory RAPID: A Rapid Prototyping Methodology for Embedded Systems Huy Nguyen, Thomas Anderson, Stuart Chen, Ford.

Overview of VLSI 魏凱城彰化師範大學資工系. VLSI  Very-Large-Scale Integration Today’s complex VLSI chips  The number of transistors has exceeded 120 million 

DR. SIMING LIU SPRING 2016 COMPUTER SCIENCE AND ENGINEERING UNIVERSITY OF NEVADA, RENO Session 3 Computer Evolution.

Raw Status Update Chips & Fabrics James Psota M.I.T. Computer Architecture Workshop 9/19/03.

The Evolution of the Intel 80x86 Architecture Chad Derrenbacker Chris Garvey Manpreet Hundal Tom Opfer CS 350 December 9, 1998.

MIT Lincoln Laboratory 1 of 4 MAA 3/8/2016 Development of a Real-Time Parallel UHF SAR Image Processor Matthew Alexander, Michael Vai, Thomas Emberley,

CS203 – Advanced Computer Architecture

HPEC-1 SMHS 7/7/2016 MIT Lincoln Laboratory Focus 3: Cell Sharon Sacco / MIT Lincoln Laboratory HPEC Workshop 19 September 2007 This work is sponsored.

SPRING 2012 Assembly Language. Definition 2 A microprocessor is a silicon chip which forms the core of a microcomputer the concept of what goes into a.

Serial I/O ADCs/DACs : The Next Giant Leap in Mixed-Signal for Space

CS203 – Advanced Computer Architecture

Lynn Choi School of Electrical Engineering

TECHNOLOGY TRENDS.

Technology advancement in computer architecture

ECE 154A Introduction to Computer Architecture

A Quantitative Analysis of Stream Algorithms on Raw Fabrics

Architecture & Organization 1

Introduction to Reconfigurable Computing

CS775: Computer Architecture

Architecture & Organization 1

Overview of VLSI 魏凱城彰化師範大學資工系.

Chapter 1 Introduction.

The performance requirements for DSP applications continue to grow and the traditional solutions do not adequately address this new challenge Paradigm.

Die Stacking (3D) Microarchitecture -- from Intel Corporation

Computer Evolution and Performance

Presentation transcript:

MIT Lincoln Laboratory jca-1 KAM 11/13/2015 Panel Session: Paving the Way for Multicore Open Systems Architectures James C. Anderson MIT Lincoln Laboratory HPEC08 Wednesday, 24 September 2008 This work was sponsored by the Department of the Air Force under Air Force Contract #FA C Opinions, interpretations, conclusions and recommendations are those of the author, and are not necessarily endorsed by the United States Government. Reference to any specific commercial product, trade name, trademark or manufacturer does not constitute or imply endorsement.

MIT Lincoln Laboratory jca-2 KAM 11/13/2015 Objective & Schedule Objective: Assess the infrastructure (hardware, software & support) that enables use of multicore open systems architectures –Where are we now? –What needs to be done? Schedule –1525: Overview –1540: Guest speaker: Mr. Markus Levy –1600: Introduction of the panelists –1605: Previously submitted questions for the panel –1635: Open forum –1655: Conclusions & the way ahead –1700: Closing remarks & adjourn

MIT Lincoln Laboratory jca-3 KAM 11/13/2015 Paving the Way for Multicore Open Systems Architectures Multicore µPs Uni-processors Design paths diverged ca Tools Methodologies

MIT Lincoln Laboratory jca-4 KAM 11/13/2015 But First, A Few Infrastructure Issues Performance was doubling every 18 months (Moore’s Law), but not anymore

MIT Lincoln Laboratory jca-5 KAM 11/13/ International Technology Roadmap for Semiconductors (ITRS00) (65nm) On-chip local clock, GHz 2010 (45nm) 2013 (32nm) 2016 (22nm) Year (Node) ITRS00 ~3.5X throughput every 3 yrs predicted for multiple independent cores (~same as 4X every 3 yrs for historical Moore’s Law) –1.4X clock speed every 3 yrs for constant power –2.5X transistors/chip every 3 yrs (partially driven by economics) for constant chip size (chip size growth ended ~1998) In 2000, ITRS00 predicted a slightly lower improvement rate vs. historical Moore’s Law for the timeframe

MIT Lincoln Laboratory jca-6 KAM 11/13/ International Technology Roadmap for Semiconductors (ITRS01-02) (65nm) On-chip local clock, GHz 2010 (45nm) 2013 (32nm) 2016 (22nm) Year (Node) ITRS00 ITRS X throughput every 3 yrs predicted for multiple independent cores –1.4X clock speed every 3 yrs for constant power (same as ITRS00) –2X transistors/chip every 3 yrs for constant chip size (less than ITRS00) ITRS01-02 predicted substantially lower improvement rate vs. ITRS00, but higher clock speeds

MIT Lincoln Laboratory jca-7 KAM 11/13/ International Technology Roadmap for Semiconductors (ITRS03-06) (65nm) ITRS03-ITRS On-chip local clock, GHz 2010 (45nm) 2013 (32nm) 2016 (22nm) Year (Node) ITRS00 ITRS01-02 ITRS03-06 predicted same improvement rate as ITRS01-02, but even higher clock speeds 2.8X throughput every 3 yrs predicted for multiple independent cores –1.4X clock speed every 3 yrs for constant power (same as ITRS00-02) –2X transistors/chip every 3 yrs for constant chip size (same as ITRS01-02) 1.4X increase

MIT Lincoln Laboratory jca-8 KAM 11/13/ International Technology Roadmap for Semiconductors (ITRS07) (65nm) ITRS03-ITRS ITRS07 predicts lower clock speeds & improvement rate vs. ITRS00-06 On-chip local clock, GHz 4.3X reduction 2010 (45nm) 2013 (32nm) 2016 (22nm) Year (Node) ITRS00 ITRS01-02 ~2.5X throughput every 3 yrs predicted for multiple independent cores –1.23X clock speed every 3 yrs for constant power (less than ITRS00-06) –2X transistors/chip every 3 yrs for constant chip size (same as ITRS01-06) ITRS07

MIT Lincoln Laboratory jca-9 KAM 11/13/2015 COTS Compute Node (processor, memory & I/O) Performance History & Projections (2Q08) GFLOPS/W (billion 32-bit floating point operations per sec-watt for computing 1K-point complex FFT) Year X in 3 yrs improvement rate for multicore µPs & COTS ASICs 0.1 2X in 3 yrs improvement rate for general- purpose uni-processor µPs 4X in 3 yrs improvement rate for all technologies nm 100 nm 10 nm 2.5X in 3 yrs improvement rate for SRAM-based FPGAs Projected improvement rates, although smaller than historical values, are still substantial IBM Cell (90nm) Intel i860 µP (1µm)

MIT Lincoln Laboratory jca-10 KAM 11/13/2015 Notional Cost (cumulative) & Schedule for COTS 90nm Cell Broadband Engine Year Cell Broadband Engine: 205 GFLOPS (peak, W (est.), ~2 GFLOPS/W 100 Development Cost (millions of $) Cell IBM, Sony & Toshiba hold architectural discussions Austin (Texas) design center opens ($400M joint investment in Cell design) Sony exits future Cell development after investing $1.7B Technology evaluation systems shipped

MIT Lincoln Laboratory jca-11 KAM 11/13/2015 Multicore Open Systems Architecture Example LEON3 –32-bit SPARC V8 processor developed by Gaisler Research (Aeroflex as of 7/14/08) for the European Space Agency –Synthesizable VHDL (GNU general public license) & documentation downloadable from –Open source software support (embedded Linux, C/C++ cross-compiler, simulator & symbolic debugger) 0.25µm LEON3FT –Commercial fault-tolerant implementation of LEON3 –75 MFLOPS/W (150 MIPS & MHz for 0.4W) 90nm quad-core LEON3FT –System emulated with a single SRAM-based FPGA –133 MFLOPS/W (4x500 MIPS & 4x100 MFLOPS for 3W) –Each core occupies <1mm 2 including caches –MOSIS fabricates 65nm & 90nm die up to 360mm 2 (IBM process) How can we improve performance (FLOPS/W), which lags COTS by up to 9 yrs (15X) in this example?

MIT Lincoln Laboratory jca-12 KAM 11/13/2015 Notional Cost (cumulative) & Schedule for 90nm LEON3FT Multicore Processor Year $3M estimated development cost is mostly staff expense, with schedule determined by foundry Development Cost (millions of $) µm LEON3FT 90nm quad- core LEON3FT 65nm fab access

MIT Lincoln Laboratory jca-13 KAM 11/13/2015 Objective & Schedule Objective: Assess the infrastructure (hardware, software & support) that enables use of multicore open systems architectures –Where are we now? –What needs to be done? Schedule –1525: Overview –1540: Guest speaker: Mr. Markus Levy –1600: Introduction of the panelists –1605: Previously submitted questions for the panel –1635: Open forum –1655: Conclusions & the way ahead –1700: Closing remarks & adjourn

MIT Lincoln Laboratory jca-14 KAM 11/13/2015 Panel Session: Paving the Way for Multicore Open Systems Architectures Panel members & audience may hold diverse, evolving opinions Moderator: Dr. James C. Anderson MIT Lincoln Laboratory Prof. Saman Amarasinghe MIT Computer Science & Artificial Intelligence Laboratory (CSAIL) Mr. Markus Levy The Multicore Association & The Embedded Microprocessor Benchmark Consortium (EEMBC) Dr. Matthew Reilly Chief Engineer SiCortex, Inc. Mr. John Rooks Air Force Research Laboratory (AFRL/RITC) Emerging Computing Technology Dr. Steve Muir Chief Technology Officer Vanu, Inc.

MIT Lincoln Laboratory jca-15 KAM 11/13/2015 Objective & Schedule Objective: Assess the infrastructure (hardware, software & support) that enables use of multicore open systems architectures –Where are we now? –What needs to be done? Schedule –1525: Overview –1540: Guest speaker: Mr. Markus Levy –1600: Introduction of the panelists –1605: Previously submitted questions for the panel –1635: Open forum –1655: Conclusions & the way ahead –1700: Closing remarks & adjourn

MIT Lincoln Laboratory jca-16 KAM 11/13/2015 Despite industry slowdown, embedded processors are still improving exponentially (2/3 of historical Moore’s Law rate) Although performance improvements in multicore designs (2.5X every 3 yrs) continue to outpace those of uni-processors (2X every 3 yrs), the “performance gap” is less than previously projected New tools and methodologies will be needed to maximize the benefits of using multicore open systems architectures –Power & packaging issues –Cost & availability issues –Training & ease-of-use issues –Platform independence issues Although many challenges remain in reducing the performance gap between highly specialized systems vs. multicore open systems architectures, the latter will help insulate users from manufacturer-specific issues Conclusions & The Way Ahead Success still depends on ability of foundries to provide smaller geometries & increasing speed for constant power (driven by large-scale COTS product economics)

MIT Lincoln Laboratory jca-17 KAM 11/13/2015 Backup Slides

MIT Lincoln Laboratory jca-18 KAM 11/13/2015 COTS ASIC: 90nm IBM Cell Broadband Engine (4Q06) 100W 3.2 GHz 170 GFLOPS sustained for 32-bit flt pt 1K cmplx FFT (83% of peak) 16 Gbyte memory options (~10 FLOPS/byte) –COTS Rambus XDR DRAM (Cell is designed to use only this memory) 256 chips 690W (note: Rambus devices may not be 3D stackable due to 2.7W/chip power consumption) –Non-COTS solution: Design a bridge chip ASIC (10W est.) to allow use of 128 DDR2 SDRAM devices (32W) 128 chips in 3D stacks to save space (0.25W/chip) Operate many memory chips in parallel Buffer to support Rambus speeds Increased latency vs. Rambus 40W budget for external 27 Gbytes/sec simultaneous I&O (using same non-COTS bridge chip to handle I/O with Cell) Single non-COTS CN (compute node) using DDR2 SDRAM –170 GFLOPS sustained for 200W (182W est. for CN plus 18W for 91% efficient DC-to-DC converter) –0.85 GFLOPS/W & 56 GFLOPS/L

MIT Lincoln Laboratory jca-19 KAM 11/13/2015 COTS Compute Node Performance History & Projections (2Q08) GFLOPS/W (billion 32-bit floating point operations per sec-watt for computing 1K-point complex FFT) Year X in 3 yrs improvement rate for SRAM-based FPGAs, COTS ASICs & multicore µPs 0.1 2X in 3 yrs improvement rate for general- purpose uni-processor µPs 2013 Intel Polaris (65nm) IBM Cell (90nm) Virtex-4 (90nm) Freescale MPC7447A (130nm) Intel i860 µP (1000nm) Analog Devices SHARC DSP (600nm) Compute Node includes FFT (fast Fourier transform) processor, memory (10 FLOPS/byte), simultaneous I&O (1.28 bits/sec per FLOPS) & DC-to-DC converter Catalina Research Pathfinder-1 ASIC (350nm) Texas Memory Systems TM-44 Blackbird ASIC (180nm) Xilinx Virtex FPGA (180nm) Motorola MPC7400 PowerPC RISC with AltiVec ( nm) MPC7410 (180nm) Virtex II (150nm) Clearspeed CS301 ASIC (130nm) 4X in 3 yrs MPC7448 (90nm) 22nm 16nm

MIT Lincoln Laboratory jca-20 KAM 11/13/2015 World’s Largest Economies: 2000 vs U.S. population grows by 1/3 & income shrinks from 5X to <4X world average * Gross domestic product (purchasing power parity) 2000 GDP* 2024 GDP* 2000 population 2024 population 6 billion total “Europe’s Top 5” are Germany, Great Britain, France, Italy & Spain 8 billion total

MIT Lincoln Laboratory jca-21 KAM 11/13/2015 Highest-performance COTS (commercial off-the- shelf) ADCs (analog-to-digital converters), 3Q08.1 Max processing gain w/ linearization : 2X speed in 6.75 yrs (but up to 0.5 bit processing gain is more than offset by loss of 1 effective bit) Historic device-level improvement rates may not be sustainable as technical & economic limits are approached Thermal noise (~0.5 bit/octave) Aperture uncertainty (~1 bit/octave) : ~ MSPS

MIT Lincoln Laboratory jca-22 KAM 11/13/2015 SFDR (spur-free dynamic range) for Highest-performance COTS ADCs, 3Q08.1 3dB/octave 6dB/octave SFDR performance often limits ability to subsequently achieve “processing gain”

MIT Lincoln Laboratory jca-23 KAM 11/13/2015 Energy per Effective Quantization Level for Highest-performance COTS ADCs, 3Q08.1 Recent power decrease (driven by mobile devices market) from smaller geometry & advanced architectures

MIT Lincoln Laboratory jca-24 KAM 11/13/2015 Resolution Improvement Timeline for Highest-performance COTS ADCs, Sampling Rate (MSPS) Year ,000 Non-COTS chip set 12 GSPS) 10-bit (ENOB>7.6) 12-bit (ENOB>9.8) 14-bit (ENOB>10.9) 16-bit (ENOB>12.5) 24-bit (ENOB>15.3) to 8-bit (ENOB>4.3)