Jim Held Intel Fellow & Director, Tera-scale Computing Research

Slides:

Advertisements

Similar presentations

1 Networks for Multi-core Chip A Controversial View Shekhar Borkar Intel Corp.

Advertisements

A Novel 3D Layer-Multiplexed On-Chip Network

Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.

The Raw Architecture Signal Processing on a Scalable Composable Computation Fabric David Wentzlaff, Michael Taylor, Jason Kim, Jason Miller, Fae Ghodrat,

†The Pennsylvania State University

1 BGL Photo (system) BlueGene/L IBM Journal of Research and Development, Vol. 49, No. 2-3.

IP I/O Memory Hard Disk Single Core IP I/O Memory Hard Disk IP Bus Multi-Core IP R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R Networks.

1 Multi - Core fast Communication for SoPC Multi - Core fast Communication for SoPC Technion – Israel Institute of Technology Department of Electrical.

Copyright © 2006, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners Intel® Core™ Duo Processor.

CS 7810 Lecture 24 The Cell Processor H. Peter Hofstee Proceedings of HPCA-11 February 2005.

Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran.

Interconnection Networks: Introduction

SoC Design Methodology for Exascale Computing

Presenter MaxAcademy Lecture Series – V1.0, September 2011 Introduction and Motivation.

CS 423 – Operating Systems Design Lecture 22 – Power Management Klara Nahrstedt and Raoul Rivas Spring 2013 CS Spring 2013.

ECE 526 – Network Processing Systems Design Network Processor Architecture and Scalability Chapter 13,14: D. E. Comer.

Emerging Technologies: A CompSci Perspective UC SANTA BARBARA Tim Sherwood.

Chalmers University of Technology FlexSoC Seminar Series – Page 1 Power Estimation FlexSoc Seminar Series – Daniel Eckerbert

Déjà Vu Switching for Multiplane NoCs NOCS’12 University of Pittsburgh Ahmed Abousamra Rami MelhemAlex Jones.

REXAPP Bilal Saqib. REXAPP  Radio EXperimentation And Prototyping Platform Based on NOC  REXAPP Compiler.

Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.

1 Single-ISA Heterogeneous Multi-Core Architectures: The Potential for Processor Power Reduction Rakesh Kumar, Keith I. Farkas, Norman P. Jouppi, Parthasarathy.

® 1 VLSI Design Challenges for Gigascale Integration Shekhar Borkar Intel Corp. October 25, 2005.

C OMPUTER O RGANIZATION AND D ESIGN The Hardware/Software Interface 5 th Edition Chapter 1 Computer Abstractions and Technology Sections 1.5 – 1.11.

High Performance Computing Processors Felix Noble Mirayma V. Rodriguez Agnes Velez Electric and Computer Engineer Department August 25, 2004.

[Tim Shattuck, 2006][1] Performance / Watt: The New Server Focus Improving Performance / Watt For Modern Processors Tim Shattuck April 19, 2006 From the.

Lecture 1: Introduction. Course Outline The aim of this course: Introduction to the methods and techniques of performance analysis of computer systems.

Buffer-On-Board Memory System 1 Name: Aurangozeb ISCA 2012.

Srihari Makineni & Ravi Iyer Communications Technology Lab

80-Tile Teraflop Network-On- Chip 1. Contents Overview of the chip Architecture ▫Computational Core ▫Mesh Network Router ▫Power save features Performance.

1 Latest Generations of Multi Core Processors

Houman Homayoun, Sudeep Pasricha, Mohammad Makhzan, Alex Veidenbaum Center for Embedded Computer Systems, University of California, Irvine,

Intel Research & Development ETA: Experience with an IA processor as a Packet Processing Engine HP Labs Computer Systems Colloquium August 2003 Greg Regnier.

Revision - 01 Intel Confidential Page 1 Intel HPC Update Norfolk, VA April 2008.

0 1 Thousand Core Chips A Technology Perspective Shekhar Borkar Intel Corp. June 7, 2007.

3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.

High-Bandwidth Packet Switching on the Raw General-Purpose Architecture Gleb Chuvpilo Saman Amarasinghe MIT LCS Computer Architecture Group January 9,

1 Dual-V cc SRAM Class presentation for Advanced VLSIPresenter:A.Sammak Adopted from: M. Khellah,A 4.2GHz 0.3mm 2 256kb Dual-V CC SRAM Building Block in.

A Low-Area Interconnect Architecture for Chip Multiprocessors Zhiyi Yu and Bevan Baas VLSI Computation Lab ECE Department, UC Davis.

C. Murad Özsert Intel's Tera Scale Processor Architecture.

Runtime Reconfigurable Network-on- chips for FPGA-based systems Mugdha Puranik Department of Electrical and Computer Engineering

Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.

Manycore processors Sima Dezső October Version 6.2.

Hang Zhang1, Xuhao Chen1, Nong Xiao1,2, Fang Liu1

Crusoe Processor Seminar Guide: By: - Prof. H. S. Kulkarni Ashish.

Lynn Choi School of Electrical Engineering

Microarchitecture.

Ph.D. in Computer Science

NVIDIA’s Extreme-Scale Computing Project

Lynn Choi School of Electrical Engineering

Morgan Kaufmann Publishers

Architecture & Organization 1

Energy Efficient Computing in Nanoscale CMOS

Scalable High Performance Main Memory System Using PCM Technology

Parallel Computers Today

Israel Cidon, Ran Ginosar and Avinoam Kolodny

BlueGene/L Supercomputer

Lecture 2: Performance Today’s topics: Technology wrap-up

Architecture & Organization 1

Computer Architecture Lecture 4 17th May, 2006

Hybrid Programming with OpenMP and MPI

Die Stacking (3D) Microarchitecture -- from Intel Corporation

CANDY: Enabling Coherent DRAM Caches for Multi-node Systems

Introduction to Heterogeneous Parallel Computing

RAW Scott J Weber Diagrams from and summary of:

Network-on-Chip Programmable Platform in Versal™ ACAP Architecture

Welcome to Computer Architecture

Energy Efficient Designs with Wide Dynamic Range

Multiprocessor System Interconnects

Presentation transcript:

Beyond Multi-core: The Dawning of the Era of Tera Intel™ 80-core Tera-scale Research Processor Jim Held Intel Fellow & Director, Tera-scale Computing Research Intel Corporation

Agenda Tera-scale Computing Teraflops Research Processor Motivation Platform Vision Teraflops Research Processor Key Ingredients Power Management Performance Programming Key Learnings Work in progress Summary

Emerging Applications will demand Tera-scale performance Computer Vision Ray-Tracing 2 Webcam Video Surveillance Beetle Scene 1M pixel CFD, 150x100x100, 30 fps 0.1 1 100 1000 10000 Computational (GFLOPs) Memory BW (GB/s) 10 4 Camera Body Tracking Bar Scene, 1M pixel CFD, 75x50x50, 10 fps ALM, 6 assets, 7 branches, 1 min ALM, 6 assets, 10 branches, 1 sec Physical Simulation Financial Analytics Courtesy of Prof. Ron Fetkiw Stanford University

A Tera-scale Platform Vision Special Purpose Engines Integrated IO devices Cache Last Level Scalable On-die Interconnect Fabric Integrated Memory Controllers High Bandwidth Off Die interconnect IO Socket Inter- Connect

Tera-scale Computing Research Applications – Identify, characterize & optimize Programming – Empower the mainstream System Software – Scalable services Memory Hierarchy – Feed the compute engine On-Die Interconnect – High bandwidth, low latency Cores – power efficient general & special function

Teraflops Research Processor 12.64mm I/O Area Goals: Deliver Tera-scale performance Single precision TFLOP at desktop power Frequency target 5GHz Bi-section B/W order of Terabits/s Link bandwidth in hundreds of GB/s Prototype two key technologies On-die interconnect fabric 3D stacked memory Develop a scalable design methodology Tiled design approach Mesochronous clocking Power-aware capability single tile 1.5mm 2.0mm 21.72mm PLL PLL TAP TAP I/O Area I/O Area

Key Ingredients Tile Special Purpose Cores 2D Mesh Interconnect Mesochronous Interface MSINT Special Purpose Cores High performance Dual FPMACs 2D Mesh Interconnect High bandwidth low latency router Phase-tolerant tile to tile communication Mesochronous Clocking Modular & scalable Lower power Workload-aware Power Management Sleep instructions and Packets Chip voltage & freq. control MSINT 39 Crossbar Router MSINT 40 GB/s 39 MSINT 2KB Data memory (DMEM) 64 96 RIB 64 64 32 32 6-read, 4-write 32 entry RF 32 32 x x 96 3KB Inst. memory (IMEM) + + 32 32 Normalize Normalize FPMAC0 FPMAC1 Tile Processing Engine (PE)

Fine Grain Power Management 21 sleep regions per tile (not all shown) FP Engine 1 FP Engine 2 Router Data Memory Instruction Memory FP Engine 1 Sleeping: 90% less power FP Engine 2 Router 10% less power (stays on to pass traffic) Data Memory Sleeping: 57% less power Instruction Memory Sleeping: 56% less power Dynamic sleep STANDBY: Memory retains data 50% less power/tile FULL SLEEP: Memories fully off 80% less power/tile Scalable power to match workload demands

2X-5X leakage power reduction Leakage Savings Est breakdown @ 1.2V 110C Dynamic sleep Total measured idle power 13W  ~7W sleeping Regulated sleep for memory arrays State retention MSINT Crossbar Router 82mW 70mW 63mW21mW 15mW7.5mW 42mW8mW 2KB Data memory (DMEM) RIB 6R, 4W 32 entry Register File 100mW 20mW 100mW 20mW x x Memory Clamping 3KB Inst. Memory (IMEM) + + Normalize Normalize FPMAC0 FPMAC1 Processing Engine (PE) 2X-5X leakage power reduction

Router Power Management Activity based power management Individual port enables Queues on sleep and clock gated when port idle 924mW 7X power reduction for idle routers

Power Performance Results Peak Performance Average Power Efficiency 1 2 3 4 5 6 0.6 0.7 0.8 0.9 1.1 1.2 1.3 1.4 V cc (V) Frequency (GHz) 80°C N=80 (0.32 TFLOP) 1GHz (1 TFLOP) 3.16GHz (1.81 TFLOP) 5.67GHz (1.63 TFLOP) 5.1GHz 5 10 15 20 200 400 600 800 1000 1200 1400 GFLOPS GFLOPS/W 80°C N=80 5.8 19.4 394 GFLOPS 10.5 Measured Power Leakage 25 50 75 100 125 150 175 200 225 250 0.70 0.80 0.90 1.00 1.10 1.20 1.30 Vcc (V) Power (W) Active Power Leakage Power 78 15.6 152 26 1.33TFLOP @ 230W 80°C, N=80 1TFLOP @ 97W 0% 4% 8% 12% 16% 0.70 0.80 0.90 1.00 1.10 1.20 1.30 Vcc (V) % Total Power Sleep disabled Sleep enabled 80°C N=80 2X Stencil: 1TFLOP @ 97W, 1.07V; All tiles awake/asleep

Programming Results Not designed as a general Software Development Vehicle Small memory ISA limitations Limited data ports Four kernels hand-coded to explore delivered performance: Stencil 2D heat diffusion equation SGEMM for 100x100 matrices Spreadsheet doing weighted sums 64 point 2D FFT (w 64 tiles) Demonstrated utility and high scalability of message passing programming models on many core

Key Learnings Teraflop performance is possible within a mainstream power envelope Peak of 1.01 Teraflops at 62 watts Measured peak power efficiency of 19.4 GFLOPS/Watt Tile-based methodology fulfilled its promise Design possible with ½ the team in ½ the time Pre & Post-Si debug reduced – fully functional on A0 Fine-grained power management pays off Hierarchical clock gating and sleep transistor techniques Up to 3X measured reduction in standby leakage power Scalable low-power mesochronous clocking Excellent SW performance possible in this message-based architecture Further improvements possible with additional instructions, larger memory, wider data ports

Work in Progress: Stacked Memory Prototype 256 KB SRAM per core 4X C4 bump density 3200 thru-silicon vias 80-tile processor with Cu bumps “Polaris” Denser than C4 pitch Memory “Freya” C4 pitch Thru-Silicon Via Package Package Memory access to match the compute power

Teraflops on IA Pat Gelsinger – Intel Developer Forum 2007

Summary Emerging applications will demand teraflop performance Teraflop performance is possible within a mainstream power envelope Intel is developing technologies to enable Tera-scale computing

Questions

Acknowledgments Sriram Vangal, Jason Howard, Gregory Ruhl, Saurabh Dighe, Howard Wilson, James Tschanz, David Finan, Priya Iyer, Arvind Singh, Tiju Jacob, Shailendra Jain, Sriram Venkataraman, Yatin Hoskote, Nitin Borkar, Rob van der Wijngaart, Michael Frumkin and Tim Mattson