Xiaocheng Zhou Intel Labs China “Single-chip Cloud Computer” An experimental many-core processor from Intel Labs.

Slides:



Advertisements
Similar presentations
System Area Network Abhiram Shandilya 12/06/01. Overview Introduction to System Area Networks SAN Design and Examples SAN Applications.
Advertisements

The Development of Mellanox - NVIDIA GPUDirect over InfiniBand A New Model for GPU to GPU Communications Gilad Shainer.
Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.
Yaron Doweck Yael Einziger Supervisor: Mike Sumszyk Spring 2011 Semester Project.
Jared Casper, Ronny Krashinsky, Christopher Batten, Krste Asanović MIT Computer Science and Artificial Intelligence Laboratory, Cambridge, MA, USA A Parameterizable.
Khaled A. Al-Utaibi  Computers are Every Where  What is Computer Engineering?  Design Levels  Computer Engineering Fields  What.
Concurrent Data Structures in Architectures with Limited Shared Memory Support Ivan Walulya Yiannis Nikolakopoulos Marina Papatriantafilou Philippas Tsigas.
11/14/05ELEC Fall Multi-processor SoCs Yijing Chen.
1 Multi - Core fast Communication for SoPC Multi - Core fast Communication for SoPC Technion – Israel Institute of Technology Department of Electrical.
1 Multi-Core Debug Platform for NoC-Based Systems Shan Tang and Qiang Xu EDA&Testing Laboratory.
2: OS Structures 1 Jerry Breecher OPERATING SYSTEMS STRUCTURES.
Configurable System-on-Chip: Xilinx EDK
Figure 1.1 Interaction between applications and the operating system.
Cs238 Lecture 3 Operating System Structures Dr. Alan R. Davis.
Murali Vijayaraghavan MIT Computer Science and Artificial Intelligence Laboratory RAMP Retreat, UC Berkeley, January 11, 2007 A Shared.
GCSE Computing - The CPU
UNIT 9 Computer architecture
Lecture 7 Lecture 7: Hardware/Software Systems on the XUP Board ECE 412: Microcomputer Laboratory.
1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,
A Flexible Architecture for Simulation and Testing (FAST) Multiprocessor Systems John D. Davis, Lance Hammond, Kunle Olukotun Computer Systems Lab Stanford.
Tanenbaum 8.3 See references
Input/Output. Input/Output Problems Wide variety of peripherals —Delivering different amounts of data —At different speeds —In different formats All slower.
Computer System Architectures Computer System Software
NetBurner MOD 5282 Network Development Kit MCF 5282 Integrated ColdFire 32 bit Microcontoller 2 DB-9 connectors for serial I/O supports: RS-232, RS-485,
Synchronization and Communication in the T3E Multiprocessor.
CPU (CENTRAL PROCESSING UNIT): processor chip (computer’s brain) found on the motherboard.
1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,
TILEmpower-Gx36 - Architecture overview & performance benchmarks – Presented by Younghyun Jo 2013/12/18.
GBT Interface Card for a Linux Computer Carson Teale 1.
Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.
High Performance Computing Processors Felix Noble Mirayma V. Rodriguez Agnes Velez Electric and Computer Engineer Department August 25, 2004.
History of Microprocessor MPIntroductionData BusAddress Bus
Enabling Multi-threaded Applications on Hybrid Shared Memory Manycore Architectures Tushar Rawat and Aviral Shrivastava Arizona State University, USA CML.
TEMPLATE DESIGN © Hardware Design, Synthesis, and Verification of a Multicore Communication API Ben Meakin, Ganesh Gopalakrishnan.
BSP on the Origin2000 Lab for the course: Seminar in Scientific Computing with BSP Dr. Anne Weill –
CSE 661 PAPER PRESENTATION
I/O Computer Organization II 1 Interconnecting Components Need interconnections between – CPU, memory, I/O controllers Bus: shared communication channel.
Computer Organization & Assembly Language © by DR. M. Amer.
ATtiny23131 A SEMINAR ON AVR MICROCONTROLLER ATtiny2313.
Network On Chip Platform
Alexey Pakhunov /XCG, Microsoft Research/ March 30 th, 2011.
Latest ideas in DAQ development for LHC B. Gorini - CERN 1.
Chapter 13 – I/O Systems (Pgs ). Devices  Two conflicting properties A. Growing uniformity in interfaces (both h/w and s/w): e.g., USB, TWAIN.
Motherboard A motherboard allows all the parts of your computer to receive power and communicate with one another.
Input/Output Problems Wide variety of peripherals —Delivering different amounts of data —At different speeds —In different formats All slower than CPU.
Different Microprocessors Tamanna Haque Nipa Lecturer Dept. of Computer Science Stamford University Bangladesh.
1 Technion – Israel Institute of Technology Department of Electrical Engineering High Speed Digital Systems Lab Instructor: Evgeny Fiksman Students: Meir.
1 Adapted from UC Berkeley CS252 S01 Lecture 18: Reducing Cache Hit Time and Main Memory Design Virtucal Cache, pipelined cache, cache summary, main memory.
Major OS Components CS 416: Operating Systems Design, Spring 2001 Department of Computer Science Rutgers University
Constructing a system with multiple computers or processors 1 ITCS 4/5145 Parallel Programming, UNC-Charlotte, B. Wilkinson. Jan 13, 2016.
Lecture 2. A Computer System for Labs
Lynn Choi School of Electrical Engineering
Processor support devices Part 2: Caches and the MESI protocol
Microcontrollers & GPIO
Chapter 1: A Tour of Computer Systems
UNIT 9 Computer architecture
System On Chip.
Constructing a system with multiple computers or processors
Architecture & Organization 1
The PCI bus (Peripheral Component Interconnect ) is the most commonly used peripheral bus on desktops and bigger computers. higher-level bus architectures.
Architecture & Organization 1
Chapter 3 Hardware and software 1.
Today’s agenda Hardware architecture and runtime system
Constructing a system with multiple computers or processors
Constructing a system with multiple computers or processors
Constructing a system with multiple computers or processors
Chapter 3 Hardware and software 1.
Chapter 2: Operating-System Structures
Chapter 2: Operating-System Structures
Presentation transcript:

Xiaocheng Zhou Intel Labs China “Single-chip Cloud Computer” An experimental many-core processor from Intel Labs

What is Tera-scale? TIPs of compute power operating on Tera-bytes of data TIPs of compute power operating on Tera-bytes of data Terabytes TIPS Gigabytes MIPS Megabytes GIPS Performance Dataset Size Kilobytes KIPS Tera-scale Multi-core Single-core Mult- Media 3D & Video Text RMS Personal Media Creation and Management Learning & Travel Entertainment Health Source: electronic visualization lab University of Illinois

Performance Scaling Challenges EnergyEfficiencyEmergingApplicationsProgrammingStrategyDesignComplexity

Cloud Computing Today Cloud datacenters: –1000s of networked computers –Millions of threads & petabytes of data Opportunity: –Lower power, higher density via integration –Greater efficiency and better programmability Example: Intel’s Open Cirrus testbed Intel Labs Pittsburgh

Single-chip Cloud Computer (SCC) Experimental many-core CPU on 45 nm Hi-K metal-gate siliconExperimental many-core CPU on 45 nm Hi-K metal-gate silicon 48 IA-compatible cores48 IA-compatible cores Network of 2-core nodes mimics cloud computing at chip levelNetwork of 2-core nodes mimics cloud computing at chip level Fine-grained power management scales from WFine-grained power management scales from W Supports proven, highly parallel “scale-out” programming modelsSupports proven, highly parallel “scale-out” programming models

R MC 24 Tiles 24 Routers 48 IA cores Inside the SCC Inside the SCC ROUTE R MEMORY CONTROLLER 2D mesh network 4 Integrated DDR3 memory controllers (64GB addressable) RR RRR 1TILE Dual-core SCDC Tile R

 Architecture –6x4 2D Mesh NOC –16B wide data links + 2B sideband –8 Virtual Channels in 2 classes –Fixed (X-Y) routing  Performance –Target freq: 1.1V –Link Bandwidth 64GB/s –4 cycle latency  Power Management –Independent Frequency & Voltage control –Sleep mode, clock gating, low power RF On-die Interconnect

 Memory –Up to 64GB DDR3 via 4 memory 21.3GB/s –16KB SRAM in each tile as Message Passing Buffer (MPB)  Caching –32KB L1 per core (16KB I,D), 12MB L2 cache (256KB/core) –No HW cache-coherent shared memory  Addressing –Core physical to system physical addresses in 16MB sections –Memory mapped configuration & control registers Memory Architecture

Address Translation: From Core Address to System Address Core Physical Address Space Core Physical Address Space System Physical Address Space Physical-Physical Mapping Look Up Table (LUT)

11

Message Passing on SCC  Regions of memory mapped to multiple cores –Message Passing Buffer (MPB) for small fast messages –Larger buffers in off-die memory  Message Passing Data Type (MPDT) –R/W bypass L2 cache – tagged in L1 as MPDT –New instruction to selectively invalidate MPDT lines  Read/Write to other core’s MPB on-die –Synchronize through special atomic register bits –Core-core asynchronous interrupts  High-level API for applications – “RCCE” –One-sided communication (Get, Put, Send, Recv) –MPB allocation, synchronization

Improving Energy Efficiency Fine-grain, software-controlled power management 8 voltage and 28 frequency islands –Each tile can run at a different frequency –6 banks of four tiles can run at different voltages –Also independent V&F control for I/O network & MCs Memory Controller Tile R R R R R R R R R R R R R R R R R R R R R R R R Memory Controller V1V1 V2V2 FnFn FnFn FnFn FnFn V3V3 V4V4 V5V5 V6V6

Package and Test Board Technology45nm Process Package1567 pin LGA package 14 layers (5-4-5) Signals970 pins

SCC Platform Board Overview SCC Platform Board Overview

SCC “Chipset”  System Interface FPGA –Connects to SCC Mesh interconnect –IO capabilities like PCIe, Ethernet & SATA –Bitstream loaded by BMC  Board Management Controller (BMC) –JTAG interface for Clocking, Power etc. –USB Stick to hold FPGA bitstream –Network interface for User intercation via Telnet –Status monitoring

Software Environment  SCC Software –Customized Linux –Bare Metal –RCCE communication & power management –Tools –Selected Intel tools (e.g., icc, ifort,...) –Microsoft research release of SCC extensions to Visual Studio  Management Console PC Software –PCIe driver with integrated TCP/IP driver –Programming API for communication with SCC platform –GUI for interaction with SCC platform –Command line tools for interaction with SCC platform

RCCE Communication API  A compact, lightweight communication environment. –SCC and RCCE were designed together side by side: –… a true HW/SW co-design project.  A research vehicle to understand how message passing APIs map onto many core chips.  For experienced parallel programmers willing to work close to the hardware.  Static SPMD Execution Model: –identical UEs created together when a program starts (this is a standard approach familiar to message passing programmers)

RCCE Power Management API  RCCE power management emphasizes safe control: V/GHz changed together within each 4-tile (8-core) power domain. –A Master core sets V + GHz for all cores in domain. –RCCE_istep_power(): –steps up or down V + GHz, where GHz is max for selected voltage. –RCCE_wait_power(): –returns when power change is done –RCCE_step_frequency(): –steps up or down only GHz  Power management latencies –V changes: Very high latency, O(Million) cycles. –GHz changes: Low latency, O(few) cycles.

sccGui for debugging Modify config registers Read system memory Read system memory 20

sccBoot & sccReset  sccBoot: A command-line tool that allows to boot Linux on selected cores and to check the status (“which cores are currently booted”).  sccReset: A command-line tool that allows to reset selected SCC cores.

sccKonsole  Regular konsole, with automatic login to selected cores.  Enables broadcasting amongst shells. 22

MARC - Many-core Application Research Community  Worldwide research partnership program with academia & industry  Providing access to SCC for many-core programming research  Overwhelming interest - ~200 research proposals received  SCC datacenter is online - Community website up and running