Mattan Erez The University of Texas at Austin

Slides:

Advertisements

Similar presentations

Larrabee Eric Jogerst Cortlandt Schoonover Francis Tan.

Advertisements

Explicit HW and SW Hierarchies High-Level Abstractions for giving the system what it wants Mattan Erez The University of Texas at Austin Salishan 2011.

Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,

Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.

Exploring Memory Consistency for Massively Threaded Throughput- Oriented Processors Blake Hechtman Daniel J. Sorin 0.

CSCE 432/832 High Performance ---- An Introduction to Multicore Memory Hierarchy Dongyuan Zhan CS252 S05.

Multi-core systems System Architecture COMP25212 Daniel Goodman Advanced Processor Technologies Group.

© Cray Inc. CSC, Finland September 21-24, XT3XT4XT5XT6 Number of cores/socket Number of cores/node Clock Cycle (CC) ??

ECE 454 Computer Systems Programming Parallel Architectures and Performance Implications (II) Ding Yuan ECE Dept., University of Toronto

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

Cell Broadband Engine. INF5062, Carsten Griwodz & Pål Halvorsen University of Oslo Cell Broadband Engine Structure SPE PPE MIC EIB.

XEON PHI. TOPICS What are multicore processors? Intel MIC architecture Xeon Phi Programming for Xeon Phi Performance Applications.

Synergistic Processing In Cell’s Multicore Architecture Michael Gschwind, et al. Presented by: Jia Zou CS258 3/5/08.

Porting a 2D Finite Volume Advection Code to the Intel MIC Kaushik Datta – NASA GSFC (SSSO) / Northrop Grumman Hamid Oloso – NASA GSFC (SSSO) / AMT Tom.

Panda: MapReduce Framework on GPU’s and CPU’s

ORIGINAL AUTHOR JAMES REINDERS, INTEL PRESENTED BY ADITYA AMBARDEKAR Overview for Intel Xeon Processors and Intel Xeon Phi coprocessors.

GPU Programming with CUDA – Accelerated Architectures Mike Griffiths

University of Michigan Electrical Engineering and Computer Science 1 Extending Multicore Architectures to Exploit Hybrid Parallelism in Single-Thread Applications.

1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,

History of Microprocessor MPIntroductionData BusAddress Bus

© 2007 SET Associates Corporation SAR Processing Performance on Cell Processor and Xeon Mark Backues, SET Corporation Uttam Majumder, AFRL/RYAS.

Virtualisation Front Side Buses SMP systems COMP Jamie Curtis.

Yang Yu, Tianyang Lei, Haibo Chen, Binyu Zang Fudan University, China Shanghai Jiao Tong University, China Institute of Parallel and Distributed Systems.

Innovation for Our Energy Future Opportunities for WRF Model Acceleration John Michalakes Computational Sciences Center NREL Andrew Porter Computational.

Embedded Computer Architecture 5SAI0 Multi-Processor Systems

Co-Processor Architectures Fermi vs. Knights Ferry Roger Goff Dell Senior Global CERN/LHC Technologist |

My Coordinates Office EM G.27 contact time:

Processor Performance & Parallelism Yashwant Malaiya Colorado State University With some PH stuff.

Martin Kruliš by Martin Kruliš (v1.1)1.

Native Computing & Optimization on Xeon Phi John D. McCalpin, Ph.D. Texas Advanced Computing Center.

Martin Kruliš by Martin Kruliš (v1.1)1.

Application of Emerging Computational Architectures (GPU, MIC) to Atmospheric Modeling Tom Henderson NOAA Global Systems Division

Lecturer: Roni Kupershtok Prepared by Lihu Rappoport

Manycore processors Sima Dezső October Version 6.2.

M. Bellato INFN Padova and U. Marconi INFN Bologna

Intel Many Integrated Cores Architecture

Presented by: Nick Kirchem Feb 13, 2004

Modern supercomputers, Georgian supercomputer project and usage areas

Processor support devices Part 2: Caches and the MESI protocol

Distributed Processors

Intel MIC Architecture Internals and Optimizations

A Closer Look at Instruction Set Architectures

NVIDIA’s Extreme-Scale Computing Project

Geant4 MT Performance Soon Yung Jun (Fermilab)

Decoupled Access-Execute Pioneering Compilation for Energy Efficiency

Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures Topic 17 NVIDIA GPU Computational Structures Prof. Zhang Gang

Lecture 5: GPU Compute Architecture

Mattan Erez The University of Texas at Austin

What is Parallel and Distributed computing?

Introduction to OpenCL 2.0

Multi-core systems COMP25212 System Architecture

Lecture 5: GPU Compute Architecture for the last time

Introduction to CUDA Programming

Embedded Computer Architecture 5SAI0 Multi-Processor Systems

Compiler Back End Panel

Interconnect with Cache Coherency Manager

Compiler Back End Panel

Mattan Erez The University of Texas at Austin

Mattan Erez The University of Texas at Austin

CANDY: Enabling Coherent DRAM Caches for Multi-node Systems

Chapter 11: Alternative Architectures

Mattan Erez The University of Texas at Austin

COMS 361 Computer Organization

Mattan Erez The University of Texas at Austin

Mattan Erez The University of Texas at Austin

Mattan Erez The University of Texas at Austin

Multicore and GPU Programming

6- General Purpose GPU Programming

CSE 502: Computer Architecture

Multicore and GPU Programming

Presentation transcript:

Mattan Erez The University of Texas at Austin EE382N-20: Intel Xeon Phi Mattan Erez The University of Texas at Austin

Intel’s version of throughput computing Similar goals as GPUs Ignore latency Focus on parallelism and efficiency But, start with traditional x86 cores More general purpose Reasonable caches Add more parallelism and more latency hiding

Knights Corner Pentium (P54) scalar cores 61 cores (22nm process) Efficient small core, but not best performing 61 cores (22nm process) 512KB shared L2 per core

Knights Landing Upgrade to Atom cores More cores (72) (14nm process) On-package “near” DRAM Socket rather than PCIe Can be “self hosted”

Parallelism Forget OOO, explicit parallelism 512-bit vector units in each core “many integrated cores” (MIC) Currently, 61 cores and more soon A lot of the innovations are in the SIMD units New instructions Better arithmetic design Scatter/gather (to cache) Help with mask generation Better permutations

Latency hiding 4 HW threads per core Many cores Big-ish caches

Simplicity first Ring interconnect Extended MESI (kind of like MOESI, but not quite) MESI + GOLS (globally owned locally shared) Simple global Tag Directory No real multi-threading support, just coherence Currently, synchronization is painful ~250 cycles to get through tag directory to find stuff

Programming Currently, accelerator as off-load mode Program with OpenMP/OpenACC or MPI for the most part Can share virtual addresses with host and compiler can insert copies (at offload boundaries) Still a bit rough around the edges Performance bugs Memory leaks Inter-node comm not ideal Significant investment though, will get better