Mattan Erez The University of Texas at Austin

Slides:



Advertisements
Similar presentations
Larrabee Eric Jogerst Cortlandt Schoonover Francis Tan.
Advertisements

Explicit HW and SW Hierarchies High-Level Abstractions for giving the system what it wants Mattan Erez The University of Texas at Austin Salishan 2011.
Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,
Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.
Exploring Memory Consistency for Massively Threaded Throughput- Oriented Processors Blake Hechtman Daniel J. Sorin 0.
CSCE 432/832 High Performance ---- An Introduction to Multicore Memory Hierarchy Dongyuan Zhan CS252 S05.
Multi-core systems System Architecture COMP25212 Daniel Goodman Advanced Processor Technologies Group.
© Cray Inc. CSC, Finland September 21-24, XT3XT4XT5XT6 Number of cores/socket Number of cores/node Clock Cycle (CC) ??
ECE 454 Computer Systems Programming Parallel Architectures and Performance Implications (II) Ding Yuan ECE Dept., University of Toronto
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
Cell Broadband Engine. INF5062, Carsten Griwodz & Pål Halvorsen University of Oslo Cell Broadband Engine Structure SPE PPE MIC EIB.
XEON PHI. TOPICS What are multicore processors? Intel MIC architecture Xeon Phi Programming for Xeon Phi Performance Applications.
Synergistic Processing In Cell’s Multicore Architecture Michael Gschwind, et al. Presented by: Jia Zou CS258 3/5/08.
Porting a 2D Finite Volume Advection Code to the Intel MIC Kaushik Datta – NASA GSFC (SSSO) / Northrop Grumman Hamid Oloso – NASA GSFC (SSSO) / AMT Tom.
Panda: MapReduce Framework on GPU’s and CPU’s
ORIGINAL AUTHOR JAMES REINDERS, INTEL PRESENTED BY ADITYA AMBARDEKAR Overview for Intel Xeon Processors and Intel Xeon Phi coprocessors.
GPU Programming with CUDA – Accelerated Architectures Mike Griffiths
University of Michigan Electrical Engineering and Computer Science 1 Extending Multicore Architectures to Exploit Hybrid Parallelism in Single-Thread Applications.
1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,
History of Microprocessor MPIntroductionData BusAddress Bus
© 2007 SET Associates Corporation SAR Processing Performance on Cell Processor and Xeon Mark Backues, SET Corporation Uttam Majumder, AFRL/RYAS.
Virtualisation Front Side Buses SMP systems COMP Jamie Curtis.
Yang Yu, Tianyang Lei, Haibo Chen, Binyu Zang Fudan University, China Shanghai Jiao Tong University, China Institute of Parallel and Distributed Systems.
Innovation for Our Energy Future Opportunities for WRF Model Acceleration John Michalakes Computational Sciences Center NREL Andrew Porter Computational.
Embedded Computer Architecture 5SAI0 Multi-Processor Systems
Co-Processor Architectures Fermi vs. Knights Ferry Roger Goff Dell Senior Global CERN/LHC Technologist |
My Coordinates Office EM G.27 contact time:
Processor Performance & Parallelism Yashwant Malaiya Colorado State University With some PH stuff.
Martin Kruliš by Martin Kruliš (v1.1)1.
Native Computing & Optimization on Xeon Phi John D. McCalpin, Ph.D. Texas Advanced Computing Center.
Martin Kruliš by Martin Kruliš (v1.1)1.
Application of Emerging Computational Architectures (GPU, MIC) to Atmospheric Modeling Tom Henderson NOAA Global Systems Division
Lecturer: Roni Kupershtok Prepared by Lihu Rappoport
Manycore processors Sima Dezső October Version 6.2.
M. Bellato INFN Padova and U. Marconi INFN Bologna
Intel Many Integrated Cores Architecture
Presented by: Nick Kirchem Feb 13, 2004
Modern supercomputers, Georgian supercomputer project and usage areas
Processor support devices Part 2: Caches and the MESI protocol
Distributed Processors
Intel MIC Architecture Internals and Optimizations
A Closer Look at Instruction Set Architectures
NVIDIA’s Extreme-Scale Computing Project
Geant4 MT Performance Soon Yung Jun (Fermilab)
Decoupled Access-Execute Pioneering Compilation for Energy Efficiency
Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures Topic 17 NVIDIA GPU Computational Structures Prof. Zhang Gang
Lecture 5: GPU Compute Architecture
Mattan Erez The University of Texas at Austin
What is Parallel and Distributed computing?
Introduction to OpenCL 2.0
Multi-core systems COMP25212 System Architecture
Lecture 5: GPU Compute Architecture for the last time
Introduction to CUDA Programming
Embedded Computer Architecture 5SAI0 Multi-Processor Systems
Compiler Back End Panel
Interconnect with Cache Coherency Manager
Compiler Back End Panel
Mattan Erez The University of Texas at Austin
Mattan Erez The University of Texas at Austin
CANDY: Enabling Coherent DRAM Caches for Multi-node Systems
Chapter 11: Alternative Architectures
Mattan Erez The University of Texas at Austin
COMS 361 Computer Organization
Mattan Erez The University of Texas at Austin
Mattan Erez The University of Texas at Austin
Mattan Erez The University of Texas at Austin
Multicore and GPU Programming
6- General Purpose GPU Programming
CSE 502: Computer Architecture
Multicore and GPU Programming
Presentation transcript:

Mattan Erez The University of Texas at Austin EE382N-20: Intel Xeon Phi Mattan Erez The University of Texas at Austin

Intel’s version of throughput computing Similar goals as GPUs Ignore latency Focus on parallelism and efficiency But, start with traditional x86 cores More general purpose Reasonable caches Add more parallelism and more latency hiding

Knights Corner Pentium (P54) scalar cores 61 cores (22nm process) Efficient small core, but not best performing 61 cores (22nm process) 512KB shared L2 per core

Knights Landing Upgrade to Atom cores More cores (72) (14nm process) On-package “near” DRAM Socket rather than PCIe Can be “self hosted”

Parallelism Forget OOO, explicit parallelism 512-bit vector units in each core “many integrated cores” (MIC) Currently, 61 cores and more soon A lot of the innovations are in the SIMD units New instructions Better arithmetic design Scatter/gather (to cache) Help with mask generation Better permutations

Latency hiding 4 HW threads per core Many cores Big-ish caches

Simplicity first Ring interconnect Extended MESI (kind of like MOESI, but not quite) MESI + GOLS (globally owned locally shared) Simple global Tag Directory No real multi-threading support, just coherence Currently, synchronization is painful ~250 cycles to get through tag directory to find stuff

Programming Currently, accelerator as off-load mode Program with OpenMP/OpenACC or MPI for the most part Can share virtual addresses with host and compiler can insert copies (at offload boundaries) Still a bit rough around the edges Performance bugs Memory leaks Inter-node comm not ideal Significant investment though, will get better