OCR on Knights Landing (Xeon-Phi)

Slides:

Advertisements

Similar presentations

Technology Drivers Traditional HPC application drivers – OS noise, resource monitoring and management, memory footprint – Complexity of resources to be.

Advertisements

Computer Science, University of Oklahoma Reconfigurable Versus Fixed Versus Hybrid Architectures John K. Antonio Oklahoma Supercomputing Symposium 2008.

1 Lecture 4: Directory Protocols Topics: directory-based cache coherence implementations.

Today’s topics Single processors and the Memory Hierarchy

Performance Analysis of Virtualization for High Performance Computing A Practical Evaluation of Hypervisor Overheads Matthew Cawood University of Cape.

OpenFOAM on a GPU-based Heterogeneous Cluster

Arquitectura de Sistemas Paralelos e Distribuídos Paulo Marques Dep. Eng. Informática – Universidade de Coimbra Ago/ Machine.

Performance benchmark of LHCb code on state-of-the-art x86 architectures Daniel Hugo Campora Perez, Niko Neufled, Rainer Schwemmer CHEP Okinawa.

ORIGINAL AUTHOR JAMES REINDERS, INTEL PRESENTED BY ADITYA AMBARDEKAR Overview for Intel Xeon Processors and Intel Xeon Phi coprocessors.

1 Down Place Hammersmith London UK 530 Lytton Ave. Palo Alto CA USA.

Motivation “Every three minutes a woman is diagnosed with Breast cancer” (American Cancer Society, “Detailed Guide: Breast Cancer,” 2006) Explore the use.

GPU Programming with CUDA – Accelerated Architectures Mike Griffiths

IPlant Collaborative Tools and Services Workshop iPlant Collaborative Tools and Services Workshop Collaborating with iPlant.

Working Group on Methodology for Optimizing Multilevel Parallelism Fialho, Gimenez, Tallent, Welton, Morris, Malony, Montoya and Browne.

GPU Computing with CBI Laboratory. Overview GPU History & Hardware – GPU History – CPU vs. GPU Hardware – Parallelism Design Points GPU Software.

1 Hardware Support for Collective Memory Transfers in Stencil Computations George Michelogiannakis, John Shalf Computer Architecture Laboratory Lawrence.

QCD Project Overview Ying Zhang September 26, 2005.

Boosting Event Building Performance Using Infiniband FDR for CMS Upgrade Andrew Forrest – CERN (PH/CMD) Technology and Instrumentation in Particle Physics.

The WRF Model The Weather Research and Forecasting (WRF) Model is a mesoscale numerical weather prediction system designed for both atmospheric research.

U.S. Department of Energy’s Office of Science High Performance Computing Challenges and Opportunities Dr. Daniel Hitchcock

Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,

Trace-Based Optimization for Precomputation and Prefetching Madhusudan Raman Supervisor: Prof. Michael Voss.

Memory System Performance in a NUMA Multicore Multiprocessor Zoltan Majo and Thomas R. Gross Department of Computer Science ETH Zurich 1.

A summary by Nick Rayner for PSU CS533, Spring 2006

Parallel Event Processing for Content-Based Publish/Subscribe Systems Amer Farroukh Department of Electrical and Computer Engineering University of Toronto.

A new perspective on processing-in-memory architecture design These data are submitted with limited rights under Government Contract No. DE-AC52-8MA27344.

Next Generation ISA Itanium / IA-64. Operating Environments IA-32 Protected Mode/Real Mode/Virtual Mode - if supported by the OS IA-64 Instruction Set.

Program Optimizations and Recent Trends in Heterogeneous Parallel Computing Dušan Gajić, University of Niš Program Optimizations and Recent Trends in Heterogeneous.

Ian Gable HEPiX Spring 2009, Umeå 1 VM CPU Benchmarking the HEPiX Way Manfred Alef, Ian Gable FZK Karlsruhe University of Victoria May 28, 2009.

CDA-5155 Computer Architecture Principles Fall 2000 Multiprocessor Architectures.

Background Computer System Architectures Computer System Software.

Running Commodity Operating Systems on Scalable Multiprocessors Edouard Bugnion, Scott Devine and Mendel Rosenblum Presentation by Mark Smith.

NUMA Optimization of Java VM

A Practical Evaluation of Hypervisor Overheads Matthew Cawood Supervised by: Dr. Simon Winberg University of Cape Town Performance Analysis of Virtualization.

1© Copyright 2015 EMC Corporation. All rights reserved. NUMA(YEY) BY JACOB KUGLER.

Sobolev(+Node 6, 7) Showcase +K20m GPU Accelerator.

Manycore processors Sima Dezső October Version 6.2.

Intel Many Integrated Cores Architecture

Deep Learning with Intel DAAL on Knights Landing Processor

Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming

Modern supercomputers, Georgian supercomputer project and usage areas

Early Results of Deep Learning on the Stampede2 Supercomputer

Chandra S. Martha Min Lee 02/10/2016

OCR GCSE Computer Science Teaching and Learning Resources

Intel MIC Architecture Internals and Optimizations

Scott Michael Indiana University July 6, 2017

Geant4 MT Performance Soon Yung Jun (Fermilab)

Reactive NUMA: A Design for Unifying S-COMA and CC-NUMA

Kilohertz Decision Making on Petabytes

Unconventional applications of Intel® Xeon Phi™ Processor (KNL)

Structural Simulation Toolkit / Gem5 Integration

Challenges CPU performance Variable density Multi-thread computing

Computer Architecture 2

Carlos Rosales, John Cazes, Kent Milfeld

IXPUG Abstract Submission Instructions

Directory-based Protocol

Mattan Erez The University of Texas at Austin

Early Results of Deep Learning on the Stampede2 Supercomputer

Template for IXPUG EMEA Ostrava, 2016

Interconnect with Cache Coherency Manager

KISS-Tree: Smart Latch-Free In-Memory Indexing on Modern Architectures

Lecture 8: Directory-Based Cache Coherence

Lecture 7: Directory-Based Cache Coherence

High Performance Computing

Many-Core Graph Workload Analysis

Accelerating Quantum Chemistry with Batched and Vectorized Integrals

Introduction, background, jargon

Department of Computer Science, University of Tennessee, Knoxville

IXPUG, SC’16 Lightning Talk Kavitha Chandrasekar*, Laxmikant V. Kale

Maximizing Speedup through Self-Tuning of Processor Allocation

Presentation transcript:

OCR on Knights Landing (Xeon-Phi) 31st Mar 2016 Acknowledgment: This material is based upon work supported by the Department of Energy Office of Science under cooperative agreement DE-SC0008717 and DE-SC0014355, and Lawrence Livermore National Labs subcontract B608115.

Knights Landing Overview Three modes Self-boot processor Self-boot w/ integrated fabric Co-processor (PCIe addon card) MCDRAM: three memory modes Flat – entirely addressable Cache – on DDR, direct-mapped Hybrid – part cache, part memory Cluster modes (cc mesh interconnect) All-to-all: address uniformly hashed Quadrant: software-transparent, address hashed to dir same quadrant as memory Sub-NUMA: exposed as 4 NUMA nodes KNL presentation at Hotchips ‘15

OCR on KNL 1 policy domain with up to 288 workers MCDRAM in flat mode, with two allocators $ numactl -H available: 2 nodes (0-1) node 0 cpus: 0 255 node 0 size: 98200 MB node 0 free: 90312 MB node 1 cpus: node 1 size: 16384 MB node 1 free: 15519 MB node distances: node 0 1 0: 10 31 1: 31 10 Memory hints to choose allocator on MCDRAM (OCR_HINT_DB_HIGHBW)

Results – Stencil 2D weak scaling Xeon KNL Preliminary results! Software under optimization

Results – MCDRAM vs DDR Stencil 2D with 256 threads Preliminary results! Software under optimization Stencil 2D with 256 threads

Results – Stream Runtime bottlenecks? Profiling underway Limited vectorization opportunities? Preliminary results! Software under optimization

Next Steps Rootcause & fix MCDRAM performance Study all-to-all vs. sub-NUMA modes Single vs multiple policy domains Performance counters & introspection