System Architecture: Near, Medium, and Long-term Scalable Architectures Panel Discussion Presentation Sandia CSRI Workshop on Next-generation Scalable.

Slides:

Advertisements

Similar presentations

Conclusion Kenneth Moreland Sandia National Laboratories Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company,

Advertisements

Performance Analysis and Optimization through Run-time Simulation and Statistics Philip J. Mucci University Of Tennessee

Issues of HPC software From the experience of TH-1A Lu Yutong NUDT.

Technology Drivers Traditional HPC application drivers – OS noise, resource monitoring and management, memory footprint – Complexity of resources to be.

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.

Lecture 6: Multicore Systems

System Simulation Of 1000-cores Heterogeneous SoCs Shivani Raghav Embedded System Laboratory (ESL) Ecole Polytechnique Federale de Lausanne (EPFL)

Efficacy of GPUs in RAID Parity Calculation 8/8/2007 Matthew Curry and Lee Ward Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed.

GPGPU Introduction Alan Gray EPCC The University of Edinburgh.

LLNL-PRES This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under contract DE-AC52-07NA27344.

Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation,

Unstructured Data Partitioning for Large Scale Visualization CSCAPES Workshop June, 2008 Kenneth Moreland Sandia National Laboratories Sandia is a multiprogram.

L13: Review for Midterm. Administrative Project proposals due Friday at 5PM (hard deadline) No makeup class Friday! March 23, Guest Lecture Austin Robison,

Hybrid PC architecture Jeremy Sugerman Kayvon Fatahalian.

Murali Vijayaraghavan MIT Computer Science and Artificial Intelligence Laboratory RAMP Retreat, UC Berkeley, January 11, 2007 A Shared.

NOCS-2009 Chita R. Das National Science Foundation/ Pennsylvania State University.

A Performance and Energy Comparison of FPGAs, GPUs, and Multicores for Sliding-Window Applications From J. Fowers, G. Brown, P. Cooke, and G. Stitt, University.

1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy (Part II)

Heterogeneous Computing Dr. Jason D. Bakos. Heterogeneous Computing 2 “Traditional” Parallel/Multi-Processing Large-scale parallel platforms: –Individual.

GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.

ORIGINAL AUTHOR JAMES REINDERS, INTEL PRESENTED BY ADITYA AMBARDEKAR Overview for Intel Xeon Processors and Intel Xeon Phi coprocessors.

Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı

Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy’s National Nuclear.

Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy’s National Nuclear.

Computer System Architectures Computer System Software

Scalable Data Clustering with GPUs Andrew D. Pangborn Thesis Defense Rochester Institute of Technology Computer Engineering Department Friday, May 14 th.

Dax: Rethinking Visualization Frameworks for Extreme-Scale Computing DOECGF 2011 April 28, 2011 Kenneth Moreland Sandia National Laboratories SAND P.

Chapter 2 Computer Clusters Lecture 2.3 GPU Clusters for Massive Paralelism.

Principles of Scalable HPC System Design March 6, 2012 Sue Kelly Sandia National Laboratories Abstract: Sandia National.

Advisor: Dr. Aamir Shafi Co-Advisor: Mr. Ali Sajjad Member: Dr. Hafiz Farooq Member: Mr. Tahir Azim Optimizing N-body Simulations for Multi-core Compute.

© David Kirk/NVIDIA and Wen-mei W. Hwu, 1 Programming Massively Parallel Processors Lecture Slides for Chapter 1: Introduction.

General Purpose Computing on Graphics Processing Units: Optimization Strategy Henry Au Space and Naval Warfare Center Pacific 09/12/12.

1 Computer Architecture Research Overview Rajeev Balasubramonian School of Computing, University of Utah

The Red Storm High Performance Computer March 19, 2008 Sue Kelly Sandia National Laboratories Abstract: Sandia National.

Kevin Skadron University of Virginia Dept. of Computer Science LAVA Lab Wrapup and Open Issues.

If Exascale by 2018, Really? Yes, if we want it, and here is how Laxmikant Kale.

Floating-Point Reuse in an FPGA Implementation of a Ray-Triangle Intersection Algorithm Craig Ulmer June 27, 2006 Sandia is a multiprogram.

Programming Concepts in GPU Computing Dušan Gajić, University of Niš Programming Concepts in GPU Computing Dušan B. Gajić CIITLab, Dept. of Computer Science.

The Memory Hierarchy 21/05/2009Lecture 32_CA&O_Engr Umbreen Sabir.

Jacquard: Architecture and Application Performance Overview NERSC Users’ Group October 2005.

Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy's National Nuclear.

Next Generation Operating Systems Zeljko Susnjar, Cisco CTG June 2015.

Kevin Skadron University of Virginia Dept. of Computer Science LAVA Lab Wrapup and Open Issues.

Threading Opportunities in High-Performance Flash-Memory Storage Craig Ulmer Sandia National Laboratories, California Maya GokhaleLawrence Livermore National.

STK (Sierra Toolkit) Update Trilinos User Group meetings, 2014 R&A: SAND PE Sandia National Laboratories is a multi-program laboratory operated.

Sandia is a multi-program laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy’s National Nuclear.

Trilinos Strategic (and Tactical) Planning Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United.

Site Report DOECGF April 26, 2011 W. Alan Scott Sandia National Laboratories Sandia National Laboratories is a multi-program laboratory managed and operated.

Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation,

B5: Exascale Hardware. Capability Requirements Several different requirements –Exaflops/Exascale single application –Ensembles of Petaflop apps requiring.

COMP381 by M. Hamdi 1 Clusters: Networks of WS/PC.

CCA Common Component Architecture Insights from Quantum Chemistry Joseph P. Kenny Scalable Computing Research and Design Sandia National Laboratories Livermore,

Other Tools HPC Code Development Tools July 29, 2010 Sue Kelly Sandia is a multiprogram laboratory operated by Sandia Corporation, a.

Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy’s National Nuclear.

Parallel Computers Today Oak Ridge / Cray Jaguar > 1.75 PFLOPS Two Nvidia 8800 GPUs > 1 TFLOPS Intel 80- core chip > 1 TFLOPS  TFLOPS = floating.

Kevin Skadron University of Virginia Dept. of Computer Science LAVA Lab Trends in Multicore Architecture.

Performing Fault-tolerant, Scalable Data Collection and Analysis James Jolly University of Wisconsin-Madison Visualization and Scientific Computing Dept.

“Processors” issues for LQCD January 2009 André Seznec IRISA/INRIA.

On the Path to Trinity - Experiences Bringing Codes to the Next Generation ASC Platform Courtenay T. Vaughan and Simon D. Hammond Sandia National Laboratories.

Photos placed in horizontal position with even amount of white space between photos and header Sandia National Laboratories is a multi-program laboratory.

Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.

Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy’s National Nuclear.

Conclusions on CS3014 David Gregg Department of Computer Science

Lynn Choi School of Electrical Engineering

Memory COMPUTER ARCHITECTURE

Morgan Kaufmann Publishers

Ray-Cast Rendering in VTK-m

Trends in Multicore Architecture

2.C Memory GCSE Computing Langley Park School for Boys.

Chip&Core Architecture

Presentation transcript:

System Architecture: Near, Medium, and Long-term Scalable Architectures Panel Discussion Presentation Sandia CSRI Workshop on Next-generation Scalable Applications: When MPI-only is not enough June 4, 2008 Kevin Pedretti Scalable System Software Dept. Sandia National Laboratories Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL85000.

Near Term Odds are good, but goods are odd... –Multi-core, many-core, mega-core –Heterogeneous ISAs, cores, systems –Accelerators: GPU, Cell, Clearspeed, FPGA, etc. –Embedded: Tilera, SPI, Ambric (336-core), Tensilica Scalable Architectures –Peak FLOPS not bottleneck –Improving per-socket efficiency on real applications is “low-hanging fruit” –Decreasing memory size & bandwidth per core –Symbiosis of architecture and system software

Near Term (Cont.)‏ Adapting MPI implementations for architecture –Shared memory copies vs. NIC –Cache pollution, injection –Leverage hierarchy / intra-node locality Adapting MPI applications for architecture –MPI + shared memory: LIBSM –MPI + something else for intra-node OpenMP, Thread Building Blocks, ALF Streaming, CUDA, Rapid Mind, Peakstream/Google, etc. All incompatible, some similar concepts Adapting architecture for MPI? Leveraging interconnect capabilities for PGAS

At 8192 nodes, CNL (2.0.44) is 49% worse than Catamount on this Partisn problem. Doesn’t appear to be a bandwidth issue. OS Scalability

Task and Memory Placement No standard mechanisms, most punt and hope for best Explicit vs. implicit mechanisms More important than node placement?

Intra-node MPI

Virtual Memory Nice, but Gets in Way Dashed Line = Small pages Solid Line = Large pages (Dual-core Opteron)‏ Open Shapes = Existing Logarithmic Algorithm (Gibson/Bruck)‏ Solid Shapes = New Constant-Time Algorithm (Slepoy, Thompson, Plimpton)‏ TLB misses increased with large pages, but time to service miss decreased dramatically (10x). Page table fits in L1! (vs. 2MB per GB with small pages)‏ Unexpected Behavior Due to TLB

So, Answer is Large Pages? DRAM bank conflicts can be considerable depending on data alignment OS-level and hardware mitigation strategies

Affects SpMV Also (28 Node HPCCG Run)‏

Medium Term More accelerators, normalization –Attractive power and memory efficiency –Commodity processors will integrate GPUs on-chip –HPC-centric off-chip accelerators General-purpose cores not getting much faster Leverage architecture for specific app domains –Some common mechanism will/must emerge for dealing with data-parallel accelerators General-purpose cores become more light-weight, better match for light-weight system software –Chip stacking –Off-chip optics

Long Term MPP-on-a-chip On and off-chip optics More intelligent memory systems Application driven architectures