Prof. Srinidhi Varadarajan Director Center for High-End Computing Systems.

Slides:

Advertisements

Similar presentations

Multiple Processor Systems

Advertisements

Multi-core Computing Lecture 3 MADALGO Summer School 2012 Algorithms for Modern Parallel and Distributed Models Phillip B. Gibbons Intel Labs Pittsburgh.

Distributed Processing, Client/Server and Clusters

The Development of Mellanox - NVIDIA GPUDirect over InfiniBand A New Model for GPU to GPU Communications Gilad Shainer.

A Dynamic World, what can Grids do for Multi-Core computing? Daniel Goodman, Anne Trefethen and Douglas Creager

Last Lecture The Future of Parallel Programming and Getting to Exascale 1.

Introduction CSCI 444/544 Operating Systems Fall 2008.

October 2007Susan Eggers: SOSP Women's Workshop 1 Your Speaker in a Nutshell BA in Economics 1965 PhD University of CA, Berkeley 1989 Technical skills.

Parallelizing Audio Feature Extraction Using an Automatically-Partitioned Streaming Dataflow Language Eric Battenberg Mark Murphy CS 267, Spring 2008.

Types of Parallel Computers

March 18, 2008SSE Meeting 1 Mary Hall Dept. of Computer Science and Information Sciences Institute Multicore Chips and Parallel Programming.

Parallel Programming Models and Paradigms

Chapter 17 Parallel Processing.

1 New Architectures Need New Languages A triumph of optimism over experience! Ian Watson 3 rd July 2009.

Contemporary Languages in Parallel Computing Raymond Hummel.

1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,

Parallelization: Conway’s Game of Life. Cellular automata: Important for science Biology – Mapping brain tumor growth Ecology – Interactions of species.

Juan Mendivelso.  Serial Algorithms: Suitable for running on an uniprocessor computer in which only one instruction executes at a time.  Parallel Algorithms:

Parallel Computing The Bad News –Hardware is not getting faster fast enough –Too many architectures –Existing architectures are too specific –Programs.

Computer System Architectures Computer System Software

OpenMP in a Heterogeneous World Ayodunni Aribuki Advisor: Dr. Barbara Chapman HPCTools Group University of Houston.

© 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 1 Concurrency in Programming Languages Matthew J. Sottile Timothy G. Mattson Craig.

ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.

1 Hardware Support for Collective Memory Transfers in Stencil Computations George Michelogiannakis, John Shalf Computer Architecture Laboratory Lawrence.

1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,

Parallel and Distributed Systems Instructor: Xin Yuan Department of Computer Science Florida State University.

Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,

4.2.1 Programming Models Technology drivers – Node count, scale of parallelism within the node – Heterogeneity – Complex memory hierarchies – Failure rates.

Integrating and Optimizing Transactional Memory in a Data Mining Middleware Vignesh Ravi and Gagan Agrawal Department of ComputerScience and Engg. The.

ECE200 – Computer Organization Chapter 9 – Multiprocessors.

Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.

Distributed Shared Memory Based on Reference paper: Distributed Shared Memory, Concepts and Systems.

CSC 7600 Lecture 28 : Final Exam Review Spring 2010 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS FINAL EXAM REVIEW Daniel Kogler, Chirag Dekate.

Spring 2003CSE P5481 Issues in Multiprocessors Which programming model for interprocessor communication shared memory regular loads & stores message passing.

Dr. Alexandra Fedorova School of Computing Science SFU

Copyright © 2011 Curt Hill MIMD Multiple Instructions Multiple Data.

Cache Coherence Protocols 1 Cache Coherence Protocols in Shared Memory Multiprocessors Mehmet Şenvar.

Shared Memory Consistency Models. SMP systems support shared memory abstraction: all processors see the whole memory and can perform memory operations.

Workshop BigSim Large Parallel Machine Simulation Presented by Eric Bohm PPL Charm Workshop 2004.

A few issues on the design of future multicores André Seznec IRISA/INRIA.

Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.

DISTRIBUTED COMPUTING

1 "Workshop 31: Developing a Hands-on Undergraduate Parallel Programming Course with Pattern Programming SIGCSE The 44 th ACM Technical Symposium.

CS- 492 : Distributed system & Parallel Processing Lecture 7: Sun: 15/5/1435 Foundations of designing parallel algorithms and shared memory models Lecturer/

Programmability Hiroshi Nakashima Thomas Sterling.

COMP381 by M. Hamdi 1 Clusters: Networks of WS/PC.

Oct 31 st 2007University of Utah1 Multi-Cores: Architecture/VLSI Perspective The Hardware-Software Relationship: Date or Dump?

3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.

3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 1.

Processor Level Parallelism 2. How We Got Here Developments in PC CPUs.

Introduction to Performance Tuning Chia-heng Tu PAS Lab Summer Workshop 2009 June 30,

Constructing a system with multiple computers or processors 1 ITCS 4/5145 Parallel Programming, UNC-Charlotte, B. Wilkinson. Jan 13, 2016.

Department of Computer Science, Johns Hopkins University Lecture 7 Finding Concurrency EN /420 Instructor: Randal Burns 26 February 2014.

Multi-Core CPUs Matt Kuehn. Roadmap ► Intel vs AMD ► Early multi-core processors ► Threads vs Physical Cores ► Multithreading and Multi-core processing.

Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming

Parallel Programming By J. H. Wang May 2, 2017.

Pattern Parallel Programming

Constructing a system with multiple computers or processors

EE 193: Parallel Computing

COMP60611 Fundamentals of Parallel and Distributed Systems

Software Defined Networking (SDN)

Compiler Back End Panel

Compiler Back End Panel

Constructing a system with multiple computers or processors

Constructing a system with multiple computers or processors

Constructing a system with multiple computers or processors

Hybrid Programming with OpenMP and MPI

EE 4xx: Computer Architecture and Performance Programming

A Virtual Machine Monitor for Utilizing Non-dedicated Clusters

Types of Parallel Computers

Presentation transcript:

Prof. Srinidhi Varadarajan Director Center for High-End Computing Systems

 We need a paradigm shift to make supercomputers more usable for mainstream computational scientists.  A similar shift occurred in computing in the 1970s when the advent of inexpensive minicomputers into academia spurred a large body of computing research.  Results from this research went back to industry creating a growth cycle that lead computing being a commodity.  This requires a comprehensive “rethink” of programming languages, runtime systems, operating systems, scheduling, reliability and operations and management  Moving to petascale and exascale class systems significantly complicates this challenge.  Need a computing environment that can efficiently and usably span the scales from department sized systems to national resources.

 The majority of our supercomputers today are distributed memory systems that use the message passing model of parallel computation.  The shared/distributed memory view is a dichotomy imposed by hardware constraints.  Modern high performance interconnects such as Infiniband are memory based systems.  Provides the hardware basis to envision DSM systems that deepen the memory hierarchy.  Most common operations are accelerated through hardware offload.

 Common question.  My application runs on my desktop, but it takes too long. Can I just run it on the supercomputer and make it run faster?  Short answer: no. Longer answer: almost certainly not.  As core frequencies have flattened, multi-core and many core architectures are here to stay.  This is increasing the prevalence of threaded codes.  Can we take standard threaded codes and run them on a cluster supercomputer without any modifications or recompilation?

 The goal of our work is to enable Pthread based threaded codes to transparently run cluster supercomputers.  The DSM system acts as the runtime, and provides a globally consistent memory abstraction.  New consistency algorithm with release consistency semantics guarantees correct operation for valid threaded codes.  No, it won’t fix your bugs, but it may make deadlock and livelock detection easier, possibly even automatic.

 Separation of concerns  The system is divided into a consistency layer and a lower level communication layer.  The communication layer uses a well-defined architecture similar to MPI’s ADI to enable a wide variety of lower level interconnects.  System consists of either dedicated memory servers or nodes may share a portion of their memory into a global pool.  Dedicated memory servers are essentially low end servers that can host a lot of memory over a fast interconnect.  Memory striping algorithms employed to mitigate memory access hotspots.

 The DSM architecture uses a global scheduler that treats cluster nodes as a set of processors.  Thread migration is simple and relatively inexpensive.  This enables load balancing through runtime migration.  Two issues:  Compute Load Imbalance  Data Affinity

 Extending the threads model to support adaptivity.  Transactional Memory  An artifact of our consistency protocol enables us to provide transactional memory semantics fairly inexpensively.  This enables speculative and/or adaptive execution models, particularly in hard to parallelize sequential sections of code.  Speculation enables us to explore multiple execution paths, with with the DSM guaranteeing that there are no memory side effects. Invalid paths are simply pruned.  Adaptive execution enables optimistic and conservative algorithms to be started concurrently.

 Current threaded and message passing models are inadequate for peta and exascale systems.  Growth in heterogeneous multi-core systems significantly complicates this problem.  Need more comprehensive runtime systems that can aid in load balancing, profile guided optimization and code adaptation.  The move in the compilers community towards greater emphasis on dynamic analysis is a step in this direction.

 We are working on hybrid programming models that combine von Neumann program counter based elements embedded in dataflow constructs.  A new model must provide insights into problem decomposition as well as map existing decomposition methods.  Coordination models that can operate at peta and exascale.

 Methods to evolve applications easily when requirements change.  Working with the compilers, programming languages, architectures, software engineering, applications and systems communities to realize this goal.

 System G: 2600 core Intel x86 Xeon Penryn processors with Quad Data Rate Infiniband.  12,000 thermal sensors, 5000 power sensors  System X: 2200 processor PowerPC cluster with Infiniband interconnect.  Anantham: 400 processor Opteron Cluster with Myrinet interconnect  Several 8-32 processor research clusters.  12 processor SGI Altix shared memory system  8 processor AMD Opteron shared memory system.  16 core AMD Opteron shared memory system  16 node Playstation 3 cluster