Prof. Srinidhi Varadarajan Director Center for High-End Computing Systems.

Slides:



Advertisements
Similar presentations
Multiple Processor Systems
Advertisements

Multi-core Computing Lecture 3 MADALGO Summer School 2012 Algorithms for Modern Parallel and Distributed Models Phillip B. Gibbons Intel Labs Pittsburgh.
Distributed Processing, Client/Server and Clusters
The Development of Mellanox - NVIDIA GPUDirect over InfiniBand A New Model for GPU to GPU Communications Gilad Shainer.
A Dynamic World, what can Grids do for Multi-Core computing? Daniel Goodman, Anne Trefethen and Douglas Creager
Last Lecture The Future of Parallel Programming and Getting to Exascale 1.
Introduction CSCI 444/544 Operating Systems Fall 2008.
October 2007Susan Eggers: SOSP Women's Workshop 1 Your Speaker in a Nutshell BA in Economics 1965 PhD University of CA, Berkeley 1989 Technical skills.
Parallelizing Audio Feature Extraction Using an Automatically-Partitioned Streaming Dataflow Language Eric Battenberg Mark Murphy CS 267, Spring 2008.
Types of Parallel Computers
March 18, 2008SSE Meeting 1 Mary Hall Dept. of Computer Science and Information Sciences Institute Multicore Chips and Parallel Programming.
Parallel Programming Models and Paradigms
Chapter 17 Parallel Processing.
1 New Architectures Need New Languages A triumph of optimism over experience! Ian Watson 3 rd July 2009.
Contemporary Languages in Parallel Computing Raymond Hummel.
1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,
Parallelization: Conway’s Game of Life. Cellular automata: Important for science Biology – Mapping brain tumor growth Ecology – Interactions of species.
Juan Mendivelso.  Serial Algorithms: Suitable for running on an uniprocessor computer in which only one instruction executes at a time.  Parallel Algorithms:
Parallel Computing The Bad News –Hardware is not getting faster fast enough –Too many architectures –Existing architectures are too specific –Programs.
Computer System Architectures Computer System Software
OpenMP in a Heterogeneous World Ayodunni Aribuki Advisor: Dr. Barbara Chapman HPCTools Group University of Houston.
© 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 1 Concurrency in Programming Languages Matthew J. Sottile Timothy G. Mattson Craig.
ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.
1 Hardware Support for Collective Memory Transfers in Stencil Computations George Michelogiannakis, John Shalf Computer Architecture Laboratory Lawrence.
1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,
Parallel and Distributed Systems Instructor: Xin Yuan Department of Computer Science Florida State University.
Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,
4.2.1 Programming Models Technology drivers – Node count, scale of parallelism within the node – Heterogeneity – Complex memory hierarchies – Failure rates.
Integrating and Optimizing Transactional Memory in a Data Mining Middleware Vignesh Ravi and Gagan Agrawal Department of ComputerScience and Engg. The.
ECE200 – Computer Organization Chapter 9 – Multiprocessors.
Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.
Distributed Shared Memory Based on Reference paper: Distributed Shared Memory, Concepts and Systems.
CSC 7600 Lecture 28 : Final Exam Review Spring 2010 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS FINAL EXAM REVIEW Daniel Kogler, Chirag Dekate.
Spring 2003CSE P5481 Issues in Multiprocessors Which programming model for interprocessor communication shared memory regular loads & stores message passing.
Dr. Alexandra Fedorova School of Computing Science SFU
Copyright © 2011 Curt Hill MIMD Multiple Instructions Multiple Data.
Cache Coherence Protocols 1 Cache Coherence Protocols in Shared Memory Multiprocessors Mehmet Şenvar.
Shared Memory Consistency Models. SMP systems support shared memory abstraction: all processors see the whole memory and can perform memory operations.
Workshop BigSim Large Parallel Machine Simulation Presented by Eric Bohm PPL Charm Workshop 2004.
A few issues on the design of future multicores André Seznec IRISA/INRIA.
Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.
DISTRIBUTED COMPUTING
1 "Workshop 31: Developing a Hands-on Undergraduate Parallel Programming Course with Pattern Programming SIGCSE The 44 th ACM Technical Symposium.
CS- 492 : Distributed system & Parallel Processing Lecture 7: Sun: 15/5/1435 Foundations of designing parallel algorithms and shared memory models Lecturer/
Programmability Hiroshi Nakashima Thomas Sterling.
COMP381 by M. Hamdi 1 Clusters: Networks of WS/PC.
Oct 31 st 2007University of Utah1 Multi-Cores: Architecture/VLSI Perspective The Hardware-Software Relationship: Date or Dump?
3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 1.
Processor Level Parallelism 2. How We Got Here Developments in PC CPUs.
Introduction to Performance Tuning Chia-heng Tu PAS Lab Summer Workshop 2009 June 30,
Constructing a system with multiple computers or processors 1 ITCS 4/5145 Parallel Programming, UNC-Charlotte, B. Wilkinson. Jan 13, 2016.
Department of Computer Science, Johns Hopkins University Lecture 7 Finding Concurrency EN /420 Instructor: Randal Burns 26 February 2014.
Multi-Core CPUs Matt Kuehn. Roadmap ► Intel vs AMD ► Early multi-core processors ► Threads vs Physical Cores ► Multithreading and Multi-core processing.
Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming
Parallel Programming By J. H. Wang May 2, 2017.
Pattern Parallel Programming
Constructing a system with multiple computers or processors
EE 193: Parallel Computing
COMP60611 Fundamentals of Parallel and Distributed Systems
Software Defined Networking (SDN)
Compiler Back End Panel
Compiler Back End Panel
Constructing a system with multiple computers or processors
Constructing a system with multiple computers or processors
Constructing a system with multiple computers or processors
Hybrid Programming with OpenMP and MPI
EE 4xx: Computer Architecture and Performance Programming
A Virtual Machine Monitor for Utilizing Non-dedicated Clusters
Types of Parallel Computers
Presentation transcript:

Prof. Srinidhi Varadarajan Director Center for High-End Computing Systems

 We need a paradigm shift to make supercomputers more usable for mainstream computational scientists.  A similar shift occurred in computing in the 1970s when the advent of inexpensive minicomputers into academia spurred a large body of computing research.  Results from this research went back to industry creating a growth cycle that lead computing being a commodity.  This requires a comprehensive “rethink” of programming languages, runtime systems, operating systems, scheduling, reliability and operations and management  Moving to petascale and exascale class systems significantly complicates this challenge.  Need a computing environment that can efficiently and usably span the scales from department sized systems to national resources.

 The majority of our supercomputers today are distributed memory systems that use the message passing model of parallel computation.  The shared/distributed memory view is a dichotomy imposed by hardware constraints.  Modern high performance interconnects such as Infiniband are memory based systems.  Provides the hardware basis to envision DSM systems that deepen the memory hierarchy.  Most common operations are accelerated through hardware offload.

 Common question.  My application runs on my desktop, but it takes too long. Can I just run it on the supercomputer and make it run faster?  Short answer: no. Longer answer: almost certainly not.  As core frequencies have flattened, multi-core and many core architectures are here to stay.  This is increasing the prevalence of threaded codes.  Can we take standard threaded codes and run them on a cluster supercomputer without any modifications or recompilation?

 The goal of our work is to enable Pthread based threaded codes to transparently run cluster supercomputers.  The DSM system acts as the runtime, and provides a globally consistent memory abstraction.  New consistency algorithm with release consistency semantics guarantees correct operation for valid threaded codes.  No, it won’t fix your bugs, but it may make deadlock and livelock detection easier, possibly even automatic.

 Separation of concerns  The system is divided into a consistency layer and a lower level communication layer.  The communication layer uses a well-defined architecture similar to MPI’s ADI to enable a wide variety of lower level interconnects.  System consists of either dedicated memory servers or nodes may share a portion of their memory into a global pool.  Dedicated memory servers are essentially low end servers that can host a lot of memory over a fast interconnect.  Memory striping algorithms employed to mitigate memory access hotspots.

 The DSM architecture uses a global scheduler that treats cluster nodes as a set of processors.  Thread migration is simple and relatively inexpensive.  This enables load balancing through runtime migration.  Two issues:  Compute Load Imbalance  Data Affinity

 Extending the threads model to support adaptivity.  Transactional Memory  An artifact of our consistency protocol enables us to provide transactional memory semantics fairly inexpensively.  This enables speculative and/or adaptive execution models, particularly in hard to parallelize sequential sections of code.  Speculation enables us to explore multiple execution paths, with with the DSM guaranteeing that there are no memory side effects. Invalid paths are simply pruned.  Adaptive execution enables optimistic and conservative algorithms to be started concurrently.

 Current threaded and message passing models are inadequate for peta and exascale systems.  Growth in heterogeneous multi-core systems significantly complicates this problem.  Need more comprehensive runtime systems that can aid in load balancing, profile guided optimization and code adaptation.  The move in the compilers community towards greater emphasis on dynamic analysis is a step in this direction.

 We are working on hybrid programming models that combine von Neumann program counter based elements embedded in dataflow constructs.  A new model must provide insights into problem decomposition as well as map existing decomposition methods.  Coordination models that can operate at peta and exascale.

 Methods to evolve applications easily when requirements change.  Working with the compilers, programming languages, architectures, software engineering, applications and systems communities to realize this goal.

 System G: 2600 core Intel x86 Xeon Penryn processors with Quad Data Rate Infiniband.  12,000 thermal sensors, 5000 power sensors  System X: 2200 processor PowerPC cluster with Infiniband interconnect.  Anantham: 400 processor Opteron Cluster with Myrinet interconnect  Several 8-32 processor research clusters.  12 processor SGI Altix shared memory system  8 processor AMD Opteron shared memory system.  16 core AMD Opteron shared memory system  16 node Playstation 3 cluster