University of Illinois at Urbana-Champaign Memory Architectures for Protein Folding: MD on million PIM processors Fort Lauderdale, May 03,

Slides:



Advertisements
Similar presentations
© 2004 Wayne Wolf Topics Task-level partitioning. Hardware/software partitioning.  Bus-based systems.
Advertisements

The Charm++ Programming Model and NAMD Abhinav S Bhatele Department of Computer Science University of Illinois at Urbana-Champaign
Parallel Programming Motivation and terminology – from ACM/IEEE 2013 curricula.
Master/Slave Architecture Pattern Source: Pattern-Oriented Software Architecture, Vol. 1, Buschmann, et al.
Abhinav Bhatele, Laxmikant V. Kale University of Illinois at Urbana-Champaign Sameer Kumar IBM T. J. Watson Research Center.
Abhinav Bhatele, Laxmikant V. Kale University of Illinois at Urbana-Champaign Sameer Kumar IBM T. J. Watson Research Center.
Dr. Gengbin Zheng and Ehsan Totoni Parallel Programming Laboratory University of Illinois at Urbana-Champaign April 18, 2011.
Summary Background –Why do we need parallel processing? Applications Introduction in algorithms and applications –Methodology to develop efficient parallel.
Where Are They Now? Current Status of C++ Parallel Language Extensions and Libraries From 1995 Supercomputing Workshop 1.
A CASE STUDY OF COMMUNICATION OPTIMIZATIONS ON 3D MESH INTERCONNECTS University of Illinois at Urbana-Champaign Abhinav Bhatele, Eric Bohm, Laxmikant V.
Computer Science Prof. Bill Pugh Dept. of Computer Science.
Diffusion scheduling in multiagent computing system MotivationArchitectureAlgorithmsExamplesDynamics Robert Schaefer, AGH University of Science and Technology,
Topology Aware Mapping for Performance Optimization of Science Applications Abhinav S Bhatele Parallel Programming Lab, UIUC.
BigSim: A Parallel Simulator for Performance Prediction of Extremely Large Parallel Machines Gengbin Zheng Gunavardhan Kakulapati Laxmikant V. Kale University.
A Framework for Collective Personalized Communication Laxmikant V. Kale, Sameer Kumar, Krishnan Varadarajan.
Charm++ Load Balancing Framework Gengbin Zheng Parallel Programming Laboratory Department of Computer Science University of Illinois at.
Institute of Computer and Communication Network Engineering OFC/NFOEC, 6-10 March 2011, Los Angeles, CA Lessons Learned From Implementing a Path Computation.
Scaling to New Heights Retrospective IEEE/ACM SC2002 Conference Baltimore, MD.
Adaptive MPI Milind A. Bhandarkar
1 Scaling Collective Multicast Fat-tree Networks Sameer Kumar Parallel Programming Laboratory University Of Illinois at Urbana Champaign ICPADS ’ 04.
Grid Computing With Charm++ And Adaptive MPI Gregory A. Koenig Department of Computer Science University of Illinois.
Seaborg Cerise Wuthrich CMPS Seaborg  Manufactured by IBM  Distributed Memory Parallel Supercomputer  Based on IBM’s SP RS/6000 Architecture.
Boston, May 22 nd, 2013 IPDPS 1 Acceleration of an Asynchronous Message Driven Programming Paradigm on IBM Blue Gene/Q Sameer Kumar* IBM T J Watson Research.
© 2012 MELLANOX TECHNOLOGIES 1 The Exascale Interconnect Technology Rich Graham – Sr. Solutions Architect.
Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.
Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.
Performance Model & Tools Summary Hung-Hsun Su UPC Group, HCS lab 2/5/2004.
A performance evaluation approach openModeller: A Framework for species distribution Modelling.
Querying Large Databases Rukmini Kaushik. Purpose Research for efficient algorithms and software architectures of query engines.
Parallel and Distributed Simulation Memory Management & Other Optimistic Protocols.
1 Blue Gene Simulator Gengbin Zheng Gunavardhan Kakulapati Parallel Programming Laboratory Department of Computer Science.
Dynamic Load Balancing in Charm++ Abhinav S Bhatele Parallel Programming Lab, UIUC.
NIH Resource for Biomolecular Modeling and Bioinformatics Beckman Institute, UIUC NAMD Development Goals L.V. (Sanjay) Kale Professor.
NIH Resource for Biomolecular Modeling and Bioinformatics Beckman Institute, UIUC NAMD Development Goals L.V. (Sanjay) Kale Professor.
Overcoming Scaling Challenges in Bio-molecular Simulations Abhinav Bhatelé Sameer Kumar Chao Mei James C. Phillips Gengbin Zheng Laxmikant V. Kalé.
Summary Background –Why do we need parallel processing? Moore’s law. Applications. Introduction in algorithms and applications –Methodology to develop.
Workshop BigSim Large Parallel Machine Simulation Presented by Eric Bohm PPL Charm Workshop 2004.
1Charm++ Workshop 2010 The BigSim Parallel Simulation System Gengbin Zheng, Ryan Mokos Charm++ Workshop 2010 Parallel Programming Laboratory University.
Introduction to Research 2011 Introduction to Research 2011 Ashok Srinivasan Florida State University Images from ORNL, IBM, NVIDIA.
1 Scaling Applications to Massively Parallel Machines using Projections Performance Analysis Tool Presented by Chee Wai Lee Authors: L. V. Kale, Gengbin.
Optimizing Charm++ Messaging for the Grid Gregory A. Koenig Parallel Programming Laboratory Department of Computer.
1 ©2004 Board of Trustees of the University of Illinois Computer Science Overview Laxmikant (Sanjay) Kale ©
Programmability Hiroshi Nakashima Thomas Sterling.
Parallelizing Spacetime Discontinuous Galerkin Methods Jonathan Booth University of Illinois at Urbana/Champaign In conjunction with: L. Kale, R. Haber,
Using Charm++ to Mask Latency in Grid Computing Applications Gregory A. Koenig Parallel Programming Laboratory Department.
IPDPS Workshop: Apr 2002PPL-Dept of Computer Science, UIUC A Parallel-Object Programming Model for PetaFLOPS Machines and BlueGene/Cyclops Gengbin Zheng,
Scalable and Topology-Aware Load Balancers in Charm++ Amit Sharma Parallel Programming Lab, UIUC.
Parallelization Strategies Laxmikant Kale. Overview OpenMP Strategies Need for adaptive strategies –Object migration based dynamic load balancing –Minimal.
NGS/IBM: April2002PPL-Dept of Computer Science, UIUC Programming Environment and Performance Modeling for million-processor machines Laxmikant (Sanjay)
A uGNI-Based Asynchronous Message- driven Runtime System for Cray Supercomputers with Gemini Interconnect Yanhua Sun, Gengbin Zheng, Laximant(Sanjay) Kale.
NGS Workshop: Feb 2002PPL-Dept of Computer Science, UIUC Programming Environment and Performance Modeling for million-processor machines Laxmikant (Sanjay)
1 Opportunities and Challenges of Modern Communication Architectures: Case Study with QsNet CAC Workshop Santa Fe, NM, 2004 Sameer Kumar* and Laxmikant.
1 Converse BlueGene Emulator Gengbin Zheng Parallel Programming Lab 2/27/2001.
1 Optimizing Quantum Chemistry using Charm++ Eric Bohm Parallel Programming Laboratory Department of Computer Science University.
Optimizing Parallel Programming with MPI Michael Chen TJHSST Computer Systems Lab Abstract: With more and more computationally- intense problems.
Parallel Computing Presented by Justin Reschke
FTC-Charm++: An In-Memory Checkpoint-Based Fault Tolerant Runtime for Charm++ and MPI Gengbin Zheng Lixia Shi Laxmikant V. Kale Parallel Programming Lab.
Parallel Molecular Dynamics A case study : Programming for performance Laxmikant Kale
BLUE GENE Sunitha M. Jenarius. What is Blue Gene A massively parallel supercomputer using tens of thousands of embedded PowerPC processors supporting.
Spark on Entropy : A Reliable & Efficient Scheduler for Low-latency Parallel Jobs in Heterogeneous Cloud Huankai Chen PhD Student at University of Kent.
Robust Non-Intrusive Record-Replay with Processor Extraction Filippo Gioachin Gengbin Zheng Laxmikant Kalé Parallel Programming Laboratory Departement.
Debugging Large Scale Applications in a Virtualized Environment Filippo Gioachin Gengbin Zheng Laxmikant Kalé Parallel Programming Laboratory Departement.
Flexibility and Interoperability in a Parallel MD code Robert Brunner, Laxmikant Kale, Jim Phillips University of Illinois at Urbana-Champaign.
Performance Evaluation of Adaptive MPI
Programming Models for Blue Gene/L : Charm++, AMPI and Applications
Milind A. Bhandarkar Adaptive MPI Milind A. Bhandarkar
Summary Background Introduction in algorithms and applications
BigSim: Simulating PetaFLOPS Supercomputers
IXPUG, SC’16 Lightning Talk Kavitha Chandrasekar*, Laxmikant V. Kale
Emulating Massively Parallel (PetaFLOPS) Machines
Presentation transcript:

University of Illinois at Urbana-Champaign Memory Architectures for Protein Folding: MD on million PIM processors Fort Lauderdale, May 03,

L.V. KaleMD on very Large PIM machines2 Overview  EIA : “ITR: Intelligent Memory Architectures and Algorithms to Crack the Protein Folding Problem”  PIs: –Josep Torrellas and Laxmikant Kale (University of Illinois) –Mark Tuckerman (New York University) –Michael Klein (University of Pennsylvania) –Also associated: Glenn Martyna (IBM)  Period: 8/00 - 7/03

L.V. KaleMD on very Large PIM machines3 Project Description  Multidisciplinary project in computer architecture and software, and computational biology  Goals: –Design improved algorithms to help solve the protein folding problem –Design the architecture and software of general- purpose parallel machines that speed-up the solution of the problem

L.V. KaleMD on very Large PIM machines4 Some Recent Progress: Ideas  Developed REPSWA –(Reference Potential Spatial Warping Algorithm) –Novel algorithm for accelerating conformational sampling in molecular dynamics, a key element in protein folding –Based on ``spatial warping'' variable transformation.  This transformation is designed to shrink barrier regions on the energy landscape and grow attractive basins without altering the equilibrium properties of the system –Result: large gains in sampling efficiency –Using novel variable transformations to enhance conformational sampling in molecular dynamics Z. Zhu, M. E. Tuckerman, S. O. Samuelson and G. J. Martyna, Phys. Rev. Lett. 88, (2002).Using novel variable transformations to enhance conformational sampling in molecular dynamics

L.V. KaleMD on very Large PIM machines5 Some Recent Progress: Tools  Developed LeanMD, a molecular dynamics parallel program that targets at very large scale parallel machines –Research-quality program based on the Charm++ parallel object oriented language –Descendant from NAMD (another parallel molecular dynamics application) that achieved unprecedented speedup on thousands of processors –LeanMD to be able to run on next generation parallel machines with ten thousands or even millions of processors such as Blue Gene/L or Blue Gene/C –Requires a new parallelization strategy that can break up the simulation problem in a more fine grained manner to generate parallelism enough to effectively distribute work across a million processors.

L.V. KaleMD on very Large PIM machines6 Some Recent Progress: Tools  Developed a high-performance communication library –For collective communication operations  AlltoAll personalized communication, AlltoAll multicast, and AllReduce  These operations can be complex and time consuming in large parallel machines  Especially costly for applications that involve all-to-all patterns –such as 3-D FFT and sorting –Library optimizes collective communication operations  by performing message combining via imposing a virtual topology –The overhead of AlltoAll communication for 76-byte message exchanges between 2058 processors is in the low tens of milliseconds

L.V. KaleMD on very Large PIM machines7 Some Recent Progress: People  The following graduate student researchers have been supported: –Sameer Kumar (University of Illinois) –Gengbin Zheng (University of Illinois) –Jun Nakano (University of Illinois) –Zhongwei Zhu (New York University)

L.V. KaleMD on very Large PIM machines8 Overview  Rest of the talk: –Objective: Develop a Molecular Dynamics program that will run effectively on a million processors  Each with low memory to processor ratio –Method:  Use parallel objects methodology  Develop an emulator/simulator that allows one to run full- fledged programs on simulated architecture –Presenting Today:  Simulator details  LeanMD Simulation on BG/L and BG/C

L.V. KaleMD on very Large PIM machines9 Performance Prediction on Large Machines  Problem: –How to predict performance of applications on future machines? –How to do performance tuning without continuous access to a large machine?  Solution: –Leverage virtualization –Develop a machine emulator –Simulator: accurate time modeling –Run a program on “100,000 processors” using only hundreds of processors

L.V. KaleMD on very Large PIM machines10 Blue Gene Emulator: functional view Affinity message queues Communication threads Worker threads inBuff Non-affinity message queues Correction Q Converse scheduler Converse Q Communication threads Worker threads inBuff Non-affinity message queues Correction Q Affinity message queues

L.V. KaleMD on very Large PIM machines11 Emulator to Simulator  Emulator: –Study programming model and application development  Simulator: –performance prediction capability –models communication latency based on network model; –Doesn’t model memory access on chip, or network contention  Parallel performance is hard to model –Communication subsystem  Out of order messages  Communication/computati on overlap –Event dependencies  Parallel Discrete Event Simulation –Emulation program executes in parallel with event time stamp correction. –Exploit inherent determinacy of application

L.V. KaleMD on very Large PIM machines12 How to simulate?  Time stamping events –Per thread timer (sharing one physical timer) –Time stamp messages  Calculate communication latency based on network model  Parallel event simulation –When a message is sent out, calculate the predicted arrival time for the destination bluegene-processor –When a message is received, update current time as:  currTime = max(currTime,recvTime) –Time stamp correction

L.V. KaleMD on very Large PIM machines13 Parallel correction algorithm  Sort message execution by receive time;  Adjust time stamps when needed  Use correction message to inform the change in event startTime.  Send out correction messages following the path message was sent  The events already in the timeline may have to move.

L.V. KaleMD on very Large PIM machines14 M8 M1M7M6M5M4M3M2 RecvTime Execution TimeLine Timestamps Correction

L.V. KaleMD on very Large PIM machines15 M8 M1M7M6M5M4M3M2 RecvTime Execution TimeLine Timestamps Correction

L.V. KaleMD on very Large PIM machines16 M1M7M6M5M4M3M2 RecvTime Execution TimeLine M8 Execution TimeLine M1M7M6M5M4M3M2M8 RecvTime Correction Message Timestamps Correction

L.V. KaleMD on very Large PIM machines17 M1M7M6M5M4M3M2 RecvTime Execution TimeLine Correction Message (M4) M4 Correction Message (M4) M4 M1M7M4M3M2 RecvTime Execution TimeLine M5M6 Correction Message M1M7M6M4M3M2 RecvTime Execution TimeLine M5 Correction Message Timestamps Correction

L.V. KaleMD on very Large PIM machines18 Predicted time vs latency factor Validation

L.V. KaleMD on very Large PIM machines19 LeanMD  LeanMD is a molecular dynamics simulation application written in Charm++  Next generation of NAMD, –The Gordon Bell Award winner in SC2002.  Requires a new parallelization strategy –break up the problem in a more fine-grained manner to effectively distribute work across the extreme large number of processors.

L.V. KaleMD on very Large PIM machines20 LeanMD Performance Analysis Need readable graphs: 1 to a page is fine, but with larger fonts, thicker lines

L.V. KaleMD on very Large PIM machines21