High Performance Computing and the FLAME Framework Prof C Greenough, LS Chin and Dr DJ Worth STFC Rutherford Appleton Laboratory Prof M Holcombe and Dr.

Slides:



Advertisements
Similar presentations
MINJAE HWANG THAWAN KOOBURAT CS758 CLASS PROJECT FALL 2009 Extending Task-based Programming Model beyond Shared-memory Systems.
Advertisements

Enabling Speculative Parallelization via Merge Semantics in STMs Kaushik Ravichandran Santosh Pande College.
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
A Dynamic World, what can Grids do for Multi-Core computing? Daniel Goodman, Anne Trefethen and Douglas Creager
Seunghwa Kang David A. Bader Large Scale Complex Network Analysis using the Hybrid Combination of a MapReduce Cluster and a Highly Multithreaded System.
1 Coven a Framework for High Performance Problem Solving Environments Nathan A. DeBardeleben Walter B. Ligon III Sourabh Pandit Dan C. Stanzione Jr. Parallel.
OpenFOAM on a GPU-based Heterogeneous Cluster
Revisiting a slide from the syllabus: CS 525 will cover Parallel and distributed computing architectures – Shared memory processors – Distributed memory.
Reference: Message Passing Fundamentals.
A Grid Parallel Application Framework Jeremy Villalobos PhD student Department of Computer Science University of North Carolina Charlotte.
Distributed Computations
Chapter Hardwired vs Microprogrammed Control Multithreading
1 Dr. Frederica Darema Senior Science and Technology Advisor NSF Future Parallel Computing Systems – what to remember from the past RAMP Workshop FCRC.
A Parallel Structured Ecological Model for High End Shared Memory Computers Dali Wang Department of Computer Science, University of Tennessee, Knoxville.
Establishing the overall structure of a software system
The new The new MONARC Simulation Framework Iosif Legrand  California Institute of Technology.
NPACI: National Partnership for Advanced Computational Infrastructure August 17-21, 1998 NPACI Parallel Computing Institute 1 Cluster Archtectures and.
Leveling the Field for Multicore Open Systems Architectures Markus Levy President, EEMBC President, Multicore Association.
Parallelization: Conway’s Game of Life. Cellular automata: Important for science Biology – Mapping brain tumor growth Ecology – Interactions of species.
1 Reasons for parallelization Can we make GA faster? One of the most promising choices is to use parallel implementations. The reasons for parallelization.
Charm++ Load Balancing Framework Gengbin Zheng Parallel Programming Laboratory Department of Computer Science University of Illinois at.
Load Balancing Dan Priece. What is Load Balancing? Distributed computing with multiple resources Need some way to distribute workload Discreet from the.
Research on cloud computing application in the peer-to-peer based video-on-demand systems Speaker : 吳靖緯 MA0G rd International Workshop.
Parallel Processing LAB NO 1.
Data Partitioning on Heterogeneous Multicore and Multi-GPU Systems Using Functional Performance Models of Data-Parallel Applications Published in: Cluster.
CC02 – Parallel Programming Using OpenMP 1 of 25 PhUSE 2011 Aniruddha Deshmukh Cytel Inc.
Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Background: MapReduce and FREERIDE Co-clustering on FREERIDE Experimental.
Research challenges faced The Agent-based modelling framework required the following features: –Ability to run many millions of complex agents –Should.
© 2005 Mercury Computer Systems, Inc. Yael Steinsaltz, Scott Geaghan, Myra Jean Prelle, Brian Bouzas,
Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Summary of Contributions Background: MapReduce and FREERIDE Wavelet.
 What is an operating system? What is an operating system?  Where does the OS fit in? Where does the OS fit in?  Services provided by an OS Services.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
DISTRIBUTED COMPUTING
Graph Algorithms for Irregular, Unstructured Data John Feo Center for Adaptive Supercomputing Software Pacific Northwest National Laboratory July, 2010.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
The WRF Model The Weather Research and Forecasting (WRF) Model is a mesoscale numerical weather prediction system designed for both atmospheric research.
Issues with Economic and Social systems modelling Mariam Kiran University of Sheffield Future Research Directions in Agent Based Modelling June 2010.
Peer-to-Peer Distributed Shared Memory? Gabriel Antoniu, Luc Bougé, Mathieu Jan IRISA / INRIA & ENS Cachan/Bretagne France Dagstuhl seminar, October 2003.
Expanding the CASE Framework to Facilitate Load Balancing of Social Network Simulations Amara Keller, Martin Kelly, Aaron Todd.
MARISSA: MApReduce Implementation for Streaming Science Applications 作者 : Fadika, Z. ; Hartog, J. ; Govindaraju, M. ; Ramakrishnan, L. ; Gunter, D. ; Canon,
Frank Casilio Computer Engineering May 15, 1997 Multithreaded Processors.
1 Multiprocessor and Real-Time Scheduling Chapter 10 Real-Time scheduling will be covered in SYSC3303.
Evaluation of Agent Teamwork High Performance Distributed Computing Middleware. Solomon Lane Agent Teamwork Research Assistant October 2006 – March 2007.
©Ian Sommerville 2000 Software Engineering, 6th edition. Chapter 10Slide 1 Architectural Design l Establishing the overall structure of a software system.
9 February 2000CHEP2000 Paper 3681 CDF Data Handling: Resource Management and Tests E.Buckley-Geer, S.Lammel, F.Ratnikov, T.Watts Hardware and Resources.
Multiprossesors Systems.. What are Distributed Databases ? “ A Logically interrelated collection of shared data ( and a description of this data) physically.
OPERATING SYSTEM SUPPORT DISTRIBUTED SYSTEMS CHAPTER 6 Lawrence Heyman July 8, 2002.
Overcoming Scaling Challenges in Bio-molecular Simulations Abhinav Bhatelé Sameer Kumar Chao Mei James C. Phillips Gengbin Zheng Laxmikant V. Kalé.
GVis: Grid-enabled Interactive Visualization State Key Laboratory. of CAD&CG Zhejiang University, Hangzhou
ATmospheric, Meteorological, and Environmental Technologies RAMS Parallel Processing Techniques.
Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.
October 2008 Integrated Predictive Simulation System for Earthquake and Tsunami Disaster CREST/Japan Science and Technology Agency (JST)
Gedae, Inc. Gedae: Auto Coding to a Virtual Machine Authors: William I. Lundgren, Kerry B. Barnes, James W. Steed HPEC 2004.
Architecture View Models A model is a complete, simplified description of a system from a particular perspective or viewpoint. There is no single view.
Progress on Component-Based Subsurface Simulation I: Smooth Particle Hydrodynamics Bruce Palmer Pacific Northwest National Laboratory Richland, WA.
Efficiency of small size tasks calculation in grid clusters using parallel processing.. Olgerts Belmanis Jānis Kūliņš RTU ETF Riga Technical University.
CS 351/ IT 351 Modeling and Simulation Technologies HPC Architectures Dr. Jim Holten.
Communication layer * Agent message delivery filtering Framework layer * Spread agents on processors * Calling of the functions on agents in order * Agent.
Computing Issues for the ATLAS SWT2. What is SWT2? SWT2 is the U.S. ATLAS Southwestern Tier 2 Consortium UTA is lead institution, along with University.
Background Computer System Architectures Computer System Software.
Introduction Goal: connecting multiple computers to get higher performance – Multiprocessors – Scalability, availability, power efficiency Job-level (process-level)
Slide 1 Chapter 8 Architectural Design. Slide 2 Topics covered l System structuring l Control models l Modular decomposition l Domain-specific architectures.
COMP7330/7336 Advanced Parallel and Distributed Computing Task Partitioning Dr. Xiao Qin Auburn University
Computer Science and Engineering Parallelizing Feature Mining Using FREERIDE Leonid Glimcher P. 1 ipdps’04 Scaling and Parallelizing a Scientific Feature.
Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming
Performance Evaluation of Adaptive MPI
Hybrid Programming with OpenMP and MPI
Department of Computer Science, University of Tennessee, Knoxville
Computational issues Issues Solutions Large time scale
Presentation transcript:

High Performance Computing and the FLAME Framework Prof C Greenough, LS Chin and Dr DJ Worth STFC Rutherford Appleton Laboratory Prof M Holcombe and Dr S Coakley Computer Science, Sheffield University

 Application can not be run on a conventional computing system – Insufficient memory – Insufficient compute power  High Performance Computing (HPC) generally now means: – Large multi-processor system – Complex communications hardware – Specialised attached processors – GRID/Cloud computing STFC Rutherford Appleton Laboratory 2CLIMACE Meeting - 14 May 2009 Why High Performance Computing?

 Parallel system are in constant development  Their hardware architectures are ever changing – simple distributed memory on multiple processors – share memory between multiple processors – hybrid systems –  clusters of share memory multiple processors  clusters of multi-core systems – the processors often have a multi-level cache system STFC Rutherford Appleton Laboratory 3CLIMACE Meeting - 14 May 2009 Issues in High Performance Computing

 Most have high speed multi-level communication switches  GRID architectures are now being used for very large simulations – many large high-performance systems – loosely coupled together over the internet  Performance can be improved by optimising to a specific architecture  Can very easily become architecture dependent STFC Rutherford Appleton Laboratory 4CLIMACE Meeting - 14 May 2009 Issues in High Performance Computing

STFC Rutherford Appleton Laboratory 5CLIMACE Meeting - 14 May 2009 The FLAME Framework

 Based on X-Machines  Agents: – Have memory – Have states – Communicate through messages  Structure of Application: – Embedded in XML and C-code – Application generation driven by state graph – Agent communication managed by library STFC Rutherford Appleton Laboratory 6CLIMACE Meeting - 14 May 2009 Characteristics of FLAME

 The Data Load – Size of agents internal memory – The number of size of message boards  The Computational Load – Work performed in any state change – Any I/O performed  FLAME Framework – Programme generator (serial/parallel) – Provides control of states – Provide communications network STFC Rutherford Appleton Laboratory 7CLIMACE Meeting - 14 May 2009 Characteristics of FLAME

 Based on : – the distribution of agents – computational load – distribution of message boards – data load  Agents only communicate via MBs  Cross-node message information is made available to agents by message board synchronisation  Communication between nodes are minimised – Halo regions – Message filtering STFC Rutherford Appleton Laboratory 8CLIMACE Meeting - 14 May 2009 Initial Parallel Implementation

STFC Rutherford Appleton Laboratory CLIMACE Meeting - 14 May Geometric Partitioning halos radius P1P1 P2P2 P3P3 P4P4 P7P7 P 10 P 11 P 12 P9P9 P6P6 P5P5 P8P8 Processors P i

STFC Rutherford Appleton Laboratory 10CLIMACE Meeting - 14 May 2009 Parallelism in FLAME

 Parallelism is hidden in the XML model and the C-code – this is in term of agent locality or groupings  Communications captured in XML – In agent function descriptions – In message descriptions  The States are the computational load – weight not known until run time – could be fine or course grained  Initial distribution based on a static analysis  Final distributions method be based on dynamic behaviour STFC Rutherford Appleton Laboratory 11CLIMACE Meeting - 14 May 2009 Issues with HPC and FLAME

STFC Rutherford Appleton Laboratory 12CLIMACE Meeting - 14 May 2009 Parallelism in FLAME Parallel agents grouped on parallel nodes. Messages synchronised Message board library allows both serial and parallel versions to work Implementation details hidden from modellers System automatically manages the simulation

 Decoupled from the FLAME framework  Well defined Application Program Interface (API)  Includes functions for creating, deleting, managing and accessing information on the Message Boards  Details such as internal data representations, memory management and communication strategies are hidden  Uses multi-threading for work and communications STFC Rutherford Appleton Laboratory 13CLIMACE Meeting - 14 May 2009 Message Boards

STFC Rutherford Appleton Laboratory 14CLIMACE Meeting - 14 May 2009 FLAME & the Message Boards

 MB Management – create, delete, add message, clear board  Access to message information (iterators) – plain, filtered, sorted, randomise  MB Synchronisation – moving information between nodes – full data replication – very expensive – filtered information using tagging – overlapped with computation STFC Rutherford Appleton Laboratory 15CLIMACE Meeting - 14 May 2009 Message Board API

 Message Board Management – MB_Env_Init - Initialises MB environment – MB_Env_Finalise - Finalises the MB environment – MB_Create - Creates a new Message Board object – MB_AddMessage - Adds a message to a Message Board – MB_Clear - Clears a Message Board – MB_Delete - Deletes a Message Board STFC Rutherford Appleton Laboratory 16CLIMACE Meeting - 14 May 2009 The MB Environment

 Message Selection & Reading - Iterators – MB_Iterator_Create - Creates an iterator – MB_Iterator_CreateSorted - Create a sorted iterator – MB_Iterator_CreateFiltered - Create a filtered iterator – MB_Iterator_Delete - Deletes an iterator – MB_Iterator_Rewind - Rewinds an iterator – MB_Iterator_Randomise - Randomises an Iterator – MB_Iterator_GetMessage - Returns next message STFC Rutherford Appleton Laboratory 17CLIMACE Meeting - 14 May 2009 The Message Board API (2)

 Message Synchronisation: Synchronisation of boards involves the propagation of message data out across the processing nodes as required by the agents on each node – MB_SyncStart - Synchronises a message board – MB_SyncTest - Tests for synchronisation completion – MB_SyncComplete - Completes the synchronisation STFC Rutherford Appleton Laboratory 18CLIMACE Meeting - 14 May 2009 The Message Board API (3)

 MB Sychronisation: – The simplest form is full replication of message data - very expensive in communication and memory – The MB uses message tagging to reduce the volume of data being transferred and stored – Tagging uses message FILTERs to select message information to be transferred – FILTERs are specified in the Model File XMML STFC Rutherford Appleton Laboratory 19CLIMACE Meeting - 14 May 2009 The Message Board API (4)

The Message Board API (5)  Selection based on filters  Filters defined in XMML  Filters can be used: – in creating iterators to reduce local message list – during synchronisation to minimise cross-node communications STFC Rutherford Appleton Laboratory 20CLIMACE Meeting - 14 May 2009

 Iterators objects used for traversing Message Board content. They provide users access to messages while isolating them from the internal data representation of Boards.  Creating an Iterator generates a list of the available messages within the Board against a specific criteria. This is a snapshot of the content of a local Board. STFC Rutherford Appleton Laboratory 21CLIMACE Meeting - 14 May 2009 MB Iterators (1)

STFC Rutherford Appleton Laboratory 22CLIMACE Meeting - 14 May 2009 MB Iterators (2)

 FLAME has been successfully ported to the to various HPC systems: – SCARF – 360x2.2 GHz AMD Opteron cores, 1.3TB total memory – HAPU – 128x2.4 GHz Opteron cores, 2GB memory / core – NW-Grid – 384x2.4 GHz Opteron cores, 2 or 4 GB memory/core – HPCx – 2560x1.5GHz Power5 cores, 2GB memory / core – Legion (Blue Gene/P) – 1026xPowerPC 850 MHz; 4096 cores – Leviathan (UNIBI) – 3xIntel Xeon E5355 (Quad Core), 24 cores STFC Rutherford Appleton Laboratory 23CLIMACE Meeting - 14 May 2009 Porting to Parallel Platforms

Test Models  Circles Model – Very simple agents – all have position data – x,y,fx,fy,radius in memory – Repulsion from neighbours – 1message type – Domain decomposition  Model – Mix of agents: Malls, Firms, People – A mixture of state complexities – All have position data – Agents have range of influence – 9 message types – Domain decomposition STFC Rutherford Appleton Laboratory 24CLIMACE Meeting - 14 May 2009

STFC Rutherford Appleton Laboratory 25CLIMACE Meeting - 14 May 2009 Circles Model

STFC Rutherford Appleton Laboratory 26CLIMACE Meeting - 14 May 2009 Model

STFC Rutherford Appleton Laboratory 27CLIMACE Meeting - 14 May 2009 Bielefeld Model

 Work only just started  Goal to move agents between compute nodes: – reduce overall elapsed time – increase parallel efficiency  There is an interaction between computational efficiency and overall elapsed time  The requirements of communications and load may conflict! STFC Rutherford Appleton Laboratory 28CLIMACE Meeting - 14 May 2009 Dynamic Load Balancing

Balance - Load vs. Communication  Distribution 1 – P1: 13 agents – P2: 3 agents – P2 P1: 1 channel  Distribution 2 – P1: 9 agents – P2: 7 agents – P1 P2: 6 channels STFC Rutherford Appleton Laboratory 29CLIMACE Meeting - 14 May 2009 Distribution A Distribution B P1P2 Frequent Occasional

Moving Wrong Agents Moving wrong agents could increase elapsed time STFC Rutherford Appleton Laboratory 30CLIMACE Meeting - 14 May 2009

 Size of agent population  Granularity of agents – is there are large computational load – How often do they communicate  Inherent parallelism (locality) in model – Are the agents in groups – Do they have short range communication  Size of initial data  Size of outputs STFC Rutherford Appleton Laboratory 31CLIMACE Meeting - 14 May 2009 HPC Issues in CLIMACE

 Effect initial static distributions  Effect dynamic agent migration algorithms  Sophisticated communication strategies – To reduce the number of communications – To reduce synchronisations – To reduce communication volumes – Pre-tagging information to allow pre-fetching  Overlapping of computation with communications  Efficient use of multi-code nodes on large systems  Efficient use of attached processors STFC Rutherford Appleton Laboratory 32CLIMACE Meeting - 14 May 2009 HCP Challenges for ABM