Scaling Agent-based Simulation of Contagion Diffusion over Dynamic Networks on Petascale Machines Keith Bisset Jae-Seung Yeom, Ashwin Aji

Slides:

Advertisements

Similar presentations

Instructor Notes This lecture describes the different ways to work with multiple devices in OpenCL (i.e., within a single context and using multiple contexts),

Advertisements

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.

Lecture 6: Multicore Systems

Concurrency Important and difficult (Ada slides copied from Ed Schonberg)

Distributed Graph Analytics Imranul Hoque CS525 Spring 2013.

Introduction CSCI 444/544 Operating Systems Fall 2008.

A system Performance Model Instructor: Dr. Yanqing Zhang Presented by: Rajapaksage Jayampthi S.

Distributed Process Scheduling Summery Distributed Process Scheduling Summery BY:-Yonatan Negash.

OpenFOAM on a GPU-based Heterogeneous Cluster

CISC October Goals for today: Foster’s parallel algorithm design –Partitioning –Task dependency graph Granularity Concurrency Collective communication.

SimDL: A Model Ontology Driven Digital Library for Simulation Systems Jonathan Leidig - Edward A. Fox Kevin Hall Madhav Marathe Henning Mortveit.

Ontology Classifications Acknowledgement Abstract Content from simulation systems is useful in defining domain ontologies. We describe a digital library.

Scaling Distributed Machine Learning with the BASED ON THE PAPER AND PRESENTATION: SCALING DISTRIBUTED MACHINE LEARNING WITH THE PARAMETER SERVER – GOOGLE,

ECE669 L4: Parallel Applications February 10, 2004 ECE 669 Parallel Computer Architecture Lecture 4 Parallel Applications.

DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.

High Performance Computing 1 Parallelization Strategies and Load Balancing Some material borrowed from lectures of J. Demmel, UC Berkeley.

Strategies for Implementing Dynamic Load Sharing.

Customized Dynamic Load Balancing for a Network of Workstations Taken from work done by: Mohammed Javeed Zaki, Wei Li, Srinivasan Parthasarathy Computer.

To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,

Synthesizing Social Proximity Networks by Combining Subjective Surveys with Digital Traces Christopher Barrett*, Huadong Xia*, Jiangzhuo Chen*, Madhav.

ADLB Update Recent and Current Adventures with the Asynchronous Dynamic Load Balancing Library Rusty Lusk Mathematics and Computer Science Division Argonne.

Charm++ Load Balancing Framework Gengbin Zheng Parallel Programming Laboratory Department of Computer Science University of Illinois at.

Introduction to Parallel Programming MapReduce Except where otherwise noted all portions of this work are Copyright (c) 2007 Google and are licensed under.

Computational Methods for Testing Adequacy and Quality of Massive Synthetic Proximity Social Networks Huadong Xia, Christopher Barrett, Jiangzhuo Chen,

Predictive Runtime Code Scheduling for Heterogeneous Architectures 1.

Introduction and Overview Questions answered in this lecture: What is an operating system? How have operating systems evolved? Why study operating systems?

Integrated Performance Views in Charm++: Projections meets TAU Scott Biersdorff Allen D. Malony Department Computer and Information Science University.

Interaction-Based HPC Modeling of Social, Biological, and Economic Contagions Over Large Networks Network Dynamics & Simulation Science Laboratory Jiangzhuo.

GraphLab: how I understood it with sample code Aapo Kyrola, Carnegie Mellon Univ. Oct 1, 2009.

Advanced / Other Programming Models Sathish Vadhiyar.

Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.

EpiFast: A Fast Algorithm for Large Scale Realistic Epidemic Simulations on Distributed Memory Systems Keith R. Bisset, Jiangzhuo Chen, Xizhou Feng, V.S.

Dynamic Load Balancing in Charm++ Abhinav S Bhatele Parallel Programming Lab, UIUC.

NIH Resource for Biomolecular Modeling and Bioinformatics Beckman Institute, UIUC NAMD Development Goals L.V. (Sanjay) Kale Professor.

PARALLEL APPLICATIONS EE 524/CS 561 Kishore Dhaveji 01/09/2000.

Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.

ARCHES: GPU Ray Tracing I.Motivation – Emergence of Heterogeneous Systems II.Overview and Approach III.Uintah Hybrid CPU/GPU Scheduler IV.Current Uintah.

Simulating Diffusion Processes on Very Large Complex networks Joint work with Keith Bisset, Xizhou Feng, Madhav Marathe, and Anil Vullikanti Jiangzhuo.

Coevolution of Epidemics, Social Networks, and Individual Behavior: A Case Study Joint work with Achla Marathe, and Madhav Marathe Jiangzhuo Chen Network.

CS 484 Load Balancing. Goal: All processors working all the time Efficiency of 1 Distribute the load (work) to meet the goal Two types of load balancing.

Partitioning using Mesh Adjacencies  Graph-based dynamic balancing Parallel construction and balancing of standard partition graph with small cuts takes.

Data Structures and Algorithms in Parallel Computing Lecture 7.

Static Process Scheduling

1 If you were plowing a field, which would you rather use? Two oxen, or 1024 chickens? (Attributed to S. Cray)

CDP Tutorial 3 Basics of Parallel Algorithm Design uses some of the slides for chapters 3 and 5 accompanying “Introduction to Parallel Computing”, Addison.

Scalable and Topology-Aware Load Balancers in Charm++ Amit Sharma Parallel Programming Lab, UIUC.

MSc in High Performance Computing Computational Chemistry Module Parallel Molecular Dynamics (i) Bill Smith CCLRC Daresbury Laboratory

Comparison of Individual Behavioral Interventions and Public Mitigation Strategies for Containing Influenza Epidemic Joint work with Chris Barrett, Stephen.

Martin Kruliš by Martin Kruliš (v1.0)1.

Uses some of the slides for chapters 3 and 5 accompanying “Introduction to Parallel Computing”, Addison Wesley, 2003.

Outline  Introduction  Subgraph Pattern Matching  Types of Subgraph Pattern Matching  Models of Computation  Distributed Algorithms  Performance.

Efficient Implementation of Complex Interventions in Large Scale Epidemic Simulations Network Dynamics & Simulation Science Laboratory Jiangzhuo Chen Joint.

COMP7330/7336 Advanced Parallel and Distributed Computing Task Partitioning Dr. Xiao Qin Auburn University

COMP7330/7336 Advanced Parallel and Distributed Computing Task Partitioning Dynamic Mapping Dr. Xiao Qin Auburn University

Parallel Molecular Dynamics A case study : Programming for performance Laxmikant Kale

Department of Computer Science, Johns Hopkins University Pregel: BSP and Message Passing for Graph Computations EN Randal Burns 14 November 2013.

CPU scheduling.  Single Process  one process at a time  Maximum CPU utilization obtained with multiprogramming  CPU idle :waiting time is wasted 2.

Department of Computer Science, Johns Hopkins University Lecture 7 Finding Concurrency EN /420 Instructor: Randal Burns 26 February 2014.

Auburn University

TensorFlow– A system for large-scale machine learning

Controlled Kernel Launch for Dynamic Parallelism in GPUs

Conception of parallel algorithms

Processes and Threads Processes and their scheduling

Network Science in NDSSL at Virginia Tech

Title Meta-Balancer: Automated Selection of Load Balancing Strategies

Integrated Runtime of Charm++ and OpenMP

Upcoming Improvements and Features in Charm++

TensorFlow: A System for Large-Scale Machine Learning

Parallel Exact Stochastic Simulation in Biochemical Systems

Presentation transcript:

Scaling Agent-based Simulation of Contagion Diffusion over Dynamic Networks on Petascale Machines Keith Bisset Jae-Seung Yeom, Ashwin Aji ndssl.vbi.vt.edu Network Dynamics and Simulation Science Lab Virginia Bioinformatics Institute Virginia Tech

The problem we are trying to solve – Contagion propagation across large interaction networks – ~300 million nodes, ~1.5/70 billion edges Examples – Infectious Disease – Norms and Fads (e.g., Smoking, Obesity) – Digital viruses (e.g., computer viruses, cell phone worms) – Human immune system modeling Contagion Diffusion

Episimdemics is an individual-based modeling environment – Each individual is represented based on synthetic population of US – Each interaction between two co-located individuals is represented Uses a people-location bipartite graph as the underlying network. Planned: Add people-people graph for direct interactions Features – Time dependent and location dependent interactions – A scripting language to specify complex interventions – PTTS representation of disease and behavior EpiSimdemics

Example Person-Person Graph Image courtesy SDSC

For each timestep (e.g., a day) – In parallel, each person Determines where they will go Send a message to each location they will visit – In parallel, each location Converts each message into an event pair Calculates probability of infection between each co-located pair of infectious, susceptible people Sends message to each newly infected person – In parallel, each infected person updates state – Global simulation state is updated EpiSimdemics Algorithm

Charm Implementation 6 PM 1 PM 2 PM n Person Manager LM 1 LM 2 LM n Location Manager P P P L L L main PE0 PE1 person data visit data Location data visit PM 1 PM 2 LM 1 LM 2 PM 2 LM 3 LM 1 LM 2 LM 3 done() main PM 1 PM 2 LM 1 LM 2 PM 2 LM 3 sendInteractors()computeInfection() PM 1 PM 2 LM 1 LM 2 PM 2 LM 3 PM 1 PM 2 done() computeInfection() PM 1 PM 2 main endOfDay() Processing steps of an iteration Sync. by Charm’s CD

P-L graph explicit, defines communication P-P graph implicit, defines computation, 50x more edges Both graphs evolve over time US Population – 270 million people, 70 million locations – 1.5 billion edges P-L graph – ~75 billion edges P-P graph (potential interactions/step) Data organization

Complex, Layered Interventions InterventionPopulationComplianceWhen Vaccination Adult Children Crit Workers 25% 60% 100% Day 0 SchoolClosure Reopen 60%1.0% Children diagnosed (by county) QuarantineCrit Workers100%1.0% adults diagnosed Self IsolateAll20%2.5% adults diagnosed # stay home when symptomatic. intervention symptomatic set num_symptomatic++ apply diagnose with prob=0.60 schedule stayhome 3 trigger disease.symptom >= 2 apply symptomatic # vaccinate 25% of adults intervention vaccinate_adult treat vaccine set num_vac_adult++ trigger person.age > 18 apply vaccinate_adult with prob=0.25

Effects of Interventions

Charm++ SMP mode Gemini network layer 4 processes/node 3 compute 1 comm threads per process Application based message coalescence BlueWaters Setup

Weak Scaling

Location load depends on number of visits Location size follows power law Not apparent until running at scale Location Granularity

Scaling for US Population

Round Robin – Random distribution – Low overhead – Works well for small geographic areas (metro area) Graph Partitioner – Metis based partitioning – Multi-constraint (two phases separated by sync) – Higher Overhead – Helps as geographic area increases (state, national) Static Partitioning

Static Partitioning - Results SendInteractor(). Person computation to generate visit messages AddVisitMessage(). Location side message receive handling. ComputeInfections(). Location computation of interaction among visitors

Message Volume Round Robin Graph Partitioner 256 nodes, 10 million people

Graph Sparsification Procedure Randomly remove edges from high degree nodes Partition sparse graph Use full graph for execution Goal: Improve runtime of Graph Partitioning

Impact of GPU Acceleration on Execution Profile 70.9% 7.7x Assume 1CPU cores per GPU devices, in practice, CPU > GPU

Scenario 1 – All chares from all CPU processes offload simultaneously to GPU – GPUs (Kepler) maintain tasks queue from different processes – Inefficient: CPUs will be idle waiting for GPU execution to complete Scenario 2 – Chares from only some select CPU processes offload to GPU – 1:1 ratio can be maintained between “GPU” processes and GPUs – But, “GPU” chares will finish sooner than “CPU” chares, i.e. load imbalance – Use LB methods of Charm++ to rebalance chares GPU-CharmSimdemics Scenarios Node GPU Node GPU

Dynamic Load Balancing with semantic information – Prediction model based on past runs – Information from simulation state variables – Use dynamic interventions – more variable load Try Charm++ Meta Load Balancer Further improvements to initial partitioning – Minimize message imbalance as well as edge-cut Message reduction Sequential replicates to amortize data load time Scale to global population - 10 billion people Future Work

Acknowledgements NSF HSD Grant SES , NSF PetaApps Grant OCI , NSF NETS Grant CNS , NSF CAREER Grant CNS , NSF NetSE Grant CNS , NSF SDCI Grant OCI , DTRA R&D Grant HDTRA , DTRA Grant HDTRA , DTRA CNIMS Contract HDTRA1-11-D , DOE Grant DE-SC , PSU/DOE 4345-VT-DOE-4261, US Naval Surface Warfare Center Grant N D-3017 DEL ORDER 13, NIH MIDAS Grant 2U01GM , NIH MIDAS Grant 3U01FM S1, LLNL Fellowship SubB DOI Contract D12PC00337 UIUC Parallel Programming Lab NDSSL Faculty, Staff and Students