MIDAS- Molecular Dynamics Analysis Tutorial February 2017

Slides:



Advertisements
Similar presentations
A Workflow Engine with Multi-Level Parallelism Supports Qifeng Huang and Yan Huang School of Computer Science Cardiff University
Advertisements

Iterative computation is a kernel function to many data mining and data analysis algorithms. Missing in current MapReduce frameworks is collective communication,
Chapter 6 Operating System Support. This chapter describes how middleware is supported by the operating system facilities at the nodes of a distributed.
IPlant Collaborative Tools and Services Workshop iPlant Collaborative Tools and Services Workshop Overview of Atmosphere.
The Limitation of MapReduce: A Probing Case and a Lightweight Solution Zhiqiang Ma Lin Gu Department of Computer Science and Engineering The Hong Kong.
Evaluation of Agent Teamwork High Performance Distributed Computing Middleware. Solomon Lane Agent Teamwork Research Assistant October 2006 – March 2007.
Tool Integration with Data and Computation Grid GWE - “Grid Wizard Enterprise”
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
Tarball server (for Condor installation) Site Headnode Worker Nodes Schedd glidein - special purpose Condor pool master DB Panda Server Pilot Factory -
NA-MIC National Alliance for Medical Image Computing UCSD: Engineering Core 2 Portal and Grid Infrastructure.
International Symposium on Grid Computing (ISGC-07), Taipei - March 26-29, 2007 Of 16 1 A Novel Grid Resource Broker Cum Meta Scheduler - Asvija B System.
Aneka Cloud ApplicationPlatform. Introduction Aneka consists of a scalable cloud middleware that can be deployed on top of heterogeneous computing resources.
Tool Integration with Data and Computation Grid “Grid Wizard 2”
Experiments in Utility Computing: Hadoop and Condor Sameer Paranjpye Y! Web Search.
Scaling up R computation with high performance computing resources.
INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.
Information Initiative Center, Hokkaido University North 11, West 5, Sapporo , Japan Tel, Fax: General.
The EPIKH Project (Exchange Programme to advance e-Infrastructure Know-How) gLite Grid Introduction Salma Saber Electronic.
Raju Subba Open Source Project: Apache Spark. Introduction Big Data Analytics Engine and it is open source Spark provides APIs in Scala, Java, Python.
Leverage Big Data With Hadoop Analytics Presentation by Ravi Namboori Visit
1 Panel on Merge or Split: Mutual Influence between Big Data and HPC Techniques IEEE International Workshop on High-Performance Big Data Computing In conjunction.
Geoffrey Fox Panel Talk: February
SPIDAL Java Optimized February 2017 Software: MIDAS HPC-ABDS
Organizations Are Embracing New Opportunities
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING CLOUD COMPUTING
Big Data is a Big Deal!.
Machine Learning Library for Apache Ignite
Introduction to Distributed Platforms
By Chris immanuel, Heym Kumar, Sai janani, Susmitha
ITCS-3190.
OpenMosix, Open SSI, and LinuxPMI
Design rationale and status of the org.glite.overlay component
Status and Challenges: January 2017
Pathology Spatial Analysis February 2017
GWE Core Grid Wizard Enterprise (
Spark Presentation.
Tools and Services Workshop Overview of Atmosphere
NSF start October 1, 2014 Datanet: CIF21 DIBBs: Middleware and High Performance Analytics Libraries for Scalable Data Science Indiana University.
Neelesh Srinivas Salian
Apache Hadoop YARN: Yet Another Resource Manager
Apache Spark Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing Aditya Waghaye October 3, 2016 CS848 – University.
NSF : CIF21 DIBBs: Middleware and High Performance Analytics Libraries for Scalable Data Science PI: Geoffrey C. Fox Software: MIDAS HPC-ABDS.
Introduction to Spark.
I590 Data Science Curriculum August
Distributed Systems CS
Applications SPIDAL MIDAS ABDS
Data Science Curriculum March
Tutorial Overview February 2017
Data Science for Life Sciences Research & the Public Good
CS110: Discussion about Spark
Hadoop Technopoints.
Introduction to Apache
Module 01 ETICS Overview ETICS Online Tutorials
Overview of big data tools
Introduction Are you looking to bag a dream job as a Hadoop YARN developer? If yes, then you must buck up your efforts and start preparing for all the.
Container cluster management solutions
An Introduction to Software Architecture
Lecture 16 (Intro to MapReduce and Hadoop)
Department of Intelligent Systems Engineering
2 Programming Environment for Global AI and Modeling Supercomputer GAIMSC 2/19/2019.
Distributed Systems CS
$1M a year for 5 years; 7 institutions Active:
Operating System Introduction.
Overview of Workflows: Why Use Them?
Big-Data Analytics with Azure HDInsight
MapReduce: Simplified Data Processing on Large Clusters
Big Data, Simulations and HPC Convergence
Convergence of Big Data and Extreme Computing
Twister2 for BDEC2 Poznan, Poland Geoffrey Fox, May 15,
L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher
Presentation transcript:

MIDAS- Molecular Dynamics Analysis Tutorial February 2017 Software: MIDAS HPC-ABDS NSF 1443054: CIF21 DIBBs: Middleware and High Performance Analytics Libraries for Scalable Data Science MIDAS- Molecular Dynamics Analysis Tutorial February 2017

MIDAS- Molecular Dynamics Analysis Tutorial Ioannis Paraskevakos, Andre Luckow, Oliver Beckstein, Shantenu Jha

Material https://becksteinlab.github.io/SPIDAL-MDAnalysis-Midas-tutorial/ https://github.com/radical-cybertools/MIDAS-tutorial https://github.com/Becksteinlab/SPIDAL-MDAnalysis-Midas-tutorial

1. Introduction

The Convergence of HPC and “Data Intensive” Computing I MIDAS is addressing integration of HPC with Big data At multiple levels: Applications, Micro-Architectural (“near data computing” processors), Macro-Architectural (e.g. File Systems), Software Environment (e.g., Analytical Libraries). Objective: Bring ABDS Capabilities to HPDC HPC: Simple Functionality, Complex Stack, High Performance ABDS: Advanced Functionality

The Convergence of HPC and “Data Intensive” Computing II See paper “A Tale of Two Data-Intensive Paradigms: Data Intensive Applications, Abstractions and Architectures” http://arxiv.org/abs/1403.1528

RADICAL-Cybertools: Abstraction based approach to HPC RADICAL-Cybertools provide: Flexible execution and resource management “Static” versus “Dynamic” resource execution Pilot-Abstraction for “higher-level” resource management Manage heterogeneity Middleware variants (syntax) Infrastructure utilization (semantics) (some) Architectural features Serve as building blocks upon which other components can be built Application/Domain specific Toolkits:Ensemble Toolkit Data-Intensive Applications Libraries (SPIDAL) Infrastructural abstractions: Pilot-Abstraction: General, Scalable HPDC resource management SAGA: Standardized API Execution Strategy: Decouple workload from execution management

MIDAS: Pilot-based Middleware for Data-intensive Analysis and Science I Application is integrated deeply with Infrastructure. Great for performance. But bad for extensibility & flexibility. Multiple levels of functionality, indirection and abstractions Performance is often difficult. Challenge: How to find “Sweet Spot”? “Neck of hour glass” for multiple applications and infrastructure

MIDAS: Pilot-based Middleware for Data-intensive Analysis and Science II MIDAS is the middleware for support analytical libraries, by providing Resource management. Pilot-Hadoop for managing ABDS frameworks on HPC Coordination and communication. Pilot-Spark for supporting iterative analytical algorithms Address heterogeneity at the infrastructure level: File and storage abstractions. Flexible and multi-level compute-data coupling. Must have a well-defined API and semantics that can then be used by application and SPIDAL library/layer.

MIDAS: Pilot-based Middleware for Data-intensive Analysis and Science III The Pilot-Abstraction provides a well-defined resource management layer for MIDAS: Application-level scheduling well suited for fine-grained data parallelism of data-intensive applications Data-intensive applications more heterogeneous and thus, more demanding with respect to their resource management needs Application-level scheduling enables the implementation of a data-aware resource manager for analytics applications Interoperability Layer between Hadoop/ABDS and HPC

Why Hadoop/Spark on HPC? Data Parallelism/SQL/Machine Learning capabilities often need to be manually implemented on HPC Code hidden behind executables (written in C/C++) – difficult to re-use and to implement complex data pipelines and applications Powerful frameworks, such as Spark, provide Implicit parallelism, reusability, flexibility. However, they are difficult to use on HPC

HPC and Hadoop Ecosystem

HPC and ABDS Interoperability

https://github.com/radical-cybertools/MIDAS-tutorial/wiki

2. Pilot-Abstraction

Introduction Pilot Abstraction Working definition: A system that generalizes a placeholder job to provide multi-level scheduling to allow application-level control over the system scheduler via a scheduling overlay. User Space User Application Pilot-Job System Policies Pilot-Job Pilot-Job Resource Manager System Space The Pilot-Jobs acquire a set of resources. As soon as those Pilot-Jobs are active the User application can start submitting the tasks. The submission can be done either based on which resource has enough resources to satisfy the tasks requirements or based on a more complicated policy that is defined by the user. Resource A Resource B Resource C Resource D

Introduction to Pilot-Abstraction Advantages of Pilot-Abstractions: The Perfect Pilot: Decouples workload from resource management Flexible Resource Management Enables the fine-grained (ie “slicing and dicing”) management of resources Tighter temporal control and other advantages of application-level Scheduling (avoid limitations of system-level only scheduling) Build higher-level frameworks without explicit resource management Pilot-jobs as a (A)RM because of limitations of the middleware and infrastructure. Different middleware; differenent limitations. Therefore different pilot jobs semantically and functionally. Pilots tell you what to do (N x M) but not why or how?

RADICAL-Pilot: Architecture Resource-Level: Pilot-Description: Describes the resource requirements Application-Level: Unit-Description: Describes the compute task RADICAL-Pilot Client: Pilot Manager & Launcher: Are responsible to launch a Pilot to a resource Unit Manager & Scheduler : Manage and schedule unit to various pilots A general overview of the RADICAL-Pilot architecture Pilot Description: Describes the resources (Resource, Nodes, Memory, Wall time) that the pilot will use Unit Description: Describes the task (Executable, Cores, Environment, etc) that will be executed from the pilot. Pilot Manager and Pilot Launcher: The Pilot manager manages the communication between the application and the pilots, the pilot launcher launches the pilot to the resource that is defined in the pilot’s description. Unit Manager and Scheduler: The Unit Manager uses the Unit Manager Scheduler to schedule units in the various pilots.

RADICAL-Pilot: Architecture Under RP Pilot we have: Task Spawner: It is responsible to start the execution of a task. The Launch Method recognises the way a task will be executed, e.g. mpirun, ssh, spark-submit LRMS * : This module recognises the resources that the Pilot has acquired (which exact nodes for example) and informs the Scheduler * Scheduler *: Knows how many resources are free and how many are being used. It is also responsible to decide where a task will be executed. A more detailed picture of the architecture, that also shows the internal modules of RADICAL-Pilot’s client and agent. LRMS Queue: Is the queue of the resource that the Pilot is submitted, e.g. the SLURM queue on Stampede. Under RP Pilot we have: Task Spawner: It is responsible to start the execution of a task. The Launch Method recognises the way a task will be executed, e.g. mpirun, ssh, spark-submit LRMS * : This module recognises the resources that the Pilot has acquired (which exact nodes for example) and informs the Scheduler * Scheduler *: Knows how many resources are free and how many are being used. It is also responsible to decide where a task will be executed. In 3&4, *  is used to distinguish the Resource manager and Scheduler that RADICAL-Pilot Agent has from the Resource manager and scheduler that an HPC machine has. The RP Agent's resource manager is responsible to discover the resources, e.g. node names, number of cores, memory, the agent has acquired and the Scheduler schedules Units to those nodes.

RADICAL-Pilot: Pilot States Pilot Activity Unit States Well defined state models (for pilots and units) Scalability (up and out) and Portability (often contradictory design objectives!) Supports research whilst supporting production scalable science! Pluggable schedulers; high degree of introspection, provenance. Modular Pilot agent adaptable for different architectures RADICAL-PIlot is a python implementation of the Pilot abstraction. RADICAL-Pilot has well defined models for the states of the Pilots and the states of the Unit (Compute Tasks). It allows a user application to scale up and out. It’s implementation allows RADICAL-Pilot to be used virtually to any system, since it is only dependency is Python. It’s architecture allows it to adapt to new resources by just adding a resource configuration file. Tutorial: https://github.com/radical-cybertools/MIDAS-tutorial/blob/master/pilot/Radical_Pilot.ipynb

Pilot-Hadoop Pilot-Hadoop can manage a Hadoop cluster on an HPC resource The Pilot’s agent is responsible to manage Hadoop resources Units are submitted to Hadoop’s RM YARN The Pilot is responsible to start a Hadoop or Spark cluster in the resources that have been acquired. Specifically, the Pilot downloads Hadoop/Spark, setups the cluster, starts it and submits applications to it according to the resources it has available at any given time. During unit submission, the Pilot checks the resource requirements (e.g Cores, Memory) of the unit and if there are enough resources free, it submits the application to Hadoop or Spark through a yarn or spark-submit command. MongoDB is used within RADICAL-Pilot to facilitate the communication between the components of RADICAL-Pilot components, i.e. the Application, the RP Client the the RP Agent. - The Hadoop deployment includes HDFS, YARN and the MapReduce Libraries. In the case of Spark, the core distribution (incl. RDDs, Dataframes, MLlib, PySpark and SparkR) is deployed. - Zookeeper is not needed for MapReduce and Spark. We do deploy Zookeeper in Streaming mode as Kafka requires it. - Filesystem: For the molecular dynamics use cases, we primarily run Spark directly on top of Lustre or other HPC filesystems. HDFS is only beneficial for workloads with large volumes sequential reads. Deploying HDFS on top of HPC resources in user-space is possible, but not very practical. In many cases it is not faster as data needs to be copied first to HDFS which is a significant overhead. - When HDFS is provided on the resource level (like the Wrangler native Hadoop nodes), we of course use it. Tutorial: https://github.com/radical-cybertools/MIDAS-tutorial/blob/master/pilot/Pilot-Hadoop.ipynb

Pilot-Spark Pilot downloads and configures a Spark cluster The Master node is with the Pilot’s Agent When Spark is launched, Spark applications can be submitted. When the Pilot is launched it downloads and configures Spark. The node where the Agent is is selected as the master node of the Spark cluster and the rest of the nodes as its workers. As soon as it is launched, Spark applications can be submitted and executed. The Scheduler communicates with the Master Node to discover the available resources. If there are enough, the Task Spawner executes a spark-submit command. From that point on the execution is handled by the Spark cluster. Tutorial: https://github.com/radical-cybertools/MIDAS-tutorial/blob/master/pilot/Pilot-Spark.ipynb