Joint Research Centre (JRC)

Slides:

Advertisements

Similar presentations

Setting up of condor scheduler on computing cluster Raman Sehgal NPD-BARC.

Advertisements

VIRTUALISATION OF HADOOP CLUSTERS Dr G Sudha Sadasivam Assistant Professor Department of CSE PSGCT.

Tier-1 experience with provisioning virtualised worker nodes on demand Andrew Lahiff, Ian Collier STFC Rutherford Appleton Laboratory, Harwell Oxford,

Design and Implementation of a Single System Image Operating System for High Performance Computing on Clusters Christine MORIN PARIS project-team, IRISA/INRIA.

Hands-On Microsoft Windows Server 2008 Chapter 1 Introduction to Windows Server 2008.

U.S. Department of the Interior U.S. Geological Survey David V. Hill, Information Dynamics, Contractor to USGS/EROS 12/08/2011 Satellite Image Processing.

Virtualization Concept. Virtualization  Real: it exists, you can see it.  Transparent: it exists, you cannot see it  Virtual: it does not exist, you.

ISG We build general capability Introduction to Olympus Shawn T. Brown, PhD ISG MISSION 2.0 Lead Director of Public Health Applications Pittsburgh Supercomputing.

MaterialsHub - A hub for computational materials science and tools.  MaterialsHub aims to provide an online platform for computational materials science.

03/27/2003CHEP20031 Remote Operation of a Monte Carlo Production Farm Using Globus Dirk Hufnagel, Teela Pulliam, Thomas Allmendinger, Klaus Honscheid (Ohio.

VO Sandpit, November 2009 e-Infrastructure to enable EO and Climate Science Dr Victoria Bennett Centre for Environmental Data Archival (CEDA)

1 Evolution of OSG to support virtualization and multi-core applications (Perspective of a Condor Guy) Dan Bradley University of Wisconsin Workshop on.

A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Tekin Bicer Gagan Agrawal 1.

17-April-2007 High Performance Computing Basics April 17, 2007 Dr. David J. Haglin.

Evaluation of Agent Teamwork High Performance Distributed Computing Middleware. Solomon Lane Agent Teamwork Research Assistant October 2006 – March 2007.

Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!

Headline in Arial Bold 30pt HPC User Forum, April 2008 John Hesterberg HPC OS Directions and Requirements.

GVis: Grid-enabled Interactive Visualization State Key Laboratory. of CAD&CG Zhejiang University, Hangzhou

The LBNL Perceus Cluster Infrastructure Next Generation Cluster Provisioning and Management October 10, 2007 Internet2 Fall Conference Gary Jung, SCS Project.

ISG We build general capability Introduction to Olympus Shawn T. Brown, PhD ISG MISSION 2.0 Lead Director of Public Health Applications Pittsburgh Supercomputing.

Experiences Running Seismic Hazard Workflows Scott Callaghan Southern California Earthquake Center University of Southern California SC13 Workflow BoF.

Grid Remote Execution of Large Climate Models (NERC Cluster Grid) Dan Bretherton, Jon Blower and Keith Haines Reading e-Science Centre

Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison Condor NT Condor ported.

Grid Activities in CMS Asad Samar (Caltech) PPDG meeting, Argonne July 13-14, 2000.

Ganga/Dirac Data Management meeting October 2003 Gennady Kuznetsov Production Manager Tools and Ganga (New Architecture)

A Web Based Job Submission System for a Physics Computing Cluster David Jones IOP Particle Physics 2004 Birmingham 1.

Grid Computing: An Overview and Tutorial Kenny Daily BIT Presentation 22/09/2016.

Advanced Computing Facility Introduction

Compute and Storage For the Farm at Jlab

Pasquale Pagano (CNR-ISTI) Project technical director

The advances in IHEP Cloud facility

Integrating Scientific Tools and Web Portals

HPC usage and software packages

Volunteer Computing for Science Gateways

Elastic Computing Resource Management Based on HTCondor

Introduction to Distributed Platforms

By Chris immanuel, Heym Kumar, Sai janani, Susmitha

Dag Toppe Larsen UiB/CERN CERN,

Working With Azure Batch AI

Dag Toppe Larsen UiB/CERN CERN,

Joint Research Centre (JRC)

LinkSCEEM-2: A computational resource for the development of Computational Sciences in the Eastern Mediterranean Mostafa Zoubi SESAME Outreach SESAME,

Provisioning 160,000 cores with HEPCloud at SC17

Spark Presentation.

Architecture & System Overview

Grid Computing.

CREAM-CE/HTCondor site

MaterialsHub - A hub for computational materials science and tools.

PES Lessons learned from large scale LSF scalability tests

Virtualization in the gLite Grid Middleware software process

NGS computation services: APIs and Parallel Jobs

FCT Follow-up Meeting 31 March, 2017 Fernando Meireles

University of Technology

The Scheduling Strategy and Experience of IHEP HTCondor Cluster

Integration of Singularity With Makeflow

Haiyan Meng and Douglas Thain

NCSA Supercluster Administration

Advanced Computing Facility Introduction

CLUSTER COMPUTING.

Ivan Reid (Brunel University London/CMS)

Prof. Leonardo Mostarda University of Camerino

Building and running HPC apps in Windows Azure

Introduction to High Performance Computing Using Sapelo2 at GACRC

Distributing META-pipe on ELIXIR compute resources

rvGAHP – Push-Based Job Submission Using Reverse SSH Connections

A Virtual Machine Monitor for Utilizing Non-dedicated Clusters

Working in The IITJ HPC System

Production Manager Tools (New Architecture)

EGI High-Throughput Compute

Presentation transcript:

Joint Research Centre (JRC) EO&SS@BigData pilot project Text and Data Mining Unit Directorate I: Competences Experiences with HTCondor Universes on a Petabyte Scale Platform for Earth Observation Data Processing D. Rodriguez, V. Vasilev, and P. Soille European HTCondor Workshop 2017 DESY, Hamburg, 6th—9th June 2017

OUTLINE JRC Earth Observation Data and Processing Platform Batch processing HTCondor universes Use-cases and achievements Global Human Settlement Layer Search for Unidentified Marine Objects Marine Ecosystem Modelling HTCondor python binding Lessons learned and open questions

Earth Observation & Social Sensing Big Data Pilot Project - Motivation Wide usage of Earth Observation (EO) data at JRC The EU Copernicus Programme with the Sentinel fleet of satellites acts as a game changer: expected 10TB/day of free and open data Requires new approaches for data management and processing JRC Pilot project launched in 2015 Major goal: set up a central infrastructure: JRC Earth Observation Data Processing Platform (JEODPP) S1A launched 03/04/14 S1B launched 25/04/15 S2A launched 23/06/15 S2B launched 07/03/17 Needed only if not already shown by Veselin. S3A launched 16/02/16 S3B launched sched. Mid 17

JRC Earth Observation Data and Processing Platform (JEODPP) Versatile platform bringing the users to the data and allowing for Running large scale batch processing of existing scientific workflows thanks to lightweight virtualisation based on Docker orchestrated by HTCondor Remote desktop capability for fast prototyping in legacy environments Interactive data visualisation and analysis with Jupyter

JRC Earth Observation Data Processing Platform (JEODPP) Main focus on satellite image data Support for existing processing workflows and environments (C/C++, Python, Matlab, Java) Development of processing API for EO data in Python, based on C/C++ modules used for Low-level batch processing High-level interactive processing Could be skipped (there are already 18 slides).

Components of the JEODPP High performance network

HTCondor universes vanilla parallel Software & tools require different computing environments vanilla [from 2015 JRC survey] parallel

JEODPP HTCondor setup HTCondor workload manager Processing cluster of commodity hardware Currently ~ 1000 CPU cores, ~15 TB RAM 10 Gb/s connectivity, dedicated switches Central and submit machine on same node: Simplify Management Maximize computing power capability - Each CPU is important Using Docker containers for executing jobs - flexible management of processing environments

Workload manager - HTCondor Great variety of applications need a great flexibility Vanilla universe: allow running almost any single core apps. Docker universe: a container instance is the HTCondor job (job sandboxing). Parrallel universe: when multiple processes of a single job must be running at the same time on different machines. Global Human Settlement Layer MARitime SECurity SEACOAST Hydrodynamic and ecosystem simulations But you should stress the advantage of the Docker universe over the Vanilla universe.

Global Human Settlement Layer Running in Vanilla and Docker universe Pre-processing of the 5,026 products resulting in a volume of ~19TB; Produced data (S1 GHSL output): ~4TB; Duration of batch processing: ~12 h with ~240 concurrent jobs. Single core application based on MATLAB Several containers by host are running (small containers) More than one container by host. Dynamic or Static slots can work Submit node Host Worker container Condor Job array

Ship detection [Image credit: Carlos Santamaria, JRC, 2017] Search for Unidentified Maritime Objects (SUMO) Running in Docker universe Ship detection [Image credit: Carlos Santamaria, JRC, 2017] Processing of those 11,451 Sentinel-1 products resulting in a volume of ~7.5GB; 480 concurrent jobs at a time; whole process completed in 1.30h Extraction of S1 metainfo (3.4GB) with MATLAB script; 390 concurrent jobs at a time; whole process completed in 1h. Multi-core core application. Large containers No communication between containers. Submit node Condor Job array Host Worker container Host Worker container Host Worker container Host Worker container

Marine Ecosystem Modelling Running in Parallel universe Hydrodynamic and ecosystem simulations http://mcc.jrc.ec.europa.eu OpenMPI application based on FORTRAN Model grid execution divided in 64 domains for 10km res. of the Mediterranean sea Workflow executed in two steps: One job triggers the startup of a virtual HPC cluster based on Docker containers Job run the working script on the virtual HPC Cluster by using OpenMPI Submit node Condor Job Virtual HPC environment Host Master container Host Worker container Host Worker container Host Worker container Host Worker container MPI Communications Based on S. Yakubov et all.

Ongoing work – Condor Python binding What user can do? Trigger a dedicated container in order to run HTCondor notebook Import classad & htcondor classes Edit job requirements Submit jobs Monitor jobs and queue status

Problems experienced Scheed limits. Maximum number by submission 20K mainly due to memory usage and disk transactions Job are forced to run a stack of inputs (chunk). If one job fails all the stack should be re-process again. chunk Job xx.xx Job xx.xx

Problems experienced Job enter in a “Vicious circle”. Example, Host cannot start docker container but still are accepting jobs from the same job cluster.

Problems experienced Job enter in a “Vicious circle”. Example, Host cannot start docker container but still are accepting jobs from the same job cluster.

Problems experienced Job enter in a “Vicious circle”. Example, Host cannot start docker container but still are accepting jobs from the same job cluster.

Problems experienced Condor “Docker universe” does not force multithread apps run on a single core

Problems experienced Condor “Docker universe” does not force multithread apps run on a single core If we are lucky we can set the environment variable (ej, OPEN_NUM_THREAD, GDAL_NUM_THREAD)

Lessons learned? Increase load by job improve the overall performance. Keep everything locally, R/W directly in the share file system can drastically decrease the overall performance. should_transfer_files = YES Must be organize with logs /submit_machine/local/job$(Cluster)_$(proccess) Set time limit for job execution (job can stay in running status forever). Foresee big scratch space for your cluster nodes, place /var/lib/docker/ and /var/lib/condor in separate volume, some unforeseen errors could fill up space with charm Parallel shell is a good alternative for condor_config. (to stop, start and reconfigure not just condor also dockers etc)

? ? ? Open questions It is possible to add universe? For example a combination of docker universe with parallel universe Universe: PVM #Parallel Virtual Machine ? ?

Thank you all EO&SS@BigData pilot project Text and Data Mining Unit Directorate I: Competences