JASMIN Overview UKMO Visit 24/11/2014 Matt Pritchard.

Slides:



Advertisements
Similar presentations
VO Sandpit, November 2009 CEDA Storage Dr Matt Pritchard Centre for Environmental Data Archival (CEDA)
Advertisements

Cloud Storage in Czech Republic Czech national Cloud Storage and Data Repository project.
System Center 2012 R2 Overview
MUNIS Platform Migration Project WELCOME. Agenda Introductions Tyler Cloud Overview Munis New Features Questions.
An Approach to Secure Cloud Computing Architectures By Y. Serge Joseph FAU security Group February 24th, 2011.
IDC HPC User Forum Conference Appro Product Update Anthony Kenisky, VP of Sales.
VO Sandpit, November 2009 NERC Big Data And what’s in it for NCEO? June 2014 Victoria Bennett CEDA (Centre for Environmental Data Archival)
Steven Newhouse, Head of Technical Services Virtualisation and Cloud Computing at EBI.
UMF Cloud
Jennifer Rexford Princeton University MW 11:00am-12:20pm Data-Center Traffic Management COS 597E: Software Defined Networking.
Introduction to DoC Private Cloud
Virtual Network Servers. What is a Server? 1. A software application that provides a specific one or more services to other computers  Example: Apache.
An Introduction to Cloud Computing. The challenge Add new services for your users quickly and cost effectively.
Virtual Desktop Infrastructure Solution Stack Cam Merrett – Demonstrator User device Connection Bandwidth Virtualisation Hardware Centralised desktops.
1. Outline Introduction Virtualization Platform - Hypervisor High-level NAS Functions Applications Supported NAS models 2.
Cluster computing facility for CMS simulation work at NPD-BARC Raman Sehgal.
JASMIN Petascale storage and terabit networking for environmental science Matt Pritchard Centre for Environmental Data Archival RAL Space Jonathan Churchill.
VIRTUALIZATION AND CLOUD COMPUTING Dr. John P. Abraham Professor, Computer Engineering UTPA.
VO Sandpit, November 2009 CEDA Mission: “curation and facilitation” “Managing complex datasets and accompanying information for reuse and repurpose” Sam.
 Cloud computing  Workflow  Workflow lifecycle  Workflow design  Workflow tools : xcp, eucalyptus, open nebula.
Steven Newhouse, Head of Technical Services European Bioinformatics Institute: ICT Challenges.
SICSA student induction day, 2009Slide 1 Social Simulation Tutorial Session 6: Introduction to grids and cloud computing International Symposium on Grid.
CEMS: The Facility for Climate and Environmental Monitoring from Space Victoria Bennett, ISIC/CEDA/NCEO RAL Space.
Virtualization. Virtualization  In computing, virtualization is a broad term that refers to the abstraction of computer resources  It is "a technique.
US ATLAS Western Tier 2 Status and Plan Wei Yang ATLAS Physics Analysis Retreat SLAC March 5, 2007.
Climate Sciences: Use Case and Vision Summary Philip Kershaw CEDA, RAL Space, STFC.
ISG We build general capability Introduction to Olympus Shawn T. Brown, PhD ISG MISSION 2.0 Lead Director of Public Health Applications Pittsburgh Supercomputing.
Introduction to Cloud Computing
M.A.Doman Short video intro Model for enabling the delivery of computing as a SERVICE.
Get More out of SQL Server 2012 in the Microsoft Private Cloud environment Steven Wort, Xin Jin Microsoft Corporation.
VO Sandpit, November 2009 e-Infrastructure to enable EO and Climate Science Dr Victoria Bennett Centre for Environmental Data Archival (CEDA)
Rensselaer Why not change the world? Rensselaer Why not change the world? 1.
608D CloudStack 3.0 Omer Palo Readiness Specialist, WW Tech Support Readiness May 8, 2012.
Sandor Acs 05/07/
JASMIN and CEMS: The Need for Secure Data Access in a Virtual Environment Cloud Workshop 23 July 2013 Philip Kershaw Centre for Environmental Data Archival.
Using Virtual Servers for the CERN Windows infrastructure Emmanuel Ormancey, Alberto Pace CERN, Information Technology Department.
Looking Ahead: A New PSU Research Cloud Architecture Chuck Gilbert - Systems Architect and Systems Team Lead Research CI Coordinating Committee Meeting.
Virtualisation & Cloud Computing at RAL Ian Collier- RAL Tier 1 HEPiX Prague 25 April 2012.
RAL Site Report HEPiX FAll 2014 Lincoln, Nebraska October 2014 Martin Bly, STFC-RAL.
VO Sandpit, November 2009 e-Infrastructure for Climate and Atmospheric Science Research Dr Matt Pritchard Centre for Environmental Data Archival (CEDA)
Slide David Britton, University of Glasgow IET, Oct 09 1 Prof. David Britton GridPP Project leader University of Glasgow UK-T0 Meeting 21 st Oct 2015 GridPP.
Required Data Centre and Interoperable Services: CEDA
1 Adventures in Web Services for Large Geophysical Datasets Joe Sirott PMEL/NOAA.
20409A 7: Installing and Configuring System Center 2012 R2 Virtual Machine Manager Module 7 Installing and Configuring System Center 2012 R2 Virtual.
ISG We build general capability Introduction to Olympus Shawn T. Brown, PhD ISG MISSION 2.0 Lead Director of Public Health Applications Pittsburgh Supercomputing.
Computing Issues for the ATLAS SWT2. What is SWT2? SWT2 is the U.S. ATLAS Southwestern Tier 2 Consortium UTA is lead institution, along with University.
R. Krempaska, October, 2013 Wir schaffen Wissen – heute für morgen Controls Security at PSI Current Status R. Krempaska, A. Bertrand, C. Higgs, R. Kapeller,
Cloud computing: IaaS. IaaS is the simplest cloud offerings. IaaS is the simplest cloud offerings. It is an evolution of virtual private server offerings.
COMP1321 Digital Infrastructure Richard Henson March 2016.
© 2014 VMware Inc. All rights reserved. Cloud Archive for vCloud ® Air™ High-level Overview August, 2015 Date.
Moonshot-enabled Federated Access to Cloud Infrastructure Terena Networking Conference, Reykjavik. May 2012 David Orrell, Eduserv.
© 2012 Eucalyptus Systems, Inc. Cloud Computing Introduction Eucalyptus Education Services 2.
Claudio Grandi INFN Bologna Virtual Pools for Interactive Analysis and Software Development through an Integrated Cloud Environment Claudio Grandi (INFN.
CLOUD COMPUTING Presented to Graduate Students Mechanical Engineering Dr. John P. Abraham Professor, Computer Engineering UTPA.
Chapter 6: Securing the Cloud
Organizations Are Embracing New Opportunities
More than IaaS Academic Cloud Services for Researchers
Workshop on the Future of Big Data Management June 2013 Philip Kershaw
Tools and Services Workshop
Dag Toppe Larsen UiB/CERN CERN,
Joslynn Lee – Data Science Educator
Dag Toppe Larsen UiB/CERN CERN,
Welcome! Thank you for joining us. We’ll get started in a few minutes.
Virtualization in the gLite Grid Middleware software process
JASMIN Success Stories
GridPP Tier1 Review Fabric
Dr. John P. Abraham Professor, Computer Engineering UTPA
20409A 7: Installing and Configuring System Center 2012 R2 Virtual Machine Manager Module 7 Installing and Configuring System Center 2012 R2 Virtual.
Cloud computing mechanisms
Presentation transcript:

JASMIN Overview UKMO Visit 24/11/2014 Matt Pritchard

W? What is it? – Petascale storage and cloud computing for big data challenges in environmental science 13 Petabytes disk 3600 computing cores (HPC, Virtualisation) High-performance network design Private clouds for virtual organisations For Whom? – Entire NERC community – Met Office – European agencies – Industry partners For What? – Everything CEDA did before Curation, Facilitation (e.g. BADC, ESGF, …) – Collaborative workspaces – Scientific analysis environment

CEDA: Evolution

Data growth Light blue = total of all tape at STFC Green = Large Hadron Collider (LHC) Tier 1 data on tape Dark blue = data on disk in JASMIN Data growth on JASMIN has been limited by: Not enough disk (now fixed …for a while) Not enough local compute (now fixed …for a while) Not enough inbound bandwidth(now fixed …for a while) Pb (unique data) projection for JASMIN Pb CMIP6?

Missing piece Urgency to provide better environmental predictions Need for higher-resolution models HPC to perform the computation Huge increase in observational capability/capacity But… Massive storage requirement: observational data transfer, storage, processing Massive raw data output from prediction models Huge requirement to process raw model output into usable predictions (graphics/post- processing) Hence JASMIN… ARCHER supercomputer (EPSRC/NERC) JAMSIN (STFC/Stephen Kill)

JASMIN Phase UK Government capital investment – JASMIN Climate science, Earth System modelling focus Support UK and European HPC facilities – CEMS (facility for Climate and Environmental Monitoring from Space) Earth Observation focussed An industry-academic partnership

Infrastructure Curation BADCCEMS Academic Archive (NEODC) IPCC-DDCUKSSDC Analysis Environments Virtual Machines Group Work- spaces: NCAS Group Work- spaces: NCEO Group Work- spaces: Other NERC LOTUS Cluster CEMS Cloud JASMIN Phase 1: Logical view

JASMIN Phase 1 Configured as a storage and analysis environment 2 types of compute: virtual/cloud environment – flexibility batch compute – Performance Both connect to 5 Pb of fast parallel disk Network Internal: Gnodal External: JANET (UK) OPNS to key partner institutions

/group_workspaces/cems//group_workspaces/jasmin/ jasmin-login1jasmin-xfer1cems-login1cems-xfer1 jasmin-sci1 cems-sci1 firewall lotus.jc.rl.ac.uk firewall JASMIN CEMS Data Centre Archive VM /neodc/badc GWS VM CEMS- Academic cloud GWS Key: General-purpose resources Project-specific resources Data centre resources Science/analysisBatch processing cluster SSH login gateway Data transfers Science/analysi s SSH login gateway Data transfers Model for JASMIN 1

JASMIN has – proprietary parallel file system (Panasas) with high I/O performance – bare-metal compute cluster – virtualisation and cloud via VMware vCloud Success for batch compute – ATSR full-mission reprocessing: One month’s L1B data processing in 12 minutes where on previous system it took 3 days! Virtualisation rather than full cloud – Provision hosts for users via helpdesk and virtualisation tools Usability and user management – Technically difficult for some users – Not enough control for other users (root) – Help and support too labour intensive Summary: service delivery needs to catch up with raw infrastructure power! JASMIN 1 Results and Lessons Learnt jobs in parallel with no I/O issues. Each colour represents a node, 12 cores / node Network diagram shows switches saturating but storage still has spare b/w

JASMIN 1 Success: UPSCALE 250 Tb in 1 year from PRACE supercomputing facility in Germany (HERMIT) Network transfer to JASMIN Analysed by Met Office scientists as soon as available Deployment of VMs running custom scientific software, co- located with data Outputs migrated to long term archive (BADC) Image: P-L Vidale & R. Schiemann, NCAS Mizielinksi et al (Geoscientific Model Development, submitted) “High resolution global climate modelling; the UPSCALE project, a large simulation campaign”

Phase 2/3 expansion Phase 2 by March 2014Phase 3 by March 2015 JASMIN hard upgrade+7 Petabytes disk +6 Petabytes tape compute cores network enhancement +o(2) Petabytes disk +o(800) compute cores network enhancement JASMIN soft upgradeVirtualisation software Scientific analysis software Cloud management software Dataset construction Documentation 2013 NERC Big Data capital investment Wider scope: support projects from new communities, e.g. EOS Cloud Environmental ‘omics. Cloud BioLinux platform Geohazards Batch compute of Sentinel-1a SAR for large-scale, hi-res Earth surface deformation measurement Sentinel-1a (ESA)

JASMIN Now

Storage: Panasas Used for – Archive and Group workspaces – Home directories Parallel file system (cf Lustre, GPFS, pNFS etc) – Single Namespace – 140GB/sec benchmarked (95 shelves PAS14) – Access via PanFS client/NFS/CIFS – POSIX filesystem out of the box. Mounted on Physical and Virtual Machines 103 shelves PAS shelves PAS14 – Each shelf connected at 10Gb (20Gb PAS14) – 2,244 ‘Blades’ (each with network address!) – JASMIN - Largest single realm in the world One Management Console TCO: Big Capital, Small Recurrent but JASMIN2 £/TB < GPFS/Lustre offerings

Used for – Virtual Machine OS image storage – Cloud Data storage (on VMDK’s) – Lower performance (NFS) 900TB. Cluster config of 4x FAS6250’s controllers – Redundant pair per disc chain – SAS Disc chains of 10 shelves x 24 discs One Management Console for whole system TCO: Medium Capital, – Medium Performance, Small Recurrent More complex than Panasas to deploy – 1 week install + 1 week configuration for 900TB vs. – 3 day physical install and configuration for 7PB Storage: NetApp

Storage: Elastic tape Robot tape already in use for CEDA Archive secondary copy – CERN CASTOR system used for LHC Tier1 – Oracle/StorageTek T10KC Requirement – Enable JASMIN GWS managers to make best use of (expensive!) high-perf disk Move data to/from group workspace Tools for them to do it themselves Not traditional “backup” system – Scales & use cases too diverse isgtw.org / gridpp

Compute ModelProcessorCoresMemory 194 x Viglen HX525T2iIntel Xeon E v2 “Ivy Bridge”16128GB 14 x Viglen HX545T4iIntel Xeon E v2 “Ivy Bridge”16512GB 6 x Dell R620Intel Xeon E “Sandy Bridge”16128GB 8 x Dell R610Intel Xeon X5690 “Westmere”1248GB 3 x Dell R610Intel Xeon X5675 “Westmere”1296GB 1 x Dell R815AMD Opteron48256GB 226 bare metal hosts 3556 cores 2 x 10Gb Ethernet (second interface for MPI traffic) Intel / AMD processors available 17 large memory hosts More than 1.3M jobs over two years Hosts can be easily redeployed as VMware/LOTUS nodes Batch compute “LOTUS” Virtualisation

RHEL + Platform LSF 8 – LSF 9 Upgrade planned Storage – CEDA (BADC, NEODC) archives mounted RO – Group Workspaces mounted RW – /home/users Software – PGI, Intel compilers – Platform MPI – JASMIN Analysis Platform – /apps for user-requested software Job submission – From LOTUS head node, *sci VMs or specific project VMs Compute: LOTUS

Networking: key features It’s big – > Gb – >26 switches – Ability to expand to > Gb High performance, low latency – Any port to any port is non-blocking: no contention Outside-world connections – 40GbE via RAL site & firewall – 10GbE Science DMZ – OPNs (Light Paths) to UKMO, Edinburgh, Leeds (1-2 GbE) Separate management network at 10/100/1000bE of >30 switches, 500 ports

Network: internal 48 compute servers per rack 5 network cables per server. Red: 100Mb network management console Blue: redundant 1Gbit virtualisation management network Black: 2x10Gb network cables. 96 x 10Gb cables per rack patched to 2 Mellanox switches at bottom of rack. Mellanox provides unique technology to minimise network contention Orange: 12 x 40Gb uplinks per switch 23 such 10Gb switches >1000 x 10G ports >3 Terabit/sec

Network: internal 12 x 40Gb Mellanox switches – 1 connection to each bottom-of- rack switches – Complete redundant mesh RAL site has 40Gb connection to JANET/internet using same 40Gb connections! 204 x 40Gb cables provides bandwidth of over 1 Terabyte / sec internal to JASMIN2 Phase 3 connects JASMIN1 to JASMIN2 via yellow 56Gbit cables

Design challenge: space to expand JASMIN 1 JASMIN 2 JASMIN 3 (2014–15 …) Science DMZ

/group_workspaces/cems//group_workspaces/jasmin/ jasmin-login1jasmin-xfer1cems-login1cems-xfer1 jasmin-sci1 cems-sci1 firewall lotus.jc.rl.ac.uk firewall JASMIN CEMS Data Centre Archive VM /neodc/badc GWS VM CEMS- Academic cloud GWS Key: General-purpose resources Project-specific resources Data centre resources Science/analysisBatch processing cluster SSH login gateway Data transfers Science/analysi s SSH login gateway Data transfers Model for JASMIN 1

/group_workspaces/jasmin/ jasmin-login1 jasmin-xfer1 jasmin-sci1 firewall lotus.jc.rl.ac.uk Data Centre Archive VM /neodc/badc GWS Science/analysis Batch processing cluster SSH login gateway Data transfer node Science DMZ xfer2 ps ftp2 Arrivals2 ingest esgf-dn? ? Ingest cluster Private ingest processing cluster

Management & Monitoring – Kickstart/Deployment system – Puppet control of key configs – Yum repos (100+ Science rpms) – LDAP Authentication Driven from CEDA user mgmt DB – Ganglia Web inc power, power, humidty. User accessible – Network monitoring (Cacti, Observium, sFlow) Overview user accessible via ‘Dashboard’ – Nagios alerting (h/w and services) – Intrusion detection (AIDE) – Universal syslog (Greylog2) – Root command logging – Patch Monitoring (Pakiti) – Dedicated helpdesks – Fulltime machine room OPs staff

JASMIN Analysis Platform Software stack enabling scientific analysis on JASMIN. – Multi-node infrastructure requires a way to install tools quickly and consistently – The community needs a consistent platform wherever it needs to be deployed. – Users need help migrating analysis to JASMIN.

What JAP Provides Standard Analysis Tools NetCDF4, HDF5, Grib Operators: NCO, CDO Python Stack – Numpy, SciPy, Matplotlib – IRIS, cf-python, cdat_lite – IPython GDAL, GEOS NCAR Graphics, NCL R, octave … Parallelisation and Workflow Python MPI bindings Jug (simple python task scheduling) IPython notebook IPython-parallel JASMIN Community Inter- comparison Suite

Community Intercomparison Suite (CIS) Time-seriesScatter plots Histograms Global plots Curtain plots Line plots Overlay plots DatasetFormat AERONETText MODISHDF CALIOPHDF CloudSATHDF AMSREHDF TRMMHDF CCI aerosol & cloudNetCDF SEVIRINetCDF Flight campaign dataRAF ModelsNetCDF CIS = Component of JAP

CIS – Co-location Model gives global output every 3 hours for a full month Observations are day- time site measurements, every 15 min for a full month Collocation Source Sampling cis col : :colocator=lin -o cis plot : : --type comparativescatter \ --logx --xlabel 'Observations AOT 675nm' --xmin 1.e-3 --xmax 10 \ --logy --ylabel 'Model AOT 670nm' --ymin 1.e-3 --ymax 10

Vision for JASMIN 2 (Applying lessons from JASMIN 1) Some key features – Nodes are general purpose: boot as bare metal or hypervisors – Use cloud tenancy model to make Virtual Organisations – Networking: make an isolated network inside JASMIN to give users greater freedom: full IaaS, root access to hosts … External Cloud Provider s Virtualisation Cloud Federation API Internal Private Cloud JASMIN Cloud Isolated part of the network Cloud burst as demand requires High performance global file system Bare Metal Compute Data Archive and compute Support a spectrum of usage models

External Network inside JASMIN Unmanaged Cloud – IaaS, PaaS, SaaS JASMIN Internal Network Data Centre Archive Lotus Batch Compute Managed Cloud - PaaS, SaaS eos-cloud-org Project1-org Project2-org Storage Science Analysis VM 0 Science Analysis VM Storage Science Analysis VM 0 Science Analysis VM CloudBioLinux Desktop Storage Science Analysis VM 0 Compute Cluster VM File Server VM CloudBioLinux Fat Node ssh via public IP IPython Notebook VM could access cluster through Python API JASMIN Cloud Management Interfaces Login VM JASMIN Cloud Architecture Direct access to batch processing cluster Direct File System Access Standard Remote Access Protocols – ftp, http, …

Management via Consortia NameManager Atmospheric & Polar ScienceGrenville Lister Oceanography & Shelf Seas Solid Earth & Mineral Physics Genomics Ecology & Hydrology Earth Observation & Climate Services Victoria Geology ArchiveSam Director’s cutBryan

NERC HPC Committee ConsortiumProject MJVO Managed Cloud JASMIN Virtual Organisation UJVO Unmanaged-Cloud JASMIN Virtual Organisation GWS Group Workspace ??? non-JASMIN e.g. ARCHER, RDF

Consortium Project 1 Project 1 MJVOProject 1 UJVO Project 1 GWS sci bastion web bastion Project 2 Project 2 GWS Project 1 MJVO sci MJVOUJVO ncas_generic sci bastion web bastion Consortium-level Project

Further info JASMIN – Centre for Environmental Data Archival – JASMIN paper Lawrence, B.N., V.L. Bennett, J. Churchill, M. Juckes, P. Kershaw, S. Pascoe, S. Pepler, M. Pritchard, and A. Stephens. Storing and manipulating environmental big data with JASMIN. Proceedings of IEEE Big Data 2013, p68-75, doi: /BigData doi: /BigData