Required Data Centre and Interoperable Services: CEDA

Slides:



Advertisements
Similar presentations
BEDI -Big Earth Data Initiative
Advertisements

System Center 2012 R2 Overview
Integrating NOAA’s Unified Access Framework in GEOSS: Making Earth Observation data easier to access and use Matt Austin NOAA Technology Planning and Integration.
An Approach to Secure Cloud Computing Architectures By Y. Serge Joseph FAU security Group February 24th, 2011.
VO Sandpit, November 2009 NERC Big Data And what’s in it for NCEO? June 2014 Victoria Bennett CEDA (Centre for Environmental Data Archival)
“It’s going to take a month to get a proof of concept going.” “I know VMM, but don’t know how it works with SPF and the Portal” “I know Azure, but.
© 2009 VMware Inc. All rights reserved Big Data’s Virtualization Journey Andrew Yu Sr. Director, Big Data R&D VMware.
NERC Data Grid Helen Snaith and the NDG consortium …
© 2010 VMware Inc. All rights reserved Confidential VMware Vision Jarod Martin Senior Solutions Engineer.
SaaS, PaaS & TaaS By: Raza Usmani
M.A.Doman Model for enabling the delivery of computing as a SERVICE.
Web-based Portal for Discovery, Retrieval and Visualization of Earth Science Datasets in Grid Environment Zhenping (Jane) Liu.
An Introduction to Cloud Computing. The challenge Add new services for your users quickly and cost effectively.
Plan Introduction What is Cloud Computing?
Modelling and Data Centre Requirements: CEDA ESGF UV-CDAT Conference December 2014 Philip Kershaw, Centre for Environmental Data Archival, RAL Space,
 Cloud computing  Workflow  Workflow lifecycle  Workflow design  Workflow tools : xcp, eucalyptus, open nebula.
EGI-Engage EGI-Engage Engaging the EGI Community towards an Open Science Commons Project Overview 9/14/2015 EGI-Engage: a project.
CEMS: The Facility for Climate and Environmental Monitoring from Space Victoria Bennett, ISIC/CEDA/NCEO RAL Space.
Cloud Computing. What is Cloud Computing? Cloud computing is a model for enabling convenient, on-demand network access to a shared pool of configurable.
Climate Sciences: Use Case and Vision Summary Philip Kershaw CEDA, RAL Space, STFC.
M.A.Doman Short video intro Model for enabling the delivery of computing as a SERVICE.
Cloud Computing in NASA Missions Dan Whorton CTO, Stinger Ghaffarian Technologies June 25, 2010 All material in RED will be updated.
Get More out of SQL Server 2012 in the Microsoft Private Cloud environment Steven Wort, Xin Jin Microsoft Corporation.
VO Sandpit, November 2009 e-Infrastructure to enable EO and Climate Science Dr Victoria Bennett Centre for Environmental Data Archival (CEDA)
Grids, Clouds and the Community. Cloud Technology and the NGS Steve Thorn Edinburgh University Matteo Turilli, Oxford University Presented by David Fergusson.
NOCS, PML, STFC, BODC, BADC The NERC DataGrid = Bryan Lawrence Director of the STFC Centre for Environmental Data Archival (BADC, NEODC, IPCC-DDC.
608D CloudStack 3.0 Omer Palo Readiness Specialist, WW Tech Support Readiness May 8, 2012.
JASMIN and CEMS: The Need for Secure Data Access in a Virtual Environment Cloud Workshop 23 July 2013 Philip Kershaw Centre for Environmental Data Archival.
VO Sandpit, November 2009 CEDA Metadata Steve Donegan/Sam Pepler.
Virtualisation & Cloud Computing at RAL Ian Collier- RAL Tier 1 HEPiX Prague 25 April 2012.
RAL Site Report HEPiX FAll 2014 Lincoln, Nebraska October 2014 Martin Bly, STFC-RAL.
VO Sandpit, November 2009 e-Infrastructure for Climate and Atmospheric Science Research Dr Matt Pritchard Centre for Environmental Data Archival (CEDA)
Status: For Information Only ESA UNCLASSIFIED - For Official Use Cloud Processing at ESA [EO Payload Ground Segment] Cristiano Lopes, ESA CEOS WGISS-40.
1 Accomplishments. 2 Overview of Accomplishments  Sustaining the Production Earth System Grid Serving the current needs of the climate modeling community.
20409A 7: Installing and Configuring System Center 2012 R2 Virtual Machine Manager Module 7 Installing and Configuring System Center 2012 R2 Virtual.
Chapter 8 – Cloud Computing
Aneka Cloud ApplicationPlatform. Introduction Aneka consists of a scalable cloud middleware that can be deployed on top of heterogeneous computing resources.
CISC 849 : Applications in Fintech Namami Shukla Dept of Computer & Information Sciences University of Delaware A Cloud Computing Methodology Study of.
noun ; Software Defined Enterprise/SDE/ The enterprise who leverages software to flank their traditional business offerings, or to create entirely new.
Using a Simple Knowledge Organization System to facilitate Catalogue and Search for the ESA CCI Open Data Portal EGU, 21 April 2016 Antony Wilson, Victoria.
Moonshot-enabled Federated Access to Cloud Infrastructure Terena Networking Conference, Reykjavik. May 2012 David Orrell, Eduserv.
© 2012 Eucalyptus Systems, Inc. Cloud Computing Introduction Eucalyptus Education Services 2.
INTRODUCTION TO CLOUD COMPUTING. CLOUD  The expression cloud is commonly used in science to describe a large agglomeration of objects that visually appear.
CEOS Working Group on Information System and Services (WGISS) Data Access Infrastructure and Interoperability Standards Andrew Mitchell - NASA Goddard.
Research and Service Support Resources for EO data exploitation RSS Team, ESRIN, 23/01/2013 Requirements for a Federated Infrastructure.
EGI-InSPIRE RI EGI Compute and Data Services for Open Access in H2020 Tiziana Ferrari Technical Director, EGI.eu
Sentinel Data Access for Africa 11 December 2014 Sentinels Data Access for Africa.
CERN IT-Storage Strategy Outlook Alberto Pace, Luca Mascetti, Julien Leduc
PaaS services for Computing and Storage
Prof. Jong-Moon Chung’s Lecture Notes at Yonsei University
Unit 3 Virtualization.
ESA-FAO GEOPortal STATUS & PLANS
Guide to Operating Systems, 5th Edition
Chapter 6: Securing the Cloud
2nd GEO Data Providers workshop (20-21 April 2017, Florence, Italy)
Organizations Are Embracing New Opportunities
Big Data Enterprise Patterns
Workshop on the Future of Big Data Management June 2013 Philip Kershaw
Introduction to Distributed Platforms
INTAROS WP5 Data integration and management
EGI-Engage Engaging the EGI Community towards an Open Science Commons
JASMIN Success Stories
Management of Virtual Execution Environments 3 June 2008
Cloud Computing Dr. Sharad Saxena.
20409A 7: Installing and Configuring System Center 2012 R2 Virtual Machine Manager Module 7 Installing and Configuring System Center 2012 R2 Virtual.
WGISS Connected Data Assets April 9, 2018 Yonsook Enloe
Guide to Operating Systems, 5th Edition
WGISS Connected Data Assets Oct 24, 2018 Yonsook Enloe
EOSC-hub Contribution to the EOSC WGs
SQL Server 2019 Bringing Apache Spark to SQL Server
Presentation transcript:

Required Data Centre and Interoperable Services: CEDA JASMIN (STFC/Stephen Kill) Required Data Centre and Interoperable Services: CEDA Philip Kershaw, Victoria Bennett, Martin Juckes, Bryan Lawrence, Sam Pepler, Matt Pritchard, Ag Stephens

CEDA, JASMIN and UK Research Councils CEDA is part of STFC and runs data centres on behalf of NERC

Centre for Environmental Data Analysis JASMIN facility CEDA CEDA’s mission, “to support environmental science, further environmental data archival practices, and develop and deploy new technologies to enhance access to data” curation and facilitation

Data Science and Food Analogy Quality ingredients (organised and discoverable) Suitable tools, and a plan (recipe) and experts (well trained people) Wide variety of products targeted for different consumers Quality data (organised and discoverable) Suitable platforms, and algorithms and experts (well trained people) Wide variety of products targeted for different consumers (Slide courtesy of Bryan)

CEDA + JASMIN Functional View

JASMIN Petascale storage, hosted processing and analysis facility for big data challenges in environmental science 16PB high performance storage (~250GByte/s) High-performance computing (~4,000 cores) Non-blocking Networking (> 3Tbit/sec), and Optical Private Network WAN’s Coupled with cloud hosting capabilities For entire UK NERC community, Met Office, European agencies, industry partners You can get food ready made but you can also go into the kitchen and make your own (IaaS)  JASMIN (STFC/Stephen Kill)

JASMIN Four services provided to the community: Storage (disk and tape) Batch computing (“Lotus”) Hosted computing Cloud computing

Challenges Big data V’s: volume and velocity Variety (complexity) How to provide a holistic cross-cutting technical solution for performance multi-tenancy flexibility + meet needs of the long-tail of science users All the data available all of the time Maximise utilisation of compute, network and storage (the ‘Tetris’ problem) With an agile deployment architecture

Volume and Velocity: Data growth (Large Hadron Collider Tier 1 data on tape) (at STFC) JASMIN 3 upgrade addressed growth issues of disk, local compute, inbound bandwidth Looking forward, disk + nearline tape storage will be needed Cloud-bursting for compute growth?

Volume and Velocity: CMIP data at CEDA For CMIP5, CEDA holds 1.2 Petabytes of model output data For CMIP6: “1 to 20 Petabytes within the next 4 years” Plus HighresMIP: 10-50 PB of Hiresmip data … on tape 2 PB disk cache Archive growth not constant depends on timeline of outputs available from model centres Schematic of proposed experiment design for CMIP6

Volume and Velocity: Sentinel Data at CEDA New family of ESA earth observation satellite missions for the Copernicus programme (formerly GMES) CEDA will be UK ESA relay hub CEDA Sentinel Archive: Recent data (O)6-12 months stored on-line Older data stored near-line Growth is predictable over time S-1A, launched 3rd April 2014 Mission Daily data rates Product archive/year Sentinel 1A, 1B 1.8 TB/day raw data 2 PB/year Sentinel 2A 1.6 TB/day raw data 2.4 PB/year Sentinel 3A 0.6 TB/day raw data Copernicus joint EU and ESA initiative to ensure timely and easily accessible information to improve the management of the environment, understand and mitigate the effects of climate change and ensure civil security S-2A, launched 23rd June 2015 Expected 10 TB/day when all missions operational S-3A expected Nov 2015

Variety (complexity) Headline figures Projects hosted using ESGF: 3PB archive ~250 datasets > 200 million files 23000 registered users Projects hosted using ESGF: CMIP5, SPECS, CCMI, CLIPC and ESA CCI Open Data Portal ESGF faceted search and federated capabilities are powerful but . . . need to have effective means to integrate other heterogeneous sources of data All CEDA data hosted through common CEDA web presence MOLES metadata catalogue OPeNDAP (PyDAP) FTP CEDA user base has been diversifying

Variety example 1: ElasticSearch project Indexing file-level metadata using Lotus cluster on JASMIN 3PB ~250 datasets > 200 million files Phases File attributes e.g. checksums File variables Geo-temporal information An OpenSearch façade will be added to CEDA ElasticSearch service to provide ESA-compatible search API for Sentinel data EUFAR flight finder project piloted use of ElasticSearch Heterogeneous airborne datasets Transformed accessibility of data

Variety example 2: ESA CCI Open Data Portal ESA Climate Change Initiative responds directly to the UNFCCC/GCOS requirements, within the internationally coordinated context of GEO/CEOS. The Global Climate Observing System (GCOS) established a list of Essential Climate Variables (ECVs) that have high impact. Goal is to provide a single point of access to the subset of mature and validated ECV data products for climate users CCI Open Data Portal builds on ESGF architecture But datasets are very heterogeneous not like well behaved model outputs ;-) . . . The ESA Earthwatch programme element “Global Monitoring of Essential Climate Variables” (GMECV) United Nations Framework Convention on Climate Change (UNFCCC) concerns systematic observation and development of data archives related to the climate system. The Global Climate Observing System (GCOS, established 1992)

CCI Open Data Portal Architecture Data discovery and other user services Single point of reference for CCI DRS. DRS is defined with SKOS and OWL classes Vocabulary Server ISO Records are tagged with appropriate DRS terms to link CSW and ESGF search results SPARQL Interface Data download for user community User Interface Web Presence ISO19115 Catalogue OGC CSW Create ISO Records Consumed by web user search interface Apply Access Policy and Logging and Auditing Search services ESGF Index Node Data Download Services ESGF Data Node THREDDS GridFTP OPeNDAP WCS WMS FTP Create Solr Index Catalogue Generation Data ingest CCI Data Archive Quality Checker ESG Publisher

CCI Open Data Portal: DRS Ontology Specifies DRS vocabulary for the CCI project Could be applied to other ESGF projects Some terms are common to CMIP5 such as organisation and frequency Specific terms are added for CCI such as Essential Climate Variable SKOS allows expression of relationships with similar terms

JASMIN Evolution 1) HTC (High throughput Computing) JASMIN Cloud HTC (High throughput Computing) Success through recognition workloads io bound Storage and analysis Global file system Group work spaces exceed space taken by curated archive Data Archive and compute Bare Metal Compute High performance global file system Virtualisation Internal Private Cloud Isolated part of the network Cloud Isolated part of infrastructure needed for IaaS: users take full control of what they want installed and how Flexibility and multi-tenancy . . . Virtualisation Flexibility and simplification of management <= Different slices thru the infrastructure => Support a spectrum of usage models

JASMIN Evolution 2) Cloud Architecture JASMIN Analysis Platform VM JASMIN Cloud Management Interfaces Firewall Access for hosted services CloudBioLinux Desktop with dynamic RAM boost JASMIN Internal Network External Network inside JASMIN Managed Cloud - PaaS, SaaS Unmanaged Cloud – IaaS, PaaS, SaaS Project1-org Science Analysis VM 0 Science Analysis VM Appliance Catalogue Firewall + NAT another-org Database VM ssh bastion Web Application Server VM eos-cloud-org CloudBioLinux VM File Server VM CloudBioLinux Fat Node Direct File System Access Direct access to batch processing cluster Lotus Batch Compute Panasas storrage Standard Remote Access Protocols – ftp, http, … NetApp storage

Custom Cloud Portal Building on top of VMware vCloud JASMIN Cloud Management Interface Building on top of VMware vCloud Keep it simple: provide just enough functionality to provision and configure VMs Right tools for right users: scientists, developers and administrators Abstraction from vCloud also provides a route to cloud federation / bursting Abstraction layer vCloud API Client OpenStack Client Pub Cloud Client Implements 0ther clients … VMware vCloud API

JASMIN Evolution 3) JASMIN Cloud How can we effectively bridge between different technologies and usage paradigms? How can we make most effective use of finite resources? Storage ‘traditional’ high performance global file system doesn’t sit well with cloud model Although JASMIN PaaS provides dedicated VM NIC for Panasas access  Compute Batch and cloud separate – cattle and pets – segregation means less effective use of overall resource VM appliance templates cannot deliver portability across infrastructures Spin up time for VMs on disk storage can be slow Data Archive and compute Bare Metal Compute High performance global file system Virtualisation Internal Private Cloud <= Different slices thru the infrastructure => External Cloud Providers Cloud Federation / bursting Isolated part of the network Support a spectrum of usage models

JASMIN Evolution 4) Container technologies OPTIRAD project Object storage enables scaling global access (REST API) inside and external to data centre ref. cloud bursting STFC CEPH object store being prepared for production use Makes workloads more amenable for bursting to public cloud or other research clouds Container technologies Easy scaling Portability between infrastructures – for bursting Responsive start-up OPTIRAD project Initial experiences with containers and container orchestration

OPTIRAD Deployment Architecture OPTIRAD JASMIN Cloud Tenancy Browser access VM: Swarm pool 0 VM: shared services VM: Swarm pool 0 VM: Swarm pool 0 Docker Container NFS LDAP JupyterHub IPython Notebook VM: slave 0 VM: Swarm pool 0 Firewall Manage users and provision of notebooks Kernel Kernel Parallel Controller Parallel Engine VM: Swarm pool 0 Docker Container Docker Container Jupyter (IPython) Notebook IPython Notebook Swarm Kernel Kernel Parallel Controller Parallel Engine Swarm manages allocation of containers for notebooks Notebooks and kernels in containers Nodes for parallel Processing

Challenges for implementation of Container-based solution Managing elasticity of compute with both containers and host VMs Extend use of containers for parallel compute Which orchestration solution? – Swarm, Kubernetes . . . Provoked some fundamental questions about how we blend cloud with batch compute . . . Apache Mesos The data centre as a server Blurs traditional lines between OS and host app and hosting environment with use of containers Integrates popular frameworks in one: Hadoop, Spark, … Managing elasticity of storage Provide object storage with REST API – CEPH likely candidate with S-3 interface BUT users will need to re-engineer POSIX interfaces to use flat key-value pair interface of object store

Further information JASMIN: OPTIRAD: Deploying JupyterHub with Docker: http://jasmin.ac.uk/ EO Science From Big EO Data On The JASMIN-CEMS Infrastructure, Proceedings of the ESA 2014 Conference on Big Data from Space (BiDS’14) Storing and manipulating environmental big data with JASMIN, Sept 2013, IEEE Big Data Conference, Santa Clara CAhttp://home.badc.rl.ac.uk/lawrence/static/2013/10/14/LawEA13_Jasmin.pdf OPTIRAD: The OPTIRAD Platform: Cloud‐Hosted IPython Notebooks for Collaborative EO Data Analysis and Processing, EO Open Science 2.0, Oct 2015, ESA-ESRIN, Frascati Optimisation Environment For Joint Retrieval Of Multi-Sensor Radiances (OPTIRAD), Proceedings of the ESA 2014 Conference on Big Data from Space (BiDS’14) http://dx.doi.org/10.2788/1823 Deploying JupyterHub with Docker: https://developer.rackspace.com/blog/deploying-jupyterhub-for-education/