EMI INFSO-RI Cloud Task Force Eric Yen and Simon Lin 22 Nov. 2010
EMI INFSO-RI Toward the DCI on Grid and Cloud Enabling collaboration to realize that the whole is grater than the sum of parts WWG realized the global e-Infrastructure to share resources over Internet Mário Campolargo European Commission - DG INFSO – OGF 23, Barcelona June 2008 Cloud offers versatile granularity and new usage patterns to the DCI services Granularity: service-oriented layers in infrastructure, platform, software, data, network, etc. Usage pattern: on-demand elasticity More user customized and user controlled environment on remote resources
EMI INFSO-RI DCI: e-Science Infrastructure Driven by Data Deluge – Turning data into insight and knowledge base efficiently – Open, consistent and well-designed data format, interface, protocol and quality code – Searchability, accessibility and sustainability Resources and Tools are sharable cross-disciplinarily Enable Service-Oriented Science – “scientific research enabled by distributed networks of interoperating services” – New e-Infrastructure is required to host both the data and services 3
EMI INFSO-RI Definition of Cloud Computing Definition (NIST) – Cloud computing is a model for enabling ubiquitous, convenient, on-demand network access to a shared pool of configurable computing resources (e.g., networks, servers, storage, applications, and services) that can be rapidly provisioned and released with minimal management effort or service provider interaction. Challenge of IT is the complexity rather than the scale Therefore, goals of DCI are to have – Elastic services & resources: on-demand, unlimited scalability, dynamic & low-cost migration, – Customized applications/tools/services to accelerate knowledge discovery simple tools to answer complex question – … 4
EMI INFSO-RI Requirements from e-Science Use Cases Job and resource matchmaking now depend on performance (architecture features), reliability, cost and data availability, rather than being dominated by resource availability – Then queued by priority at the target site – Auto migration, VM management, Cloud federation Job could be overflowed to collaborating sites (of the same VO or even to other VOs) whenever necessary (based on performance (e.g., Time to Finish) requirement) Computing environment on-demand – E.g., MapReduce computing environment; Distributed database like HBase, File System (GPFS/HDFS/Lustre/NFS), etc. Storage space on-demand ? – Have to retain the original consistent storage system – Have to meet performance requirement Data as a Service 5
EMI INFSO-RI Requirements from Cloud Service Providers More easier and friendly AAI Storage as a Service Cloud Federation – Interoperability (DCI information system), automatic scaling, HA provisioning, policy, AAI, Accounting and monitoring, … Standards to prevent vendors lock-in Network as a Service 6
EMI INFSO-RI Collaborations OCCI (and OGF) VenusC StratusLab WNoD RESERVOIR (OpenNebula) EDGI SIENA (Standards and Interoperability for eInfrastructure implemeNation initiaAtive) Globus and Nimbus DMTF (Distributed Management Task Force) OMG (Object Management Group) SNIA (Storage Network Industry Association) OCC (Open Cloud Consortium) 7
EMI INFSO-RI What we have from the Grid Middleware EMI middleware stack Identify those components which should be integrated with Cloud technology 8
EMI INFSO-RI Challenges of Cloud at this moment Application management – community appliance management and customized computing environment provisioning AAI trust framework Reliability Elasticity – automatic overflow and scalable – Scale out to new communities with diverse needs Performance of deployment and runtime – Data trans.: LanTorrent, streaming and min. congestion, avoid duplicate trans. – 1,000 VMs in 10 min (by Nimbus) Cost Standardization Interoperability and Federation (either to Clouds and Grids) 9
EMI INFSO-RI What Cloud Task Force Should Do ? Interoperability between grids, supercomputers and emerging computing models like clouds and desktop grids will be extended to address scalability and accessibility requirements. Evaluate integration scenarios with off-the-shelf computing cloud systems Bridge the gap between grid and cloud infrastructure Report with architecture/infrastructure sketch M30 - Successful computational usage of emerging computing models, ie, clouds with EMI components (and a test bed) 10
EMI INFSO-RI Plan for the next 6 months Identify probable scenarios for the evolution of EMI and Cloud – Based on e-Science applications – Barriers to effective computing, vision of future computing model from user community, etc. Identify what EMI component should be adapted to/interoperable with/integrate what Cloud technology with step and schedule – OCCI focus on generic standardization issues from the use case analysis, performance assessment metrics, etc. – EMI should aim for the scope between EMI software stacks and Clouds – What cloud technology has to be deployed Test bed design and evaluation Make recommendations on EMI components from the evolution roadmap 11
EMI INFSO-RI Backup Slides 12
EMI INFSO-RI Deliverable & Milestone Identify what EMI component should be adapted to/interoperable with/integrate what Cloud technology with step and schedule – EMI MW stack Or just in terms of the EMI areas first – What cloud technology has to be deployed Make recommendations on EMI components from the evolution roadmap 13
EMI INFSO-RI Experiences from User Communities STAR – Nimbus + EC2 (Magellan) Ocean Observation Initiative – Adaptive, reliable and elastic computing, and resource strategy – HA – Nimbus + EC2 + FutureGrid BarBar – Nimbus for appliance preparation and cloud scheduler CANFAR (Canadian Adv Network for Astro Research) – Nimbus + WestGrid for Appliance management CloVR – Virtual appliance for auto and portable sequence analysis appliances for push-button pipelines, and from desktop to cloud – Nimbus + EC2 + Magellan 14 Sky Computing
EMI INFSO-RI Network as a Service The LHC experiments, with their distributed Computing Models and world-wide hands-on involvement in LHC physics, have brought renewed focus on networks – This has given the experiment the confidence to seek more agile and effective Models of data distribution and/or remote access – Bringing new challenges and opportunities, both for the Computing Models and a broader network infrastructure An exponential growth in capacity – 10X in usage every 47 months in ESnet over 18 years – 6M times capacity growth over 25 years across the Atlantic (LEP3Net in 1985 to US LHCNet in 2010) – The transition from 10G to 40G ( ) and 100G ( ) are the next steps 15
EMI INFSO-RI Purposes of the TF Session & from the AHM Work Plan (with directional issue) – Ask for comment – Call for contribution Identify the function between TF of Virtualization and Cloud Define agenda for the next 6 months Request a session at the OGF in Taipei, and also the EGI UF as well 16
EMI INFSO-RI IaaSPaaSSaaSDaaSNaaS Common On-demand, Information & Monitoring Job Management Auto overflow to other site/cloud Customized comp env. provisioning HPC, HTC Data Search, accessLight speed connection with flexible bandwidth Security 17
EMI INFSO-RI Notes Automatically scale out and overflow computations to any available resources – CEs capable scaling out to clouds and cloud services Data transmission in speed of light Security & trust framework security team ? Review the US solutions Cloud Federation Summary from Hepix (summarized slides) Storage Benchmarking 18
EMI INFSO-RI Organisation of Hepix Virtualization WG 5 work areas – Image Generation Policy – Image Exchange – Image Expiry/Revocation – Image Contextualisation – Multiple Hypervisor Support 19
EMI INFSO-RI Policy for Trusted Image Generation You recognise that VM base images, VO environments and VM complete images, must be generated according to current best practice, the details of which may be documented elsewhere by the Grid. These include but are not limited to: – any image generation tool used must be fully patched and up to date; – all operating system security patches must be applied to all images and be up to date; – images are assumed to be world-readable and as such must not contain any confidential information; – there should be no installed accounts, host/service certificates, ssh keys or user credentials of any form in an image; – images must be configured such that they do not prevent Sites from meeting the fine-grained monitoring and control requirements defined in the Grid Security Traceability and Logging policy to allow for security incident response; – the image must not prevent Sites from implementing local authorisation and/or policy decisions, e.g. blocking the running of Grid work for a particular user. s 20
EMI INFSO-RI Image Cataloguing and Exchange 21
EMI INFSO-RI Image Contextualisation Contextualisation is needed so that sites can configure images to interface to local infrastructure – e.g. for syslog, monitoring & batch scheduler. Contextualisation is limited to these needs! Sites may not alter the image contents in any way. – Any site are concerned about security aspects of an image should refuse to instantiate it and notify the endorser. Contextualisation mechanism – Images should attempt to mount a CDROM image provided by the sites and, if successful, invoke two scripts from the CDROM image: prolog.sh before network initialisation epilog.sh after network initialisation 22
EMI INFSO-RI Multiple Hypervisor Support Andrea – Surveyed sites; results show that kvm and Xen dominate as hypervisors, especially in batch virtualisation area. – Documented method to produce VM image that can be used with both kvm and Xen Method tested by Sebastien Goasguen and Abdeslem Djaoui (RAL) 23
EMI INFSO-RI Current Status Generation policy – Clear, but probably needs to be formally approved by JSPG? Contextualisation & kvm/Xen support – Also clear. Image Cataloguing & Exchange – Ideas sound, and working internal CERN, but we need functioning inter-site exchange! – Key issue is lack (to date) of working group member(s) with management support to deliver (and support!) a solution. – Stratus Lab, as Michel will report later this week, has developed similar ideas. Joint intention to explore collaboration, but no opportunity to do so before late November. 24
EMI INFSO-RI Other thoughts The CernVM filesystem offers an attractive way to ensure sites have the correct VO software, reducing the need for VM images as a mechanism for this. – See Ian Collier’s presentation shortly. – CVMFS team have asked Romain Wartel to lead security audit This should allay any fears from sites about using CVMFS. Virtualisation is an area with much scope for communication failures! – We must be clear that “image endorsement” is a very rapid process The person creating endorses an image which is then immediately available for instantiation by all sites who trust the endorser; there is no need for a lengthy process of verification at sites. – Some sites talk about restricting instantiated VM images but the actual impact for end-users is likely small e.g. VM images would have no need to connect to a NFS-based shared storage area, so it would not matter if “isolated from the rest of the network” just means “no access to our NFS servers”. This appears to be the likely situation at NIKHEF, one of the sites the most reluctant to enable instantiation of remotely generated images. 25
EMI INFSO-RI Summary The working group has made good progress in establishing policies to allow the exchange of VM images… … but not such good progress in delivering a distributed catalogue of endorsed images. CVMFS is probably the neatest solution to the problem of VO software distribution… … but VM exchange remains interesting – as an option for sites to run hypervisors not OSes and automatically migrate to latest patched system as images instantiate, and – if the VM images can contact pilot job frameworks directly, simplifying the scheduling problems at sites. 26