C Loomis (CNRS/LAL) and V. Floros (GRNET) Operational Considerations From Running Grid Services on Cloud Resources C Loomis (CNRS/LAL) and V. Floros (GRNET) EGI-TF (Amsterdam) 14 September 2010
Contents StratusLab Project Administrator Survey Results Project Facts Vision and Benefits Administrator Survey Results Operational Considerations Dynamic Deployment Performance Considerations Accounting Trust Conclusions
Contents Facts Goal Contacts Started: 1 June 2010 Duration: 24 months Partners: 6 from 5 countries Goal Provide coherent, open-source private cloud distribution Contacts Website: http://stratuslab.eu/ Twitter: StratusLab CNRS (FR) UCM (ES) GRNET (GR) SIXSQ (CH) TID (ES) TCD (IE)
Grid Services Over Cloud Resources users Grid Resource Center Grid Services Cloud API StratusLab Distribution Private Cloud
Benefits Ease of Deployment Increased Reliability/Robustness Prepared virtual machine images for grid services Decouple OS on physical systems from grid requirements Increased Reliability/Robustness Migration for load balancing or for avoid hardware failures Better isolate user jobs/machines from others Customized Environments Responsibility of environments rests with VOs Flexible environments appeals to more diverse scientific community
Administrator Survey Results Surveys for Users and Administrators Collect use cases and requirements from target communities Analysis: D2.1 (http://stratuslab.eu/doku.php?id=deliverables) Important Feedback 33% of users are regular cloud users 68% of administrators will deploy virtualization/cloud technologies Half have already deployed; remainder within 1 year Reluctance for administrators to trust user-generated VMs
Early Results Complete Grid Sites at GRNET Worker Node Tests at LAL Two StratusLab clouds deployed: Ubuntu 10.04 and CentOS 5.5 Functional (pre-production) grid site on each cloud Worker Node Tests at LAL Worker nodes deployed via Quattor in StratusLab cloud Run in production for >1 month without problems Appliance Repository at TCD Based on standard Apache web server Contains stock images for supported operating systems Will contain grid service images (“appliances”)
Dynamic Deployment Easier Grid Deployment Grid service appliances easier, wider deployment VO service and tool appliances lower barrier to creating VOs Elastic site resources site capacity changes significantly and rapidly Streamlined Procedures and Policies Larger load from more, smaller, and more diverse sites Site validation procedures more sites with faster startup VO procedures adapted for more dynamic, temporary VOs Grid administrator may not have physical control of machines Dynamic Grid Services Info. system, scheduling policies adapt to dynamic environment Cloud will help with broader, more systematic middleware testing
Performance Considerations Maximize Efficiency Pack more jobs/machines onto a physical machine Energy savings through shutdown of unused machines Minimize performance losses (esp. IO, communication) Better Knowledge of “Task” Requirements Can easily enforce memory, disk, CPU, bandwidth limits Efficient packing requires this information at the site Users/system must systematically transmit this information Data access information critical for using hybrid cloud infrastructure
Accounting Short Term Integration: Easy Running grid services over cloud Simple accounting integration, just use existing mechanisms May want to add information about execution environment Longer Term Integration: More Challenging Users deploying their own machines and services CPU accounting fairly simple, must avoid double counting New types of resources: bandwidth, storage, VM types Incremental reporting for long-running services
Trust Reluctance to Run User-Defined Images Requirements Realizing full potential of cloud infrastructures will require building (more) trust between users, VOs, and system administrators Trust will vary with actors involved sliding scale that balances level of trust with allowed capabilities Requirements Define security requirements for images (HEPiX) Matchmaking: appliance metadata vs. site requirements Enhanced monitoring (e.g. ports and connections) Additional trust between firewall “controllers” and site administrators
Conclusions Combining Grid and Cloud Technologies Easier deployment, maintenance of sites for administrators More flexible, capable, and dynamic infrastructure for users Operational Considerations Dynamic infrastructure streamlined procedures/policies Better performance better knowledge of resource requirements Accounting expanded to include new types of resources User-defined images more trust, more metadata, policies, …