Download presentation
Presentation is loading. Please wait.
Published byUrsula Caldwell Modified over 9 years ago
1
Federico Calzolari 1, Silvia Arezzini 2, Alberto Ciampa 2, Enrico Mazzoni 2 1 Scuola Normale Superiore - Pisa, Italy 2 National Institute of Nuclear Physics INFN - Pisa, Italy contact: federico.calzolari@sns.it Federico Calzolari 1, Silvia Arezzini 2, Alberto Ciampa 2, Enrico Mazzoni 2 1 Scuola Normale Superiore - Pisa, Italy 2 National Institute of Nuclear Physics INFN - Pisa, Italy contact: federico.calzolari@sns.it CHEP 2009 Prague Aims: A zero cost solution to the High availability problem. Requirements: Full exploitation of virtual environment features: start, stop and move virtual machines between physical hosts. Reliable shared storage infrastructure. Solution: Using virtualization, it is possible to achieve a redundancy system for all the services running on a data center, distributing the running virtual machines over the only up and running physical servers. Summary gridce.sns.it [SNS-Pisa Grid CE] crashes for system overload @4:00 AM Scenario Grid data center Infrastructure: reliable shared Storage Unified management Local Controller and Monitoring service Installation tool: PXE technology Availability of all system components Spin-off: Host on-demand Host on-demand: basic concepts Virtualization and PXE architecture allows to bring up a server in a few minutes Possibility to offer host on-demand: CPUn core RAMn GB DISKn TB Operating System Linux [several distros], Windows Middleware and Applications for T time at the end of time T hosts will be erased! Spin-off: Host on-demand Host on-demand: basic concepts Virtualization and PXE architecture allows to bring up a server in a few minutes Possibility to offer host on-demand: CPUn core RAMn GB DISKn TB Operating System Linux [several distros], Windows Middleware and Applications for T time at the end of time T hosts will be erased! High Availability System design protocol that ensures a certain degree of operational continuity during a given period. High Availability System design protocol that ensures a certain degree of operational continuity during a given period. Virtualization Abstraction of computer resources. Abstraction layer that allows each physical server to run one or more virtual servers, decoupling operating system and applications from the underlying physical server. Virtualization Abstraction of computer resources. Abstraction layer that allows each physical server to run one or more virtual servers, decoupling operating system and applications from the underlying physical server. Classical solution Virtualized solution Operation in a real crash example Proposal RELAXED High availability service: A system able to restore any previously running application in less than ten minutes from the crash time. Proposal RELAXED High availability service: A system able to restore any previously running application in less than ten minutes from the crash time. Primary server Secondary server Pro & Contra Zero cost solution Server consolidation Relaxed recovery time [~3 minutes] Sessions are NOT kept alive Pro & Contra Zero cost solution Server consolidation Relaxed recovery time [~3 minutes] Sessions are NOT kept alive Outcomes RECOVERcrashedmachine in 3 min REINSTALLbrokenmachine in 9 min SNS-PISA is the first EGEE/LCG Grid node fully virtualized (services + WN) highly available NO downtime after service crash Outcomes RECOVERcrashedmachine in 3 min REINSTALLbrokenmachine in 9 min SNS-PISA is the first EGEE/LCG Grid node fully virtualized (services + WN) highly available NO downtime after service crash 3 Re-Cycle Finite state machine with Hysteresis REBOOTVirtual Machine RESTARTVirtual Layer REINSTALLfrom scratch - PXE Finite state machine with Hysteresis REBOOTVirtual Machine RESTARTVirtual Layer REINSTALLfrom scratch - PXE Goals relaxed High Availability < 10 min backup ONLY @disaster_time each physical server can backup each virtual machine Goals relaxed High Availability < 10 min backup ONLY @disaster_time each physical server can backup each virtual machine 3RC High Availability Project Requirements Remote Redundant Controller Reliable Storage: SAN or NAS via FC or NFS RAID over network DRBD Requirements Remote Redundant Controller Reliable Storage: SAN or NAS via FC or NFS RAID over network DRBD Experimental data Recovery time distribution Gaussian:mean181sec sigma10sec Reinstall time Gaussian:mean542sec sigma17sec NON Destructive test overhead; shutdown DESTRUCTIVE test rm /boot dd 0 on filesystem reboot 10.000 crash test 5.000 crash test Several redundancy strategies for several availability levels Virtual machines/disks on external storage ►► problems if software crashes Scheduled virtual machines dump: disk, ram, registers ►► scheduled dumps: recovery @time T_{n-1} Virtual machines ready to be mounted ►► virgin machine from disk copy Install from scratch: operating system and middleware ►► virgin machine from real installation via PXE Several redundancy strategies for several availability levels Virtual machines/disks on external storage ►► problems if software crashes Scheduled virtual machines dump: disk, ram, registers ►► scheduled dumps: recovery @time T_{n-1} Virtual machines ready to be mounted ►► virgin machine from disk copy Install from scratch: operating system and middleware ►► virgin machine from real installation via PXE
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.