High Availability for OPNFV May. 2015
Agenda Project introduction Scenarios discussion Requirement doc Gap analysis Blue prints Open questions: Storage HA
Project Introduction Fu Qiao
Project Detail Project Page: https://wiki.opnfv.org/high_availability_for_opnfv Weekly meeting: Wednesday at 13:00pm-14:00pm UTC https://wiki.opnfv.org/high_availability_project_meetings Mailing list Opnfv-tech-discussion [availability] Participants: Hui Deng [denghui@chinamobile.com] Jolliffe, Ian [ian.jolliffe@windriver.com] Maria Toeroe [maria.toeroe@ericsson.com] Qiao Fu [fuqiao@chinamobile.com] Xue Yifei [xueyifei@huawei.com] Yuan Yue [yuan.yue@zte.com.cn] Yao Cheng LIANG [ycliang@astri.org] Sean Winn [Sean.Winn@emc.com] Joe Huang [joehuang@huawei.com] Georg Kunz [georg.kunz@ericsson.com] Basavaprabhu Badami [basavaprabhu.badami@velankani.com] ….
Project progress The ongoing work of 1st release of OPNFV has included some HA schemas, e.g. openstack HA, active/active or active/passive state of Rabbit MQ and Mysql, which is described in requirement doc. section 5. In this project, we further discuss the scenarios, framework, and detail requirements and API definition of HA in OPNFV platform. Project Outputs: Service HA scenario analysis Requirement Document https://etherpad.opnfv.org/p/High_Availabiltiy_Requirement_for_OPNFV Gap Analysis of Openstack HA scheme https://etherpad.opnfv.org/p/ha_gap_analysis Blue Prints: https://etherpad.opnfv.org/p/Blue_Print_From_HA_project https://blueprints.launchpad.net/keystone/+spec/keystone-ha-multisite HA API description
Scenarios Discussion Fu Qiao
Service Availability Levels for Carrier Grade VNFs Recovery Time Customer Type Recommendations SAL 1 e.g. 5 – 6 seconds Network Operator Control Traffic Government/Regulatory Emergency Services Redundant resources to be made available on-site to ensure fast recovery. SAL 2 e.g. 10 – 15 seconds Enterprise and/or large scale Customers Network Operators service traffic Redundant resources to be available as a mix of on-site and off-site as appropriate. On-site resources to be utilized for recovery of real-time services. Off-site resources to be utilized for recovery of data services. SAL 3 e.g. 20 – 25 seconds General Consumer Public and ISP Traffic Redundant resources to be mostly available off-site. Real-time services should be recovered before data services Source: ETSI GS NFV-REL 001 V1.1.1
Scenarios State Redundancy in VNF Failure detection Use Case VNF Stateful yes VNF only UC1 VNF & NFVI UC2 no UC3 UC4 Stateless UC5 UC6 UC7 UC8 UC9: Repeated failure in VNF
UC1: Stateful VNF with Redundancy 2014-12-12 UC1: Stateful VNF with Redundancy Failure detection: VNF only NFVO Recovery time 1. VNFC fails 2. NF fails* 3. VNF detects the failure VNF’s Services NF 4. VNF isolates VNFC VNF VNFM 5. VNF fails over 6. NF recovers STB ACT ACT STB 7. VNF repairs VNFC NFVI’s Services VM VM VM VM NFVI VIM Nothing new in this scenario *Steps 1&2 are simultaneous they are separated for clarity
UC2: Stateful VNF with Redundancy 2014-12-12 UC2: Stateful VNF with Redundancy Failure detection: VNF & NFVI NFVO Recovery time 1. VM fails 2. VM Service fails 3. VNFC fails VNF’s Services NF 4. NF fails* VNF VNFM 5a. VNF detects the failure 6a. VNF fails over STB ACT ACT STB 7a. NF recovers 5b. NFVI detects the failure NFVI’s Services 6b. NFVI reports to VIM VM VM VM VM VM NFVI VIM 7b. VIM reports to VNFM 8. VNFM ok to VIM 9. VIM repairs VM 10. VM Service recovers 11. VNF repairs VNFC *Steps 1-4 are simultaneous they are separated for clarity
UC3: Stateful VNF with No Redundancy 2014-12-12 UC3: Stateful VNF with No Redundancy Failure detection: VNF only NFVO VNFC checkpoints its state to VD, which is HA Recovery time 1. VNFC fails 2. NF fails* VNF’s Services 3. VNF detects the failure NF VNF VNFM 4. VNF isolates VNFC 5. VNF repairs VNFC ACT ACT 6. VNFC gets state 7. NF recovers state state state NFVI’s Services VM VD VM VM NFVI VIM *Steps 1&2 are simultaneous they are separated for clarity
UC4: Stateful VNF with No Redundancy 2014-12-12 UC4: Stateful VNF with No Redundancy Failure detection: VNF & NFVI NFVO VNFC checkpoints its state to VD, which is HA 1. VM fails Recovery time 2. VM Service fails 3. VNFC fails VNF’s Services NF 4. NF fails* VNF VNFM 5a. VNF detects the failure 6a. VNF reports to VNFM ACT ACT 5b. NFVI detects the failure 6b. NFVI reports to VIM state state state 7. VIM reports to VNFM NFVI’s Services 8. VNFM ok to VIM VM VM VD VM VM NFVI VIM 9. VIM repairs VM 10. VM Service recovers 11. VIM informs VNFM 12. VNFM repairs VNFC 13. VNFC gets state 14. NF recovers *Steps 1-4 are simultaneous they are separated for clarity
High Availability Flow Chart Service failure happens(may be caused of failure of VNF or NFVI) Step 1-Service Recovery: (Time Constraint, Carrier Grade VNF should be recovered within seconds) recovery time Failure detection (by service heartbeat loss/NFVI report of failure) Service is unavailable Service failover VNF failure only NFVI failure VNFC repair/restart VM recovery and VNFC recovery repeated failure Step 2-NFVI recovery or repair: (less Time Constraint)
Requirement doc & Gap Analysis doc. Ian Jolliffe
Requirement Doc. Details Framework The ultimate goal is to provide upper layer service high availability Service high availability is provided by recovery of service (including service restart and failover) within seconds following the SAL. Repair or recovery of the failed layer should happen afterwards. Ensure that no failure in one layer causes a cascading failure at other layers. A single layer can detect failures in other layers and help recover failed components. Service layer Service Application/VM layer VNF/VNFC VNFM NFVI/VIM layer NFVI VIM Hardware layer Hardware
Requirement doc. outline 1 Overall Principle for High Availability in NFV 1.1 Framework for High Abailability in NFV 1.2 Definitons 1.3 Overall requirements 1.4 Time requirement 2 Hardware HA 3 Virtualization Facilities (Host OS, Hypervisor) 4 Virtual Infrastructure HA – Requirements: 4.1 Virtual Compute 4.2 Virtual Storage 4.3 Virtual Network 5 VIM High availability 5.1 Archeticture requirement of VIM HA 5.2 Fault detection and alarm requirement of VIM 5.3 HA mechanism of VIM provided for NFV 5.4 SDN controller 6 VNF HA 6.1 Service Availability 6.2 Service Continuity 7 Storage
Gap Analysis 14+ HA related gaps have been discovered Nova: 6 gaps in Nova covering scheduler, consoleauth and health status of compute node. Neutron: 2 gaps in Neutron covering L3 agent and DHCP agent. Cinder: 2 gaps in Cinder covering HA configuration and multi-attachment. VIM NBI: 1 gap for error reporting QoS: 1 gap for QoS management References: https://etherpad.openstack.org/p/kilo-crossproject-ha-integration https://etherpad.openstack.org/p/kilo-summit-ops-ha https://blueprints.launchpad.net/openstack
Blue Prints Joe Huang
Escape from site level KeyStone failure Only one KeyStone server can be configured for token validation or revoke list Validate Token (Fernet,UUID) or retrieve RevokeList (PKI) API Request (Nova/Cinder/Neutron…) KeyStone Middleware OpenStack service (Nova/Cinder/Neutron…) Allow secondary KeyStone server configured in case of site level KeyStone failure. (Cons. of DNS based load balance : delayed failover for caching issues, an unpredictable routing) Site1 KeyStone Site2 KeyStone Validate Token (Fernet,UUID) or retrieve RevokeList (PKI) API Request (Nova/Cinder/Neutron…) KeyStone Middleware OpenStack service (Nova/Cinder/Neutron…)
Open Questions: Storage HA Georg Kunz
storage service component Storage Architecture file, (object) VNF/VNFC storage service component NFVI block, file, object block, file, object VIM distributed storage Hardware host host storage array Ctrl1 Crtl2 switch switch
Storage HA – Network Failure block, file, object block, file, object VIM distributed storage host host storage array Ctrl1 Crtl2 switch switch Storage network link fails Storage network detects failure Storage network switches to standby link(s) iSCSI multi-pathing bonding Report failure to O&M
Storage HA – Failure in Storage Array block, file, object block, file, object VIM distributed storage host host storage array Ctrl1 Crtl2 switch switch Component within storage array fails Array-internal fail-over kicks in RAID Redundant controllers, NICs, … Report failure to O&M
Storage HA – Host Failure block, file, object block, file, object VIM distributed storage host host storage array Ctrl1 Crtl2 switch switch Storage host fails Distributed storage layer detects failure Distributed storage layer rebalances data
Non-HA Block Storage (legacy) Mirroring of block devices on VNF level VNF VNFC (active) VNFC (passive) mirroring NFVI
HA Block Storage Active/passive configuration Failover supervised by clustering software in VNF Requires multi-attach capability of Cinder VNF VNFM VNFC (active) VNFC (standby) VNFC (active) NFVI VIM
HA Block Storage Active/active configuration Clustered file system enables concurrent access Requires multi-attach capability of Cinder VNF VNFM VNFC (active) VNFC (active) NFVI VIM
VNF level HA for Multiple Backends Block devices provided by multiple backends Mirroring of block devices on VNF level Pro-active failover possible NFVI VNF VNFC (active) (passive) mirroring VNFM VIM backend 1 backend 2
Open Questions Can NFVI storage system provide sufficient level of HA to meet SAL levels? Failover/recovery times heavily depend on deployed solution How much does rebuild of data impact performance?
File Storage Legacy deployments NFVI File storage service provided by VNFC Layered on top of block storage services NFVI File storage service provided by NFVI / hardware Openstack Manila
Ephemeral Storage Ephemeral Storage Main use: File systems of VMs booted from image Location On local disks of compute host Isolation of failover domains VM unaffected by failure of storage system Disk failure corresponds to host failure Limits live migration capabilities On distributed or external storage Correlated failures possible Failure of storage backend impacts VMs Properties of respective storage backend apply
Appendix
UC3: Statefull VNF with No Redundancy 2014-12-12 UC3: Statefull VNF with No Redundancy Failure detection: VNF only NFVO VNFC checkpoints its state to VD, which is HA Recovery time 1. VNFC fails 2. NF fails* VNF’s Services 3. VNF detects the failure NF VNF VNFM 4. VNF isolates VNFC 5. VNF repairs VNFC ACT ACT 6. VNFC gets state 7. NF recovers state state state NFVI’s Services VM VD VM VM NFVI VIM *Steps 1&2 are simultaneous they are separated for clarity
UC3-b: Statefull VNF with No Redundancy 2014-12-12 UC3-b: Statefull VNF with No Redundancy Failure detection: VNF only NFVO VNFC checkpoints its state to VD, which is HA Recovery time 1. VNFC fails 2. NF fails* VNF’s Services 3. VNF detects the failure NF VNF VNFM 4. VNF reports to VNFM 5. VNFM isolates VNFC ACT ACT 6. VNFM repairs VNFC 7. VNFC gets state state state state 8. NF recovers NFVI’s Services VM VD VM VM NFVI VIM *Steps 1&2 are simultaneous they are separated for clarity
UC4: Statefull VNF with No Redundancy 2014-12-12 UC4: Statefull VNF with No Redundancy Failure detection: VNF & NFVI NFVO VNFC checkpoints its state to VD, which is HA 1. VM fails Recovery time 2. VM Service fails 3. VNFC fails VNF’s Services NF 4. NF fails* VNF VNFM 5a. VNF detects the failure 6a. VNF reports to VNFM ACT ACT 5b. NFVI detects the failure 6b. NFVI reports to VIM state state state 7. VIM reports to VNFM NFVI’s Services 8. VNFM ok to VIM VM VM VD VM VM NFVI VIM 9. VIM repairs VM 10. VM Service recovers 11. VIM informs VNFM 12. VNFM repairs VNFC 13. VNFC gets state 14. NF recovers *Steps 1-4 are simultaneous they are separated for clarity
UC5: Stateless VNF with Redundancy 2014-12-12 UC5: Stateless VNF with Redundancy Failure detection: VNF only NFVO Spare VNFC may or may not be instantiated 1. VNFC fails Recovery time VNF’s Services 2. NF fails* NF VNF VNFM 3. VNF detects the failure 4. VNF isolates VNFC Spare ACT Spare ACT 5. VNF fails over 6. NF recovers 7. VNF restores redundancy NFVI’s Services VM VM VM VM NFVI VIM Nothing new in this scenario *Steps 1&2 are simultaneous they are separated for clarity
UC6: Stateless VNF with Redundancy 2014-12-12 UC6: Stateless VNF with Redundancy Failure detection: VNF & NFVI NFVO Spare VNFC may or may not be instantiated Recovery time 1. VM fails 2. VM Service fails 3. VNFC fails VNF’s Services NF 4. NF fails* VNF VNFM 5a. VNF detects the failure 6a. VNF fails over Spare ACT ACT Spare 7a. NF recovers 5b. NFVI detects the failure NFVI’s Services 6b. NFVI reports to VIM VM VM VM VM VM NFVI VIM 7b. VIM reports to VNFM 8. VNFM ok to VIM 9. VIM repairs VM 10. VM Service recovers 11. VNF restores redundancy *Steps 1-4 are simultaneous they are separated for clarity
UC7: Stateless VNF with No Redundancy 2014-12-12 UC7: Stateless VNF with No Redundancy Failure detection: VNF only NFVO Recovery time 1. VNFC fails 2. NF fails* VNF’s Services 3. VNF detects the failure NF VNF VNFM 4. VNF reports to VNFM 5. VNF isolates VNFC ACT ACT 6. VNF repairs VNFC 7. NF recovers NFVI’s Services VM VD VM VM NFVI VIM *Steps 1&2 are simultaneous they are separated for clarity
UC8: Stateless VNF with No Redundancy 2014-12-12 UC8: Stateless VNF with No Redundancy Failure detection: VNF & NFVI NFVO Recovery time 1. VM fails 2. VM Service fails 3. VNFC fails VNF’s Services NF 4. NF fails* VNF VNFM 5a. VNF detects the failure 6a. VNF reports to VNFM ACT ACT 5b. NFVI detects the failure 6b. NFVI reports to VIM 7. VIM reports to VNFM NFVI’s Services VM VM VD VM VM 8. VNFM ok to VIM NFVI VIM 9. VIM repairs VM 10. VM Service recovers 11. VIM informs VNFM 12. VNF repairs VNFC 13. NF recovers *Steps 1-4 are simultaneous they are separated for clarity
UC9: Stateless VNF with No Redundancy 2014-12-12 UC9: Stateless VNF with No Redundancy Failure detection: VNF only – BUT Repeatedly NFVO 1. VNFC fails 2. NF fails 3. VNF detects the failure and counts UC7 VNF’s Services NF 4. VNF isolates VNFC VNF VNFM 5. VNF repairs VNFC 6. NF recovers ACT ACT ACT ACT 1 4 2 3 …. VNFC fails….2 …. VNFC fails….3 …. VNFC fails….4 NFVI’s Services VM VD VM VM NFVI VIM Fault is not in the VNFC!
UC9: Stateless VNF with No Redundancy 2014-12-12 UC9: Stateless VNF with No Redundancy Failure detection: VNF only – BUT Repeatedly NFVO 1. VNFC fails 2. NF fails 3. VNF detects the failure and counts VNF’s Services NF 4. VNF isolates VNFC VNF VNFM 5. VNF repairs VNFC 6. NF recovers ACT 4 …. VNFC fails….2 …. VNFC fails….3 …. VNFC fails….4 NFVI’s Services N. VNF reports to VNFM VM VM VD VM VM NFVI VIM N+1. VNFM reports to VIM N+2. VIM isolates VM N+3. VIM repairs VM N+4. VM Service recovers N+5. VNF repairs VNFC N+6. NF recovers
Scenario chart Scenario 1,2,5,6 Add all the scenarios as appendix NFVI provide HA API to VNF? Opensaf is a PaaS, as a HA middleware actually VNF stateful and stateless may require different schema in the NFVI, if VNF is not redundancy, we may need VM redundancy. At this case VNF problem may not be solved.