High Availability for OPNFV

Slides:



Advertisements
Similar presentations
Express5800/ft series servers Product Information Fault-Tolerant General Purpose Servers.
Advertisements

Remus: High Availability via Asynchronous Virtual Machine Replication
ETSI NFV Management and Orchestration - An Overview
High Availability Deep Dive What’s New in vSphere 5 David Lane, Virtualization Engineer High Point Solutions.
High Availability Project Qiao Fu Project Progress Project details: – Weekly meeting: – Mailing list – Participants: Hui Deng
Open Stack Summit – Hong Kong OPENSTACK
Cisco Confidential © 2013 Cisco and/or its affiliates. All rights reserved. 1 Unity Connection Qualification for Prime Collaboration Development Release.
1 Security on OpenStack 11/7/2013 Brian Chong – Global Technology Strategist.
Virtualized Infrastructure Deployment Policies (Copper) 19 February 2015 Bryan Sullivan, AT&T.
1 © Copyright 2010 EMC Corporation. All rights reserved. EMC RecoverPoint/Cluster Enabler for Microsoft Failover Cluster.
Zhipeng (Howard) Huang
Keith Wiles DPACC vNF Overview and Proposed methods Keith Wiles – v0.5.
OSCAR Project Proposed Project for OPNFV
24 February 2015 Ryota Mibu, NEC
Towards an Evaluation Framework for Availability Solutions in the Cloud Proceedings of ISSRE 2014 Proceedings of ISSRE 2014 Majid Hormati, Ferhat Khendek.
OSCAR Project Proposed Project for OPNFV
storage service component
OpenStack High Availability
HA Scenarios.
1© Copyright 2011 EMC Corporation. All rights reserved. EMC RECOVERPOINT/ CLUSTER ENABLER FOR MICROSOFT FAILOVER CLUSTER.
1 Doctor Fault Management 18 May 2015 Ryota Mibu, NEC.
MCTS Guide to Microsoft Windows Server 2008 Applications Infrastructure Configuration (Exam # ) Chapter Ten Configuring Windows Server 2008 for High.
Microsoft Load Balancing and Clustering. Outline Introduction Load balancing Clustering.
ProjectWise Virtualization Kevin Boland. What is Virtualization? Virtualization is a technique for deploying technologies. Virtualization creates a level.
VMware vSphere 4 Introduction. Agenda VMware vSphere Virtualization Technology vMotion Storage vMotion Snapshot High Availability DRS Resource Pools Monitoring.
ATIF MEHMOOD MALIK KASHIF SIDDIQUE Improving dependability of Cloud Computing with Fault Tolerance and High Availability.
OpenContrail for OPNFV
IT Infrastructure Chap 1: Definition
1 Week #10Business Continuity Backing Up Data Configuring Shadow Copies Providing Server and Service Availability.
Clustering In A SAN For High Availability Steve Dalton, President and CEO Gadzoox Networks September 2002.
Visual Studio Windows Azure Portal Rest APIs / PS Cmdlets US-North Central Region FC TOR PDU Servers TOR PDU Servers TOR PDU Servers TOR PDU.
VMware vSphere Configuration and Management v6
Fault Localization (Pinpoint) Project Proposal for OPNFV
Gerald Kunzmann, DOCOMO Carlos Goncalves, NEC Ryota Mibu, NEC
Ashish Prabhu Douglas Utzig High Availability Systems Group Server Technologies Oracle Corporation.
DPACC Management Aspects
You there? Yes Network Health Monitoring Heartbeats are sent to monitor health status of network interfaces Are sent over all cluster.
1 OPNFV Summit 2015 Doctor Fault Management Gerald Kunzmann, DOCOMO Carlos Goncalves, NEC Ryota Mibu, NEC.
VCS Building Blocks. Topic 1: Cluster Terminology After completing this topic, you will be able to define clustering terminology.
HUAWEI TECHNOLOGIES CO., LTD. Huawei FusionSphere Key Messages – Virtualization, Cloud Data Center, and NFVI.
Escalator Questionnaire. Why a Questionnaire? To understand better the challenges of smooth upgrade in the OPNFV context –Part 1: Questions investigating.
Ashiq Khan NTT DOCOMO Congress in NFV-based Mobile Cellular Network Fault Recovery Ryota Mibu NEC Masahito Muroi NTT Tomi Juvonen Nokia 28 April 2016OpenStack.
Ashiq Khan NTT DOCOMO Congress in NFV-based Mobile Cellular Network Fault Recovery Ryota Mibu NEC Masahito Muroi NTT Tomi Juvonen Nokia 28 April 2016OpenStack.
What is OPNFV? Frank Brockners, Cisco. June 20–23, 2016 | Berlin, Germany.
OSIsoft High Availability PI Replication Colin Breck, PI Server Team Dave Oda, PI SDK Team.
Failure Inspection in Doctor utilizing Vitrage and Congress
Distributed Computing Network Laboratory Reliability; Report on Models and Features for E2E Reliability ETSI GS REL 003 양현식.
When RINA Meets NFV Diego R. López Telefónica
ONAP Management Requirements
Security on OpenStack 11/7/2013
Chapter 6: Securing the Cloud
Bentley Systems, Incorporated
X V Consumer C1 Consumer C2 Consumer C3
High Availability Linux (HA Linux)
ARC: Definitions and requirements for SO/APP-C/VF-C discussion Chris Donley Date , 2017.
Escalator: Refreshing Your OPNFV Environment With Less Troubles
Principles of Computer Security
OPNFV Doctor - How OPNFV project works -
Tomi Juvonen SW Architect, Nokia
Nov, 2015 Howard Huang, Huawei Julien Zhang, ZTE
Multisite BP and OpenStack Kingbird Discussion
Christopher Donley Prakash Ramchandran Ulas Kozat
Microsoft Ignite NZ October 2016 SKYCITY, Auckland.
Documenting ONAP components (functional)
DPACC Management Aspects
Virtualization Meetup Discussion
SpiraTest/Plan/Team Deployment Considerations
Latest Update on Gap Analysis of Openstack for DPACC
Figure 3-2 VIM-NFVI acceleration management architecture
Presentation transcript:

High Availability for OPNFV May. 2015

Agenda Project introduction Scenarios discussion Requirement doc Gap analysis Blue prints Open questions: Storage HA

Project Introduction Fu Qiao

Project Detail Project Page: https://wiki.opnfv.org/high_availability_for_opnfv Weekly meeting: Wednesday at 13:00pm-14:00pm UTC https://wiki.opnfv.org/high_availability_project_meetings Mailing list Opnfv-tech-discussion [availability] Participants: Hui Deng [denghui@chinamobile.com] Jolliffe, Ian [ian.jolliffe@windriver.com] Maria Toeroe [maria.toeroe@ericsson.com] Qiao Fu [fuqiao@chinamobile.com] Xue Yifei [xueyifei@huawei.com] Yuan Yue [yuan.yue@zte.com.cn] Yao Cheng LIANG [ycliang@astri.org] Sean Winn [Sean.Winn@emc.com] Joe Huang [joehuang@huawei.com] Georg Kunz [georg.kunz@ericsson.com] Basavaprabhu Badami [basavaprabhu.badami@velankani.com] ….

Project progress The ongoing work of 1st release of OPNFV has included some HA schemas, e.g. openstack HA, active/active or active/passive state of Rabbit MQ and Mysql, which is described in requirement doc. section 5. In this project, we further discuss the scenarios, framework, and detail requirements and API definition of HA in OPNFV platform. Project Outputs: Service HA scenario analysis Requirement Document https://etherpad.opnfv.org/p/High_Availabiltiy_Requirement_for_OPNFV Gap Analysis of Openstack HA scheme https://etherpad.opnfv.org/p/ha_gap_analysis Blue Prints: https://etherpad.opnfv.org/p/Blue_Print_From_HA_project https://blueprints.launchpad.net/keystone/+spec/keystone-ha-multisite HA API description

Scenarios Discussion Fu Qiao

Service Availability Levels for Carrier Grade VNFs Recovery Time Customer Type Recommendations SAL 1 e.g. 5 – 6 seconds Network Operator Control Traffic Government/Regulatory Emergency Services Redundant resources to be made available on-site to ensure fast recovery. SAL 2 e.g. 10 – 15 seconds Enterprise and/or large scale Customers Network Operators service traffic Redundant resources to be available as a mix of on-site and off-site as appropriate. On-site resources to be utilized for recovery of real-time services. Off-site resources to be utilized for recovery of data services. SAL 3 e.g. 20 – 25 seconds General Consumer Public and ISP Traffic Redundant resources to be mostly available off-site. Real-time services should be recovered before data services Source: ETSI GS NFV-REL 001 V1.1.1

Scenarios State Redundancy in VNF Failure detection Use Case VNF Stateful yes VNF only UC1 VNF & NFVI UC2 no UC3 UC4 Stateless UC5 UC6 UC7 UC8 UC9: Repeated failure in VNF

UC1: Stateful VNF with Redundancy 2014-12-12 UC1: Stateful VNF with Redundancy Failure detection: VNF only NFVO Recovery time 1. VNFC fails 2. NF fails* 3. VNF detects the failure VNF’s Services NF 4. VNF isolates VNFC VNF VNFM 5. VNF fails over 6. NF recovers STB ACT ACT STB 7. VNF repairs VNFC NFVI’s Services VM VM VM VM NFVI VIM Nothing new in this scenario *Steps 1&2 are simultaneous they are separated for clarity

UC2: Stateful VNF with Redundancy 2014-12-12 UC2: Stateful VNF with Redundancy Failure detection: VNF & NFVI NFVO Recovery time 1. VM fails 2. VM Service fails 3. VNFC fails VNF’s Services NF 4. NF fails* VNF VNFM 5a. VNF detects the failure 6a. VNF fails over STB ACT ACT STB 7a. NF recovers 5b. NFVI detects the failure NFVI’s Services 6b. NFVI reports to VIM VM VM VM VM VM NFVI VIM 7b. VIM reports to VNFM 8. VNFM ok to VIM 9. VIM repairs VM 10. VM Service recovers 11. VNF repairs VNFC *Steps 1-4 are simultaneous they are separated for clarity

UC3: Stateful VNF with No Redundancy 2014-12-12 UC3: Stateful VNF with No Redundancy Failure detection: VNF only NFVO VNFC checkpoints its state to VD, which is HA Recovery time 1. VNFC fails 2. NF fails* VNF’s Services 3. VNF detects the failure NF VNF VNFM 4. VNF isolates VNFC 5. VNF repairs VNFC ACT ACT 6. VNFC gets state 7. NF recovers state state state NFVI’s Services VM VD VM VM NFVI VIM *Steps 1&2 are simultaneous they are separated for clarity

UC4: Stateful VNF with No Redundancy 2014-12-12 UC4: Stateful VNF with No Redundancy Failure detection: VNF & NFVI NFVO VNFC checkpoints its state to VD, which is HA 1. VM fails Recovery time 2. VM Service fails 3. VNFC fails VNF’s Services NF 4. NF fails* VNF VNFM 5a. VNF detects the failure 6a. VNF reports to VNFM ACT ACT 5b. NFVI detects the failure 6b. NFVI reports to VIM state state state 7. VIM reports to VNFM NFVI’s Services 8. VNFM ok to VIM VM VM VD VM VM NFVI VIM 9. VIM repairs VM 10. VM Service recovers 11. VIM informs VNFM 12. VNFM repairs VNFC 13. VNFC gets state 14. NF recovers *Steps 1-4 are simultaneous they are separated for clarity

High Availability Flow Chart Service failure happens(may be caused of failure of VNF or NFVI) Step 1-Service Recovery: (Time Constraint, Carrier Grade VNF should be recovered within seconds) recovery time Failure detection (by service heartbeat loss/NFVI report of failure) Service is unavailable Service failover VNF failure only NFVI failure VNFC repair/restart VM recovery and VNFC recovery repeated failure Step 2-NFVI recovery or repair: (less Time Constraint)

Requirement doc & Gap Analysis doc. Ian Jolliffe

Requirement Doc. Details Framework The ultimate goal is to provide upper layer service high availability Service high availability is provided by recovery of service (including service restart and failover) within seconds following the SAL. Repair or recovery of the failed layer should happen afterwards. Ensure that no failure in one layer causes a cascading failure at other layers.  A single layer can detect failures in other layers and help recover failed components.           Service layer  Service Application/VM layer  VNF/VNFC VNFM NFVI/VIM layer  NFVI VIM Hardware layer  Hardware

Requirement doc. outline 1  Overall Principle for High Availability in NFV 1.1 Framework for High Abailability in NFV 1.2 Definitons 1.3 Overall requirements 1.4 Time requirement  2     Hardware HA 3  Virtualization Facilities (Host OS, Hypervisor) 4 Virtual Infrastructure HA – Requirements: 4.1 Virtual Compute 4.2 Virtual Storage 4.3 Virtual Network 5     VIM High availability 5.1 Archeticture requirement of VIM HA 5.2 Fault detection and alarm requirement of VIM 5.3 HA mechanism of VIM provided for NFV 5.4 SDN controller 6     VNF HA 6.1     Service Availability 6.2     Service Continuity 7 Storage

Gap Analysis 14+ HA related gaps have been discovered Nova: 6 gaps in Nova covering scheduler, consoleauth and health status of compute node. Neutron: 2 gaps in Neutron covering L3 agent and DHCP agent. Cinder: 2 gaps in Cinder covering HA configuration and multi-attachment. VIM NBI: 1 gap for error reporting QoS: 1 gap for QoS management References: https://etherpad.openstack.org/p/kilo-crossproject-ha-integration https://etherpad.openstack.org/p/kilo-summit-ops-ha https://blueprints.launchpad.net/openstack

Blue Prints Joe Huang

Escape from site level KeyStone failure Only one KeyStone server can be configured for token validation or revoke list Validate Token (Fernet,UUID) or retrieve RevokeList (PKI) API Request (Nova/Cinder/Neutron…)  KeyStone Middleware  OpenStack service (Nova/Cinder/Neutron…) Allow secondary KeyStone server configured in case of site level KeyStone failure. (Cons. of DNS based load balance : delayed failover for caching issues, an unpredictable routing) Site1 KeyStone Site2 KeyStone Validate Token (Fernet,UUID) or retrieve RevokeList (PKI) API Request (Nova/Cinder/Neutron…)  KeyStone Middleware  OpenStack service (Nova/Cinder/Neutron…)

Open Questions: Storage HA Georg Kunz 

storage service component Storage Architecture file, (object) VNF/VNFC storage service component NFVI block, file, object block, file, object VIM distributed storage Hardware host host storage array Ctrl1 Crtl2 switch switch

Storage HA – Network Failure block, file, object block, file, object VIM distributed storage host host storage array Ctrl1 Crtl2 switch switch Storage network link fails Storage network detects failure Storage network switches to standby link(s) iSCSI multi-pathing bonding Report failure to O&M

Storage HA – Failure in Storage Array block, file, object block, file, object VIM distributed storage host host storage array Ctrl1 Crtl2 switch switch Component within storage array fails Array-internal fail-over kicks in RAID Redundant controllers, NICs, … Report failure to O&M

Storage HA – Host Failure block, file, object block, file, object VIM distributed storage host host storage array Ctrl1 Crtl2 switch switch Storage host fails Distributed storage layer detects failure Distributed storage layer rebalances data

Non-HA Block Storage (legacy) Mirroring of block devices on VNF level VNF VNFC (active) VNFC (passive) mirroring NFVI

HA Block Storage Active/passive configuration Failover supervised by clustering software in VNF Requires multi-attach capability of Cinder VNF VNFM VNFC (active) VNFC (standby) VNFC (active) NFVI VIM

HA Block Storage Active/active configuration Clustered file system enables concurrent access Requires multi-attach capability of Cinder VNF VNFM VNFC (active) VNFC (active) NFVI VIM

VNF level HA for Multiple Backends Block devices provided by multiple backends Mirroring of block devices on VNF level Pro-active failover possible NFVI VNF VNFC (active) (passive) mirroring VNFM VIM backend 1 backend 2

Open Questions Can NFVI storage system provide sufficient level of HA to meet SAL levels? Failover/recovery times heavily depend on deployed solution How much does rebuild of data impact performance?

File Storage Legacy deployments NFVI File storage service provided by VNFC Layered on top of block storage services NFVI File storage service provided by NFVI / hardware Openstack Manila

Ephemeral Storage Ephemeral Storage Main use: File systems of VMs booted from image Location On local disks of compute host Isolation of failover domains VM unaffected by failure of storage system Disk failure corresponds to host failure Limits live migration capabilities On distributed or external storage Correlated failures possible Failure of storage backend impacts VMs Properties of respective storage backend apply

Appendix

UC3: Statefull VNF with No Redundancy 2014-12-12 UC3: Statefull VNF with No Redundancy Failure detection: VNF only NFVO VNFC checkpoints its state to VD, which is HA Recovery time 1. VNFC fails 2. NF fails* VNF’s Services 3. VNF detects the failure NF VNF VNFM 4. VNF isolates VNFC 5. VNF repairs VNFC ACT ACT 6. VNFC gets state 7. NF recovers state state state NFVI’s Services VM VD VM VM NFVI VIM *Steps 1&2 are simultaneous they are separated for clarity

UC3-b: Statefull VNF with No Redundancy 2014-12-12 UC3-b: Statefull VNF with No Redundancy Failure detection: VNF only NFVO VNFC checkpoints its state to VD, which is HA Recovery time 1. VNFC fails 2. NF fails* VNF’s Services 3. VNF detects the failure NF VNF VNFM 4. VNF reports to VNFM 5. VNFM isolates VNFC ACT ACT 6. VNFM repairs VNFC 7. VNFC gets state state state state 8. NF recovers NFVI’s Services VM VD VM VM NFVI VIM *Steps 1&2 are simultaneous they are separated for clarity

UC4: Statefull VNF with No Redundancy 2014-12-12 UC4: Statefull VNF with No Redundancy Failure detection: VNF & NFVI NFVO VNFC checkpoints its state to VD, which is HA 1. VM fails Recovery time 2. VM Service fails 3. VNFC fails VNF’s Services NF 4. NF fails* VNF VNFM 5a. VNF detects the failure 6a. VNF reports to VNFM ACT ACT 5b. NFVI detects the failure 6b. NFVI reports to VIM state state state 7. VIM reports to VNFM NFVI’s Services 8. VNFM ok to VIM VM VM VD VM VM NFVI VIM 9. VIM repairs VM 10. VM Service recovers 11. VIM informs VNFM 12. VNFM repairs VNFC 13. VNFC gets state 14. NF recovers *Steps 1-4 are simultaneous they are separated for clarity

UC5: Stateless VNF with Redundancy 2014-12-12 UC5: Stateless VNF with Redundancy Failure detection: VNF only NFVO Spare VNFC may or may not be instantiated 1. VNFC fails Recovery time VNF’s Services 2. NF fails* NF VNF VNFM 3. VNF detects the failure 4. VNF isolates VNFC Spare ACT Spare ACT 5. VNF fails over 6. NF recovers 7. VNF restores redundancy NFVI’s Services VM VM VM VM NFVI VIM Nothing new in this scenario *Steps 1&2 are simultaneous they are separated for clarity

UC6: Stateless VNF with Redundancy 2014-12-12 UC6: Stateless VNF with Redundancy Failure detection: VNF & NFVI NFVO Spare VNFC may or may not be instantiated Recovery time 1. VM fails 2. VM Service fails 3. VNFC fails VNF’s Services NF 4. NF fails* VNF VNFM 5a. VNF detects the failure 6a. VNF fails over Spare ACT ACT Spare 7a. NF recovers 5b. NFVI detects the failure NFVI’s Services 6b. NFVI reports to VIM VM VM VM VM VM NFVI VIM 7b. VIM reports to VNFM 8. VNFM ok to VIM 9. VIM repairs VM 10. VM Service recovers 11. VNF restores redundancy *Steps 1-4 are simultaneous they are separated for clarity

UC7: Stateless VNF with No Redundancy 2014-12-12 UC7: Stateless VNF with No Redundancy Failure detection: VNF only NFVO Recovery time 1. VNFC fails 2. NF fails* VNF’s Services 3. VNF detects the failure NF VNF VNFM 4. VNF reports to VNFM 5. VNF isolates VNFC ACT ACT 6. VNF repairs VNFC 7. NF recovers NFVI’s Services VM VD VM VM NFVI VIM *Steps 1&2 are simultaneous they are separated for clarity

UC8: Stateless VNF with No Redundancy 2014-12-12 UC8: Stateless VNF with No Redundancy Failure detection: VNF & NFVI NFVO Recovery time 1. VM fails 2. VM Service fails 3. VNFC fails VNF’s Services NF 4. NF fails* VNF VNFM 5a. VNF detects the failure 6a. VNF reports to VNFM ACT ACT 5b. NFVI detects the failure 6b. NFVI reports to VIM 7. VIM reports to VNFM NFVI’s Services VM VM VD VM VM 8. VNFM ok to VIM NFVI VIM 9. VIM repairs VM 10. VM Service recovers 11. VIM informs VNFM 12. VNF repairs VNFC 13. NF recovers *Steps 1-4 are simultaneous they are separated for clarity

UC9: Stateless VNF with No Redundancy 2014-12-12 UC9: Stateless VNF with No Redundancy Failure detection: VNF only – BUT Repeatedly NFVO 1. VNFC fails 2. NF fails 3. VNF detects the failure and counts UC7 VNF’s Services NF 4. VNF isolates VNFC VNF VNFM 5. VNF repairs VNFC 6. NF recovers ACT ACT ACT ACT 1 4 2 3 …. VNFC fails….2 …. VNFC fails….3 …. VNFC fails….4 NFVI’s Services VM VD VM VM NFVI VIM Fault is not in the VNFC!

UC9: Stateless VNF with No Redundancy 2014-12-12 UC9: Stateless VNF with No Redundancy Failure detection: VNF only – BUT Repeatedly NFVO 1. VNFC fails 2. NF fails 3. VNF detects the failure and counts VNF’s Services NF 4. VNF isolates VNFC VNF VNFM 5. VNF repairs VNFC 6. NF recovers ACT 4 …. VNFC fails….2 …. VNFC fails….3 …. VNFC fails….4 NFVI’s Services N. VNF reports to VNFM VM VM VD VM VM NFVI VIM N+1. VNFM reports to VIM N+2. VIM isolates VM N+3. VIM repairs VM N+4. VM Service recovers N+5. VNF repairs VNFC N+6. NF recovers

Scenario chart Scenario 1,2,5,6 Add all the scenarios as appendix NFVI provide HA API to VNF? Opensaf is a PaaS, as a HA middleware actually VNF stateful and stateless may require different schema in the NFVI, if VNF is not redundancy, we may need VM redundancy. At this case VNF problem may not be solved.