Download presentation
Presentation is loading. Please wait.
Published byWilliam Edwards Modified over 6 years ago
1
Infrastructure Maintenance & Upgrade: Zero VNF Downtime with OPNFV Doctor on OCP Hardware
AirFrame Open Rack V1 and V2 compatible Demo video:
2
OPNFV Doctor Maintenance Use Case
CloudAdmin 4. Switch to SBY configuration V Consumer C1 Consumer C2 Consumer C3 3. Maintenance Notification (VM ID) 1. Maintenance Request (Server S3) 5. Instruction (VM ID) OpenStack Northbound Interface Virtualized Infrastructure Manager (VIM), e.g. OpenStack VM-1 VM-2 VM-7 VM-4 6. Execute Instruction - e.g. migrate VM Resource Map Server – VM mapping Server S1 VM-1, VM-2 Server S2 VM-7 Server S3 VM-4 Resource Pool Hypervisor Hypervisor Hypervisor Ownership information VM-1, VM-7 Consumer C1 VM-2 Consumer C2 VM-4 Consumer C3 Hardware Server S1 Hardware Server S2 Hardware Server S3 2. Which VMs are affected? Find Consumer owning the VM(s) from the database.
3
Test case A single compute host at a time will be maintained /upgraded in rolling fashion. There needs to be one empty already maintained host where application payload (vms) is moved in interaction with application manager. All capacity is used in the beginning. Application first need to down scale, so there can be an empty compute host. After this there is enough empty capacity, but it might be it is not on single compute host. Some VMs might still need to be migrated. After there is empty compute host, maintenance can begin. Empty host is maintained first and then occupied host one at the time. VMs that will be moved, will now always go to already maintained host. Application can do own action to re-instantiate VM or tell if it wants to migrate Application can do switch over and even wait some transaction to end, to guarantee zero downtime Application will know about new capabilities he can take into use When maintenance is complete, application can scale up back to full operation.
4
Beyond Multi tenancy (multiple VNFS) is supported, but not used in the test case. With this supported you should always get free capacity through down scaling, even if down scaling not possible for all VNFs. Running several parallel maintenance sessions is also supported, but it is not suitable for test case with 4 (3) compute nodes. You would like to make session for each different type of hardware to ensure moving VMs around will always work. One real thing would be to have more than one compute host “empty” for maintenance. This is supported, but what could be added is that one could have rolling upgrade ongoing on at least one, but other could be used for infrastructure admin verify test, before putting compute back to production. Also parallel maintenance of computes nodes would be possible to speed up operation. It might be you do not have enough compute nodes, so down scale would be possible or you just do not want to do that. In this case, you can use OPNFV Promise to reserve more compute capacity. This concept is generic and can be used for hybrid cloud as well.
5
Demo Setup There will be 56 VCPUs on each compute host and each VM will take 28 VCPUs, meaning 2 VMs fit on each compute. Application with 2 type of VMs: doctor_ha_app_#: 2 instances Floating IP is set for the active instance. Uses AntiAffinity so they are assured to be in different host doctor_nonha_app_#: 6 instances taking 4 Nokia Open Rarck compatible AirFrame computes overcloud-novacompute-0 nova-compute: enabled 763f06ae cf2-823f-bb579475dad6: doctor_nonha_app_2 768d4f1d-531b-4c9d-9c b03e50e: doctor_ha_app_0 overcloud-novacompute-1 nova-compute: enabled 2d c-ab35-6e553505cc82: doctor_nonha_app_4 03d b-48a3-8b0d-51622d85416b: doctor_nonha_app_5 overcloud-novacompute-2 nova-compute: enabled 0af47a fb-9fa0-b01c18b2cf4b: doctor_nonha_app_0 4b5eac21-b6af-4d18-ac9b-c9cc1f38fe2b: doctor_nonha_app_1 overcloud-novacompute-3 nova-compute: enabled 13df92ef-7b4d-4f b71ab68e05d8: doctor_ha_app_1 cbbe5dc6-52e8-422f-9a14-cea860b997a7: doctor_nonha_app_3
6
Doctor Test Case and Demo Design
Infra Admin Doctor Test Case and Demo Design maintenance session for compute hosts App manager Inspector Admin tool Get status When maintenance start and which VMs MAINTENANCE overcloud-novacompute-0 nova-compute: enabled disabled Start maintenance process. Check for empty compute 03d b-48a3-8b0d-51622d85416b: doctor_nonha_app_5 763f06ae cf2-823f-bb579475dad6: doctor_nonha_app_2 ACK_MAINTENANCE 4b5eac21-b6af-4d18-ac9b-c9cc1f38fe2b: doctor_nonha_app_1 768d4f1d-531b-4c9d-9c b03e50e: doctor_ha_app_0 DOWN_SCALE ACT -> STDBY Down scale ACK_DOWN_SCALE If no empty host, have to make one Agree admin action to move VM overcloud-novacompute-1 nova-compute: enabled disabled PREPARE_MAINTENANCE Migrate, Live Migrate 2d c-ab35-6e553505cc82: doctor_nonha_app_4 ACK_PREPARE_MAINTENANCE Disable all nova-computes 03d b-48a3-8b0d-51622d85416b: doctor_nonha_app_5 Repeated for each compute Switch over reply admin action PLANNED_MAINTENANCE STDBY-> ACT if VMs on host ACK_PLANNED_MAINTENANCE Migrate, Live Migrate overcloud-novacompute-2 nova-compute: enabled disabled Disable host automatic fault management IN_MAINTENANCE 13df92ef-7b4d-4f b71ab68e05d8: doctor_ha_app_1 0af47a fb-9fa0-b01c18b2cf4b: doctor_nonha_app_0 Actual host maintenance done here cbbe5dc6-52e8-422f-9a14-cea860b997a7: doctor_nonha_app_3 4b5eac21-b6af-4d18-ac9b-c9cc1f38fe2b: doctor_nonha_app_1 Enable host automatic fault management MAINTENANCE_COMPLETE Enable nova-compute. overcloud-novacompute-3 nova-compute: enabled disabled MAINTENANCE_COMPLETE Next compute Up scale 13df92ef-7b4d-4f b71ab68e05d8: doctor_nonha_app_4 13df92ef-7b4d-4f b71ab68e05d8: doctor_ha_app_1 ACK_MAINTENANCE_COMPLETE ACT -> STDBY STDBY-> ACT cbbe5dc6-52e8-422f-9a14-cea860b997a7: doctor_nonha_app_0 cbbe5dc6-52e8-422f-9a14-cea860b997a7: doctor_nonha_app_3 MAINTENANCE_COMPLETE
7
Virtualized Infrastructure (Resource Pool)
Design Infra Admin Migrate, Live Migrate, Own action Application Down or up scale App Manager / VNFM Re-instantiate (optional own action) Create Alarm Schedule maintenance Switch over Ack + action Heat HAPP 1 act NONHAPP1 HAPP 2 stdby Orchestration NONHAPP2 HAPP 1 stdby NONHAPP1 NONHAPP3 NONHAPP4 Is project alarm enabled Admin tool Project maintenance alarm Maintenance state events Virtualized Infrastructure (Resource Pool) Physical host Physical host Physical host Physical host Notifier Create Alarm Alarm Conf. Ceilometer /Aodh Maintenance Empty Admin maintenance alarm Inspector Failure Policy Controller Controller Controller Resource Map Vitrage Migrate, Live Migrate Nova Congress Ironic Maintenance workflow actions Migrate, Live Migrate OpenStack project Admin action Project action Cloud Infra Entity
8
Watch this Here is web camera on 2x3 grid of demo servers:
When the maintenance is starting, you can see this in log window: maintaining host overcloud-novacompute-1.opnfvlf.org | | | | | overcloud-novacompute-1 | | | --==Blue blink in 10 secs!!!==-- host overcloud-novacompute-1.opnfvlf.org maintenance ongoing... You can expect to see blue led shutting off for couple of second as server is booting as under maintenance
9
Admin Tool: Infrastructure Admin APIs
POST /maintenance Example Request1: { 'hosts': ['overcloud-novacompute-1.opnfvlf.org’, 'overcloud-novacompute-3.opnfvlf.org’, 'overcloud-novacompute-0.opnfvlf.org', 'overcloud-novacompute-2.opnfvlf.org’], 'state': 'MAINTENANCE’, 'maintenance_at': ' :06:03’, 'metadata': {'openstack_version': ‘Queens’} } Response (200): 'session_id': ee-1c4d-11e8-a9b0-0242ac110002 'session_id’ is uniq ID though maintenance session. Typically session should include some amount of similar type of compute nodes. GET /maintenance Request: 'session_id’: ee-1c4d-11e8-a9b0-0242ac110002 'state’: 'MAINTENANCE_COMPLETE’ POST /maintenance Example Request2: { 'state': ‘REMOVE_MAINTENANCE_SESSION’, 'session_id': ee-1c4d-11e8-a9b0-0242ac110002 } Response (200): 'state’: ‘ACK_REMOVE_MAINTENANCE_SESSION’ In demo infra admin will at the end remove the maintenance session
10
Admin Tool: Project APIs
GET /<projet_id>/maintenance ‘state’ can have different values depending to which action the reply is: Example Request: GET /ead0dbcaf3564cbbb04842e3e54960e3/maintenance ACK_MAINTENANCE { ACK_DOWN_SCALE 'session_id': '76e55df8-1c51-11e ac110002’ ACK_MAINTENANCE_COMPLETE } ACK_PREPARE_MAINTENANCE ACK_PLANNED_MAINTENANCE Response (200) 'instance_ids': ['109e14d b3-93e f264d8f', ' f0fc-4428-a8b2-0b3edd64bcad’] PUT /<projet_id>/maintenance PUT /ead0dbcaf3564cbbb04842e3e54960e3/maintenance 'instance_actions': {'109e14d b3-93e f264d8f': 'MIGRATE’, ' f0fc-4428-a8b2-0b3edd64bcad': 'MIGRATE’}, 'session_id': '76e55df8-1c51-11e ac110002’, 'state': 'ACK_PLANNED_MAINTENANCE'}
11
Admin Tool: Notification for Admin
Event type: 'maintenance.host’ payload: { 'service': 'admin_tool’, 'state': 'IN_MAINTENANCE’, 'session_id': '76e55df8-1c51-11e ac110002’, 'host': 'overcloud-novacompute-0.opnfvlf.org’, 'project_id': 'ead0dbcaf3564cbbb04842e3e54960e3’ } 'state’: can have values ‘MAINTENANCE_COMPLETE’ or ‘IN_MAINTENANCE’ Note! Inspector have POST /maintenance API to catch this trough AODH event alarm
12
Admin Tool: Notification for Project
Event type: ‘maintenance.planned’ payload: { 'service': 'admin_tool’, 'allowed_actions': ['MIGRATE', 'LIVE_MIGRATE', 'OWN_ACTION’], 'instance_ids': ' 'reply_url': ' 'state': 'PLANNED_MAINTENANCE’, 'session_id': '76e55df8-1c51-11e ac110002’, 'actions_at': ' T06:40:16’, 'project_id': 'ead0dbcaf3564cbbb04842e3e54960e3’, 'metadata': {'openstack_version': ‘Queens’} } 'allowed_actions’ is valid for DOWN_SCALE, PREPARE_MAINTENANCE and PLANNED_MAINTENANCE ‘state’ can have different values depending what is wanted in different states: MAINTENANCE tell project when maintenance starts and what are affected VMs DOWN_SCALE project needs to down scale to get free compute capacity PREPARE_MAINTENANCE after down scale free capacity might need actions to have at least one empty compute host PLANNED_MAINTENANCE tell project which VMs are to be moved as host goes to maintenance MAINTENANCE_COMPLETE whole maintenance session is complete, application can up scale to full capacity if needed 'actions_at’ time when need to reply 'metadata’ can tell new capabilities coming after project reply move actions for his VMs with ACK_PLANNED_MAINTENANCE Application manager can have POST /maintenance API to catch this trough AODH event alarm Application manager calls admin tool APIs to query instances and to reply action and ACK to requests
13
How We Got Here OPNFV Santa Clara hackfest March 2016
Go for OpenStack Nova OpenStack Austin summit April 2016, Nova + OPS session OpenStack Barcelona summit October 2016, OPS -> Craton Nova BP to continue later OpenStack Atlanta PTG February 2017 At the end maybe no changes to Nova. Out of scope. OPS Milan March 2017, Craton; a generic solution for cloud OSIC funding cut -> Craton development to halt in April 2017 OPNFV Peking summit June 2017, First Doctor POC and discussion Doctor design guideline, August 2017 Second POC September 2017 OpenStack Sydney summit November 2017, Upgrade a big topic. Time couldn’t be better OPNFV plug-fest Portland December 2017, POC
14
Let’s keep our clouds running with five nines
Demo video: Let’s keep our clouds running with five nines Thanks
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.