How Adobe Has Built An OpenStack Cloud Jun Park (Ph.D, MBA), Solutions Architect At Adobe Arghya Banerjee, Sr. Systems Engineer At Adobe OpenStack Mitaka Summit At Tokyo, Oct 2015
Swiss Cheese Model Flaws In Defense layers If aligned, flaws would allow an accident to occur From Wikipedia
Two More Factors That Complicate Things SpaceTime Continuum - Einstein Interactions, Higgs Field & Boson From Wikipedia From Youtube
Our Template To Analyze Components Dependencies In Red: Bugs or Issues In Green: Fix or Stable Time
OpenStack Survey, May 2015 The most common arch: Ubuntu + KVM + OVS + Ceph
Adobe OpenStack Architecture Storage: Ceph RBD VM1 VM2 VM3 eth0 eth1 eth0 eth1 eth0 eth1 Private Networks: VxLAN-based External Provider Networks: VLAN-based Adobe Network Firewall Adobe Corporate Networks
USE CASE: Mesos Clustering
Possible Combinations Containers In Containers Bare Metals Containers VMs VMs
Mesos Cluster Via Heat Host1 Host2 Host3 VM2: mesos slave1 VM1: mesos master Marathon Zookeeper http server http server -> Ubuntu-mesos image available via diskimage-builder -> Post configuration for master -> starting services -> Ubuntu-mesos image -> Post configuration for slave using mesos master IP. -> starting services
Mesos Cluster with Marathon Request to run a micro-service via REST API Marathon Mesos Master With ZooKeeper Mesos Slave1 http server Mesos Slave2 http server
Master + 2 slaves: Heat Stacks
Topology of Slave2
Marathon: Two Apps on Slave1
App Running On Slave
Mesos UI
Heat Template Components Dependencies Time
What Happened At Networking? A New Bug: OVS Sporadically Crashes In Adding A Port (https://bugs.launchpad.net/ubuntu/+source/openvswitch/+bug/1336555 and 1449012) Restarting agents re-establishes entire flows Neutron Fix ready, not added Security Group O(N^2) Issue Enhancement Patch Not Yet Integrated (e.g., 270 secs to 3 secs For 25K rules) OVS 2.0.1 Released: Mega Flow Multiprocessing OpenvSwitch (OVS) This Bug Introduced with OVS Mega Flow OVS 2.3.0 OVS 2.1.3 OVS 2.0.2 Released Bug Fix In all OVS 2.x Ubuntu 14.04 Trusty Released With OVS 2.0.1 Bug Report With OVS 2.0.1 In Ubuntu 14.04 Cherry-Pick On OVS 2.0.2 In Ubuntu 14.04.2 Ubuntu 14.04 Jun ‘13 Dec ‘13 Apr ‘14 Jul ‘14 Aug ‘14 May ‘15
What Happened At Networking? A New Bug: OVS Sporadically Crashes In Adding A Port (https://bugs.launchpad.net/ubuntu/+source/openvswitch/+bug/1336555 and 1449012) OVS 2.0.1 Released: Mega Flow Multiprocessing OVS Some companies reverted OVS to LinuxBridge! Some pundits spread FUD about Neutron! OpenStack Summits Atlanta IceHouse Paris Juno Vancouver Kilo Cherry-Pick Onto OVS 2.0.2 In Ubuntu 14.04 Ubuntu 14.04 Trusty Released With OVS 2.0.1 Ubuntu 14.04 Dec ‘13 Apr ‘14 May ‘14 Nov ‘14 May ‘15
What Happened At Storage? Ceph Operational Instability, Cinder Scalability Issue Cinder is stuck when Ceph is stuck (e.g., use local drive for copying an image) Enhancement Solution Not Yet Integrated (e.g., APIs Stacked Up -> Multiprocessing) Cinder Ceph Failover Instability With FireFly Hammer? Ubuntu 14.04 Trusty Released With Ceph FireFly 0.79 Ubuntu 14.04 Ubuntu 14.04 Updates With Ceph FireFly 0.80.10 Apr ‘14 May ‘14 July ‘15
What Happened At Data Node? Kernel Memory Bug, Security Issue KVM Security Issue Security Patch XFS Deadlock Bug Kernel Bug Fix Ubuntu 14.04 Trusty Released With Kernel… Ubuntu 14.04 Trusty Released With Kernel… Ubuntu 14.04 Dec‘13 Apr ‘14 May ‘14 Nov‘14 May ‘15 July ‘15
Takeaways: Workarounds & Tips Networks Understand OVS and find stable OVS Cherry-pick for Neutron Scalability: firewall rules Our own out-of-band rate limiting on networks, e.g., 200 Mbps Set up right MTU size on OVS structure Turn off GRO/LRO on hosts Storage Decouple Storage system from OpenStack API services Cinder Scalability Ceph Stability: Hammer, reconfigure towards optimal
How To Test at Scale Emulate future production env Create hundreds of VMs, inject workloads, and destroy all Recycle this entire test over and over again Findings: dead tokens stacked up Each component scalability Neutron: OVS Cinder: Ceph Nova: KVM
Have We Done Enough? 4? 3?
It's not that I'm so smart, it's just that I stay with problems longer. - Albert Einstein
New Efforts In OpenStack OpenStack Product Working Group Link up between contributors and users Governance/DefCoreCommittee Defining OpenStack Core Large Deployment Team Operational issues for large delpoyments Open Virtual Network (OVN) In-kernel Conntrack, DPDK, etc. Will run atop OVS
APPENDIX
Adobe OpenStack Architecture VM1 eth0 eth1 Linux Bridge OpenvSwitch bond0 Physical VLANs External Provider Networks: VLAN-based Adobe Network Firewall Adobe Corporate Networks
Volume Management in OpenStack Glance API Server Set of Images Image1: Ubuntu Trusty Cinder API Server 1. Copy Volume1 : Ubuntu Trusty Copy-On-Write (COW) Ceph Volume 2. Snapshot Snapshot1: Ubuntu Trusty Base Volume For All Three VMs 3. Volumes New Volume1 for VM1 Individual COW Volumes New Volume2 for VM2 New Volume3 for VM3