Hierarchical Fabric Designs The Journey to Multisite Lukas Krattiger Principal Engineer September 2017
A Single Fabric, a Single Data Center Leaf/Spine Topologies (aka Folded Clos, Fat-Tree) Similar to tradition 3-Tier Hierarchical Topologies (Access, Aggregation, Core) Roles and Location change Leaf – End-Point and First-Hop handling. Multi- Tenancy aware Spine – an IP Router Border – where we connect External; Multi- Tenancy aware Roles and Functions can be Collapsed New: Border + Spine = Border Spine New: Border + Leaf = Border Leaf Old: Core + Aggregation = Collapsed Core External Layer-3 Network VTEP Spine VTEP Pod 1
A Second Fabric, a Second Data Center External Layer-3 Network VTEP Pod 1 Spine VTEP Pod 2 Spine
A Tale of the Super-Spine Pod 1 Pod 2 Spine Spine Spine Spine Spine Spine Spine Spine Still a Leaf/Spine Topologies (aka Folded Clos, Fat-Tree) Yet another Tier Hierarchical and well Structured VTEP VTEP VTEP VTEP VTEP VTEP VTEP VTEP VTEP VTEP VTEP VTEP VTEP VTEP
Hierarchical Fabric Designs Nicely Structured and Tiered Topologies Allows Efficient Scale-Out More End-Points = More Leaf More Bandwidth, Resilience or Capacity = More Spine or Tiers Roles and Location change Leaf – Becomes more Intelligent Spine – Becomes less Intelligent (no touch-points) Super-Spine – Yet another Spine (next Tier for Capacity) Border – Your Way to the Rest of the World Best Approach for a Routed Fabric (Layer-3) Well described in RFC7938 - Use of BGP for Routing in Large-Scale Data Centers
What About Overlays?
A Single Fabric Overlay A Single Overlay Domain Roles and Location change, again Leaf – End-Point and First-Hop handling. Multi- Tenancy aware (VTEP) Spine – not seen Border – where we connect External; Multi- Tenancy aware (VTEP) External Layer-3 Network VTEP Overlay Spine VTEP Pod 1
Two Fabric Overlay – WAN Connectivity External Layer-3 Network VTEP Pod 1 Spine VTEP Pod 2 Spine Overlay Overlay
A Tale of the Super-Spine Pod 1 Pod 2 Overlay Spine Spine Spine Spine Overlay Spine Spine Spine Spine VTEP VTEP VTEP VTEP VTEP VTEP VTEP VTEP VTEP VTEP VTEP VTEP VTEP VTEP
Hierarchical Fabric Designs Underlay Nicely Structured and Tiered Topologies Allows Efficient Scale-Out More End-Points = More Leaf More Bandwidth, Resilience or Capacity = More Spine or Tiers Roles and Location change Leaf – Becomes more Intelligent Spine – Becomes less Intelligent (no touch-points) Super-Spine – Yet another Spine (next Tier for Capacity) Border – Your Way to the Rest of the World Overlay End-to-End, Flat, No Hierarchy
Building Underlay Hierarchies – Non Hierarchical Overlay ”The Single” Single Overlay Domain – End-to-End Encapsulation Single Overlay Control-Plane Domain – End-to-End EVPN Updates Single Underlay Domain End-to-End Single Replication Domain for BUM Single VNI Administrative Domain Building Underlay Hierarchies – Non Hierarchical Overlay
”Fixing” The Overlays!
A Single Fabric Overlay – NOT a Problem External Layer-3 Network VTEP Overlay Spine VTEP Pod 1
Two Fabric Overlay – NOT a Problem External Layer-3 Network VTEP Pod 1 Spine VTEP Pod 2 Spine Overlay Overlay
A Tale of the Super-Spine – A Problem(?!) Pod 1 Pod 2 Overlay Spine Spine Spine Spine Overlay Spine Spine Spine Spine VTEP VTEP VTEP VTEP VTEP VTEP VTEP VTEP VTEP VTEP VTEP VTEP VTEP VTEP
”The Multiple” – Multisite Overlay Multisite VTEP Site 1 Spine VTEP Site 2 Spine Overlay Overlay
Underlay Isolation – Overlay Hierarchies ”The Multiple” Multiple Overlay Domains – Interconnected & Controlled Multiple Overlay Control-Plane Domains – Interconnected & Controlled Multiple Underlay Domains - Isolated Multiple Replication Domains for BUM – Interconnected & Controlled Multiple VNI Administrative Domains Underlay Isolation – Overlay Hierarchies
Scale, Extend and Grow
External Layer-3 Network Lifecycle External Layer-3 Network Applications 1-100 Applications 101-200 Applications 201-300 Non Critical t
Lifecycle with Migration Overlay Multisite Applications 1-98,100 Applications 1-100 Applications 101-200 Applications 201-300 Non Critical 99 99 t
Availability, Change and Control Overlay Multisite Same, Similar or Different Fabrics Interconnected Extended Availability through Hierarchies Smaller Failure and Operation Domain Simpler Adoption of New Technology – non Critical Application Deployment
Architecture Overview
Multisite – Hierarchical Overlay Domains Multiple Overlay Domains – Interconnected & Controlled Scaling and Segregating VXLAN EVPN Networks Overlay Multisite External Layer-3 Network VTEP Spine VTEP Site n Spine VTEP Site 1 Overlay Site 1 Overlay Site n Unicast Baremetal Baremetal
Multisite – Introducing the Border Gateway Two Mode - Anycast Mode and VPC Mode Overlay Multisite Border Gateway (BG) - Anycast Cluster - External Layer-3 Network Border (VIP) 10.1.1.111 Border (VIP) 10.2.2.222 VTEP Spine VTEP Site n Spine VTEP Site 1 Overlay Site 1 Overlay Site n Any VXLAN EVPN VTEP
Multisite – Anycast vs VPC Border Gateway Anycast Border Gateway Up to 4 Border Gateways Border Leaf (G-Release) or Border Spine (G- MR) Common Virtual IP (VIP) across BG BGP-based DF election Single-Homed End-Point possible VPC Border Gateway 2 Border Gateways Border Leaf (G-MR) VPC-based DF election Multi-Homed End-Points possible Easy Migration (Conversion) Overlay Multisite External Layer-3 Network VTEP Spine VTEP Site n Spine VTEP Site 1 Overlay Site 1 Overlay Site n
Multisite – Border Gateway & VRF-Lite Co-Existence of Multisite and External Connectivity VRF-Lite for External Layer-3 Connectivity Overlay Multisite VRF-C VRF-B VRF-A External Layer-3 Network VTEP Spine VTEP Site n Spine VTEP Site 1 Overlay Site 1 Overlay Site n
Multisite – VXLAN Tunnel Adjacencies Multiple Overlay Control-Plane Domains – Interconnected & Controlled Contained Overlay Control-Plane Update Propagation BG102# show nve peers Interface Peer-IP VNI Up Time ---------- ----------- ------ ---------- nve1 10.1.1.111 30000 00:12:16 nve1 10.2.2.222 30000 00:12:23 nve1 10.1.1.1 30000 00:12:23 Overlay Multisite External Layer-3 Network Border (VIP) 10.1.1.111 Border (VIP) 10.2.2.222 VTEP Spine VTEP Site n Spine VTEP Site 1 Overlay Site 1 Overlay Site n VTEP 10.1.1.1 Leaf1-1# show nve peers Interface Peer-IP VNI Up Time ---------- ----------- ------ ---------- nve1 10.1.1.1 30000 03:18:06 nve1 10.1.1.111 30000 00:12:23 VTEP 10.2.2.7 Leaf2-7# show nve peers Interface Peer-IP VNI Up Time ---------- ----------- ------ ---------- nve1 10.2.2.7 30000 01:12:06 nve1 10.2.2.222 30000 00:12:25
Multisite – Separated Underlay Domains Multiple Underlay Domains - Isolated Isolated Underlay Domains – No need for Extension External Layer-3 Network Border (VIP) 10.1.1.111 Border (VIP) 10.2.2.222 VTEP Spine VTEP Site n Spine VTEP Site 1 Border (PIP) 10.1.1.101 10.1.1.102 Border (PIP) 10.2.2.101 10.2.2.102 Site 1 Underlay Routing Table Leaf: 10.1.1.1 10.1.1.2 10.1.1.3 10.1.1.4 10.1.1.5 10.1.1.6 10.1.1.7 Border: 10.1.1.101 10.1.1.102 10.1.1.111 Site n Underlay Routing Table Leaf: 10.2.2.1 10.2.2.2 10.2.2.3 10.2.2.4 10.2.2.5 10.2.2.6 10.2.2.7 Border: 10.2.2.101 10.2.2.102 10.2.2.222 VTEP 10.1.1.1 VTEP 10.2.2.7
Multisite – Inter Site Network Inside Inter-Site Network Routing Table Border Site1: 10.1.1.101 10.1.1.102 10.1.1.111 Border Site2: 10.2.2.101 10.2.2.102 10.2.2.222 Inter Site Network Border (VIP) 10.1.1.111 Border (VIP) 10.2.2.222 VTEP Spine VTEP Site n Spine VTEP Site 1 Border (PIP) 10.1.1.101 10.1.1.102 Border (PIP) 10.2.2.101 10.2.2.102 VTEP 10.1.1.1 VTEP 10.2.2.7
Multisite – BUM Replication Multiple Replication Domains for BUM – Interconnected & Controlled Individual BUM flooding domain with Traffic control Overlay Multisite External Layer-3 Network VTEP Spine VTEP Site n Spine VTEP Site 1 Overlay Site 1 Overlay Site n BUM Baremetal
Multisite – BUM Enforcement Overlay Multisite External Layer-3 Network Storm Control Broadcast 0-100% Unknown Unicast 0-100% Multicast 0-100% Layer-2 Multicast 0-100% VTEP Spine VTEP Site n Spine VTEP Site 1 Overlay Site 1 Overlay Site n BUM Baremetal
Multisite – BUM Replication Modes (1) Overlay Multisite Ingress Replication External Layer-3 Network VTEP Spine VTEP Site n Spine VTEP Site 1 Multicast Overlay Site 1 Overlay Site n Multicast
Multisite – BUM Replication Modes (2) Overlay Multisite Ingress Replication VTEP Spine VTEP Site n Spine VTEP Site 1 Ingress Replication Overlay Site 1 Overlay Site n Ingress Replication
Border Gateways Deployment Considerations
Border Gateways Deployment Considerations Anycast Border Gateways Site 1 VTEP BGW Border Gateways used for two main functions: Interconnecting each site to the Inter-Site network (for East-West traffic flows) Connecting each site to the external Layer 3 domain (for North-South traffic flows) May also be used to connect End-Points and/or network service nodes (FWs, ADCs) Two deployment models supported: Anycast Border Gateways VPC Border Gateways VPC Border Gateways Site 1 VTEP BGW
Anycast Border Gateways
Anycast Border Gateway (1) Cisco Live 2017 9/11/2018 Anycast Border Gateway (1) Anycast Border Gateway Up to 4 Border Gateways Border Gateway Deploying at Leaf – 7.0(3)I7(1) Deploying at Spine – 7.0(3)I7(2) VTEP BGW Site 1
Anycast Border Gateway (2) Cisco Live 2017 9/11/2018 Anycast Border Gateway (2) Anycast Border Gateway Common Virtual IP (VIP) across BGW VIP is used for Intra- and Inter-Site Communication VIP for communication between the Border Gateways in different Sites VIP for communication between Border Gateway and Leaf within a Site Individual Primary IP (PIP) per BGW Used for Broadcast, Unknown Unicast and Multicast (BUM) replication PIP for communication with Single-Homed End-Points (routed only), intra- and inter-Site Border VIP 10.1.1.111 VTEP BGW PIP-BGW1 10.1.1.101 PIP-BGW2 10.1.1.102 PIP-BGW3 10.1.1.103 PIP-BGW4 10.1.1.104 Site 1 Border VIP 10.1.1.111
Anycast Border Gateway (3) Per-VNI Designated Forwarder (DF) election Each BGW can serve as DF for a single or a set of Layer-2 VNI DF election and assignment is automatic Using BGP EVPN Route Type 4 for DF election Auto Generated MAC-based (Type: 03) Six Octet Site Identifier (System MAC: 00:00:00:00:00:01) Multisite Discriminator (Ethernet-Segment: 00:03:09) Originators IP Address (PIP): 10.1.1.101 Layer-2 VNI: 30010 Type: 03 System MAC: 00:00:00:00:00:01 Ethernet Segment: 00:03:09 4 IP: 10.1.1.101 VNI: 30010 VTEP BGW Site 1 30010 DF 30011 DF 30012 DF 30099 DF BGP EVPN Spine RR
Anycast Border Gateway (4) Per-VNI Designated Forwarder (DF) election Round Robin Ordinal List of Originator IP (PIP) Ordered List of Layer-2 VNI 1st Configuration Order 2nd Ordinal List Type: 03 System MAC: 00:00:00:00:00:01 Ethernet Segment: 00:03:09 4 IP: 10.1.1.101 VNI: 30010 VTEP BGW Site 1 30010 DF 30011 DF 30012 DF 30099 DF BGP EVPN Spine RR
Anycast Border Gateway (5) Single-Homed End-Points (Orphan End-Points) Routed attachment only (Layer-3 Interface, physical or logical, i.e. SVI) Services Appliance (i.e. Firewall, ADC etc.) Multi-Homed & Layer-2 attached End-Points in future Advertised and Reachable through Individual Primary IP Address (PIP) Intra-Site: Leaf nodes use PIP to reach End- Points connected to Border Gateways Inter-Site: Remote Border Gateways use PIP to reach End-Points connected to local Border Gateway Border VIP 10.1.1.111 VTEP BGW PIP-BGW1 10.1.1.101 PIP-BGW2 10.1.1.102 PIP-BGW3 10.1.1.103 PIP-BGW4 10.1.1.104 Site 1 Border VIP 10.1.1.111 ADC 192.168.10.101 VTEP Type IP / Length L3VNI / RT Next-Hop 5 192.168.10.0/24 50001, 65599:50001 10.1.1.104
Feature Overview
External Layer-3 Network VXLAN EVPN – Multisite Inter Site Network Border Gateway (BG) to Border Gateway (BG) reachability required Reachability Back-to-Back (full-mesh) or via Layer-3 transport network Any Routing Protocol for BG reachability IPv4 Unicast Transport (Ingress Replication) BGP full-mesh or Route-Server (eBGP ”Route Reflector”) for Overlay Control-Plane External Layer-3 Network VTEP Multisite Border Gateway (BG): Seamless insertion into existing VXLAN EVPN Fabrics (Border Gateways require Nexus 9x00-EX/-FX) Layer-2 and Layer-3 extension to other Sites BGP- or VPC-based Border Gateway (BG) Cluster (up to 4 nodes when using BGP) All Border Gateways (BG) are representing a common Anycast VTEP (VIP) Failure containment through Broadcast, Unknown Unicast and Layer-2 Multicast limiter (off or rate-based) Co-Existence with VRF-lite for External Connectivity Core and Fabric link tracking Spine VTEP Site n Spine VTEP Site 1
VXLAN EVPN – Multisite NX-OS 7.0(3)I7(1) Layer-2 and Layer-3 extension across VXLAN EVPN Sites Intra-Site using Multicast or Unicast (IR) Inter-Site using Unicast (Ingress Replication) Up to 4 Border Gateway (BG) per Site / 10* Sites Total BGP-based Anycast BG VPC-based Anycast BG (G-MR) Border Gateway: Nexus 9300-EX/-FX Nexus 9500-EX/-FX - 7.0(3)I7(2) Control- and Data-Plane Separation Re-encapsulation at Border Gateway (Layer-2 and Layer-3) Control-Plane Update suppression Split Horizon Traffic Control of Broadcast, Unknown Unicast and Layer-2 Multicast (Disable or Rate-based) Uplink/Downlink tracking Symmetric VNI (same VNI) Co-Existence with VRF-Lite VXLAN OAM across Sites *Target
VXLAN EVPN – Multisite in the IETF New IETF Draft for Multisite Design Multisite EVPN based VXLAN using Border Gateways https://tools.ietf.org/html/draft-sharma-multi-site-evpn