Washington WASHINGTON UNIVERSITY IN ST LOUIS Substrate Control: Overview Fred Kuhns Applied Research Laboratory Washington University in St. Louis
2 Washington WASHINGTON UNIVERSITY IN ST LOUIS Fred Kuhns - 12/11/2015 Defining Terms and Models
3 Washington WASHINGTON UNIVERSITY IN ST LOUIS Fred Kuhns - 12/11/2015 The SPP Node Slice instantiation: –Allocate virtual machine ( VM ) instance on a GPE –may request code option instance, NPE resources and bandwidth Share a common set of (global) IP addresses –UDP/TCP port space shared across GPE/NPEs Line card TCAM Filters direct traffic –unregistered traffic originating outside the node is sent to the CP. –unregistered traffic originating within node uses NAT (on line card) –application may register server ports. Causes filter to be inserted in the line card directing traffic to specific GPE –application must register ports (or tunnels) associated with fast path instances It is assumed that fast path instances will use tunnels (overlays) to send traffic between routing nodes. –Currently we only support UDP tunnels but will extend to include GRE and possibly others. GPE RMP NMP planetlab OS vm x app NPE SRAM TCAM SCD mi-mux code option FP x GPENPE LC Internet … Ingress … map flow to internal destination … Egress … IP route table and ARP SCD (ARP, nat) local delivery/exceptions, uses an Internal UDP Tunnel
4 Washington WASHINGTON UNIVERSITY IN ST LOUIS Fred Kuhns - 12/11/2015 Meta-Interfaces and Tunnels Slice Fast path (Code option instance, allocated resources) are assumed to sit at one end of a tunnel –currently only UDP tunnels are supported. –UDP Tunnel is defined by the 4-tuple UDP tunnel: {peer ipaddr, peer port, local ipaddr, local port} –Meta-interface or MI: Represents a tunnel endpoint as viewed by a slice’s the fast path router. A meta-interface is defined by the local endpoint’s address Meta-Interface: {local ipaddr, local UDP port} The encapsulated packet is processed by the fast path. –packet is always encapsulated within a tunnel by the substrate –code option instance processes the encapsulated frame In the SPP context, slice registers MI and substrate manages encapsulation headers: –Guard against forging source address –A filter is installed in the corresponding line card’s TCAM to send matching packets to the correct NPE –NPE’s decap module verifies the encapsulation header and provides isolation between slices (based on local IP and port number values in the tunnel header) –Fabric VLANs are used to provide link level isolation between slice instances. The VLAN label is also used by the substrate to associate packets with slice fast paths. meta-interfaces MI: local tunnel endpoint (UDP), {external ipaddr, udp_port} fast path (FP x ) MIIP AddressUDP Port
5 Washington WASHINGTON UNIVERSITY IN ST LOUIS Fred Kuhns - 12/11/2015 Lookup Table, TCAM, Use
6 Washington WASHINGTON UNIVERSITY IN ST LOUIS Fred Kuhns - 12/11/2015 Lookup filters: Key, Action and Result A lookup key is then created from the packet’s header fields and the receiving meta- interface –code option extracts fields from the encapsulated packet –substrate adds the receiving meta-interface identifier If no entry is found then the packet’s no_route exception attribute is set, otherwise a result is returned containing an action field and forwarding information (output meta-interface and next hop address) –a code option may define additional exception attributes The complete filter specification: {lookup_key, result_vector} lookup_key : {RxMI, *copt_key} –RxMI : Meta interface ID on which the packet was received. –copt_key : Lookup key defined by the code option. The IPv4 key: {daddr(32),saddr(32),sport(16),dport(16),tcp_flgs(8),proto(8)} result_vector : { sindx, action[, qid, TxMI, nexthop]} –sindx : stats index –action : Packet disposition, one of {drop, fwd, ld} drop : drop packet; fwd : forward packet using next hop value ( fwdkey ) ld : local delivery, code option instance has local address information?? –qid : packet Queue –TxMI : Meta-interface used for sending packet, corresponds to a previously registered local tunnel endpoint. Used to fill in the local address of the outgoing packet tunnel header. –nexthop : Tunnel endpoint for the next hop. For UDP tunnels, this is the IP address and UDP port number of the next hop device.
7 Washington WASHINGTON UNIVERSITY IN ST LOUIS Fred Kuhns - 12/11/2015 Slice view of the Lookup Key When a packet is received the substrate creates a lookup key using the target slices xsid and the receiving meta-interface. The remaining bits are defined by the code option. –xsid’ : represents the internal slice ID and may differ from the value of xsid. For implementation efficiency, this is the VLAN identifier assigned to the slice. –xmi : Internal representation of the meta-interface ( MI ), encoding of the received tunnel endpoint. For UDP tunnels this field includes a 4-bit interface id and the 16 bit local UDP port number. The 4-bit id is used as an index into a table of local IP addresses. The IPv4 code option defined fields are shown below where pr is the IP protocol field and tcp is the TCP header flags. slice defined fieldsxmixsid’ 128-NN12 user specified lookup key ( bit words)
8 Washington WASHINGTON UNIVERSITY IN ST LOUIS Fred Kuhns - 12/11/2015 IPv4 TCAM Filter Formats (on NPE) 682 flags TCPRSVproto00!TCP daddrsaddrsportdport tcp/proto Defined by the IPv4 Code Option, 112bits vlan 11 if T = 0: Normal Lookup T = 1; substrate only lookup T 1 RX port Substrate defined 164 TX IP daddr TX dportTX sportrsv QM 3 D: Drop packet L: Local delivery rsv 1131 L 1 Drsv sindx Sch 2 qid bit internal qid (SCD maps slice’s miid to QM and Sch. SCD Also maps slice’s qid to global qid value) TX IP address and sport represents the output meta-interface. The dport is provided by the slice. (RMP maps miid to tx tunnel params, use dport provided by slice) Result, 64 bits Represents input meta-interface global stats index (SCD maps slice’s sindx to global value) Key: Input miid, IPv4 fltr {daddr, saddr, sport, dport, tcp/proto} Result: Flags {Drop, GPE}, sindx, Output miid, QID Slice parameters:
9 Washington WASHINGTON UNIVERSITY IN ST LOUIS Fred Kuhns - 12/11/2015 Lookup Parse block make copt_key. Substrate add the xsid and xmi fields. Substrate uses the TxMI and nexthop fields to construct encapsulation header... xsid:RxMI:copt_key Lookup A slice defined fieldsxmixsid’ sindx;action:qid:TxMI:nexthop packet annotations: {xsid, RxMI} parse block decap TxMI:nexthop
10 Washington WASHINGTON UNIVERSITY IN ST LOUIS Fred Kuhns - 12/11/2015 Version 2 and Multicast... lookup_keyaction:sindx:rindx LookupA slice defined fieldsxmixsid’ result_index packet annotations: {xsid, RxMI} parse block decap overloaded with fanout address fanout Table... qid:TxMI:nexthop In version 2 there will be 2 stages to the lookupadd fanout (count) to lookup B. if fanout > 1 then address of fanout else result vector; Chain fanout blocks TxMI includes an interface vector: 4-bit field that is used to lookup interface IP address and MAC address.... rindx sindx:qid:TxMI:nexthop LookupB sindex passed from side A VLAN table in header format and VLAN table in Decap/Parse
11 Washington WASHINGTON UNIVERSITY IN ST LOUIS Fred Kuhns - 12/11/2015 Then the control software could use the following: write_fltr(fid, rxmi, {prefix,width}, action, {qid,TxMI,nexthop}) write_fltr(0, *, { , 0xFFFFFFFF}, LD}) write_fltr(1, *, { , 0xFFFFFF00}, fwd, {1, 1, NHA}) write_fltr(2, *, { , 0xFFFFFF00}, fwd, {2, 2, NHB}) write_fltr(3, *, { , 0xFFFF0000}, fwd, {3, 3, NHC}) Lookup Example When a code option is requested the slice is allocated the requested number of TCAM entries; fid ε {0,..., N f -1} –all TCAM operations accept a TCAM entry ID ( fid ) –Entries are listed in priority order with fid= 0 the highest priority and entry N f -1 the lowest. It is up to the slice control path to order the lookup entries. –For example if we have the simple routing database: /32Local delivery (GPE) /24NH A /24NH B /16NH C prefixTxMInexthop /320*Local /241NH A /242NH B /163NH C MIIP AddressUDP Port QIDInterfaceBWmax Bytes 00*-Local* 1140% % %1024 InterfaceBWipAddr 0*BE Mbps Mbps Desired Route Table (LPM) Slice BW Allocations Slice Meta-Interfaces Slice Queue Bindings
12 Washington WASHINGTON UNIVERSITY IN ST LOUIS Fred Kuhns - 12/11/2015 Example IPv4 LPM In general for longest prefix match a good strategy is to divide allocated filters into 32 sets For example assume 1024 TCAM entries have been allocated and we are using LPM. –Divide the filters into 32 sets of 32 filters each and associate a prefix length with each: –Then for a particular prefix width add it to the appropriate set. –Entries within a set are non-overlapping so their order doesn’t matter. –This is the scheme used by software written by IDT, the manufacturer of the TCAM we currently use. Prefix WidthFilter ID Range w(32-w)*32 +(0...31)
13 Washington WASHINGTON UNIVERSITY IN ST LOUIS Fred Kuhns - 12/11/2015 Keeping track of TCAM entries Substrate will have to manage the mapping of VM TCAM filter IDs to the actual filter ID. VM control software will use a normalized filter index list (starts at 0 and has the requested number of filters entries). The SCD (xscale daemon) must map the per-VM index into the actual TCAM Index. Source for managing TCAM entries. NPU A and B share a common TCAM and index range so this must be managed across the two xscales. –See C++ implementation of the RangeMap class in $WUSRC/range –Class will also be used for managing the QID name space.
14 Washington WASHINGTON UNIVERSITY IN ST LOUIS Fred Kuhns - 12/11/2015 Control Software: Resource Management
15 Washington WASHINGTON UNIVERSITY IN ST LOUIS Fred Kuhns - 12/11/2015 node components not in hub (switch, GPEs, Development Hosts) FP k FP x NPE SRAM TCAM SCD LC SCD TCAM MUX SRM Resource DB System Resource Manager Exception and Local delivery traffic. Includes shim header with RxMI. SNM CP GPE RMP NMP planetlab OS root context vm x control Support fast path configuration via the PLC vnet SP
16 Washington WASHINGTON UNIVERSITY IN ST LOUIS Fred Kuhns - 12/11/2015 Partitioning of (substrate) Responsibilities Virtual Machine (Slice control SW): Application logic, code option specific control and data operations. –traditional PlanetLab slice operations –manage code option specific lookup tables, stats, memory and configuration blocks –implements interface with fast path for exception and local delivery traffic vnet –flow isolation: filtering traffic through the linux kernel –add support for VLAN- based filtering and port reservation Resource Manager Proxy (aka Local Resource Manager) –all VM commands are issued to the RMP the RMP is able to validate command sender (authenticate) enforce access restrictions (authorize) decouples VMs from substrate control entities. That is, maps exported abstractions and interfaces to specific hardware and software interfaces. –verifies (or inserts) substrate message header slice IDs to prevent deliberate or accidental masquerading - part of ensuring isolation and security. –in tandem with SRM implements device independent logic System Resource Manager –device independent logic –responsible for implementing and enforcing system resource abstractions resource isolation and allocation policies facilitating SNM: implementing PlanetLab compatible behavior and abstractions Substrate Control Daemon –intermediary between VM and code option instances (vouches for VM) –enforces policies on resource allocations and isolation in the control plane –implements device dependent logic
17 Washington WASHINGTON UNIVERSITY IN ST LOUIS Fred Kuhns - 12/11/2015 Responsibilities endpoint (port) maps resvMapavailMapusedMapsxsidMap System tables Interfaces ifn:{type,ipaddr,linkBW,availBW}... Per Slice Tables xsid vlan meta-ifaces mi:endpoint... endpoints id:{type,ipaddr,port,proto,board,bw}... gpe board idBW plab sliceID NPE (allocated) sram {start,size}#flts #Qsboard IDBW#Stats SRM (the “ Decider ”) Per Slice data xsid: {qidMap,FidMap,statsMap} Interface BW Slice Maps xsid: {sram_start,sram_size} Slice SRAM Assignments SCD (NPE) SRAM base xsid:size xsid:offset Lookup Table xsid:range Queue Params xsid:range Stats Table xsid:range Tables in data Path VLAN Table vlan copt:sram_addr ranges are not required to be contiguous “real” indx “real” indx sid fid “real” indx qid HF Control Block? code option control blocks? GPE servMapresvMap endpoint (port) maps controlIPBWmaps?? RMP request allocation make allocation RMP Responsibilities Translate slice MI to local endpoint. Either call SRM or cache mappings. Add xsid to subMsg header Pass through identifiers mapped by SCD : qid, fid and stats. Pass through relative queue weights, SCD maps to global weight. SCD Responsibilities Translate slice specific indices to global indices: qid, fid and stats. Knows the location of all tables Interprets commands to add, remove and modify entries to data path tables. Knows per slice interface BW allocation and maps relative queue weight to global weight. Each interface schedule is assigned (by SRM) max rate. xsid:offset Per interface scheduler and rate limits NPE Table id:{addr,BW/Port,copts,fltrs,sram,Qs}... VLAN maps range:{start,end} vlanid:xsid...
18 Washington WASHINGTON UNIVERSITY IN ST LOUIS Fred Kuhns - 12/11/2015 Queuing and allocating Interface Bandwidth
19 Washington WASHINGTON UNIVERSITY IN ST LOUIS Fred Kuhns - 12/11/2015 FP slice 1 Simple Queuing Example q 1n’... Slice Interface and Queue Allocations: {Port, BW, QList}, Qlist = {{qid, weight, threshold},...} q 10 q 11 wrr q 2m’... FP slice 2 q 20 q 21 NPE GPE FP 1 GPE FP 2 linkBW wrr BW 11 BW 21 BW 11 + BW 21 = BW 1 BW 1 Physical Port (Interface) Attributes: {ifn, type, ipaddr, linkBW, availBW} ifn : Interface number type: {Internet, Peering} Operations: get_interfaces() get_ifattrs(ifn) get_ifpeer(ifn) alloc_ifbw(ifn,xsid,bw) LC qid in 0...n-1 qid in 0...m-1 ipAddr
20 Washington WASHINGTON UNIVERSITY IN ST LOUIS Fred Kuhns - 12/11/2015 Substrate Message Format
21 Washington WASHINGTON UNIVERSITY IN ST LOUIS Fred Kuhns - 12/11/2015 Substrate Message mlen : Total message length, including the header. mid : Message ID, used to support synchronous message processing. cid : context identifier. Specifies context within which the message is processed. A value of 0 indicates substrate context. cmd : Command to execute or a return code. The 4 header fields are each 16 bites. body : 0 or more bytes of command data. mlenmid cmdcid body: 0-N(B) Assume a simple command response (two-way) messaging framework. But will support one-way schemes.. Supports asynchronous communications using a message ID. The command field is overloaded for the return code. Every server is expected to implement a simple Version command ( cmd == 0 ) which return the server’s ID and Version number as two 32-bit fields. –primary use is for monitoring health of servers and debugging. –All other command values are uniique only to a particular server. Uses UDP as the transport protocol. All commands are expected to be idempotent msg header 0150
22 Washington WASHINGTON UNIVERSITY IN ST LOUIS Fred Kuhns - 12/11/2015 Overview In the interface specifications I provide a c-like description of the operations and results. The descriptions are only intended to describe the actual message format, data fields and returned results. It is not meant to specify an application level library. The arguments are to be encoded into the message body in the order that are given, using network byte order (Big Endian) and without padding. All commands result in: 1.No return response: one-way call semantics 2.an error occurs processing the message or command encounters and unexpected condition or error. In this case the return message will have the error return code in the cmd field. 3.The command completes and does not indicate and error to the message framework then the message result code indicates success. The message body contains any result data.
23 Washington WASHINGTON UNIVERSITY IN ST LOUIS Fred Kuhns - 12/11/2015 Example Message Slice with xsid of 0x10 requests the allocation of a global UDP port (decimal 17 ) for the local IP address (hex 0x80FC8222). –Assume the alloc_port command ID is 4. port = alloc_port(0x80FC8222, 0, 17) –Allocate a global UDP (decimal 17 ) port for the local IP address (hex 0x80FC8222), and let the system assign the next available port number. The resource manager allocates port 5050 ( 0x13BA ), the return code of 0 indicates success. F FC Command Message F FC BA11 Reply Message
24 Washington WASHINGTON UNIVERSITY IN ST LOUIS Fred Kuhns - 12/11/2015 NAT
25 Washington WASHINGTON UNIVERSITY IN ST LOUIS Fred Kuhns - 12/11/2015 Problem: –UDP, TCP : 2 or more GPEs attempt to use same global IP, Port and Proto –ICMP : ???
26 Washington WASHINGTON UNIVERSITY IN ST LOUIS Fred Kuhns - 12/11/2015