1 © 2002, Cisco Systems, Inc. All rights reserved. Cisco Confidential Fault Management for Provider Bridges Ali Sajassi & Norm Finn June 5, 2003 Cisco.

222 Agenda OAM Layering OAM Functionality OAM Message Types Putting the pieces together OAM Message Format

333 Layered Approached Reference Model Endpoint of the corresponding levels Connecting Point of the corresponding levels Ethernet Virtual Connection / Trail Transport Virtual Connection/Trail

444 Some Terminology Ethernet Trail spans across multiple Ethernet connecting points (e.g., multiple Ethernet switches) Transport Trail spans across multiple transport connecting points. Examples of Transport Trails are MPLS PW, ATM VC, FR VC, etc.

555 OAM Types CE PB CEs: Customer Equipment PBs: Provider Bridges serve as Ethernet connecting points (they can also terminate an Ethernet Trail) Ps: Provider nodes serve as transport connecting point Ethernet VC / Trail Ethernet Link Ethernet Virtual Connection / Trail Network A Network B P P P P P P

666 OAM Types - Continue 1.Ethernet Service Level (or Ethernet Trail) OAM Needed for 802.1ad 2.Transport Level OAM Defined by underlying transport – e.g., by IETF if transport is MPLS/IP 3.Ethernet Link OAM Ethernet Link OAM is defined by 802.3ah

777 Ethernet OAM Trails within a Provider’s Network PB PB: Provider Bridge u-PE: User-facing PE n-PE: Network-facing PE P: Provider Node xPE devices have Ethernet switching capability Ethernet Trail Network A PB Ethernet Trail (u-PE) (n-PE) (u-PE) (P)

888 Ethernet Trail OAM Types Two general Types – End-to-end (e.g., between Edge PBs) – Segment (between any two PBs along the end-to-end path)

10 OAM Scope F ault Management C onfiguration Management A ccount Management P erformance Management S ecurity Management

11 OAM Scope - Continue Fault Management – Fault Notification – Fault Isolation – Fault Detection – Fault Recovery Scope is limited to the first three bullets

12 Fault Management – ATM Example Fault Detection – Continuity Check for soft failure Fault Notification – Alarm Indication Signal (AIS) and Remote Defect Indication (RDI) Fault Isolation (Connectivity Verification) – Loopback test Fault Recovery – Automatic Protection Switching (APS)

13 Fault Detection: ATM Continuity Check Tx endpoint sends a cell periodically Rx endpoint can distinguish between an idle connection and a failed connection Rx node generates an RDI in the upstream if it doesn’t receive CC within a time interval End point 1 ATM CC Node 1 Node 4 End point 2 Node 2 Node 3 End point 1 ATM CC Node 1 Node 4 End point 2 Node 2 Node 3 RDI

14 Fault Notification: ATM AIS + RDI Node 1 Node 4 Node 2 Node 3 Failure-A Failure-B Downstream from A Upstream from A AIS-A RDI-A RDI-B AIS-B Node 1 Node 4 Node 2 Node 3 Failure-A Downstream from A Upstream from A AIS-A RDI-A The downstream node adjacent to the failure generates an AIS signal to indicate a failure in the upstream The receiving endpoint generates an RDI signal for the upstream endpoint (this is an end-to-end OAM message)

15 Fault Isolation: ATM Loopback Test Tx endpoint sends either a segment or end-to-end Loopback cell Rx connecting points (that are not loopback destination) extract the cell and forwrads them in segment loopback Rx connecting points relay the OAM cell transparently in end-to-end OAM Rx endpoint extracts and process the OAM cell and after modifying the cell, it sends a response in the opposite direction

16 Ethernet Trail OAM Objectives – To follow proper layering structure – To avoid duplicate functionality in other layers – To have minimum impact to existing IEEE standard

17 Ethernet Service Level (Trail) OAM Fault Detection – Continuity Check to detect soft failure (mis-config and software failure) as well as hard failure (node or link failure) Fault Notification – AIS/RDI ? Fault Isolation – Loopback & Traceroute Fault Recovery – Spanning Tree Protocol (STP)

18 Fault Detection – Continuity Check Agg CE u-PE Agg n-PE CE Agg CE n-PE Aggu-PE CE Aggu-PE Aggn-PE CE Island A Island C Island B LAN Emulation for blue VPLS u-PE, Agg, and n-PE are all Provider Bridges (PBs) Each u-PE broadcast CC message periodically to all members of a service instance

19 Fault Detection – Continuity Check CE u-PEAggn-PE CE n-PEAggu-PE CE CC messages LAN Emulation for blue VPLS Only generated by u-PE periodically – e.g., once a minute

20 Fault Detection – Continuity Check Description: – A heartbeat message gets sent periodically from end points similar to ATM CC. In the absence of receiving the heartbeat message for a given interval, the soft/hard failure can get detected Operation: – The Edge PB generates an “All Bridge” broadcast message as hearbeat on a per customer service instance basis (e.g., P-VLAN). It is assumed that a PB besides forwarding the message also passes it to its control plane – if not then a special multicast address needs to be used for it. Characteristics: – It is only intended for Edge Provider Bridges (E-PB) – e.g., the bridges with UNI for a given customer service instance – It is used to detect both soft and hard failure – This message needs to get blocked at the customer UNI – Relative to loopback function, it is a light-load OAM function and no response is required from the far-end E-PB – Intermediate nodes use this message to remember the E-PBs’ MAC addresses

21 Fault Notification For a point-to-point Ethernet connection it is conceivable to have an AIS + RDI similar to ATM case But in case of Multipoint Ethernet connection, is there a need for AIS and/or RDI ? If so, what purpose would it serve ?

22 Fault Isolation – Loopback test Agg CE u-PE Agg n-PE CE Agg CE n-PE Aggu-PE CE Aggu-PE Aggn-PE CE Island A Island C Island B LAN Emulation for blue VPLS End-to-end loopback Segment loopback

23 Fault Isolation - Loopback CE u-PEAggn-PE CE n-PEAggu-PE CE Segment loopback LAN Emulation for blue VPLS End-to-end loopback

24 Fault Isolation – Loopback test Description: – It is used to check the connectivity between two service nodes similar to ATM loopback test Operation: – The originating end-point generates a unicast request (aka ping request) and expects a response from the destination node Characteristics: – It case be both Segment and End-to-End – Loopback mode cannot be “hard” loopback (e.g., cross connecting the circuits) – Only the destination node with the designated MAC DA can respond – This message needs to get blocked at the customer UNI Note: 802.2 has defined XID and TEST for loopback test / ping; however, since we need additional FM messages such as TR and CC, we are defining a consistent format for all these messages

25 Fault Isolation – Trace-route Agg CE u-PE Agg n-PE CE Agg CE n-PE Aggu-PE CE Aggu-PE Aggn-PE CE Island A Island C Island B LAN Emulation for blue VPLS

26 Fault Isolation - TraceRoute CE u-PEAggn-PE CE n-PEAggu-PE CE Trace Route messages LAN Emulation for blue VPLS TR responses

27 Fault Isolation – Trace Route Description: – It is used to isolate the location of the fault along the end-to-end Ethernet Trail. There is no equivalent function in ATM. Operation: – The originating end-point generates a multicast request with the destination MAC DA embedded in it. This message traverses hop by hop. Each intermediate service nodes upon receiving this message, checks to see if it has the MAC DA. If so, it forwards the message to the proper port associated with the MAC DA and respond to the originator. If not, it ignores the message. Characteristics: – It case be both Segment and End-to-End – Only the destination nodes can respond in order to reduce the number of responses – This message needs to get blocked at the customer UNI – It can also use multicast/broadcast address as MAC DA in which case, the trace route can used to discover network topology (active nodes)

29 OAM Message Types - Three Types Simple – OAM message is completely transparent to the intermediate service nodes and simply gets switched by them Snoop – OAM message gets snooped by each service node along the path as well as getting forwarded to the next node Relay – OAM message gets intercepted by each service node along the path and then gets generated again

30 OAM Message Types CE u-PEAggn-PE CE n-PEAggu-PE CE Simple LAN Emulation for blue VPLS Relay Snoop

31 FM Messages and their types Loopback – Simple Continuity Check – Snoop Trace-Route – Relay

32 Loopback: Simple CE u-PEAggn-PE CE n-PEAggu-PE CE Loopback or Ping request LAN Emulation for blue VPLS Ping Response

33 Continuity Check: Snoop CE u-PEAggn-PE CE n-PEAggu-PE CE LAN Emulation for blue VPLS

34 Trace Route: Relay CE u-PEAggn-PE CE n-PEAggu-PE CE TR Relay LAN Emulation for blue VPLS

35 Response Types for TR There are two response types for TR messages – Response from selected nodes – e.g., n-PE only or u-PE + n- PE – Response from all nodes – e.g., u-PE, Agg, n-PE

36 Relay TR Response: Selected v.s. All CE u-PEAggn-PE CE n-PEAggu-PE CE RELAY Trace Route LAN Emulation for blue VPLS Response from selected nodes Response from all nodes

38 Fault Isolation – is there an issue ? Even though CC can be used as periodic refresh of MAC aging timers for u-PEs MAC addresses, upon a failure and network isolation, the u-PEs MAC addresses get aged out from the intermediate nodes. Therefore, in order to do proper fault isolation, need to remember u-PEs MAC addresses in the intermediate nodes.

39 Steps for Fault Detection & Isolation 1.Issue CC w/ multicast address from u-PEs belonging to a given service instance 2.If CC fails, then issue PING w/ PE MAC address for connectivity verification to the far-end PE (MAC SA of CC msg). 3.If PING fails, then issue Trace Route w/ the target PE MAC address 4.If TR fails and the last egress node is an MPLS/IP node, then issue LSP ping or VCCV to isolate the failure within the Transport Trail

40 Failure Management scenario – an Example Agg CE u-PE Agg n-PE CE Agg CE n-PE Aggu-PE CE Aggu-PE Aggn-PE CE Island A Island C Island B LAN Emulation for blue VPLS

41 Step 1: Continuity Check Each u-PE periodically multicasts its heartbeat to other members of the VPLS Heartbeat msgs are forwarded by n-PEs and Agg devices but not intercepted Heartbeat message (CCs) are of type Snoop Heartbeat msgs are terminated by u-PEs and the received u-PE starts a timer for each far-end u-PE

42 Step 2: PING When a u-PE doesn’t receive the CC (hearbeat) msgs from a remote u-PE (either due to LINK, NODE, or SOFT failure), then it initiates PING toward that remote PE PING msg is of type SIMPLE and uses the MAC address of that remote PE If PING fails, then operator is notified for further diagnostics (e.g., to run Trace Route) If PING passes and CC fails, then this is an abnormal case (e.g., SW corruption) and needs to get reported to the operator

43 Step 3: RELAY Trace Route If PING fails, then need to locate the failure 1 st diagnostic test is to run Trace Route The last TR response has the cumulative path info and points to the failed node/link There are two types of RELAY TR. – A) response comes from selective nodes (n-PEs and u-PEs) – B) response comes from every node

44 Step 4: Transport Trail test If the Relay TR fails, then the failure can be isolated to the last egress PB along the path If the connection between last PB and its adjacent downstream PB is virtual (e.g., PW over MPLS/IP), then one can further isolate the failure along the transport trail by using facilities available at the transport level – e.g., in case of MPLS – use MPLS ping and TR

45 Speical Case If RELAY TR passes, then there may be a FIB corruption and the operator can initiate multiple loopback tests – one for each pair of u-PE and one of the intermediate service nodes reported in the relay TR.

46 Agenda OAM Layering OAM Functionality Putting the pieces together OAM Message Types OAM Message Format

47 OAM Message Format The followings are some examples of Ethernet OAM message format for Fault Management messages (e.g., Loopback, Continuity Check, and Trace Route) There are some proposal in MEF for Loopback OAM message format (but not for CC and TR since these concepts haven’t been discussed to my knowledge) Need to work with MEF and other interested parties in defining and finalizing the format for these messages

48 OAM msg format OAM MAC DA OAM MAC SA VLAN Tag Ether type (VLAN) Ether type (OAM) Type OAM Data Sub-type FCS

49 OAM message format - continue OAM TypeOAM Sub-type Fault ManagementContinuity Check Loopback Trace Route Performance ManagementTBD

50 OAM Data: Continuity Check No OAM data part has been identified for CC currently

51 OAM Data: Trace Route Request Addr type Resp. mode Address of Requesting Service Node Min. hop to resp Max hop to resp Correlation Tag Requested info bit map Hop # Reserved Piggy-bag Info Piggy-bag flag Mac address of the destination Edge-PB

52 OAM Data: Trace Route Request Hop number – The hop number of the adjacent upstream node. The receiving node increments this number by one before passing the TR request to the next downstream node. Response mode – Indicates which nodes need to respond » 01: All nodes need to respond » 02: Only n-PEs need to respond » 03: Only the nodes with min. and/or max. hops value need to respond » etc. Min. hop to respond – Only nodes with hop count greater or equal to this value can respond (a value of zero means there is no minimum hop count) Max. hop to respond – Only nodes with hop count less or equal to this value can respond (a value of zero means there is no maximum hop count)

53 OAM Data: Trace Route Request Piggy-bag Flag – Indicates whether piggy bagging of info is allowed or not Address type – indicates what type of address to be used to identify the requesting service node (the same type must also be used in the respond). » 01: IPv4 (4-byte) » 02: IPv6 » 03: VPN-id (8-byte) » etc. Address of Service Node – The address of the service type as specified by the previous field Correlation Tag – The tag used to associated the TR responses to the TR request

54 OAM Data: Trace Route Request MAC address of the destination Edge-PB – MAC address of the Edge-PB to which lost connectivity (not receiving CC message from) Requested Info Bit Map – A bit map that indicates what info should be sent in the response

55 OAM Data: Trace Route Response Node Hop # Node Type Address of the Requesting Service Node Correlation Tag Info from node 1 Info from responding node EOM

56 OAM Data: Trace Route Response Node hop # – Indicates the hop count of the responding node Node Type – Indicates the type of the node Address of Requesting Service Node – Indicates address of requesting service node (copied from the request) Correlation Tag – The tag used to associated the TR responses to the TR request (copied from the TR request)

57 OAM Data: Loopback Request Addr type Address of Requesting Service Node Correlation Tag Reserved Time Stamp (optional)

59 How to remember Edge-PB MAC addresses Two basic approaches 1. Maintain a separate table of E-PBs’ MAC addresses in the software to associate a given E-PB’s MAC address with an egress port. 2. To disable aging timer for E-PBs’ MAC addresses by using a semi-static entries in the forwarding table.

60 How to remember Edge-PB MAC addresses – Software Approach In the software approach – Upon a TCN, the primary forwarding table gets flushed; however, the software forwarding table remain intact and its entries maintain the association between E-PBs’ MAC addresses and corresponding egress ports – Since TR goes hop by hop (through each node’s control plane), the software forwarding table can be used for forwarding TR messages – When a TCN occurs, the entries of the software table may get obsolete; however, since this table is not used for forwarding of customer data, then it should be O.K. Also upon the next OAM CC message from a E-PB, the corresponding entry for that E-PB gets corrected – When a failure occurs without a TCN, then the entries of the software has the correct value which can be used for tracing. The entries of the primary table may have already been aged out. – In this approach the E-PBs can be identified without the use of special MAC addresses

61 How to remember Edge-PB MAC addresses – “Hardware” Approach In the “Hardware” approach – A special MAC address needs to be used for each E-PB for the purpose of OAM management – The aging of these MAC addresses are disabled through the use of static value in the primary forwarding table – When a CC is received, the corresponding entry for the E-PB’s MAC address of that CC gets updated – Upon a TCN, all others MAC entries get flushed except these special MAC addresses. However, these special MAC will point to the correct egress ports upon receiving subsequent CC messages – Since these special MAC addresses are just used for Fault Management propose, it should not impact any other traffic flow – When a failure occurs without a TCN, then these static entries point to the correct egress ports which can be used in tracing

62 Hardware approach - continue This is not an issue for the following reasons: – Only OAM messages get affected and NO customer traffic gets affected at any time – Next broadcasted CC will correct the MAC entry of a given intermediate node in case of TCN by overwriting the MAC entry w/ correct port-id and thus Rxed CC will be received correctly. – Continuity Check messages gets broadcasted periodically from each u-PE. So if a failure results in: » Topology change, then the entry for u-PE MAC address gets overwritten w/ proper port-id upon receiving the next CC msg » No topology change, then the entries for u-PE MAC address in all the service node remain intact and point toward the failure point – Given that CC transmission interval is in order of a minute versus topology change (which is in order of seconds), at worst only a single CC message can get lost

63 Ethernet Layer-2 Layer 2 world (in blue) consists of endstations, e.g. hosts and routers, at the ends, with bridges in the middle, connected by Layer 1 media.

64 Proper Layering Proper layering allows you to put together a network using very different Layer 1 media. Layer 3 tunnel (e.g. EoMPLS) is an emulated Layer 1 (pseudowire) medium in this environment!

65 Ethernet Trace Route Ethernet Trace Route shall be independent from underlying physical or emulated links Ethernet TR shall operate at L2 level

66 Mixed Layering One could define an ATM OAM (or MPLS OAM cell), for example, that carried a MAC address from bridge to bridge, “popping up” at each one, and re-submerging into ATM to traverse the next VC. But that would create dependency on L1 and lots of inter-operability issues among different emulated L1 interfaces (e.g., ATM 1483, FR 1490, EoMPLS, EoL2TPv3, EoSONET, etc.)

67 Issues 1.How do we differentiate Customer’s OAM from Provider’s OAM frames when same Ether Type is used for both? –If we do translation of Ether-type, then that may be inconsistent w/ our 802.1ad strategy (802.1ad doesn’t translate Ether Type, it simply adds additional Ether Type) –If we do insertion of Ether-type, then we will end up inserting double tag in font of customer’s tag !! 2.Easier alternative would be to use different Ether Types for provider’s OAM versus customer OAM – If customer uses provider’s OAM Ether Type, then drop the packet – Otherwise the packet is passed through provider’s network as a regular packet and provider’s bridges are oblivious to it.

68 Issues – Using SA MAC for Forwarding Decision Majority of time DA MAC is used for forwarding decision. However, there are two scenarios where SA MAC can be used: – Load balancing over a set of 802.3ad links (besides SA MAC, other fields within payload such as IP five-topple can be used for hashing) – Load balancing over blocked ports (proprietary mechanism) In such scenarios, customer frames can take a different path than provider’s OAM frames

69 Failure Type E2E path Failure Permanent (e.g., node or link failure without alternative path) Transient (e.g., node or link failure with alternative path – causing TCN) TR pinpoints the failed node/link (Relay TR can pinpoints the failed nodes at any time because PE MAC doesn’t age out) TR doesn’t come into picture (some nodes will point incorrectly but they get corrected upon next periodic OAM CC message. Only OAM CC msgs gets affected for a short period)

70 PE MACs versus Customer MACs Advantages of PE MACs in OAM msgs: – fully under control of SP –Age timers can be disabled » if a link/node fails w/o re-route, then use of TR w/ PE MAC address can pinpoint the failure location » if a link/node fails w/ re-route, then MAC address learning mechanism will self correct the FDB entries upon receiving the next message from the source Disadvantages of Customer MACs in OAM msgs: – can timeout quickly and thus making it very difficult to do tracing for fault isolation » will result in flooding and lots of non-relevant responses » can’t pinpoint the problem because there is no way to identify the egress node that used to have the customer MAC address

71 OAM msgs and MAC addresses Known Unicast (PE MAC) Known Unicast (Cust. MAC) Unknown Unicast (PE MAC) Unknown Unicast (Cust. MAC) Multicast Continuity Check N/A SNOOP OAM-type = Heartbeat Set MAC SA timer Ping SIMPLE SNOOP ? only egress u-PE returns ACK N/A Trace Route SNOOP RELAY SNOOP ? RELAY ? N/A SNOOP RELAY

72 CC versus PING for fault detection CC & PING comparison – CC requires O(n) msgs; whereas, PING requires O(n**2) – CC requires no ACKs; whereas, PING requires ACKs – both CC & PING need to start after VPLS instance initialization in order to populate FDB along the path properly – PING cannot be started after failure occurs (w/o CC) since most likely there won’t be any ACK and TR won’t be able to pinpoint the problem because of unknown PE MAC addresses

73 Special Case-2: Ping 1.If PING passes and CC fails, then there may be SW corruption in egress u-PE. Although SW corruption should typically disable both CC and Ping 2.If OAM ping (w/ PE MAC) passes, and the problem still persists, then one may want to do OAM ping w/ customer MAC. In such case the problem can be A) at the egress interfaces in which case, OAM ping w/ customer MAC won’t detect the problem either B) w/ FDB corruption; however, checksum on the FIB should catch this C) there is a wrong ACL on customer MAC D) there is some sort of load balancing (across either 802.3ad set of links or across blocked links – later one is a proprietary method) 3.Does it make sense to use Ping w/ customer MAC since after a few minutes the entries get aged out and so untraceable ?

1 © 2002, Cisco Systems, Inc. All rights reserved. Cisco Confidential Fault Management for Provider Bridges Ali Sajassi & Norm Finn June 5, 2003 Cisco.

Similar presentations

Presentation on theme: "1 © 2002, Cisco Systems, Inc. All rights reserved. Cisco Confidential Fault Management for Provider Bridges Ali Sajassi & Norm Finn June 5, 2003 Cisco."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 © 2002, Cisco Systems, Inc. All rights reserved. Cisco Confidential Fault Management for Provider Bridges Ali Sajassi & Norm Finn June 5, 2003 Cisco.

Similar presentations

Presentation on theme: "1 © 2002, Cisco Systems, Inc. All rights reserved. Cisco Confidential Fault Management for Provider Bridges Ali Sajassi & Norm Finn June 5, 2003 Cisco."— Presentation transcript:

Similar presentations

About project

Feedback