Packet Drop in HP Switches Guoming
Cause: packet based hashing in F10 LAG + HP switch buffer Assumption: link utilization 50% In the hashing, several events with varied IP_ID could use the same output link, but from long term’s view, the load is still balanced very well among all LAG members Example: 4 1 F10 Ports 231 HP Switch Congestion 1 Event, 1: destination 1 Another round after have sent to all farm nodes
What makes thing worse? HP switch available buffer: 350 ~ 500 KB Available buffer depends on the frame size Big event size e.g. the case in slide 2: two events contest the same output port in HP, packets get dropped if the event size is bigger the buffer
Simulation Studies with simulation Assumptions / simplifications 1) All frames have the same size 2) Full farm size: 5-port LAG X 100, 30 nodes/rack 3) To speed up the simulation: 12 pkts/event, 1.2 KB/frame
Result: Max. Queue Length vs MEP factor 12 pkts/event Link utilization: 80% No clear correlation between Queue length and MEP factor Mep factor F10Q Max (pkt) HPQ Max (pkt)
Result: Max. Queue Length vs link number per LAG Link utilization: 80% Link number/LAG 5678 F10Q Max HPQ Max
Result: Max. Queue Length vs link utilization 5 links/LAG X100 Link Utilization (%) F10Q Max HPQ Max
Possible solutions Enable flowcontrol Others if flowcontrol does not help – Small MEP factor – Change IP_IDENT (see the result in next slide) 1) half with IP_IDENT, the other half with IP_IDENT+1 2) 1/3 IP_IDENT, 1/3 IP_IDENT+1, 1/3 IP_IDENT+2 3) some other schemes... – Change back to the original scheme: no LAG, small VLAN – Feature request for F10: round-robin hashing – Upgrade HP switches
Simulation Result: changing IP_IDENT (1) 1/2 + 1/2 1/3 + 1/3+ 1/3 Link Utilization (%) F10Q Max HPQ Max Link Utilization (%) F10Q Max HPQ Max