Download presentation
Presentation is loading. Please wait.
Published byLucas Davies Modified over 11 years ago
1
Intel Compiler Lab and USTC, PPoPP08 Scalable Packet Classification Using Interpreting -- Cross-platform Multi-core Solution Haipeng Cheng & Bei Hua Univ. of Science & Technology of China (USTC) Xinan Tang Intel Compiler Lab.
2
Intel Compiler Lab and USTC, PPoPP08 Background Packet Classification Problem Review RFC Algorithm TIC Algorithm Experimental Results and Analysis Future Work Outline
3
Intel Compiler Lab and USTC, PPoPP08 10GbE Smaller, Chapter, and Denser 2006200720082009 Switch Ports8-1220-244896 Servers with 10GbE 10%30-40%50-60%>60% Port Cost$2-5K$1-2K<$400<$250
4
Intel Compiler Lab and USTC, PPoPP08 Background (Networking) 10Gbps offers too much bandwidth for the multi-core computers to handle Traffic complexity: triple-play (voice, video, and data) support is essential Traffic types: P2P packets occupy 70% of the total network traffic Packet classification becomes increasingly important to identify and control the traffic
5
Intel Compiler Lab and USTC, PPoPP08 Background (Multi-core) Multi-core becomes prevalent Networking (Intel IXP, Cavium Octeon, RMI XLR) Multi-media (IBM Cell, Intel Larabee) General-purpose – Intel Core 2 Duo – AMD Barcelona – IBM Power5 – Sun Niagara Comment: find an efficient solution for one multi-core architecture is hard; find a cross-platform solution even harder
6
Intel Compiler Lab and USTC, PPoPP08 Classification Problem The process of partitioning packets into groups is called packet classification. Packet classification typically uses 5-tuples Enable value-added services: – Security: classify packets based on security policies – QoS: sort packets and ensure the packets receiving an appropriate bandwidth share – P2P management: tame the P2P traffic
7
Intel Compiler Lab and USTC, PPoPP08 Which package does it match to ? Packet Classification Example Packet (000, 010) How to match?
8
Intel Compiler Lab and USTC, PPoPP08 Why Is Packet Classification Hard? Packet classification is NP-hard Heuristic solutions seek O(1) solutions At 10Gbps (OC-192) speed, a 64-byte packet needs to be classified within 40ns – one DARM access time – 100 cycles for a 2.5Ghz CPU
9
Intel Compiler Lab and USTC, PPoPP08 Packet Classification Solutions At 10Gbps (OC-192) speed, it is done by Special ASIC TCAM Algorithms (?) – Hierarchical Tries – Recursive Flow Classification (RFC) – Two-stage Interpreter based Classification (TIC)
10
Intel Compiler Lab and USTC, PPoPP08 RFC Example Even though search space is huge (2^3)*(2^3)*(2^3), for a given packet, the actual matched rules per field is limited Class bitmap can be used to describe the matched rules: – 0001 means R4 is the matched rule – 1101 means R1, R2, and R4 are the ones matched
11
Intel Compiler Lab and USTC, PPoPP08 RFC Exam.
12
Intel Compiler Lab and USTC, PPoPP08 Recursive Flow Classification Map an S-bit string concatenated from the d fields of the packet header to a T-bit number through multiple phases (T << S ) S-IP(32b) D-IP(32b) S-Port(16b) D-Port(16b) Proto(8b)
13
Intel Compiler Lab and USTC, PPoPP08 Whats Wrong with RFC? Memory exploded Too slow to do update in practice However, 13-memory-access is the fastest classification algorithm
14
Intel Compiler Lab and USTC, PPoPP08 Two-stage Interpreting based Classification Domain knowledge: divide the RFC into two stages: Search source-destination prefix pair – 99.9% of the time the number of rules that match a pair of source- destination prefix is no more than 5 Search the list of port-range expressions – Range [2..14] in prefix: 001*, 01**, 10**, 110*, 1110 – Range search is based on calculation (,=) – Encoding the type of the range expressions intelligently – Evaluating them sequentially
15
Intel Compiler Lab and USTC, PPoPP08 TIC Main Ideas L2 cache size is in the range of mega-bytes Network applications are memory intensive Memory is best accessed sequentially – 64bytes cache line size for Core 2 Duo – 64bytes local-memory for IXP Can compression be used to optimize performance? – CISC encoding for smaller memory footprint
16
Intel Compiler Lab and USTC, PPoPP08 Putting Everything Together Domain knowledge: two-stage classification Architecture features: – Plenty of CPU cycles – Large L2 cache – Block based sequential access – Branch prediction can eliminate infrequent executed paths
17
Intel Compiler Lab and USTC, PPoPP08 Port-Range Expressions There are five type of range expressions – WC (wildcard) – HI ([1024, 65535]) – LO ([0, 1023]) – AR (arbitrary range) – EM (exact match) For (s-port, d-port, proto), there are at least 5x5x2=50 operators
18
Intel Compiler Lab and USTC, PPoPP08 Characteristics of Range Expressions for Destination Port ClassifierWCHILOEMAR seed130.42%--57.89%11.6% seed29.25%13.96%-65.75%11.04% seed38.56%12.15%-68.08%11.21% seed430.00%4.08%-60.72%5.20% seed555.46%6.52%-35.48%2.53%
19
Intel Compiler Lab and USTC, PPoPP08 Encoding and Interpreting Eliminate WC calculation Introduce HI and LO operators without storing the constants – HI ([1024, 65535]) – LO ([0, 1023]) Store AR and EM parameters in the operand fields NOP for code block alignment
20
Intel Compiler Lab and USTC, PPoPP08 Can we afford to increase #operator? Interpreter is a big switch-case statement. Compiler stores the starting address of each case in a jump table. Interpreter executes two instructions per iteration: – load an address into a register from the jump table – jump to the address in the indirect addressing mode IXP –E compiler can optimize switch-case with – Default Case Removal – Switch Block Packing
21
Intel Compiler Lab and USTC, PPoPP08 Experimental Setup Intel Xeon 5160 Core 2 Duo running at 3.00GHz with 4MB L2 cache and a 1333MHz system bus Cycle-accurate IXP2800 simulator, and each ME runs at 1.2GHz with 8 threads Generate packet traces from ClassBench, and use the low locality traces to cancel the locality
22
Intel Compiler Lab and USTC, PPoPP08 Space Reduction SIZE Classifier #Rules RFC(MB)TIC(MB) 2KDB119212.461.55 DB2202014.862.79 DB3200811.932.48 DB416712.822.10 DB5201247.312.89 4KDB634612.661.59 DB7398939.176.80 DB8400936.816.23 DB929252.932.09 DB10368882.522.97
23
Intel Compiler Lab and USTC, PPoPP08 Relative Speedups on Core 2 Duo 1 T2 T3 T4 T DB1 RFC15.4221.5424.3126.61 TIC12.8920.5225.2830.02 Imp.-16.4%-4.7%3.9%12.8% DB2 RFC10.8914.5917.0920.49 TIC11.4316.0818.5721.13 Imp.4.9%10.2%8.6%3.1% DB3 RFC11.4715.5717.9620.96 TIC11.7216.38 19.7321.27 Imp.2.1%5.2%9.8%1.5% DB4 RFC14.8419.0822.5224.43 TIC13.0619.4322.9524.87 Imp.-12%1.9% 1.8% DB5 RFC9.0512.1415.1916.44 TIC10.6414.8221.6318.59 Imp.17.5%22%42.4%13.1% Ave. RFC12.3316.5819.4221.78 TIC11.9417.4421.6323.18 Imp.-3.1%5.2%11.4%6.39%
24
Intel Compiler Lab and USTC, PPoPP08 Speedups on IXP (RFC vs. TIC) 1 T2 T4 T8T16 T32 T DB1 RFC1.813.596.9111.0121.0635.29 TIC1.382.384.765.7211.2220.49 Imp. -23.7%-33.7%-31.1%-48.1%-46.7%-41.9% DB2 RFC1.783.596.8910.9820.6134.98 TIC1.73.246.259.3918.2329.98 Imp. -4.5%-9.7%-9.3%-14.5%-11.5%-14.3% DB3 RFC1.773.576.8610.8920.4335.01 TIC1.673.26.199.0217.9930.07 Imp. -5.6%-10.4%-9.8%-17.2%-11.9%-14% DB4 RFC1.783.586.8610.7320.7535.11 TIC1.593.015.799.0817.2826.9 Imp. -10.7%-15.9%-15.6%-15.4%-16.7%-23.4% DB5 RFC1.753.516.8410.920.5935.03 TIC1.713.236.249.4118.1629.58 Imp. -2.3%-7.9%-8.8%-13.7%-11.8%-15.6%
25
Intel Compiler Lab and USTC, PPoPP08 Why RFC is better than TIC on IXP? Block size plays an important role in the IXP architecture since SRAM is optimized for 32bit access #SRAM Access#Words Accessed RFC1313W TIC7+1 = 87+8 =15W
26
Intel Compiler Lab and USTC, PPoPP08 Block Size Impacts on IXP
27
Intel Compiler Lab and USTC, PPoPP08 Future Work Improve TIC performance on IXP Improve TIC performance on firewall rules Improve update speeds
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.