SCALABLE PACKET CLASSIFICATION USING INTERPRETING A CROSS-PLATFORM MULTI-CORE SOLUTION Author: Haipeng Cheng, Zheng Chen, Bei Hua and Xinan Tang Publisher/Conf.: ACM/PPoPP '08 (the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming) Speaker: Han-Jhen Guo Date:
OUTLINE Developing TIC Algorithm RFC Reduction Tree TIC Algorithm Description Instruction Encoding The Range Interpreter Architecture-aware Design and Implementation Simulation and Performance Analysis Relative Speedups for Core 2 Duo Relative Speedups for IXP2800
DEVELOPING TIC ALGORITHM - RFC REDUCTION TREE (1/2) A simple example of RFC reduction tree
DEVELOPING TIC ALGORITHM - RFC REDUCTION TREE (2/2) Actual architecture of RFC reduction tree 4 phases 13 memory accesses per packet disadvantage: the cost of memory explosion
DEVELOPING TIC ALGORITHM - TIC ALGORITHM DESCRIPTION Two-stage Interpreting based Classification (TIC) algorithm Stage 1 source IP address destination IP address Retrieve list of range expression (possibly matched rules) from source IP address and destination IP address Stage 2 source portdestination portprotocol Search matched rules with source port, destination port and protocol in the code block by the Range Interpreter (RI)
DEVELOPING TIC ALGORITHM - INSTRUCTION ENCODING (1/2) Operator (8-bit) protocol-srcPORT- desPORT Operand0 (8-bit) protocol Operand1~Operand4 (16-bit) srcPORT (begin|end) desPORT (begin|end) RuleID (16-bit) # of matched rules the maximum # of rules in the classifiers is 64K (2 16 )
DEVELOPING TIC ALGORITHM - INSTRUCTION ENCODING (2/2) Classes of “ClassBench ” port range WC (wildcard) HI ([1024 : 65535]) LO ([0 : 1023]) AR (arbitrary range) EM (exact match) protocol range WC EM A small example of instruction encoding EM-WC-WC use instruction of 4B
DEVELOPING TIC ALGORITHM - THE RANGE INTERPRETER All instruction blocks are stored in external memory we get the address of the first code block after the stage 1
ARCHITECTURE-AWARE DESIGN AND IMPLEMENTATION (1/3) Hardware Intel Core 2 Duo with two levels of cache (multi-core architecture) 4MB L2 cache size and 64B cache line size Intel IXP2800 without cache (multi-threaded architecture)
ARCHITECTURE-AWARE DESIGN AND IMPLEMENTATION (2/3) Space Reduction CISC instruction encoding produces smaller program size than the RISC encoding the variable-size instruction encoding of CISC can save up to 15% of memory than 8-byte RISC encoding
ARCHITECTURE-AWARE DESIGN AND IMPLEMENTATION (3/3) Latency hiding of memory access Core 2 one CPU core can be used as a helper thread to warm up L2 cache the other main thread can be executed faster if the same cache lines are already fetched IXP2800 1) issuing outstanding memory requests whenever possible memory operations in phase 0 at the first stage can be simultaneously issued 2) overlapping the memory access with the ALU execution their memory address calculation can then be overlapped with other memory operations
SIMULATION AND PERFORMANCE ANALYSIS Effective Space Reduction
SIMULATION AND PERFORMANCE ANALYSIS - RELATIVE SPEEDUPS FOR CORE 2 DUO (1/2) In worst cases quite large number of L2 cache misses overhead of RI is insignificant TIC’s worst-case classification speeds are slower than RFC’s in 1, 2- thread cases but faster than RFC in 3, 4-thread cases TIC’s performance is better than RFC’s when more threads available
SIMULATION AND PERFORMANCE ANALYSIS - RELATIVE SPEEDUPS FOR CORE 2 DUO (2/2) In average cases main thread has a very few L2 cache misses the interpreter overhead might be noticeable TIC is faster than RFC in terms of classification speed if the memory space of RFC is bigger than L2 cache size
SIMULATION AND PERFORMANCE ANALYSIS - RELATIVE SPEEDUPS FOR IXP2800 (1/3) TIC’s performance is worse than RFC’s
SIMULATION AND PERFORMANCE ANALYSIS - RELATIVE SPEEDUPS FOR IXP2800 (2/3) Factors of poor performance Less memory access but more long-words access the FIFO size of TIC is bigger than that of RFC for both average and the worst cases a SRAM operation will stay in FIFO longer for TIC than for RFC
SIMULATION AND PERFORMANCE ANALYSIS - RELATIVE SPEEDUPS FOR IXP2800 (3/3) Block Size Impact on IXP2800 The classification speed and the speedup is all higher when the block size is 32B than 64B