Download presentation
Presentation is loading. Please wait.
Published byAmie Horn Modified over 9 years ago
1
StrideBV: Single chip 400G+ packet classification Author: Thilan Ganegedara, Viktor K. Prasanna Publisher: HPSR 2012 Presenter: Chun-Sheng Hsueh Date: 2012/12/12
2
Introduction From a hardware perspective, the main bottleneck in implementing packet classification engines has been the amount of memory required to store the rule set. To solve this, various solutions have been proposed to reduce the memory footprint of rule set storage. Most of all, these solutions exploit some properties or features of the rule set to achieve memory efficiency
3
Introduction This work improves throughput as the primary goal while memory efficiency is kept secondary. StrideBV is a performance independent of rule set features, high- throughput(400+ Gbps) packet classification scheme based on field- split algorithm.
4
BACKGROUND AND RELATED WORK FPGAs are widely used in various networking applications. Even though the operating frequency of FPGA is low fine-grained pipelining can be used to dramatically improve the performance.
5
ALGORITHM AND CLASSIFICATION PROCESS Problem Definition Field-split Algorithm StrideBV: Algorithm Multi-match to Single-match StrideBV: Lookup Process and Hardware Architecture
6
Problem Definition The most widely used scheme is 5-field packet classification in which the following tuple of headers of each incoming packet is inspected: Source IP (SA), Destination IP (DA), Source Port (SP), Destination Port (DP), Protocol (PR)
7
Problem Definition Given a packet classification rule set that has N number of rules that considers d number of packet header fields, f0, f1,..., fd−1, devise: A lookup scheme whose performance is independent of the features or properties of rule set A hardware architecture to perform wire-speed packet classification for 400 Gbps and beyond
8
Field-split Algorithm
9
StrideBV: Algorithm This paper apply field-split algorithm to all the 5 fields. The challenge in doing the above is the pipeline length. The resulting pipeline length becomes In the case of 5-field packet classification, this amounts to 104 pipeline stages.
10
StrideBV: Algorithm Reducing pipeline length in this approach can be done using multiple bits (k bit stride) than a single-bit inspection. This can be performed in two different methods by storing: 1) Bit vectors corresponding to the 2^k combinations of the k bit stride and load a single bit vector per stage 2) 2×k bit vectors corresponding to the individual bits of the k bit stride and load multiple bit vectors per stage
11
StrideBV: Algorithm The first method consumes more memory while reducing memory bandwidth and second method saves memory at the cost of memory bandwidth.
12
StrideBV: Algorithm However, it should be noted that in the second case, in a single stage, k number of N bit AND operations need to be performed. This increases the amount of work to be done per stage which causes the clock period to increase. 1) Bit vectors corresponding to the 2^k combinations of the k bit stride and load a single bit vector per stage 2) 2×k bit vectors corresponding to the individual bits of the k bit stride and load multiple bit vectors per stage
13
StrideBV: Algorithm Since the goal of this paper is to implement a high-throughput packet classification engine, we opt for the first method at the cost of increasing the memory consumption
14
Multi-match to Single-match The output of the lookup engine is the bit-vector that indicates the matching rules for the input packet header. However, in packet classification, only the highest priority match is reported since routing is the main concern. The rules of a classifier is sorted in the order of decreasing priority. This task can be easily realized using a priority encoder. However, when the length of the bit vector increases, the time required to report the highest priority match increases significantly.
15
Multi-match to Single-match As a remedy, we introduce a Pipelined Priority Encoder (PPE). A PPE for a N bit-vector consists of logN number of stages and since the work done per stage is trivial, the PPE is able to operate at very high frequencies.
16
StrideBV: Lookup Process and Hardware Architecture The output of the stage memory is the N bit-vector corresponding to the input stride. This bit-vector is ANDed with the bit-vector from the preceding stage to produce the intermediate result.
17
StrideBV: Lookup Process and Hardware Architecture This process is implemented as a linear Static Random Access Memory (SRAM) based pipeline. The output of the initial pipeline is the multimatch result. In order to extract the highest-priority match, the StrideBV pipeline is followed by a PPE.
18
PERFORMANCE ANALYSIS ON FPGA In this section, we provide a detailed analysis of the StrideBV architecture under different configurations. The performance of the proposed architecture is measured in throughput, memory efficiency, power and resource usage. A state-of-art Xilinx Virtex-6 device (XC6VLX760) was used for the experiments and the results presented here are based on the post place-and-route performance. Since StrideBV does not rely on rule set features, to evaluate the performance, we used rule set sizes ranging from 32 to 512 rules, considering real-life firewall classifiers.
19
PERFORMANCE ANALYSIS - Throughput There were two options: Use 1) distributed RAM or 2) block RAM as stage memory. Here the tradeoff is memory size vs. clock period. This paper opted to use distributed RAM since the memory requirement for real-life classifiers in our application is less than the maximum distributed RAM available on the considered FPGA.
20
PERFORMANCE ANALYSIS - Throughput A single pipeline was not adequate to support 400 Gbps, this requires an operating frequency of 1.25 GHz, which current FPGA device do not support. For that, this paper employed 4 pipelines for rule set sizes less than 512 and 6 pipelines for rule set size 512.
21
PERFORMANCE ANALYSIS - Throughput Figure 3 shows the throughput variation with the size of the classifier for various stride sizes for minimum packet size (40 Bytes).
22
PERFORMANCE ANALYSIS - Memory Efficiency The author employed only the distributed RAM (built using logic) in this work. Since it use dual-ported stage memory, to implement 4 and 6 parallel pipelines, to calculate the total memory consumption, multiplication factors of 2× and 3× has to be introduced, respectively.
23
PERFORMANCE ANALYSIS – Power Per Unit Throughput To measure power consumption of our device, we used the XPower Analyzer tool available in the Xilinx ISE 12.4 suite. Using a small stride size yields lower power efficiency. This is mainly because of the extensive resource usage.
24
PERFORMANCE ANALYSIS - Resource Consumption
25
Comparison with Existing Approaches Comparing the worst case performance of several existing solutions to illustrate the benefits of StrideBV. For this evaluation, we considered a 5-field classification rule set with 512 rules for all the schemes.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.