GLOBECOM (Global Communications Conference), 2012

GLOBECOM (Global Communications Conference), 2012
Performance Evaluation of Packet Classiﬁcation on FPGA-based TCAM Emulation Architectures GLOBECOM (Global Communications Conference), 2012 Presenter: NTHU 李若萍

Outline Introduction Related Work TCAM Emulation
RAM-based TCAM Architecture Performance Evaluation Conclusion

Introduction Packet fields are used as keys to determine the best matching rule and apply a corresponding action. Exact matching Prefix matching Range matching How to find the best matching rule? Each rule is assigned a cost. packet classification在routers, switches, firewalls, service上都已廣泛應用。 Exact: host addressing and protocol flag fields. //封包的欄位與規則相對應的欄位內容一模一樣 Prefix: network addressing. //規則欄位內的值是封包相對應欄位之內容的前序，前序愈長，其優先等級愈高 Range: specifying port ranges. //封包的某欄位之值是落在規則相對應之欄位值的範圍之內因為一個packet可能會match好幾個rule

Introduction (cont.) TCAM(Ternary Content Addressable Memories)
CAM(Content Addressable Memories) TCAM(Ternary Content Addressable Memories) Data Key Mask Actual Data Match line X (don’t care) 1 X SRAM cell ≠ VCC KEY Match line Data Match Data Key Match line 1 SRAM cell ≠ VCC KEY Match line Data Match Mask SRAM & CAM: 兩種狀態: 0, 1 => 只支援exact match TCAM: 三種狀態: 0, 1, X => 支援exact, prefix和range match Match = (key ≠ Data) Match line = !Match Match = (key ≠ Data) & Mask Match line = !Match

Introduction (cont.) TCAMs (Ternary Content Addressable Memories)
RAM Compared key 1 X TCAM Priority Encoder Compared result: Memory address: 2 N memory address as index to find responding action store rules 3 Capacity constraints Storage inefficiency High power consumption Limited scalability 規則依優先等級遞減地儲存在記憶體陣列中比較之值被以平行方式進行各位元的比對，比對結果是一個具有N位元的位元向量透過N位元的優先等級編碼器(priority encoder)找出符合之最高等級規則的位址，再以此為作索引，找出記憶體中與此前序相關的動作(允許或禁止) 。簡單且快速的運作流程，可以在一個時眽週期就完成一項分類的工作 Capacity constraints due to core cell area=> TCAM的面積限制受core的面積大小影響 Storage inefficiency due to range expansion High power consumption=>耗費較多電力，因為需搜尋chip上的所有資料 Limited scalability for wide keys=>key無法做擴展 => 較少的儲存空間(相同大小的chip，RAM儲存的資料比TCAM多)

Introduction (cont.) Purpose : we investigated performance and trade- offs related to TCAM emulation in FPGAs (Field- Programmable Gate Array). We considered the impact of encoding different key ranges on rules for different conﬁgurations in terms of the search key length and the number of rules. (Not ASIC: Application-Specific Integrated Circuits) ASIC是指依特定用途而設計的特殊規格積體電路，它是一種全定製的電路。 FPGA是以硬體描述語言(Verilog或VHDL)描述的邏輯電路,可以利用邏輯綜合和布線工具軟體,快速地燒錄至 FPGA 上進行測試。 FPGA比ASIC的速度慢,無法完成更複雜的設計,並且會消耗更多的電能。 FPGA可以快速成品,其內部邏輯可以被設計者反覆修改,從而改正程式中的錯誤,此外，使用FPGA進行調試的成本較低。

Related Work Hardware-assisted packet classification Decision tree
Hierarchically split rule pattern straitens incremental updates. Decomposition The cross-producting stage issue. Exhaustive search Predictable memory requirements. Decision Tree: 依據分類器的規則建立決策樹，再用封包的欄位來追蹤此決策樹 => 難預測實際的效率和資料，而且階級切割使得更新一條規則變得困難。 Decomposition: 將多重欄位的搜尋分解成多個個別欄位的搜尋，然後各自獨進行搜尋，之後再將結果合併起來。 => 適合用硬體實作，但有cross-producting stage issue(規則異動時需重建外積表，適合靜態不常變動的分類器) Exhaustive Search: 檢驗分類器的每一條規則 => 記憶體需求大

TCAM Emulation Native TCAM Emulated TCAM
Altera APEX family => on-chip native TCAM(左圖) (1 clock cycle 完成分類) (1.) 無擴展性 (2.) 成本高 // first和last rule is matching 若在RAM memory上做Linear search, 若有n個stored rules, 最多需要n個clock cycles確保完成一個key的分類. 提出TCAM emulation(右圖): 使用register去模擬TCAM架構=>address當作key直接找data(101: first和last rule) (1.) 簡單 (2.) 因為routing delays和limited resources而限制了scalability 最明顯的問題: memory expension: memory需要存放key可能發生的所有match vector的情況比較scable的方法就是實作一個有限制的address expansion=>較少的match vector卻滿足所有的keys

RAM-based TCAM Architecture
m-bit key (m = 10) w = m = 10 m/w = 1 RAM block block size = 2^w = 2^10 ( 0~2^10-1 ) Full address expansion w = 1 m/w = 10/1 = 10 RAM block block size = 2^w = 2 ( 0~1 ) native TCAM w = 2 m/w = 10/2 = 5 RAM block block size = 2^w = 2^2 = 4 ( 0~3 ) w = m-2 = 8 m/w = 10/8 = 1 RAM block block size = 2^w = 2^8 ( 0~2^8-1 ) w = m-1 = 9 m/w = 10/9 = 1 RAM block block size = 2^w = 2^9 ( 0~2^9-1 ) W=m=10 => full expansion(single ram block) 降低address space: 將key切割成寬度為w的segments(1個segmnt為1個rule => depth) Exact 和 prefix對於任何寬w的configuration影響不大但是ragne會有明顯的差異=> w越大, range interval越大, 需要越多的memory resource 較小的w消耗比較少的memory，但對於rule expansion也比較弱。 M9K BRAM can be conﬁgured from 8K×1 to 256×36 M20K BRAM from 16K × 1 to 512 × 40 BRAMs demands (m/w) * 2^w bits BRAMs modes = depth*width

RAM-based TCAM Architecture (cont.)
n = 64, m = 16 16 –bit key 2^16*64 m/w = 16/6 = 2 64個16-bit rules (n = 64, m = 16) native 需要1KB(64*16) Emulating with single RAM block: 需要2MB(64*2^16) w = 6 2^8*32*4 m/w = 16/6 = 2 RAM block block size = 2^w = 2^6 = 64

RAM-based TCAM Architecture (cont.)
n=64(上下and輸出64bits代表各rule是否match), 橘色框: ram block 將key切成2半，所以8bit可指向256種address Rule去決定是上面還是下面找到的data由31bits去輸出，合併成64bits Update: 64個rule中的其中一個, 針對每個key做更新 Rule 暫存器可指定更新哪條rule N=64=2^6, 所以須update6 bits=> 耗費2^w = 2^6 clock cycles 有很多方法可以實作這樣的功能，但是作者認為這個方法能以最少的時間去做完整的rule set update

Performance Evaluation
emulated one (m/w)*(2^w)*6 Resource Utilization A TCAM bit typically demands 16 transistors, while a RAM bit, only 6 TCAM => w*m*16 TCAM emulation => (m/w)*(2^w)*6 TCAM w*m*16 larger range: emulated one需較多的cell size small/medium range: emulated one需要較少的area => Not depend on the key size, even if for larger keys, both TCAM and emulated one demand more transistors.

Performance Evaluation (cont.)
(m/w)*(2^w) bits Small intervals can be encoded with 0.1Kb to 1Kb depending on the key size. Larger intervals in the order of ranges tends to consume from 1Kb to 100Kb per rule.

Classification Throughput a crucial factor for evaluating emulated TCAM performance on FPGA is the actual classiﬁcation throughput in terms of packets per second (pps). Small keys (e.g., 32-bit) can 支援所有的rule sets都達到300Mpps. Larger keys result in throughput低於300Mpps但都保證至少可達到200Mpps.

Range Impact we assess the impact of supporting different ranges in terms of memory requirements and classiﬁcation rate. mid-size ranges(11-bit width) 顯示需要的memory size較合理且classiﬁcation performance.也不錯.

Conclusion Classiﬁcation rates above 300Mpps for both large keys and rule sets can be implemented with only a few megabits of RAM when considering up to medium size range intervals ( ). Support for both large ranges and large rule sets tends to demand much memory resources, which also penalizes the resulting classiﬁcation rate. 在large key和rule sets下，搭配 a few (少許) megabits RAM，可以達到300Mpps的classification rates Large ranges和large rule sets會需要更多的memory resources，且降低classification rate

Thank you! The End.

GLOBECOM (Global Communications Conference), 2012

Similar presentations

Presentation on theme: "GLOBECOM (Global Communications Conference), 2012"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

GLOBECOM (Global Communications Conference), 2012

Similar presentations

Presentation on theme: "GLOBECOM (Global Communications Conference), 2012"— Presentation transcript:

Similar presentations

About project

Feedback