Download presentation
Presentation is loading. Please wait.
Published byFrancis Gordon Modified over 9 years ago
1
John DeHart jdd@arl.wustl.edu http://www.arl.wustl.edu/projects/techX Block Design Review: Lookup for IPv4 MR, LC Ingress and LC Egress
2
2 - John DeHart - 1/5/2016 Revision History 10/11/06 (JDD): »Created 10/23/06 (JDD): »Finished for presentation on 10/24/06 10/24/06 (JDD): »Updates from comments during review. »Added more TCAM info »Added information on format of Database entry files
3
3 - John DeHart - 1/5/2016 Guidelines for Design Reviews Definition of interfaces In/Out Block diagram of module »Including list of files where code for each block/module exists. Macros: »List macros and files where they can be found »For each macro, provide a few lines of comments in the code that describes the macro. »Document local and global registers used by macro. »Memory assumptions What addresses are pre-defined, etc… Initialization of Memory Data Structures Control Blocks Details of memory accesses, xfer register usage, signal usage. Critical path Testing »Develop a well defined acceptance test that convinces you that your block works »Document acceptance test Pktgen “project” file? Known bugs Areas and suggestions for improvements.
4
4 - John DeHart - 1/5/2016 Contents Lookup Rx Tx QM Parse Header Format Substr Decap Lookup Phy Int Rx Switch Tx QM/Schd Key Extract Hdr Format Lookup Key Extract Switch Rx Phy Int Tx QM/Schd Hdr Format SWITCHSWITCH
5
5 - John DeHart - 1/5/2016 File locations Code »src/applications/LC_Ingress/src/lookup/PL/lookup.uc »src/applications/LC_Egress/src/lookup/PL/lookup.uc »src/applications/IPv4_MR/src/lookup/PL/lookup.uc Configuration and Database Entry Files: »src/applications/LC_Ingress/build/PL/LCI_config.txt LC_Ingress_Database_64bKey_64bResult_BothQM.txt »src/applications/LC_Engress/build/PL/LCE_config.txt LC_Egress_Database_24bKey_64bResult.txt »src/applications/IPv4_MR/build/PL/IPv4_config.txt GM_Database_144Key_128bResult.txt IDT Includes »src/IDT_NSE/data_plane_IXP2XXX/include/Iipc.uc Which then includes Iipc.h from same directory IDT Simulation Library »Typical Installed location: C:/IDT_NSE/simulation/windows/IDT75K234.dll »Repository location: src/IDT_NSE/simulation/windows/IDT75K234.dll »Directions for adding simulation library to a simulation project: Simulation menu: select Options Simulation Options window: select Foreign Model tab Foreign Model DLLs panel: click on New(Insert) icon Use locator to go to the Repository location listed above Ø Hit return after selecting dll file Add Instance information in bottom panel Ø Click in Instance Name box and enter: IDT75K234 Ø Clink in Priority box and enter: 1 Ø Click in Initialization String box and enter: IPv4_config.txt
6
6 - John DeHart - 1/5/2016 TCAM Documentation Docs are distributed sprinkled through the different installation directories »We have gathered most of the important stuff here: /project/techX/DataSheets/IDT »The following documents are located in the above directory Datasheet: (Under non-disclosure) »75K72234_datasheet.pdf User Manual: »75K72234_UserManual.pdf Instruction Latency Application Note: »75K72234_latency.pdf SLAM: Simulation »IDT75K234SLAM_UsersManual.pdf Dataplane Macros: »NSEDataPlaneMacroAPIGuide.pdf IMS API: »IMS_API.pdf
7
7 - John DeHart - 1/5/2016 WU Macros LC Ingress: »dl_nn_ring_init »dl_source_1ME_NN_4words »dl_sink_1ME_NN_4words IPv4_MR: »dl_nn_ring_init »dl_source_1ME_NN_9words »dl_sink_1ME_NN_4words LC Egress: »dl_nn_ring_init »dl_source_1ME_NN_4words »dl_sink_1ME_NN_5words Diagnostics: »GetTimeStamp »CompareTimeStamps
8
8 - John DeHart - 1/5/2016 IDT Macros IipcStartTimestamp »Does CAP read and write to set bit in MISC_CONTROL to start the timestamp counter. IipcFormContextFromCsrMeCtx »Sets up the Context field for the TCAM command word based on the ME and context »128 Contexts per LA-1 Interface IipcMakeBase »Form the base address word for any instruction for this context »Address is 22 bit WORD address, covers 16 MByte address space IipcMakeDirectInstruction »Form the command word for any of the 4 Direct instructions »Result of IipcMakeBase and IipcMakeDirectInstruction will be passed as the two address parameters to sram[write]: sram[write, $w00, iipc_base_word, iipc_command_word, count] IipcDelayUsingFutureCount(cycles) »Sets the Future Count register to this many cycles »Sets the Future Count Signal register »Ctx_arb on that signal IipcSramRead »Performs and SRAM read until Done bit is set in result. »We don’t use this if any more. »We do the sram[read] ourselves now and check the done bit. This allows us to more easily perform diagnostics and performance testing.
9
9 - John DeHart - 1/5/2016 Lookup Initialization and Control XScale utility to initialize NSE and Databases Control Plane and XScale mechanisms to read and write TCAM entries while system is active.
10
10 - John DeHart - 1/5/2016 Lookup Miscellany Bugs: No known bugs Testing: »Minimal testing done so far »Some simple functional tests to show distribution of packets across all output ports based on Key fields for each of the three projects. »More complete test plan needed. Still To Do: »Add information on how to configure Filters for Lookup engine. »Handle init_done signal from Rx »Turn on optimizer »Substrate only lookup for IPv4_MR GPE NPE pkts »Add second database to IPv4 MR DB1: GM/EM Database DB2: Route Lookup »LD bit in Lookup Result »Clean up definition of DB Ids. »Consider making Lookup code one common file with #ifdef’s to differentiate »Consider removing #ifdef DONE_BIT_FIX code Refers to a Done bit bug in the Dual Port QDR (which is what we have) Ø I have not seen this bug mentioned anywhere else. I have not witnessed any such bug and I have not enabled this code We’ll probably keep this code around, at least until we have done more thorough testing. »Performance testing with both LC Ingress and LC Egress operating. »Performance testing with second IPv4 MR Database Data Structures: None Performance »Current analysis is with OPTIMIZER turned off! Turning it on should give immediate gains via branch and ctx_arb deferral slots. »How does TCAM perform when both LC Ingress and LC Egress are operating?
11
11 - John DeHart - 1/5/2016 TCAM Entries in Simulation Four Parts to a TCAM Entry in simulation: »dbindex Slot in database occupied by entry. Start at 0 Incremented by 1 for each entry Not dependent on size »core What is matched against a provided key »mask Indicates what part of the entry(core) has to match key supplied to give a hit »data Results data Configuration and Database Entry files »src/applications/LC_Ingress/build/PL/LCI_config.txt LC_Ingress_Database_64bKey_64bResult_BothQM.txt »src/applications/LC_Engress/build/PL/LCE_config.txt LC_Egress_Database_24bKey_64bResult.txt »src/applications/IPv4_MR/build/PL/IPv4_config.txt GM_Database_144Key_128bResult.txt
12
12 - John DeHart - 1/5/2016 TCAM Entries in Simulation LC Ingress Database entry from file: »src/applications/LC_Ingress/build/PL/ LC_Ingress_Database_64bKey_64bResult_BothQM.txt { dbindex 0x0; core 0x51C0A80002110001; # SL Type: 0x5 # Port: 1 # IP DA=192.168.0.2 # IP Proto: 17 (UDP) # UDP DPort: 0x0001 # Exact Match everything, except wildcard Port mask 0xf0ffffffffffffff; data 0x0001004A01100001; # VLAN(16b)=0x0001 # Stats_Index(16b)=74(0x4A) # DA=0x01 # Port=1 # QID=1 }
13
13 - John DeHart - 1/5/2016 TCAM Entries in Simulation IPv4 MR Database entry from file: »src/applications/IPv4_MR/build/PL/GM_Database_144Key_128bResult.txt { dbindex 0x0; core 0x0AAA0002C0A84001C0A82002000100020011; # MR ID (VLAN) = 0x0AAA # UDP DPort=0x0002 # IP DA=192.168.64.1 # IP SA=192.168.32.02 # TCP/UDP SPort=0x0001 # TCP/UDP DPort=0x0002, # TCP_FLAGS_Proto=0x0011 (Proto=UDP, no TCP Flags) mask 0xffffffffffffffffffffffffffffffffffff; # Exact match everything data 0x0000003780FC99F95555666601000001; # Reserved(3b), Drop Bit(1b) # Reserved(12b) # Cntr_Index(16b)=55(0x37), # Tx IP DAddr=128.252.153.249, # Tx UDP Dport=0x5555 # Tx UDP SPort=0x6666 # DA=0x01, # Port=0 # QID=1 }
14
14 - John DeHart - 1/5/2016 TCAM Entries in Simulation LC Egress Database entry from file: »src/applications/LC_Egress/build/PL/LC_Egress_Database_24bKey_64bResult.txt { dbindex 0x0; core 0x11000100; # IP Proto (8b) = 0x11 (UDP) # UDP SPort (16b) = 1 # Rsvd(8b) = 0 mask 0xffffffff; # Exact Match. data 0x000101000021; # Rsvd(4b) = 0 # VLAN(12b)=0x001 # Rsvd(4b)=0 # Port(4b)=1 # Rsvd(4b) # QID(20b)=33 (0x00021) }
15
15 - John DeHart - 1/5/2016 Basics of TCAM Operation Instruction is given to TCAM as an sram write: »Address bus gives instruction 4 Direct Instructions: Ø Lookup: This is all we use right now. Ø MultiHit Lookup (MHL) or Simultaneous Multi-Database Lookup –Which one is determined by a bit in a config register Ø Preload Ø Indirect: Uses data field to specify subinstruction »Data bus gives: Subinstruction for Indirect instructions (There are 16 subinstructions) Data for all instructions Ø Our lookup keys go here. »Example: IPv4 MR Lookup (Key of 144 bits in 5 words): Load xfer registers $w00, $w01, $w02, $w03, $w04 with the lookup key sram[ write, $w00, iipc_base_word, iipc_command_word, 5 ] More about iipc_base_word and iipc_command_word later 5: number of data words needed for key Result is read back from Context’s Results Mailbox »This is an SRAM read, not a TCAM Read instruction. »Example: IPv4_MR Lookup result of 4 words: sram[read, $r00, iipc_base_word, 0, 4] »Result is valid only when the high order bit of the first word in the mailbox is set. So, multiple reads may be necessary We can predict the latency of the TCAM instruction Ø More about this later when we look at the macros used.
16
16 - John DeHart - 1/5/2016 LC Ingress Lookup Main functions: »Perform TCAM Lookup »Pass Through Data: Buf Handle IP Pkt Length and Ethernet Header Length Single code path with possible loop around Result Read NN communication Uses 8 threads Lookup Phy Int Rx Switch Tx QM/Schd Key Extract Hdr Format SWITCHSWITCH
17
17 - John DeHart - 1/5/2016 LC Ingress: Lookup Block Interfaces Lookup Phy Int Rx Switch Tx QM/Schd Key Extract Hdr Format SWITCHSWITCH Lookup Key[63-32] (32b) Buf Handle(32b) IP Pkt Length (16b) Reserved (8b) Lookup Key[ 31-0] (32b) Buf Handle(32b) IP Pkt Length (16b) QID (20b) VLAN (16b)Stats Index (16b) DAddr (8b) Port (4b) Eth Hdr Len (8b) Reserved (8b) Eth Hdr Len (8b) D_Addr[31:8] (24b) D_Addr[7:0] (8b) SL (4b) Port (4b) UDP DPort (16b) Protocol (8b) Lookup Key: QID (20b) VLAN (16b)Stats Index (16b) DAddr (8b) Port (4b) Lookup Result: Rsvd (4b) Rsvd (4b)
18
18 - John DeHart - 1/5/2016 LC Ingress Lookup Block Diagram Load Xfer Regs Send Lookup Request TimeStamp Delay Read Result Reformat Output Wait for prev ctx Signal next ctx NN Enqueue (4W) Wait for prev ctx Signal next ctx NN Dequeue (4W) init signal dl_sink() dl_source() SRAM Write: 2W SRAM Read: 2W mem access Check Done Bit ctx_swap 15 cycles + 2 abort cycles 7 cycles + 2 abort cycles 1 cycles + 2 abort cycles 5 cycles + 0 abort cycles 12 cycles + 8 abort cycles 1 cycles + 2 abort cycles Totals: 41 processing cycles 16 abort cycles
19
19 - John DeHart - 1/5/2016 IPv4 MR Lookup Lookup Rx Tx QM Parse Header Format Substr Decap Main functions: »Perform TCAM Lookup »Pass Through Data: Buf Handle IP Pkt Length and Offset Slice Data Ptr Exception Bits Single code path with possible loop around Result Read NN communication Uses 8 threads
20
20 - John DeHart - 1/5/2016 IPv4 MR Lookup Block Interfaces Lookup DeMuxRx Tx QM Parse Header Format Lookup Key[111-80] DA (32b) Buf Handle(32b) IP Pkt Length (16b)IP Pkt Offset (16b) Lookup Key[ 79-48] SA (32b) Lookup Key[ 47-16] Ports (32b) Lookup Key Proto/TCP_Flags [15- 0] (16b) Exception Bits (12b) Lookup Key[143-112] Slice ID/Rx UDP DPort (32b) L Flags (4b) Port (4b) QID(20b) DA(8b) Tx IP DAddr (32b) Buf Handle(32b) IP Pkt Length (16b)IP Pkt Offset (16b) Cntr Index (16b) R S V d (1b) D (1b) H (1b) Exception Bits (12b) L D (1b) Rx UDP DPort(16b)Slice ID (VLAN) (16b) Tx UDP SPort(16b)Tx UDP DPort (16b) Slice Data Ptr (32b) Reserved (28b) Code (4b) Reserved (28b) Code (4b) IP DAddr (32b) IP SAddr (32b) SPort (16b) Slice ID/Rx UDP DPort (32b) Lookup Key (144b): DPort (16b) Proto/TCP_Flags(16b)
21
21 - John DeHart - 1/5/2016 IPv4 MR Functional Block Results As given to HF Lookup Result (128b): Stored in TCAM Lookup Result (128b): Port (4b) QID(20b) DA(8b) Tx IP DAddr (32b) Cntr Index (16b) D 1b Reserved (11b) Tx UDP SPort(16b)Tx UDP DPort (16b) D O N e 1b H I t 1b M H I t 1b Port (4b) QID(20b) DA(8b) Tx IP DAddr (32b) Cntr Index (16b) D (1b) Exception Bits (12b) Tx UDP SPort(16b)Tx UDP DPort (16b) R S V d (1b) H I t (1b) L D (1b) TCAM Status Bits L D 1b Lookup Key (144b): IP DAddr (32b) IP SAddr (32b) SPort (16b) Slice ID/Rx UDP DPort (32b) DPort (16b) Proto/TCP_Flags(16b)
22
22 - John DeHart - 1/5/2016 IPv4 MR Lookup Block Diagram Load Xfer Regs Send Lookup Request TimeStamp Delay Read Result Reformat Output Wait for prev ctx Signal next ctx NN Enqueue (9W) Wait for prev ctx Signal next ctx NN Dequeue (9W) init signal dl_sink() dl_source() SRAM Write: 5W SRAM Read: 4W mem access Check Done Bit ctx_swap 25 cycles + 2 abort cycles 7 cycles + 2 abort cycles 1 cycles + 2 abort cycles 5 cycles + 0 abort cycles 17 cycles + 8 abort cycles 2 cycles + 2 abort cycles Totals: 57 processing cycles 16 abort cycles
23
23 - John DeHart - 1/5/2016 LC Egress Lookup Main functions: »Perform TCAM Lookup »Pass Through Data: Buf Handle IP Pkt Length and Ethernet Header Length IP Destination Address Single code path with possible loop around Result Read NN communication Uses 8 threads Lookup Key Extract Switch Rx Phy Int Tx QM/Schd Hdr Format SWITCHSWITCH
24
24 - John DeHart - 1/5/2016 LC Egress: Lookup Block Interfaces Lookup Key Extract Switch Rx Phy Int Tx QM/Schd Hdr Format SWITCHSWITCH Buf Handle(32b) IP DAddr (32b) Buf Handle(32b) IP DAddr (32b) Lookup Result [63-32] (32b) Lookup Result [31-0] (32b) IP Pkt Length (16b) Reserved (8b) Eth Hdr Len (8b) IP Pkt Length (16b) Reserved (8b) Eth Hdr Len (8b) Lookup Key – UDP SPort (16b) Lookup Key IP Proto (8b) Reserved (8b) Lookup Key: Lookup Result: UDP SPort (16b) Protocol (8b) Reserved (8b) QID (20b) VLAN (12b)Stats Index (16b) Port (4b) Rsvd (4b) Rsvd (4b) Rsvd (4b)
25
25 - John DeHart - 1/5/2016 LC Egress Lookup Block Diagram Load Xfer Regs Send Lookup Request TimeStamp Delay Read Result Reformat Output Wait for prev ctx Signal next ctx NN Enqueue (5W) Wait for prev ctx Signal next ctx NN Dequeue (4W) init signal dl_sink() dl_source() SRAM Write: 1W SRAM Read: 2W mem access Check Done Bit ctx_swap 14 cycles + 2 abort cycles 7 cycles + 2 abort cycles 1 cycles + 2 abort cycles 5 cycles + 0 abort cycles 13 cycles + 8 abort cycles 3 cycles + 2 abort cycles Totals: 43 processing cycles 16 abort cycles
26
Performance
27
27 - John DeHart - 1/5/2016 Packet Sizes Ethernet VLAN Header18B Substrate Header IPv4 Header20B UDP Header8B Metanet Frame GPE to MPEn IPv4 Header20B UDP Header8B Payloadn Ethernet Pad0 Ethernet FCS4B Total 78B + internal + payload Ethernet IFS12B Total Physical 90B + internal + payload
28
28 - John DeHart - 1/5/2016 Cycle Budget (min eth packets) To hit 5 Gb rate: »76B per min IPv4 packet (64 min Eth + 12B IFS) »1.4Ghz clock rate »5 Gb/sec * 1B/8b * packet/76B = 8.22 Mp/sec »1.4Gcycle/sec * 1 sec/ 8.22 Mp = 170.3 cycles per packet »compute budget: 170 cycles »latency budget: (threads*170) 8 threads: 1360 cycles To hit 10 Gb rate: »76B per min IPv4 packet (64 min Eth + 12B IFS) »1.4Ghz clock rate »10 Gb/sec * 1B/8b * packet/76B = 16.44 Mp/sec »1.4Gcycle/sec * 1 sec/ 16.44 Mp = 85.16 cycles per packet »compute budget: 85 cycles »latency budget: (threads*85) 8 threads: 680 cycles
29
29 - John DeHart - 1/5/2016 Cycle Budget (IPv4 MN packets) To hit 5 Gb rate: »90B per min IPv4 packet (78 min IPv4MN + 12B IFS) »1.4Ghz clock rate »5 Gb/sec * 1B/8b * packet/90B = 6.94 Mp/sec »1.4Gcycle/sec * 1 sec/ 6.94 Mp = 201.7 cycles per packet »compute budget: 201 cycles »latency budget: (threads*201) 8 threads: 1608 cycles To hit 10 Gb rate: »90B per min IPv4 packet (78 min IPv4MN + 12B IFS) »1.4Ghz clock rate »10 Gb/sec * 1B/8b * packet/90B = 13.88 Mp/sec »1.4Gcycle/sec * 1 sec/ 13.88 Mp = 100.86 cycles per packet »compute budget: 100 cycles »latency budget: (threads*100) 8 threads: 800 cycles
30
30 - John DeHart - 1/5/2016 TCAM Instruction Latency Analysis QDR Clock: 200 MHz, 5ns period TCAM core Clock: 200 MHz, 5ns period NPU Clock: 1400 MHz, 0.714 ns period »1 QDR cycle == 1 TCAM cycle == 7 NPU cycles TCAM Lookup Latencies: »QDR xfer: 1 cycle per word in key »Instruction Fifo: constant 2 cycles »Synchronizer: constant 3 cycles »Execution Latency: fct(key width, output data width) Table in IDT Latency Application Note »Re-Synchronizer: constant 1 cycle
31
31 - John DeHart - 1/5/2016 TCAM Instruction Latency Analysis IPv4 MR »Key: 144 bit (5 words) »Output data: 128 bit »QDR Xfer: 5 cycles »Constants: 2 + 3 + 1 = 6 cycles »Execution Latency: 36 cycles »Total Latency: 47 TCAM cycles (235 ns) (329 NPU cycles) LC Ingress »Key: 64 bit (2 words) »Output data: 64 bit »QDR Xfer: 2 cycles »Constants: 2 + 3 + 1 = 6 cycles »Execution Latency: 32 cycles »Total Latency: 40 TCAM cycles (200 ns) (280 NPU cycles) LC Egress »Key: 24 bit (1 words) »Output data: 64 bit »QDR Xfer: 1 cycles »Constants: 2 + 3 + 1 = 6 cycles »Execution Latency: 34 cycles »Total Latency: 41 TCAM cycles (195 ns) (273 NPU cycles)
32
32 - John DeHart - 1/5/2016 TCAM Performance (Rates in M/sec) LC_Egress LC_Ingress IPv4 MR
33
33 - John DeHart - 1/5/2016 TCAM Performance (Rates in M/sec) Lookup Size#LA-1 WordsCore SizeAssoc. DataSingle LA-1 Max Rate Max Core Rate Avg Shared Rate (Each of 2 LA-1s) 321363250 25 645025 1282512.5 362 3250 25 645025 1282512.5 6427232100 50 645025 1282512.5 723 326710050 645025 1282512.5 1284144325010050 645025 1282512.5 1445 324010040 645025 1282512.5 160528832405040 645025 1282512.5 LC_Ingress LC_Egress IPv4 MR
34
34 - John DeHart - 1/5/2016 IPv4: Performance Snapshot ~610 Cycles dl_source & Xfer reg loads sram write Timestamp Delay setup Timestamp Delay sram read dl_sink ctx_arb IPv4 MR lookup »Unloaded dl_sink processing Ctx_arb vs br_signal optimization
35
35 - John DeHart - 1/5/2016 IPv4: Performance Snapshot IPv4 MR lookup »Hack to Parse: loop and repeatedly call dl_sink with same buf_handle Should guarantee that there is always something in NN ring for lookup to pick up »Hack to HF : set dlNextBlock to IX_DROP Keep Tx from trying to transmit something bad. 34016– 33333= 683 Cycles Write issued At 33333 Write issued At 34016
36
36 - John DeHart - 1/5/2016 LC_Ingress: Performance Snapshots >=563 Cycles LC Ingress lookup »unloaded
37
37 - John DeHart - 1/5/2016 LC_Ingress: Performance Snapshots 60494 – 59888 = 606 Cycles LC Ingress lookup »Hack to KE stub: loop and repeatedly call dl_sink with same buf_handle Should guarantee that there is always something in NN ring for lookup to pick up »Hack to HF stub: set dl_next_block to IX_DROP Keep Tx from trying to transmit something bad. Write issued At 59888 Write issued At 60494
38
38 - John DeHart - 1/5/2016 LC_Egress: Performance Snapshots ~560 Cycles LC Egress lookup »Unloaded
39
39 - John DeHart - 1/5/2016 LC_Egress: Performance Snapshots LC Egress lookup »Loaded with KE and HF hacks. ~610 Cycles
40
40 - John DeHart - 1/5/2016 Performance Summary Processing Cycles: »LC Ingress:41 »IPv4 MR: 57 »LC Egress:43 Abort Cycles: »LC Ingress:16 »IPv4 MR: 16 »LC Egress:16 Latency Cycles: »LC Ingress: 560 – 57 = 503? »IPv4 MR: 610 – 73 = 537? »LC Egress: 560 – 59 = 501? Expected performance »LC Ingress: 10Gb/s »IPv4 MR: 5Gb/s + »LC Egress: 10Gb/s
41
41 - John DeHart - 1/5/2016 Optimizations Possibilities May still be some code we can move out of processing loop or at least between sram write or read and the ctx swap. dl_sink has a possible improvement. »ctx_arb vs. br_signal/br_!signal
42
42 - John DeHart - 1/5/2016 Extra Slides
43
43 - John DeHart - 1/5/2016 Image Slide Template
44
44 - John DeHart - 1/5/2016 Text Slide Template
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.