Author: Xianghui Hu, Xinan Tang, Bei Hua Lecturer: Bo Xu

Slides:



Advertisements
Similar presentations
Larrabee Eric Jogerst Cortlandt Schoonover Francis Tan.
Advertisements

© 2004 Wayne Wolf Topics Task-level partitioning. Hardware/software partitioning.  Bus-based systems.
A Search Memory Substrate for High Throughput and Low Power Packet Processing Sangyeun Cho, Michel Hanna and Rami Melhem Dept. of Computer Science University.
POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:
A Scalable and Reconfigurable Search Memory Substrate for High Throughput Packet Processing Sangyeun Cho and Rami Melhem Dept. of Computer Science University.
Network Algorithms, Lecture 4: Longest Matching Prefix Lookups George Varghese.
Fast Firewall Implementation for Software and Hardware-based Routers Lili Qiu, Microsoft Research George Varghese, UCSD Subhash Suri, UCSB 9 th International.
1 Fast Routing Table Lookup Based on Deterministic Multi- hashing Zhuo Huang, David Lin, Jih-Kwon Peir, Shigang Chen, S. M. Iftekharul Alam Department.
M. Waldvogel, G. Varghese, J. Turner, B. Plattner Presenter: Shulin You UNIVERSITY OF MASSACHUSETTS, AMHERST – Department of Electrical and Computer Engineering.
IP Routing Lookups Scalable High Speed IP Routing Lookups.
Is SC + ILP = RC? Presented by Vamshi Kadaru Chris Gniady, Babak Falsafi, and T. N. VijayKumar - Purdue University Spring 2005: CS 7968 Parallel Computer.
CSC457 Seminar YongKang Zhu December 6 th, 2001 About Network Processor.
Router Architecture : Building high-performance routers Ian Pratt
A supernetwork.
1 High-performance packet classification algorithm for multithreaded IXP network processor Authors: Duo Liu, Zheng Chen, Bei Hua, Nenghai Yu, Xinan Tang.
1 A Tree Based Router Search Engine Architecture With Single Port Memories Author: Baboescu, F.Baboescu, F. Tullsen, D.M. Rosu, G. Singh, S. Tullsen, D.M.Rosu,
ECE669 L11: Static Routing Architectures March 4, 2004 ECE 669 Parallel Computer Architecture Lecture 11 Static Routing Architectures.
IXP1200 Microengines Apparao Kodavanti Srinivasa Guntupalli.
Performance Analysis of the IXP1200 Network Processor Rajesh Krishna Balan and Urs Hengartner.
Study of IP address lookup Schemes
1 A Fast IP Lookup Scheme for Longest-Matching Prefix Authors: Lih-Chyau Wuu, Shou-Yu Pin Reporter: Chen-Nien Tsai.
ECE 526 – Network Processing Systems Design IXP XScale and Microengines Chapter 18 & 19: D. E. Comer.
1 Route Table Partitioning and Load Balancing for Parallel Searching with TCAMs Department of Computer Science and Information Engineering National Cheng.
Networking Virtualization Using FPGAs Russell Tessier, Deepak Unnikrishnan, Dong Yin, and Lixin Gao Reconfigurable Computing Group Department of Electrical.
Paper Review Building a Robust Software-based Router Using Network Processors.
Sarang Dharmapurikar With contributions from : Praveen Krishnamurthy,
ECE 526 – Network Processing Systems Design Network Processor Architecture and Scalability Chapter 13,14: D. E. Comer.
PARALLEL TABLE LOOKUP FOR NEXT GENERATION INTERNET
IP Address Lookup Masoud Sabaei Assistant professor
Author: Haoyu Song, Fang Hao, Murali Kodialam, T.V. Lakshman Publisher: IEEE INFOCOM 2009 Presenter: Chin-Chung Pan Date: 2009/12/09.
Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 1 ECSE-6600: Internet Protocols Informal Quiz #14 Shivkumar Kalyanaraman: GOOGLE: “Shiv RPI”
StrideBV: Single chip 400G+ packet classification Author: Thilan Ganegedara, Viktor K. Prasanna Publisher: HPSR 2012 Presenter: Chun-Sheng Hsueh Date:
ECE 526 – Network Processing Systems Design Network Processor Introduction Chapter 11,12: D. E. Comer.
SCALABLE PACKET CLASSIFICATION USING INTERPRETING A CROSS-PLATFORM MULTI-CORE SOLUTION Author: Haipeng Cheng, Zheng Chen, Bei Hua and Xinan Tang Publisher/Conf.:
A Small IP Forwarding Table Using Hashing Yeim-Kuan Chang and Wen-Hsin Cheng Dept. of Computer Science and Information Engineering National Cheng Kung.
Efficient Cache Structures of IP Routers to Provide Policy-Based Services Graduate School of Engineering Osaka City University
Data Management for Decision Support Session-4 Prof. Bharat Bhasker.
Performance Analysis of Packet Classification Algorithms on Network Processors Deepa Srinivasan, IBM Corporation Wu-chang Feng, Portland State University.
Memory-Efficient IPv4/v6 Lookup on FPGAs Using Distance-Bounded Path Compression Author: Hoang Le, Weirong Jiang and Viktor K. Prasanna Publisher: IEEE.
Author: Haoyu Song, Murali Kodialam, Fang Hao and T.V. Lakshman Publisher/Conf. : IEEE International Conference on Network Protocols (ICNP), 2009 Speaker:
Updating Designed for Fast IP Lookup Author : Natasa Maksic, Zoran Chicha and Aleksandra Smiljani´c Conference: IEEE High Performance Switching and Routing.
Author: Weirong Jiang and Viktor K. Prasanna Publisher: The 18th International Conference on Computer Communications and Networks (ICCCN 2009) Presenter:
ECE 526 – Network Processing Systems Design Programming Model Chapter 21: D. E. Comer.
IP Address Lookup Masoud Sabaei Assistant professor Computer Engineering and Information Technology Department, Amirkabir University of Technology.
Chapter 1: A Tour of Computer Systems
Low-power Digital Signal Processing for Mobile Phone chipsets
Reducing Hit Time Small and simple caches Way prediction Trace caches
IP Routers – internal view
Computer Structure Multi-Threading
Addressing: Router Design
Hyperthreading Technology
Memory Management Lectures notes from the text supplement by Siberschatz and Galvin Modified by B.Ramamurthy Chapter 8 11/24/2018.
Network Core and QoS.
Memory Management Lectures notes from the text supplement by Siberschatz and Galvin Modified by B.Ramamurthy Chapter 9 12/1/2018.
Jason Klaus Supervisor: Duncan Elliott August 2, 2007 (Confidential)
CHAPTER 8: CPU and Memory Design, Enhancement, and Implementation
Apparao Kodavanti Srinivasa Guntupalli
Packet Classification Using Coarse-Grained Tuple Spaces
Multithreaded Programming
Jinquan Dai, Long Li, Bo Huang Intel China Software Center
A Small and Fast IP Forwarding Table Using Hashing
Latency Tolerance: what to do when it just won’t go away
A SRAM-based Architecture for Trie-based IP Lookup Using FPGA
Memory Management Lectures notes from the text supplement by Siberschatz and Galvin Modified by B.Ramamurthy Chapter 9 4/5/2019.
Duo Liu, Bei Hua, Xianghui Hu, and Xinan Tang
Lecture 5: Pipeline Wrap-up, Static ILP
Authors: Duo Liu, Bei Hua, Xianghui Hu and Xinan Tang
A SRAM-based Architecture for Trie-based IP Lookup Using FPGA
Authors: Ding-Yuan Lee, Ching-Che Wang, An-Yeu Wu Publisher: 2019 VLSI
Network Core and QoS.
Presentation transcript:

Author: Xianghui Hu, Xinan Tang, Bei Hua Lecturer: Bo Xu High-performance IPv6 Forwarding Algorithm for Multi-core and Multithreaded Network Processor Author: Xianghui Hu, Xinan Tang, Bei Hua Lecturer: Bo Xu

Contributions TrieC Investigate the main software design issues on NPU Scalable IPv6 forwarding algorithm -- 10G Routing tables with up to 400,000 entries Investigate the main software design issues on NPU Memory space reduction Instruction selection Data allocation Task partitioning Latency hiding Thread synchronization

Background Traditionally core routers rely on ASIC/FPGA to perform IP forwarding at line-rate speed NPU is expected to retain the same high performance as that of ASIC and to gain the time-to-market advantage from the programmable architecture Intel, Freescale, AMCC and Agere To meet OC-192 (10Gbps), 57 clock cycles, IXP2800 to process a minimal IPv4 packet

Methods typical parallel programming issues, such as data allocation and task partitioning, to reduce memory latencies; micro-architectural factors that impact performance, such as instruction scheduling and selection; thread-level parallelism for hiding memory latencies; thread synchronization for coordinating potential parallelism among packets.

Methods Moving data from DRAM to SRAM Fast bit-manipulation instructions Non-blocking memory access Hardware supported multithreading

Related Work Binary trie -> multi-bit trie Hareware:Basic-24-8-DIR Worst-case memory Hareware:Basic-24-8-DIR At most two memory accesses per lookup More than 32 Mbytes memory

Related Work Waldvogel et al Binary search on hash table Memory accesses Update time Memory requirement

Related Work Lulea scheme & Tree Bitmap TrieC Use multi-bit tries and bit maps to compress redundant storage in trie nodes minimize the memory space at the cost of extra memory accesses TrieC Similar to the above two algorithms Use a built-in bit-manipulation instruction to calculate the number of bits set

TrieC Prefix expansion IXP2800 instruction, POP_COUNT, which can calculate the number of bits set in a 32-bit register in three clock cycles

TrieC Modified Compact Prefix Expansion (MCPE)

TrieC – Data Structures Stride series: 24-8-8-8-16 TrieC15/6 TrieC4/4 Hash16

TrieC – Data Structures

Memory Space Compression DRAM size – 8 times of SRAM DRAM latency – 2 times of SRAM MCPE – reduce memory requirement

Memory Space Compression

Memory Space Compression

Instruction Selection Compute number of bits in a 32-bit register 100 RISC/CISC instructions (ADD,SHIFT,AND,BRANCH) POP_COUNT (3-cycle IXP2800 instruction) Save 97% cycles BR_BSET Save 2 cycles for each checking and branching operation Hash16 table CRC unit on each ME 5 clock cycles to compute a 32-bit CRC Run in parallel on each ME

Instruction Selection NPUs that do not have POP_COUNT normally support another bit-manipulation instruction, FFS, which can find the first bit set in a 32-bit register in one clock cycle compute TotalEntropy and PositionEntropy with FFS by looping through the 32 bits

Data Allocation All the tables are stored in one SRAM controller; Tables are properly distributed on four SRAM controllers; Tables are properly distributed on SRAM and DRAM in a hybrid manner; All the tables are distributed on DRAM, and data structures are redesigned to facilitate the burst access.

Data Allocation DRAM-128 and DRAM-256 mean that the bit widths of bit-vector are 128 bits and 256 bits Hybrid-1 configuration stores the first level of the TrieC tree and all ExtraNHIA nodes in SRAM Hybrid-2 configuration stores all ExtraNHIA tables in SRAM

Task Partitioning Multi-processing Context pipelining A maximum of 8 threads can be used per ME A task can use multiple MEs if needed Threads allocated on the same ME must compete for shared resources Context pipelining Allow a context to access more ME resources Cost: communication between neighboring MEs

Task Partitioning

Latency Hiding hide the memory-access latency by overlapping the memory access with the calculation of bit vector index in the same thread

Latency Hiding MicroengineC compiler provides only one switch to turn latency hiding optimizations on or off

Packet Ordering Even though uses no more than 8 MEs, a performance loss of over 13% exists

Relative Speedups When the minimal IPv6 packet is 69 bytes (9-byte PPP header + 40-byte IPv6 header + 20-byte TCP header), a forwarding rate of 18.16Mpps is required to achieve the OC-192 line rate

PROGRAMMING GUIDANCE ON NPU Compress data structures and store them in SRAM whenever possible to reduce memory access latency Multi-processing is preferred to parallelize network applications rather than context pipelining because the former is insensitive to workload balance. Unless the workload can be statically determined, use a combination of both to help distribute loads among different processing stages fairly

PROGRAMMING GUIDANCE ON NPU In general, the NPU has many different shared resources, such as command and data buses. Pay attention to how you use them because they might become a bottleneck in algorithm implementation The NPU supports powerful bit-manipulation instructions. Select instructions your application needs even without compiler support

PROGRAMMING GUIDANCE ON NPU Use compiler optimizations to schedule ALU instructions to fill the delay slots to hide latency whenever possible The cost of enforcing packet order can not be neglected in practice. A signal ring can be used when the ring is small

The End! Q&A