Lec 11 – Multicore Architectures and Network Processors

Slides:



Advertisements
Similar presentations
Deep Packet Inspection Which Implementation Platform? Sarang Dharmapurikar Cisco.
Advertisements

A Search Memory Substrate for High Throughput and Low Power Packet Processing Sangyeun Cho, Michel Hanna and Rami Melhem Dept. of Computer Science University.
4. Shared Memory Parallel Architectures 4.4. Multicore Architectures
A KTEC Center of Excellence 1 Cooperative Caching for Chip Multiprocessors Jichuan Chang and Gurindar S. Sohi University of Wisconsin-Madison.
AMD OPTERON ARCHITECTURE Omar Aragon Abdel Salam Sayyad This presentation is missing the references used.
Computer Abstractions and Technology
©UCR CS 162 Computer Architecture Lecture 8: Introduction to Network Processors (II) Instructor: L.N. Bhuyan
CSC457 Seminar YongKang Zhu December 6 th, 2001 About Network Processor.
Chapter 8 Hardware Conventional Computer Hardware Architecture.
Page 1 John Morgan Infrastructure Processor Division September 2004 Intel® IXP2XXX Network Processor Architecture Overview.
1 Router Construction II Outline Network Processors Adding Extensions Scheduling Cycles.
1 Design and Implementation of A Content-aware Switch using A Network Processor Li Zhao, Yan Luo, Laxmi Bhuyan University of California, Riverside Ravi.
Associative Cache Mapping A main memory block can load into any line of cache Memory address is interpreted as tag and word (or sub-address in line) Tag.
Shangri-La: Achieving High Performance from Compiled Network Applications while Enabling Ease of Programming Michael K. Chen, Xiao Feng Li, Ruiqi Lian,
4/22/2003 Network Processor & Its Applications1 Network Processor and Applications Prof. Laxmi Bhuyan
Chapter 17 Parallel Processing.
Performance Analysis of the IXP1200 Network Processor Rajesh Krishna Balan and Urs Hengartner.
Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 1 Intel IXP1200 Network Processor q Lab 12, Introduction to the Intel IXA q Jonathan Gunner, Sruti.
©UCR CS 260 Lecture 1: Introduction to Network Processors Instructor: L.N. Bhuyan
©UCB CS 162 Computer Architecture Lecture 2: Introduction & Pipelining Instructor: L.N. Bhuyan
Embedded Transport Acceleration Intel Xeon Processor as a Packet Processing Engine Abhishek Mitra Professor: Dr. Bhuyan.
ECE 526 – Network Processing Systems Design IXP XScale and Microengines Chapter 18 & 19: D. E. Comer.
ECE 526 – Network Processing Systems Design
DAP Spr.‘98 ©UCB 1 CS 203 A Lecture 16: Review for Test 2.
Router Construction II Outline Network Processors Adding Extensions Scheduling Cycles.
Network Processors and Web Servers CS 213 LECTURE 17 From: IBM Technical Report.
A Scalable, Cache-Based Queue Management Subsystem for Network Processors Sailesh Kumar, Patrick Crowley Dept. of Computer Science and Engineering.
Networking Virtualization Using FPGAs Russell Tessier, Deepak Unnikrishnan, Dong Yin, and Lixin Gao Reconfigurable Computing Group Department of Electrical.
Paper Review Building a Robust Software-based Router Using Network Processors.
ECE 526 – Network Processing Systems Design Network Processor Architecture and Scalability Chapter 13,14: D. E. Comer.
MIDeA :A Multi-Parallel Instrusion Detection Architecture Author: Giorgos Vasiliadis, Michalis Polychronakis,Sotiris Ioannidis Publisher: CCS’11, October.
SpliceNP: A TCP Splicer using a Network Processor Li Zhao +, Yan Luo*, Laxmi Bhuyan University of California Riverside Ravi Iyer Intel Corporation + Now.
(1) Scheduling for Multithreaded Chip Multiprocessors (Multithreaded CMPs)
Srihari Makineni & Ravi Iyer Communications Technology Lab
Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 1 ECSE-6600: Internet Protocols Informal Quiz #14 Shivkumar Kalyanaraman: GOOGLE: “Shiv RPI”
IXP Lab 2012: Part 1 Network Processor Brief. NCKU CSIE CIAL Lab2 Outline Network Processor Intel IXP2400 Processing Element Register Memory Interface.
An Architecture and Prototype Implementation for TCP/IP Hardware Support Mirko Benz Dresden University of Technology, Germany TERENA 2001.
XStream: Rapid Generation of Custom Processors for ASIC Designs Binu Mathew * ASIC: Application Specific Integrated Circuit.
ECE 526 – Network Processing Systems Design Network Processor Introduction Chapter 11,12: D. E. Comer.
Intel ® IXP2XXX Network Processor Architecture and Programming Prof. Laxmi Bhuyan Computer Science UC Riverside.
Performance Analysis of Packet Classification Algorithms on Network Processors Deepa Srinivasan, IBM Corporation Wu-chang Feng, Portland State University.
Authors: Danhua Guo 、 Guangdeng Liao 、 Laxmi N. Bhuyan 、 Bin Liu 、 Jianxun Jason Ding Conf. : The 4th ACM/IEEE Symposium on Architectures for Networking.
Full and Para Virtualization
Operating Systems: Summary INF1060: Introduction to Operating Systems and Data Communication.
Lecture 2. A Computer System for Labs
Lynn Choi School of Electrical Engineering
Introduction to Operating Systems
ECE232: Hardware Organization and Design
Memory COMPUTER ARCHITECTURE
CS427 Multicore Architecture and Parallel Computing
Lynn Choi School of Electrical Engineering
System On Chip.
Virtual Memory Use main memory as a “cache” for secondary (disk) storage Managed jointly by CPU hardware and the operating system (OS) Programs share main.
Architecture & Organization 1
Cache Memory Presentation I
William Stallings Computer Organization and Architecture 7th Edition
NT1110 Computer Structure and Logic
CS 213 Lecture 7: Multiprocessor 3: Synchronization, Prefetching
Hyperthreading Technology
Introduction to Operating Systems
Internetworking: Hardware/Software Interface
Architecture & Organization 1
Apparao Kodavanti Srinivasa Guntupalli
Jinquan Dai, Long Li, Bo Huang Intel China Software Center
Instructor: L.N. Bhuyan CS 213 Computer Architecture Lecture 7: Introduction to Network Processors Instructor: L.N. Bhuyan.
CSE 471 Autumn 1998 Virtual memory
Duo Liu, Bei Hua, Xianghui Hu, and Xinan Tang
Virtual Memory Use main memory as a “cache” for secondary (disk) storage Managed jointly by CPU hardware and the operating system (OS) Programs share main.
Author: Xianghui Hu, Xinan Tang, Bei Hua Lecturer: Bo Xu
Performing Security Auditing In Hardware
Presentation transcript:

Lec 11 – Multicore Architectures and Network Processors

Intel Clovertown Machine – 2 Quad-core Xeon Processor Multi-Core Architecture Increased resource density Major Challenges Parallelism in legacy program scalability Current solutions Software Naïve multithreading Hardware Receive Side Scaling (RSS) in NIC No Multithreading

Nehalem Architecture – 2-Way SMT

Naïve Thread Scheduling Major reasons for poor scalability Round-Robin workload distribution among threads. Waste of cache locality O(1) scheduler in the OS Overhead in load balance Potential Solutions Improving cache locality Smarter load balance

Optimized Multithreading A connection affinity based multithreading Load Balance the packets System architecture packets in the same connection go to the same thread each thread is affinitized to a designated core

Experimental Results for L7 Filter, A DPI Program (ANCS 2008) Throughput and Utilization in Native Kernel. Summary Throughput is improved by 51%. Performance scales close to linear as more cores are used.

SUN Niagara 2 Characteristics of Sun T5120 Niagara 2 2 pipelines per core 4 hardware threads per pipeline (b) The bars show the average utilization (%) at each level in the core; the lines represent the range of peak high and peak low values (%).

Hierarchical HRW (ANCS 2009) Tested for L7 Filter on Sun Niagra 2 Architecture Proposed Idea Apply HRW at each level Pick a core first, then pick a pipeline of the selected core, then pick a thread of the selected pipeline What is HRW? Highest Random Weight, IEEE Trans on Networking, 1998 Based on a Hash function that maps a connection to a core at the same time load balancing among homogeneous cores Extend to packet level load balancing, heterogeneous cores, cache/thread locality,

Results (ANCS 2009) Throughput and Core Utilization 59.2% improvement in system throughput Fig. 4 illustrates the throughput and CPU utilization results obtained from different optimizations. The "conn+aff." reflects the basic connection locality + thread affinity optimization, as proposed in their paper [7]. We adopt this optimization as our baseline set up. It is clearly presented that our adaptive multilayer hash scheduler ("3-HRW+Adp") increases the system throughput by 130% (0.87 Gbps VS 1.99Gbps). It is arguably reasonable to question the fairness of this comparison because the testbed in that paper was an Intel dual quad-core Clovertown machine, whereas we use a 64-thread 8-core Sun Niagara 2 machine. Therefore, we did a simple optimization ("conn+os") for the connection locality technique by using the default software scheduler on Solaris, which provides a better load balance compared to the thread affinity set up. This optimization can increase the throughput by 43% (1.25 Gbps VS 0.87 Gbps). To keep our result report reasonable, we choose the “conn+os” case as the default case which our optimizations are compared to. We also observe that HRW alone only increases the throughput by 3.2%, while the multilayer HRW achieves a throughput of 1.54 Gbps, an additional improvement of 20%. The ultimate system throughput can be increased by 59.2% compared to "conn+os" using the adaptive multilayer scheduler. The CPU utilization shows a pattern of growth as throughput increases. This is because better load balance reduces CPU idle time. Therefore, more CPU time is spent in matching connection buffers. If the per core CPU workload is unevenly distributed, some of the cores might be idling after they finish the workload in their runqueue, while those cores with higher workloads keep running blindly, blocking workload deeper in the runqueue. In the next subsection, we will present the workload distribution situation at the core, the pipeline and the thread level.

Power Aware Scheduling of Network Applications (INFOCOM 2010) Use SUIF compiler to generate basic blocks and program dependency graph (PDG). Partition and map the application on an AMD machine with two Quad-Core Opteron 2350 processors. Apply power-aware scheduling algorithm to improve power consumption. 12/27/2018

12/27/2018

12/27/2018

Network Processor & Its Applications Network Processors Network Processor & Its Applications

Network Processor & Its Applications

What the Internet Needs? ASIC (large, expensive to develop, not flexible) Increasing Huge Amount of Packets & Routing, Packet Classification, Encryption, QoS, New Applications and Protocols, etc….. High processing power Support wire speed Programmable Scalable Specially for network applications … General Purpose RISC (not capable enough) Network Processor & Its Applications

Typical NP Architecture SDRAM (Packet buffer) SRAM (Routing table) multi-threaded processing elements Co-processor Input ports Output ports Network Processor Bus

Network Processor & Its Applications IXP1200 Block Diagram StrongARM processing core Microengines introduce new ISA I/O PCI SDRAM SRAM IX : PCI-like packet bus On chip FIFOs 16 entry 64B each Network Processor & Its Applications

Network Processor & Its Applications IXP1200 Microengine 4 hardware contexts Single issue processor Explicit optional context switch on SRAM access Registers All are single ported Separate GPR 256*6 = 1536 registers total 32-bit ALU Can access GPR or XFER registers Shared hash unit 1/2/3 values – 48b/64b Standard 5 stage pipeline For IP routing hashing 4KB SRAM instruction store – not a cache! Barrel shifter Network Processor & Its Applications

Network Processor & Its Applications IXP 2400 Block Diagram XScale core replaces StrongARM Microengines Faster More: 2 clusters of 4 microengines each Local memory Next neighbor routes added between microengines Hardware to accelerate CRC operations and Random number generation 16 entry CAM ME0 ME1 ME2 ME3 ME4 ME5 ME6 ME7 Scratch /Hash /CSR MSF Unit DDR DRAM controller XScale Core QDR SRAM PCI Network Processor & Its Applications

IXP2400 Rbuf PCI Tbuf 72 MEv2 1 MEv2 2 DDRAM S P I 3 or C X 32b MEv2 4 Intel® XScale™ Core 32K IC 32K DC G A S K E T PCI (64b) 66 MHz Tbuf 64 @ 128B 32b 64b MEv2 5 MEv2 6 Hash 64/48/128 Scratch 16KB QDR SRAM 1 QDR SRAM 2 MEv2 8 MEv2 7 CSRs -Fast_wr -UART -Timers -GPIO -BootROM/Slow Port E/D Q E/D Q 18 18 18 18

IXP2800 MEv2 1 MEv2 2 MEv2 3 MEv2 4 Rbuf MEv2 8 MEv2 7 MEv2 6 MEv2 5 18 18 18 IXP2800 Stripe RDRAM 1 RDRAM 2 RDRAM 3 MEv2 1 MEv2 2 MEv2 3 MEv2 4 Rbuf 64 @ 128B S P I 4 or C X 16b MEv2 8 MEv2 7 MEv2 6 MEv2 5 Intel® XScale™ Core 32K IC 32K DC G A S K E T PCI (64b) 66 MHz 64b Tbuf 64 @ 128B 16b MEv2 9 MEv2 10 MEv2 11 MEv2 12 Hash 48/64/128 Scratch 16KB QDR SRAM 1 QDR SRAM 2 QDR SRAM 3 QDR SRAM 4 MEv2 16 MEv2 15 MEv2 14 MEv2 13 CSRs -Fast_wr -UART -Timers -GPIO -BootROM/SlowPort E/D Q E/D Q E/D Q E/D Q 18 18 18 18 18 18 18 18

MicroEngine v2 Control Store 4K/8K Instructions Local Memory 128 GPR D-Push Bus S-Push Bus From Next Neighbor Control Store 4K/8K Instructions Local Memory 640 words 128 GPR 128 GPR 128 Next Neighbor 128 D Xfer In 128 S Xfer In LM Addr 1 2 per CTX B_op A_op LM Addr 0 Prev B Prev A P-Random # A_Operand B_Operand CRC Unit Multiply TAGs 0-15 Lock 0-15 Status and LRU Logic (6-bit) 32-bit Execution Data Path Find first bit CRC remain CAM Add, shift, logical Other Local CSRs Status Entry# ALU_Out To Next Neighbor Timers 128 D Xfer Out 128 S Xfer Out Timestamp D-Pull Bus S-Pull Bus

Example Toaster System: Cisco 10000 Almost all data plane operations execute on the programmable XMC Pipeline stages are assigned tasks – e.g. classification, routing, firewall, MPLS Classic SW load balancing problem External SDRAM shared by common pipe stages XMC: Express Micro Controller 2-way RISC VLIW Network Processor & Its Applications

Network Processor & Its Applications IBM PowerNP 16 pico-procesors and 1 powerPC Each pico-processor Support 2 hardware threads 3 stage pipeline : fetch/decode/execute Dyadic Processing Unit Two pico-processors 2KB Shared memory Tree search engine Focus is layers 2-4 PowerPC 405 for control plane operations 16K I and D caches Target is OC-48 Network Processor & Its Applications

C-Port C-5 Chip Architecture Network Processor & Its Applications

Design of a Web Switch (IEEE Micro 2006) Internet GET /cgi-bin/form HTTP/1.1 Host: www.site.com… APP. DATA TCP IP Problems with Application level Processing Vertical Processing in network – A change in paradigm! Overhead to copy data between two connections Data goes up through O/S interrupt and protocol stack Data is copied from kernel space to user space and vice versa Oh, the PCI Bottleneck! It takes a 1 GHz CPU to handle a 1 Gb network. Now at 10Gb, but don’t have a 10 GHz CPU! Application level Processing

Partition the Workload

Latency

Throughput