Multicore Architectures. Managing Wire Delay in Large CMP Caches Bradford M. Beckmann David A. Wood Multifacet Project University of Wisconsin-Madison.

Slides:



Advertisements
Similar presentations
IP Router Architectures. Outline Basic IP Router Functionalities IP Router Architectures.
Advertisements

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Managing Wire Delay in Large CMP Caches Bradford M. Beckmann David A. Wood Multifacet Project University of Wisconsin-Madison MICRO /8/04.
Lecture 6: Multicore Systems
Institute of Computer Science Foundation for Research and Technology – Hellas Greece Computer Architecture and VLSI Systems Laboratory Exploiting Spatial.
Study of Hurricane and Tornado Operating Systems By Shubhanan Bakre.
Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.
Thin Servers with Smart Pipes: Designing SoC Accelerators for Memcached Bohua Kou Jing gao.
VIA and Its Extension To TCP/IP Network Yingping Lu Based on Paper “Queue Pair IP, …” by Philip Buonadonna.
1 Design and Implementation of A Content-aware Switch using A Network Processor Li Zhao, Yan Luo, Laxmi Bhuyan University of California, Riverside Ravi.
I/O Channels I/O devices getting more sophisticated e.g. 3D graphics cards CPU instructs I/O controller to do transfer I/O controller does entire transfer.
1 Soft Timers: Efficient Microsecond Software Timer Support For Network Processing Mohit Aron and Peter Druschel Rice University Presented By Jonathan.
Soft Timers: Efficient Microsecond Software Timer Support For Network Processing Mohit Aron and Peter Druschel Rice University Presented by Reinette Grobler.
Architectural Impact of SSL Processing Jingnan Yao.
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Nov 9, 2005 Topic: Caches (contd.)
Embedded Transport Acceleration Intel Xeon Processor as a Packet Processing Engine Abhishek Mitra Professor: Dr. Bhuyan.
ECE 526 – Network Processing Systems Design
5/8/2006 Nicole SAN Protocols 1 Storage Networking Protocols Nicole Opferman CS 526.
Router Architectures An overview of router architectures.
Router Architectures An overview of router architectures.
1 Instant replay  The semester was split into roughly four parts. —The 1st quarter covered instruction set architectures—the connection between software.
I/O Acceleration in Server Architectures
Revisiting Network Interface Cards as First-Class Citizens Wu-chun Feng (Virginia Tech) Pavan Balaji (Argonne National Lab) Ajeet Singh (Virginia Tech)
TLC: Transmission Line Caches Brad Beckmann David Wood Multifacet Project University of Wisconsin-Madison 12/3/03.
Networking Virtualization Using FPGAs Russell Tessier, Deepak Unnikrishnan, Dong Yin, and Lixin Gao Reconfigurable Computing Group Department of Electrical.
Computer System Architectures Computer System Software
ECE 526 – Network Processing Systems Design Network Processor Architecture and Scalability Chapter 13,14: D. E. Comer.
Network Server Performance and Scalability June 9, 2005 Scott Rixner Rice Computer Architecture Group
High Performance User-Level Sockets over Gigabit Ethernet Pavan Balaji Ohio State University Piyush Shivam Ohio State University.
University of Michigan Electrical Engineering and Computer Science 1 Extending Multicore Architectures to Exploit Hybrid Parallelism in Single-Thread Applications.
Lecture 3 Review of Internet Protocols Transport Layer.
High Performance Computing & Communication Research Laboratory 12/11/1997 [1] Hyok Kim Performance Analysis of TCP/IP Data.
Boosting Event Building Performance Using Infiniband FDR for CMS Upgrade Andrew Forrest – CERN (PH/CMD) Technology and Instrumentation in Particle Physics.
© 2010 IBM Corporation Plugging the Hypervisor Abstraction Leaks Caused by Virtual Networking Alex Landau, David Hadas, Muli Ben-Yehuda IBM Research –
Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.
Comparing Memory Systems for Chip Multiprocessors Leverich et al. Computer Systems Laboratory at Stanford Presentation by Sarah Bird.
C OMPUTER O RGANIZATION AND D ESIGN The Hardware/Software Interface 5 th Edition Chapter 1 Computer Abstractions and Technology Sections 1.5 – 1.11.
Architectural Characterization of an IBM RS6000 S80 Server Running TPC-W Workloads Lei Yang & Shiliang Hu Computer Sciences Department, University of.
Architectural Characterization of an IBM RS6000 S80 Server Running TPC-W Workloads Lei Yang & Shiliang Hu Computer Sciences Department, University of.
A Measurement Based Memory Performance Evaluation of High Throughput Servers Garba Isa Yau Department of Computer Engineering King Fahd University of Petroleum.
Srihari Makineni & Ravi Iyer Communications Technology Lab
Increasing Web Server Throughput with Network Interface Data Caching October 9, 2002 Hyong-youb Kim, Vijay S. Pai, and Scott Rixner Rice Computer Architecture.
OS Services And Networking Support Juan Wang Qi Pan Department of Computer Science Southeastern University August 1999.
I/O Computer Organization II 1 Interconnecting Components Need interconnections between – CPU, memory, I/O controllers Bus: shared communication channel.
1 Public DAFS Storage for High Performance Computing using MPI-I/O: Design and Experience Arkady Kanevsky & Peter Corbett Network Appliance Vijay Velusamy.
IBM Haifa Research Lab © IBM Corporation IsoStack – Highly Efficient Network Processing on Dedicated Cores Leah Shalev Eran Borovik, Julian Satran,
1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Architectures Iowa State University (Ames) Reconfigurable Architectures Forces that drive.
Next Generation Operating Systems Zeljko Susnjar, Cisco CTG June 2015.
An Architecture and Prototype Implementation for TCP/IP Hardware Support Mirko Benz Dresden University of Technology, Germany TERENA 2001.
ND The research group on Networks & Distributed systems.
Intel Research & Development ETA: Experience with an IA processor as a Packet Processing Engine HP Labs Computer Systems Colloquium August 2003 Greg Regnier.
Chapter 13 – I/O Systems (Pgs ). Devices  Two conflicting properties A. Growing uniformity in interfaces (both h/w and s/w): e.g., USB, TWAIN.
Chapter 5 Memory III CSE 820. Michigan State University Computer Science and Engineering Miss Rate Reduction (cont’d)
Full and Para Virtualization
1Thu D. NguyenCS 545: Distributed Systems CS 545: Distributed Systems Spring 2002 Communication Medium Thu D. Nguyen
Slide #1 CIT 380: Securing Computer Systems TCP/IP.
Hybrid Multi-Core Architecture for Boosting Single-Threaded Performance Presented by: Peyman Nov 2007.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Architectures Iowa State University (Ames) CPRE 583 Reconfigurable Computing Lecture.
High-Bandwidth Packet Switching on the Raw General-Purpose Architecture Gleb Chuvpilo Saman Amarasinghe MIT LCS Computer Architecture Group January 9,
Exploiting Task-level Concurrency in a Programmable Network Interface June 11, 2003 Hyong-youb Kim, Vijay S. Pai, and Scott Rixner Rice Computer Architecture.
Memory Hierarchy— Five Ways to Reduce Miss Penalty.
1 Scaling Soft Processor Systems Martin Labrecque Peter Yiannacouras and Gregory Steffan University of Toronto FCCM 4/14/2008.
Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.
CIS 700-5: The Design and Implementation of Cloud Networks
Morgan Kaufmann Publishers
Bank-aware Dynamic Cache Partitioning for Multicore Architectures
Internetworking: Hardware/Software Interface
Storage Networking Protocols
Chapter 13: I/O Systems.
Presentation transcript:

Multicore Architectures

Managing Wire Delay in Large CMP Caches Bradford M. Beckmann David A. Wood Multifacet Project University of Wisconsin-Madison MICRO /8/04

Static NUCA

Dynamic NUCA

Current CMP: IBM Power 5 L2 Bank L2 Bank L2 Bank 2 CPUs 3 L2 Cache Banks CPU 0 L1 I$ L1 D$ CPU 1 L1 D$ L1 I$

7 Baseline: CMP-SNUCA L1 I $ L1 D $ CPU 2 L1 I $ L1 D $ CPU 3 L1 D $ L1 I $ CPU 7 L1 D $ L1 I $ CPU 6 L1 D $ L1 I $ CPU 1 L1 D $ L1 I $ CPU 0 L1 I $ L1 D $ CPU 4 L1 I $ L1 D $ CPU 5

8 Block Migration: CMP-DNUCA L1 I $ L1 D $ CPU 2 L1 I $ L1 D $ CPU 3 L1 D $ L1 I $ CPU 7 L1 D $ L1 I $ CPU 6 L1 D $ L1 I $ CPU 1 L1 D $ L1 I $ CPU 0 L1 I $ L1 D $ CPU 4 L1 I $ L1 D $ CPU 5 B A A B

On-chip Transmission Lines Similar to contemporary off-chip communication Provides a different latency / bandwidth tradeoff Wires behave more “transmission-line” like as frequency increases –Utilize transmission line qualities to our advantage –No repeaters – route directly over large structures –~10x lower latency across long distances Limitations –Requires thick wires and dielectric spacing –Increases manufacturing cost

10 Transmission Lines: CMP-TLC CPU 3 L1 I $ L1 D $ L1 I $ L1 D $ L1 I $ L1 D $ L1 I $ L1 D $ L1 I $ L1 D $ L1 I $ L1 D $ L1 I $ L1 D $ L1 I $ L1 D $ CPU 2 CPU 1 CPU 0 CPU 4 CPU 5 CPU 6 CPU byte links

11 Combination: CMP-Hybrid L1 I $ L1 D $ CPU 2 L1 I $ L1 D $ CPU 3 L1 D $ L1 I $ CPU 7 L1 D $ L1 I $ CPU 6 L1 D $ L1 I $ CPU 1 L1 D $ L1 I $ CPU 0 L1 I $ L1 D $ CPU 4 L1 I $ L1 D $ CPU byte links

12 CMP-DNUCA: Organization Bankclusters Local Inter. Center CPU 2 CPU 3 CPU 7 CPU 6 CPU 1 CPU 0 CPU 4 CPU 5

Hit Distribution: Grayscale Shading CPU 2 CPU 3 CPU 7 CPU 6 CPU 1 CPU 0 CPU 4 CPU 5 Greater % of L2 Hits

L2 Hit Latency Bars Labeled D: CMP-DNUCA T: CMP-TLC H: CMP-Hybrid

Overall Performance Transmission lines improve L2 hit and L2 miss latency

I/O Acceleration in Server Architectures Laxmi N. Bhuyan University of California, Riverside

Acknowledgement Many slides in this presentation have been taken (or modified from) from Li Zhao’s Ph.D. dissertation at UCR and Ravi Iyer’s (Intel) presentation at UCR.

Enterprise Workloads Key Characteristics –Throughput-Oriented Lots of transactions, operations, etc in flight Many VMs, processes, threads, fibers, etc Scalability and Adaptability are key –Rich (I/O) Content TCP, SoIP, SSL, XML High Throughput Requirements Efficiency and Utilization are key

Server Network I/O Acceleration Bottlenecks

Rate of Technology Improvement

Rich I/O Content – How does a server communicate with I/O Devices?

CPU User Kernel NIC PCI Bus Communicating with the Server: The O/S Wall Problems: O/S overhead to move a packet between network and application level => Protocol Stack (TCP/IP) O/S interrupt Data copying from kernel space to user space and vice versa Oh, the PCI Bottleneck! Our Aim: Design server (CPU) architectures to overcome the problems!

TCP Offloading with Cavium Octeon Multi-core MIPS 64 processor

Application Oriented Networking (AON) Switch Internet GET /cgi-bin/form HTTP/1.1 Host: APP. DATATCPIP Same problems with programmable routers. Requests going through network, IP and TCP layers Solution: Bring processing down to the network level => TCP Offload (Not a topic for discussion)! Ref: L. Bhuyan, “A Network Processor Based, Content Aware Switch”, IEEE Micro, May/June 2006, (with L. Zhao, et al). Application level Processing

Timing Measurement in an UDP communication X.Zhang, L. Bhuyan and W. Feng, ““Anatomy of UDP and M-VIA for Cluster Communication” JPDC, October 2005

Rich I/O Content in the Enterprise Trends –Increasing layers of processing on I/O data Business critical functions (TCP, IP storage, security, XML etc.) Independent of actual application processing Exacerbate by high network rates –High rates of I/O Bandwidth with new technologies PCI-Express technology 10/s Gb to 40 Gb/s network technologies and it just keeps going Problem Statement –Data Movement latency to deliver data Interconnect protocols Data structures used for shared memory communication, serialization and locking Data movement instructions (for e.g. rep mov) –Data Transformation latencies SW efficiency – degree of IA optimization IA cores vs. fixed function devices Location of processing: Core, Uncore, Chipset vs. Device –Virtualization and real workload requirements TCP/IP iSCSI SSL XML App NetworkData Platform

Network Protocols TCP/IP protocols –4 layers OSI Reference Model –App  3 layers Application Transport Internet Link HTTP, Telnet Application Presentation Session Transport Network Data Link Physical OSITCP/IPExamples Ethernet, FDDI Coax, Signaling XML SSL TCP, UDP IP, IPSec, ICMP HTTP, Telnet Ethernet, FDDI

Our Concentration in this talk: TCP/IP

Network Bandwidth is Increasing GHz and Gbps Time /7 Network bandwidth outpaces Moore’s Law Moore’s Law TCP requirements Rule of thumb: 1GHz for 1Gbps The gap between the rate of processing network applications and the fast growing network bandwidth is increasing

Profile of a Packet System Overheads Descriptor & Header Accesses Total Avg Clocks / Packet: ~ 21K Effective Bandwidth: 0.6 Gb/s (1KB Receive) IP Processing TCB Accesses TCP Processing Memory Copy Computes Memory

Five Emerging Technologies Optimized Network Protocol Stack (ISSS+CODES, 2003) Cache Optimization (ISSS+CODES, 2003, ANCHOR, 2004) Network Stack Affinity Scheduling Direct Cache Access Lightweight Threading Memory Copy Engine (ICCD 2005 and IEEE TC)

Stack Optimizations (Instruction Count) Separate Data & Control Paths –TCP data-path focused –Reduce # of conditionals –NIC assist logic (L3/L4 stateless logic) Basic Memory Optimizations –Cache-line aware data structures –SW Prefetches Optimized Computation –Standard compiler capability 3X reduction in Instructions per Packet

Reduce Protocol Overheads TCP/IP –Data touching Copies: 0-copy Checksum: offload to NIC –Non-data touching Operating system –Interrupt processing: interrupt coalescing –Memory management Protocol processing: LSO

Instruction Mix & ILP Higher % of unconditional branches Lower % of conditional branches Less sensitive to ILP Issue width: 1  2, 2  4 SPEC: 40%, 24% TCP/IP: 29%, 15%

EX: Frequently Used Instruction Pairs in TCP/IP 1 st instruction 2 nd instruction Occurrenc e ADDIUBNE4.91% ANDIBEQ4.80% ADDU 3.56% SLLOR3.38% Identify frequent instruction pairs with dependence (RAW) Integer + branch: header validation, packet classification, states checking Combine the two instructions to create a new instruction => Reduces the number of instructions and cycles

Execution Time Reduction Number of instructions reduced is not proportional to the execution time reduction 1%  6% to 23% Instruction access time: 47% CPU execution time: 3% Data access time: 14% L. Bhuyan, “Architectural Analysis and Instruction Set Optimization for Network Protocol Processors”, IEEE ISSS+CODES, October 2003, (with H. Xie and L. Zhao),

Cache Optimizations

Instruction Cache Behavior Higher requirement on L1 cache size due to the program structure Benefit more from a L1 cache with larger size, line size, higher degree of set associativity

Execution Time Analysis Given a total L1 cache size on the chip, more area should be devoted to I-Cache and less to D-cache

Network Cache Two sub-caches –TLC: temporal data –SB: non- temporal data Benefit –Reduce cache pollution –Each cache has its own configuration

Reduce Compulsory Cache Misses NIC descriptors and TCP headers –Cache Region Locking w/ Auto Updates (CRL) Lock a memory region Perform update –Support for CRL-AU Hybrid protocols: update Auto-fill: prefetch TCP payload –Cache Region Prefetching (CRP)

Chipset MemoryMemory CPU MemoryMemory MemoryMemory Memory CPU Network Stack Affinity Dedicated for network I/O Intel calls it Onloading I/O Interface CPU Core …  Assigns network I/O workloads to designated devices  Separates network I/O from application work  Reduces scheduling overheads  More efficient cache utilization  Increases pipeline efficiency

Direct Cache Access (DCA) Normal DMA Writes CPU Cache Memory NIC Memory Controller Step 1 DMA Write Step 2 Snoop Invalidate Step 3 Memory Write Step 4 CPU Read Direct Cache Access CPU Cache Memory NIC Memory Controller Step 1 DMA Write Step 2 Cache Update Step 3 CPU Read Eliminate 3 to 25 memory accesses by placing packet data directly into cache

Lightweight Threading Single Hardware Context Execution pipeline S/W controlled thread 1 S/W controlled thread 2 Memory informing event (e.g. cache miss) Continue computing in single pipeline in shadow of cache miss Single Core Pipeline Thread Manager Builds on helper threads; reduces CPU stall

Memory Copy Engines L.Bhuyan, “Hardware Support for Bulk Data Movement in Server Platforms”, ICCD, October 2005 (Also to appear in IEEETC), with L. Zhao, et.al.

Memory Overheads NIC descriptors Mbufs TCP/IP headers Payload

Copy Engines Copy is time-consuming due to –CPU moves data at small granularity –Source or destination is in memory (not cache) –Memory accesses clog up resources Copy engine can –Fast copies and reducing CPU resource occupancy –Copies can be done in parallel with the CPU computation –Avoid cache pollution and reduce interconnect traffic Low overhead communication between the engine & the CPU –Hardware support to allow the engine to run asynchronously with the CPU –Hardware support to share the virtual address between the engine and the CPU –Low overhead signaling of completion

Design of Copy Engine Trigger CE –Copy initiation –Address translation –Copy communication Communication between the CPU and the CE

Performance Evaluation

Asynchronous Low-Cost Copy (ALCC) App Processing Memory Copy Today, memory to memory data copies require CPU execution Build a copy engine and tightly couple it with the CPU –Low communication overhead; asynchronous execution w.r.t CPU Continue computing during memory to memory copies App Processing Memory Copy

Total I/O Acceleration

Potential Efficiencies (10X) On CPU, multi-gigabit, line speed network I/O is possible Benefits of Affinity Benefits of Architectural Technques Greg Regnier, et al., “TCP Onloading for DataCenter Servers,” IEEE Computer, vol 37, Nov 2004

I/O Acceleration – Problem Magnitude Memory Copies & Effects of Streaming CRCs Crypto Parsing, Tree Construction Storage over IP Networking Security Services I/O Processing Rates are significantly limited by CPU in the face of Data Movement and Transformation Operations

Building Block Engines Bulk Data Operations Copies / Moves Scatter / Gather Inter-VM-comm Data Transformation EncryptionCompression XML Parsing Data Validation XORs Checksums CRCs Investigate architectural and platform support for building block engines in future servers Questions: –What are the characteristics of bulk data operations? –Why are they performance bottlenecks today? –What is best way to improve their performance? Parallelize the computation across many small CPUs? Build a BBE and tightly couple it with the CPU How do we expose BBEs? Data Movement Scalable on-die fabric Integrated Memory Controllers CACHE Small Core BBE Granularity? Reconfigurability? Core  BBE

Conclusions and Future Work Studied architectural characteristics of key network protocols –TCP/IP: requires more instruction cache, has a large memory overhead Proposed several techniques for optimization –Caching techniques –ISA optimization –Data Copy engines Further investigation on network protocols & optimization – Heterogeneous Chip multiprocessors -- Other I/O applications, SSL, XML, etc. - Use of Network Processors and FPGA’s