Multicore Architectures. Managing Wire Delay in Large CMP Caches Bradford M. Beckmann David A. Wood Multifacet Project University of Wisconsin-Madison.

Multicore Architectures

Managing Wire Delay in Large CMP Caches Bradford M. Beckmann David A. Wood Multifacet Project University of Wisconsin-Madison MICRO 2004 12/8/04

Static NUCA

Dynamic NUCA

Current CMP: IBM Power 5 L2 Bank L2 Bank L2 Bank 2 CPUs 3 L2 Cache Banks CPU 0 L1 I$ L1 D$ CPU 1 L1 D$ L1 I$

7 Baseline: CMP-SNUCA L1 I $ L1 D $ CPU 2 L1 I $ L1 D $ CPU 3 L1 D $ L1 I $ CPU 7 L1 D $ L1 I $ CPU 6 L1 D $ L1 I $ CPU 1 L1 D $ L1 I $ CPU 0 L1 I $ L1 D $ CPU 4 L1 I $ L1 D $ CPU 5

8 Block Migration: CMP-DNUCA L1 I $ L1 D $ CPU 2 L1 I $ L1 D $ CPU 3 L1 D $ L1 I $ CPU 7 L1 D $ L1 I $ CPU 6 L1 D $ L1 I $ CPU 1 L1 D $ L1 I $ CPU 0 L1 I $ L1 D $ CPU 4 L1 I $ L1 D $ CPU 5 B A A B

On-chip Transmission Lines Similar to contemporary off-chip communication Provides a different latency / bandwidth tradeoff Wires behave more “transmission-line” like as frequency increases –Utilize transmission line qualities to our advantage –No repeaters – route directly over large structures –~10x lower latency across long distances Limitations –Requires thick wires and dielectric spacing –Increases manufacturing cost

10 Transmission Lines: CMP-TLC CPU 3 L1 I $ L1 D $ L1 I $ L1 D $ L1 I $ L1 D $ L1 I $ L1 D $ L1 I $ L1 D $ L1 I $ L1 D $ L1 I $ L1 D $ L1 I $ L1 D $ CPU 2 CPU 1 CPU 0 CPU 4 CPU 5 CPU 6 CPU 7 16 8-byte links

11 Combination: CMP-Hybrid L1 I $ L1 D $ CPU 2 L1 I $ L1 D $ CPU 3 L1 D $ L1 I $ CPU 7 L1 D $ L1 I $ CPU 6 L1 D $ L1 I $ CPU 1 L1 D $ L1 I $ CPU 0 L1 I $ L1 D $ CPU 4 L1 I $ L1 D $ CPU 5 8 32-byte links

12 CMP-DNUCA: Organization Bankclusters Local Inter. Center CPU 2 CPU 3 CPU 7 CPU 6 CPU 1 CPU 0 CPU 4 CPU 5

Hit Distribution: Grayscale Shading CPU 2 CPU 3 CPU 7 CPU 6 CPU 1 CPU 0 CPU 4 CPU 5 Greater % of L2 Hits

L2 Hit Latency Bars Labeled D: CMP-DNUCA T: CMP-TLC H: CMP-Hybrid

Overall Performance Transmission lines improve L2 hit and L2 miss latency

I/O Acceleration in Server Architectures Laxmi N. Bhuyan University of California, Riverside http://www.cs.ucr.edu/~bhuyan

Acknowledgement Many slides in this presentation have been taken (or modified from) from Li Zhao’s Ph.D. dissertation at UCR and Ravi Iyer’s (Intel) presentation at UCR.

Enterprise Workloads Key Characteristics –Throughput-Oriented Lots of transactions, operations, etc in flight Many VMs, processes, threads, fibers, etc Scalability and Adaptability are key –Rich (I/O) Content TCP, SoIP, SSL, XML High Throughput Requirements Efficiency and Utilization are key

Server Network I/O Acceleration Bottlenecks

Rate of Technology Improvement

Rich I/O Content – How does a server communicate with I/O Devices?

CPU User Kernel NIC PCI Bus Communicating with the Server: The O/S Wall Problems: O/S overhead to move a packet between network and application level => Protocol Stack (TCP/IP) O/S interrupt Data copying from kernel space to user space and vice versa Oh, the PCI Bottleneck! Our Aim: Design server (CPU) architectures to overcome the problems!

TCP Offloading with Cavium Octeon Multi-core MIPS 64 processor

Application Oriented Networking (AON) Switch Internet GET /cgi-bin/form HTTP/1.1 Host: www.site.com… APP. DATATCPIP Same problems with programmable routers. Requests going through network, IP and TCP layers Solution: Bring processing down to the network level => TCP Offload (Not a topic for discussion)! Ref: L. Bhuyan, “A Network Processor Based, Content Aware Switch”, IEEE Micro, May/June 2006, (with L. Zhao, et al). Application level Processing

Timing Measurement in an UDP communication X.Zhang, L. Bhuyan and W. Feng, ““Anatomy of UDP and M-VIA for Cluster Communication” JPDC, October 2005

Rich I/O Content in the Enterprise Trends –Increasing layers of processing on I/O data Business critical functions (TCP, IP storage, security, XML etc.) Independent of actual application processing Exacerbate by high network rates –High rates of I/O Bandwidth with new technologies PCI-Express technology 10/s Gb to 40 Gb/s network technologies and it just keeps going Problem Statement –Data Movement latency to deliver data Interconnect protocols Data structures used for shared memory communication, serialization and locking Data movement instructions (for e.g. rep mov) –Data Transformation latencies SW efficiency – degree of IA optimization IA cores vs. fixed function devices Location of processing: Core, Uncore, Chipset vs. Device –Virtualization and real workload requirements TCP/IP iSCSI SSL XML App NetworkData Platform

Network Protocols TCP/IP protocols –4 layers OSI Reference Model –App  3 layers Application Transport Internet Link HTTP, Telnet 7 6 5 4 3 2 1 Application Presentation Session Transport Network Data Link Physical OSITCP/IPExamples 4 3 2 1 Ethernet, FDDI Coax, Signaling XML SSL TCP, UDP IP, IPSec, ICMP HTTP, Telnet Ethernet, FDDI

Our Concentration in this talk: TCP/IP

Network Bandwidth is Increasing 10 100 40 GHz and Gbps Time 1990 19952000200320052010.01 0.1 1 10 100 1000 2006/7 Network bandwidth outpaces Moore’s Law Moore’s Law TCP requirements Rule of thumb: 1GHz for 1Gbps The gap between the rate of processing network applications and the fast growing network bandwidth is increasing

Profile of a Packet System Overheads Descriptor & Header Accesses Total Avg Clocks / Packet: ~ 21K Effective Bandwidth: 0.6 Gb/s (1KB Receive) IP Processing TCB Accesses TCP Processing Memory Copy Computes Memory

Five Emerging Technologies Optimized Network Protocol Stack (ISSS+CODES, 2003) Cache Optimization (ISSS+CODES, 2003, ANCHOR, 2004) Network Stack Affinity Scheduling Direct Cache Access Lightweight Threading Memory Copy Engine (ICCD 2005 and IEEE TC)

Stack Optimizations (Instruction Count) Separate Data & Control Paths –TCP data-path focused –Reduce # of conditionals –NIC assist logic (L3/L4 stateless logic) Basic Memory Optimizations –Cache-line aware data structures –SW Prefetches Optimized Computation –Standard compiler capability 3X reduction in Instructions per Packet

Reduce Protocol Overheads TCP/IP –Data touching Copies: 0-copy Checksum: offload to NIC –Non-data touching Operating system –Interrupt processing: interrupt coalescing –Memory management Protocol processing: LSO

Instruction Mix & ILP Higher % of unconditional branches Lower % of conditional branches Less sensitive to ILP Issue width: 1  2, 2  4 SPEC: 40%, 24% TCP/IP: 29%, 15%

EX: Frequently Used Instruction Pairs in TCP/IP 1 st instruction 2 nd instruction Occurrenc e ADDIUBNE4.91% ANDIBEQ4.80% ADDU 3.56% SLLOR3.38% Identify frequent instruction pairs with dependence (RAW) Integer + branch: header validation, packet classification, states checking Combine the two instructions to create a new instruction => Reduces the number of instructions and cycles

Execution Time Reduction Number of instructions reduced is not proportional to the execution time reduction 1%  6% to 23% Instruction access time: 47% CPU execution time: 3% Data access time: 14% L. Bhuyan, “Architectural Analysis and Instruction Set Optimization for Network Protocol Processors”, IEEE ISSS+CODES, October 2003, (with H. Xie and L. Zhao),

Cache Optimizations

Instruction Cache Behavior Higher requirement on L1 cache size due to the program structure Benefit more from a L1 cache with larger size, line size, higher degree of set associativity

Execution Time Analysis Given a total L1 cache size on the chip, more area should be devoted to I-Cache and less to D-cache

Network Cache Two sub-caches –TLC: temporal data –SB: non- temporal data Benefit –Reduce cache pollution –Each cache has its own configuration

Reduce Compulsory Cache Misses NIC descriptors and TCP headers –Cache Region Locking w/ Auto Updates (CRL) Lock a memory region Perform update –Support for CRL-AU Hybrid protocols: update Auto-fill: prefetch TCP payload –Cache Region Prefetching (CRP)

Chipset MemoryMemory CPU MemoryMemory MemoryMemory Memory CPU Network Stack Affinity Dedicated for network I/O Intel calls it Onloading I/O Interface CPU Core …  Assigns network I/O workloads to designated devices  Separates network I/O from application work  Reduces scheduling overheads  More efficient cache utilization  Increases pipeline efficiency

Direct Cache Access (DCA) Normal DMA Writes CPU Cache Memory NIC Memory Controller Step 1 DMA Write Step 2 Snoop Invalidate Step 3 Memory Write Step 4 CPU Read Direct Cache Access CPU Cache Memory NIC Memory Controller Step 1 DMA Write Step 2 Cache Update Step 3 CPU Read Eliminate 3 to 25 memory accesses by placing packet data directly into cache

Lightweight Threading Single Hardware Context Execution pipeline S/W controlled thread 1 S/W controlled thread 2 Memory informing event (e.g. cache miss) Continue computing in single pipeline in shadow of cache miss Single Core Pipeline Thread Manager Builds on helper threads; reduces CPU stall

Memory Copy Engines L.Bhuyan, “Hardware Support for Bulk Data Movement in Server Platforms”, ICCD, October 2005 (Also to appear in IEEETC), with L. Zhao, et.al.

Memory Overheads NIC descriptors Mbufs TCP/IP headers Payload

Copy Engines Copy is time-consuming due to –CPU moves data at small granularity –Source or destination is in memory (not cache) –Memory accesses clog up resources Copy engine can –Fast copies and reducing CPU resource occupancy –Copies can be done in parallel with the CPU computation –Avoid cache pollution and reduce interconnect traffic Low overhead communication between the engine & the CPU –Hardware support to allow the engine to run asynchronously with the CPU –Hardware support to share the virtual address between the engine and the CPU –Low overhead signaling of completion

Design of Copy Engine Trigger CE –Copy initiation –Address translation –Copy communication Communication between the CPU and the CE

Performance Evaluation

Asynchronous Low-Cost Copy (ALCC) App Processing Memory Copy Today, memory to memory data copies require CPU execution Build a copy engine and tightly couple it with the CPU –Low communication overhead; asynchronous execution w.r.t CPU Continue computing during memory to memory copies App Processing Memory Copy

Total I/O Acceleration

Potential Efficiencies (10X) On CPU, multi-gigabit, line speed network I/O is possible Benefits of Affinity Benefits of Architectural Technques Greg Regnier, et al., “TCP Onloading for DataCenter Servers,” IEEE Computer, vol 37, Nov 2004

I/O Acceleration – Problem Magnitude Memory Copies & Effects of Streaming CRCs Crypto Parsing, Tree Construction Storage over IP Networking Security Services I/O Processing Rates are significantly limited by CPU in the face of Data Movement and Transformation Operations

Building Block Engines Bulk Data Operations Copies / Moves Scatter / Gather Inter-VM-comm Data Transformation EncryptionCompression XML Parsing Data Validation XORs Checksums CRCs Investigate architectural and platform support for building block engines in future servers Questions: –What are the characteristics of bulk data operations? –Why are they performance bottlenecks today? –What is best way to improve their performance? Parallelize the computation across many small CPUs? Build a BBE and tightly couple it with the CPU How do we expose BBEs? Data Movement Scalable on-die fabric Integrated Memory Controllers CACHE Small Core BBE Granularity? Reconfigurability? Core  BBE

Conclusions and Future Work Studied architectural characteristics of key network protocols –TCP/IP: requires more instruction cache, has a large memory overhead Proposed several techniques for optimization –Caching techniques –ISA optimization –Data Copy engines Further investigation on network protocols & optimization – Heterogeneous Chip multiprocessors -- Other I/O applications, SSL, XML, etc. - Use of Network Processors and FPGA’s

Multicore Architectures. Managing Wire Delay in Large CMP Caches Bradford M. Beckmann David A. Wood Multifacet Project University of Wisconsin-Madison.

Similar presentations

Presentation on theme: "Multicore Architectures. Managing Wire Delay in Large CMP Caches Bradford M. Beckmann David A. Wood Multifacet Project University of Wisconsin-Madison."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Multicore Architectures. Managing Wire Delay in Large CMP Caches Bradford M. Beckmann David A. Wood Multifacet Project University of Wisconsin-Madison.

Similar presentations

Presentation on theme: "Multicore Architectures. Managing Wire Delay in Large CMP Caches Bradford M. Beckmann David A. Wood Multifacet Project University of Wisconsin-Madison."— Presentation transcript:

Similar presentations

About project

Feedback