Download presentation
Presentation is loading. Please wait.
1
Multicore Architectures
2
Managing Wire Delay in Large CMP Caches Bradford M. Beckmann David A. Wood Multifacet Project University of Wisconsin-Madison MICRO 2004 12/8/04
4
Static NUCA
5
Dynamic NUCA
6
Current CMP: IBM Power 5 L2 Bank L2 Bank L2 Bank 2 CPUs 3 L2 Cache Banks CPU 0 L1 I$ L1 D$ CPU 1 L1 D$ L1 I$
7
7 Baseline: CMP-SNUCA L1 I $ L1 D $ CPU 2 L1 I $ L1 D $ CPU 3 L1 D $ L1 I $ CPU 7 L1 D $ L1 I $ CPU 6 L1 D $ L1 I $ CPU 1 L1 D $ L1 I $ CPU 0 L1 I $ L1 D $ CPU 4 L1 I $ L1 D $ CPU 5
8
8 Block Migration: CMP-DNUCA L1 I $ L1 D $ CPU 2 L1 I $ L1 D $ CPU 3 L1 D $ L1 I $ CPU 7 L1 D $ L1 I $ CPU 6 L1 D $ L1 I $ CPU 1 L1 D $ L1 I $ CPU 0 L1 I $ L1 D $ CPU 4 L1 I $ L1 D $ CPU 5 B A A B
9
On-chip Transmission Lines Similar to contemporary off-chip communication Provides a different latency / bandwidth tradeoff Wires behave more “transmission-line” like as frequency increases –Utilize transmission line qualities to our advantage –No repeaters – route directly over large structures –~10x lower latency across long distances Limitations –Requires thick wires and dielectric spacing –Increases manufacturing cost
10
10 Transmission Lines: CMP-TLC CPU 3 L1 I $ L1 D $ L1 I $ L1 D $ L1 I $ L1 D $ L1 I $ L1 D $ L1 I $ L1 D $ L1 I $ L1 D $ L1 I $ L1 D $ L1 I $ L1 D $ CPU 2 CPU 1 CPU 0 CPU 4 CPU 5 CPU 6 CPU 7 16 8-byte links
11
11 Combination: CMP-Hybrid L1 I $ L1 D $ CPU 2 L1 I $ L1 D $ CPU 3 L1 D $ L1 I $ CPU 7 L1 D $ L1 I $ CPU 6 L1 D $ L1 I $ CPU 1 L1 D $ L1 I $ CPU 0 L1 I $ L1 D $ CPU 4 L1 I $ L1 D $ CPU 5 8 32-byte links
12
12 CMP-DNUCA: Organization Bankclusters Local Inter. Center CPU 2 CPU 3 CPU 7 CPU 6 CPU 1 CPU 0 CPU 4 CPU 5
13
Hit Distribution: Grayscale Shading CPU 2 CPU 3 CPU 7 CPU 6 CPU 1 CPU 0 CPU 4 CPU 5 Greater % of L2 Hits
14
L2 Hit Latency Bars Labeled D: CMP-DNUCA T: CMP-TLC H: CMP-Hybrid
15
Overall Performance Transmission lines improve L2 hit and L2 miss latency
16
I/O Acceleration in Server Architectures Laxmi N. Bhuyan University of California, Riverside http://www.cs.ucr.edu/~bhuyan
17
Acknowledgement Many slides in this presentation have been taken (or modified from) from Li Zhao’s Ph.D. dissertation at UCR and Ravi Iyer’s (Intel) presentation at UCR.
18
Enterprise Workloads Key Characteristics –Throughput-Oriented Lots of transactions, operations, etc in flight Many VMs, processes, threads, fibers, etc Scalability and Adaptability are key –Rich (I/O) Content TCP, SoIP, SSL, XML High Throughput Requirements Efficiency and Utilization are key
19
Server Network I/O Acceleration Bottlenecks
20
Rate of Technology Improvement
21
Rich I/O Content – How does a server communicate with I/O Devices?
22
CPU User Kernel NIC PCI Bus Communicating with the Server: The O/S Wall Problems: O/S overhead to move a packet between network and application level => Protocol Stack (TCP/IP) O/S interrupt Data copying from kernel space to user space and vice versa Oh, the PCI Bottleneck! Our Aim: Design server (CPU) architectures to overcome the problems!
23
TCP Offloading with Cavium Octeon Multi-core MIPS 64 processor
24
Application Oriented Networking (AON) Switch Internet GET /cgi-bin/form HTTP/1.1 Host: www.site.com… APP. DATATCPIP Same problems with programmable routers. Requests going through network, IP and TCP layers Solution: Bring processing down to the network level => TCP Offload (Not a topic for discussion)! Ref: L. Bhuyan, “A Network Processor Based, Content Aware Switch”, IEEE Micro, May/June 2006, (with L. Zhao, et al). Application level Processing
25
Timing Measurement in an UDP communication X.Zhang, L. Bhuyan and W. Feng, ““Anatomy of UDP and M-VIA for Cluster Communication” JPDC, October 2005
26
Rich I/O Content in the Enterprise Trends –Increasing layers of processing on I/O data Business critical functions (TCP, IP storage, security, XML etc.) Independent of actual application processing Exacerbate by high network rates –High rates of I/O Bandwidth with new technologies PCI-Express technology 10/s Gb to 40 Gb/s network technologies and it just keeps going Problem Statement –Data Movement latency to deliver data Interconnect protocols Data structures used for shared memory communication, serialization and locking Data movement instructions (for e.g. rep mov) –Data Transformation latencies SW efficiency – degree of IA optimization IA cores vs. fixed function devices Location of processing: Core, Uncore, Chipset vs. Device –Virtualization and real workload requirements TCP/IP iSCSI SSL XML App NetworkData Platform
27
Network Protocols TCP/IP protocols –4 layers OSI Reference Model –App 3 layers Application Transport Internet Link HTTP, Telnet 7 6 5 4 3 2 1 Application Presentation Session Transport Network Data Link Physical OSITCP/IPExamples 4 3 2 1 Ethernet, FDDI Coax, Signaling XML SSL TCP, UDP IP, IPSec, ICMP HTTP, Telnet Ethernet, FDDI
28
Our Concentration in this talk: TCP/IP
29
Network Bandwidth is Increasing 10 100 40 GHz and Gbps Time 1990 19952000200320052010.01 0.1 1 10 100 1000 2006/7 Network bandwidth outpaces Moore’s Law Moore’s Law TCP requirements Rule of thumb: 1GHz for 1Gbps The gap between the rate of processing network applications and the fast growing network bandwidth is increasing
30
Profile of a Packet System Overheads Descriptor & Header Accesses Total Avg Clocks / Packet: ~ 21K Effective Bandwidth: 0.6 Gb/s (1KB Receive) IP Processing TCB Accesses TCP Processing Memory Copy Computes Memory
31
Five Emerging Technologies Optimized Network Protocol Stack (ISSS+CODES, 2003) Cache Optimization (ISSS+CODES, 2003, ANCHOR, 2004) Network Stack Affinity Scheduling Direct Cache Access Lightweight Threading Memory Copy Engine (ICCD 2005 and IEEE TC)
32
Stack Optimizations (Instruction Count) Separate Data & Control Paths –TCP data-path focused –Reduce # of conditionals –NIC assist logic (L3/L4 stateless logic) Basic Memory Optimizations –Cache-line aware data structures –SW Prefetches Optimized Computation –Standard compiler capability 3X reduction in Instructions per Packet
33
Reduce Protocol Overheads TCP/IP –Data touching Copies: 0-copy Checksum: offload to NIC –Non-data touching Operating system –Interrupt processing: interrupt coalescing –Memory management Protocol processing: LSO
34
Instruction Mix & ILP Higher % of unconditional branches Lower % of conditional branches Less sensitive to ILP Issue width: 1 2, 2 4 SPEC: 40%, 24% TCP/IP: 29%, 15%
35
EX: Frequently Used Instruction Pairs in TCP/IP 1 st instruction 2 nd instruction Occurrenc e ADDIUBNE4.91% ANDIBEQ4.80% ADDU 3.56% SLLOR3.38% Identify frequent instruction pairs with dependence (RAW) Integer + branch: header validation, packet classification, states checking Combine the two instructions to create a new instruction => Reduces the number of instructions and cycles
36
Execution Time Reduction Number of instructions reduced is not proportional to the execution time reduction 1% 6% to 23% Instruction access time: 47% CPU execution time: 3% Data access time: 14% L. Bhuyan, “Architectural Analysis and Instruction Set Optimization for Network Protocol Processors”, IEEE ISSS+CODES, October 2003, (with H. Xie and L. Zhao),
37
Cache Optimizations
38
Instruction Cache Behavior Higher requirement on L1 cache size due to the program structure Benefit more from a L1 cache with larger size, line size, higher degree of set associativity
39
Execution Time Analysis Given a total L1 cache size on the chip, more area should be devoted to I-Cache and less to D-cache
40
Network Cache Two sub-caches –TLC: temporal data –SB: non- temporal data Benefit –Reduce cache pollution –Each cache has its own configuration
41
Reduce Compulsory Cache Misses NIC descriptors and TCP headers –Cache Region Locking w/ Auto Updates (CRL) Lock a memory region Perform update –Support for CRL-AU Hybrid protocols: update Auto-fill: prefetch TCP payload –Cache Region Prefetching (CRP)
42
Chipset MemoryMemory CPU MemoryMemory MemoryMemory Memory CPU Network Stack Affinity Dedicated for network I/O Intel calls it Onloading I/O Interface CPU Core … Assigns network I/O workloads to designated devices Separates network I/O from application work Reduces scheduling overheads More efficient cache utilization Increases pipeline efficiency
43
Direct Cache Access (DCA) Normal DMA Writes CPU Cache Memory NIC Memory Controller Step 1 DMA Write Step 2 Snoop Invalidate Step 3 Memory Write Step 4 CPU Read Direct Cache Access CPU Cache Memory NIC Memory Controller Step 1 DMA Write Step 2 Cache Update Step 3 CPU Read Eliminate 3 to 25 memory accesses by placing packet data directly into cache
44
Lightweight Threading Single Hardware Context Execution pipeline S/W controlled thread 1 S/W controlled thread 2 Memory informing event (e.g. cache miss) Continue computing in single pipeline in shadow of cache miss Single Core Pipeline Thread Manager Builds on helper threads; reduces CPU stall
45
Memory Copy Engines L.Bhuyan, “Hardware Support for Bulk Data Movement in Server Platforms”, ICCD, October 2005 (Also to appear in IEEETC), with L. Zhao, et.al.
46
Memory Overheads NIC descriptors Mbufs TCP/IP headers Payload
47
Copy Engines Copy is time-consuming due to –CPU moves data at small granularity –Source or destination is in memory (not cache) –Memory accesses clog up resources Copy engine can –Fast copies and reducing CPU resource occupancy –Copies can be done in parallel with the CPU computation –Avoid cache pollution and reduce interconnect traffic Low overhead communication between the engine & the CPU –Hardware support to allow the engine to run asynchronously with the CPU –Hardware support to share the virtual address between the engine and the CPU –Low overhead signaling of completion
48
Design of Copy Engine Trigger CE –Copy initiation –Address translation –Copy communication Communication between the CPU and the CE
49
Performance Evaluation
50
Asynchronous Low-Cost Copy (ALCC) App Processing Memory Copy Today, memory to memory data copies require CPU execution Build a copy engine and tightly couple it with the CPU –Low communication overhead; asynchronous execution w.r.t CPU Continue computing during memory to memory copies App Processing Memory Copy
51
Total I/O Acceleration
52
Potential Efficiencies (10X) On CPU, multi-gigabit, line speed network I/O is possible Benefits of Affinity Benefits of Architectural Technques Greg Regnier, et al., “TCP Onloading for DataCenter Servers,” IEEE Computer, vol 37, Nov 2004
53
I/O Acceleration – Problem Magnitude Memory Copies & Effects of Streaming CRCs Crypto Parsing, Tree Construction Storage over IP Networking Security Services I/O Processing Rates are significantly limited by CPU in the face of Data Movement and Transformation Operations
54
Building Block Engines Bulk Data Operations Copies / Moves Scatter / Gather Inter-VM-comm Data Transformation EncryptionCompression XML Parsing Data Validation XORs Checksums CRCs Investigate architectural and platform support for building block engines in future servers Questions: –What are the characteristics of bulk data operations? –Why are they performance bottlenecks today? –What is best way to improve their performance? Parallelize the computation across many small CPUs? Build a BBE and tightly couple it with the CPU How do we expose BBEs? Data Movement Scalable on-die fabric Integrated Memory Controllers CACHE Small Core BBE Granularity? Reconfigurability? Core BBE
55
Conclusions and Future Work Studied architectural characteristics of key network protocols –TCP/IP: requires more instruction cache, has a large memory overhead Proposed several techniques for optimization –Caching techniques –ISA optimization –Data Copy engines Further investigation on network protocols & optimization – Heterogeneous Chip multiprocessors -- Other I/O applications, SSL, XML, etc. - Use of Network Processors and FPGA’s
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.