Download presentation
Presentation is loading. Please wait.
Published byJonas Dean Modified over 9 years ago
1
1 MEMORY PERFORMANCE EVALUATION OF HIGH THOUGHPUT SERVERS Garba Ya’u Isa Master’s Thesis Oral Defense Computer Engineering King Fahd University of Petroleum & Minerals Saturday, 7 th June 2003
2
2 Introduction Problem Statement Analysis of Memory Accesses Measurement Based Performance Evaluation Design and Implementation of Prototype Contributions Conclusions Future Work Outline
3
3 Introduction Processor and memory performance discrepancy Growing network bandwidth Data rates in Terabits per second possible Gigabit per second LANs already deployed High throughput servers in network infrastructure Streaming media servers Web servers Software Routers
4
4 Dealing with Performance Gap Hierarchical memory architecture temporal locality spatial locality Constrains Characteristics of network payload data: Large won’t fit into cache Hardly reusable poor temporal locality
5
5 Problem Statement Network servers should: Deliver high throughput Respond to requests with low latency Respond to large number of clients Our goal Identify specific conditions at which server memory becomes a bottleneck Includes: cache, main memory, and virtual memory Benefits Better server design that alleviates memory bottlenecks Optimal performance can be achieved Constraints Large amount of data flowing through CPU and memory Writing code to optimize memory utilization is a challenge
6
6 Analysis of Memory Accesses: Data Flow Analysis Four data transfer paths: Memory-CPU Memory-memory Memory-I/O Memory-network
7
7 Latency Model and Memory Overhead Each transaction involves: CPU cycles Data transfers: one or more of four identified types Transaction latency: T trans = T cpu + n 1 T m-c + n 2 T m-m + n 3 T m-disk + n 4 T m-net T cpu Total CPU time needed for the transaction T m-c Time to transfer entire PDU from memory to CPU for proc. T m-m Latency of memory-memory copy of a PDU T m-disk Latency of memory-I/O read/write of a block of data T m-net Latency of memory-network read/write of a PDU n i Number of each type of data movement operations
8
8 Memory-CPU Transfers PDU Processing checksum computation and header updating Typically, one-way data flow (memory to CPU via cache) Memory stall cycles Number of memory stall cycles = (IC)(AR)(MR)(MP) Cache miss rate Worst case: MR = 1 (not as bad!) Best case: MR = 0 (trivial)
9
9 Cache overhead in various cases: Worst case: MR = 1, MP = 10 and (MR)(MP) 10 Best case: MR = 0 trivial Average case: MR = 0.1, MP = 10 and (MR)(MP) 1 Memory-CPU latency dependent on internal bus bandwidth T m-c = S/32B i usec where S is the PDU size and B i is the internal bus bandwidth in MB/s Memory-CPU Transfers cont.
10
10 Memory-memory transfer: Due to memory copy of PDU between protocol layers Transfers through caches and CPU Stride =1 (contiguous) Transfer involves memory cache CPU cache memory data movement Latency: Dependent on internal (system) bus bandwidth T m-m = 2S/B i usec Memory-Memory Transfers
11
11 Memory-network transfers: Passes over the I/O bus DMA can be used Again, stride = 1 (contiguous) Latency: Limiting factor is the I/O bus bandwidth T m-net = S/B e usec Memory-I/O and Memory-Network Transfers
12
12 RTP Transaction Latency HTTP Transaction Latency IP Transaction Latency 1 2 3 Latency of Reference Applications
13
13 Assumptions CPU usage latency compared to data transfer latency is negligible and can be ignored Bus contention from multiple simultaneously executed transactions do not result in any additional overhead Server Throughput = S/T S = size of transaction data T = latency of a transaction given by equations 1, 2 and 3 Peak Throughputs
14
14 Peak Throughputs cont. ProcessorInternal bus bandwidth (MB/sec) Throughput of three network applications IP forwarding (Mbits/sec) HTTP (Mbits/sec) RTP Streaming (Mbits/sec) Intel Pentium IV 3.06 GHz320042643640 AMD Athlon XP 3000+270042643291 MIPS R16000 700 MHz320042643640 Sun Ultraspac III 900 MHz120042641862
15
15 Measurement Based Performance Evaluation Experimental TestbedTestbed Dual boot server (Pentium IV 2.0 GHz) 256 MB RAM 1.0 GHz NIC Closed LAN (Cisco catalyst 1.0 GHz 3550 switch) Tools Intel Vtune Windows Performance Monitor Netstat Linux tools: vmstat, sar, iostat
16
16 Platforms and Applications Platforms Linux (kernel 2.4.7-10) Windows 2000 Applications Streaming media servers Darwin streaming server Windows media server Web servers Apache web server Microsoft Internet Information server Software router Linux kernel IP forwarding
17
17 Analysis of Operating System Role Memory Throughput Test ECT (extended copy transfer) – memperf Locality of reference: temporal locality – varying working set size (block size) spatial locality – varying access pattern (strides)
18
18 Context switching overhead Analysis of Operating System Role cont.
19
19 Streaming Media Servers Experimental Design Factors Number of streams (streaming clients) Media encoding rate (56kbps and 300kbps) Stream distribution (unique and multiple media) Metrics Cache miss (L1 and L2 cache) Page fault rate Throughput Benchmarking Tools DSS - streaming load tool WMS – media load simulator
20
20 Cache Performance L1 cache misses (56kbps)
21
21 L1 cache misses (300 kbps) Cache Performance cont.
22
22 Memory Performance Page fault (300kbps)
23
23 Throughput Throughput (300kbps)
24
24 Summary: Streaming Media Server Memory Performance Highest degradation in cache performance (both L1 and L2) when the number of clients is large and the encoding rate is 300kbps with multiple multimedia objects. When clients demand unique media objects, page fault rate is constant. However, if the request is for multiple objects, the page fault rate increases with the number of clients. Throughput increases with number of clients. Higher encoding rate - 300kbps, also accounts for more throughputs. Darwin streaming server has less throughput compared to Windows media server.
25
25 Web Servers Experimental Design Factors Number of web clients Document size Metrics Cache miss (L1 and L2 cache) Page fault rate Throughput Transactions/sec (connection rate) Average latency Benchmarking Tool Webstone
26
26 Transactions
27
27 L1 Cache Miss
28
28 Page Fault
29
29 Throughput
30
30 Summary: Web Server Memory Performance Evaluation AttributeValue ApacheIIS Max. transaction rate (conn/sec) Max. throughput (Mbps) CPU utilization (%) 2586 217 71 4178 (58 % more than apache) 349 (62% more than Apache) 63 L1 misses (Millions) L2 misses (Millions) Page fault rate (pfs/sec) 424 1673 < 10 200 117 < 10 Comparing Apache and IIS for an average file size of 10K
31
31 Software Router Experimental Design Factors Routing configurationsRouting configurations TCP message size (64bytes, 10 Kbytes, and 64 Kbytes) Metrics Throughput Number of context switching Number of active pages Benchmarking Tool Netperf
32
32 Software Router Throughput
33
33 CPU Utilization
34
34 Context Switching
35
35 Active Page
36
36 Summary: Software Router Performance Evaluation Maximum throughput of 449 Mbps for configuration number 2 - full duplex one-to-one communication. Highest CPU utilization was 84% Highest context switching rate was 5378/sec Number of active pages fairly uniformly distributed. Indicates low memory activity.
37
37 Design, Implementation and Evaluation of Prototype DB-RTP Server Architecture Implementation Linux platform (C) Our implementation of RTSP/RTP (why?)
38
38 Double Buffering and Synchronization Buffer read Buffer write
39
39 RTP Server Throughput
40
40 Jitter
41
41 Throughput DB-RTP server – 63.85 Mbps RTP server – 59 Mbps. Both servers exhibit steady jitter, but DB-RTP has relatively lower jitter compared to RTP server. Summary: DB-RTP Server Performance Evaluation
42
42 Contributions Cache overhead analysis. Memory latency and bandwidth analysis Measurement-based performance evaluation Design, implementation, and evaluation of a prototype streaming server - Double Buffer RTP (DB-RTP) server.
43
43 Conclusions High throughput is possible with server design enhancement. Server throughput is significantly degraded by excessive cache misses and page faults. Latency hiding with pre-fetching and buffering can improve throughput and jitter performance
44
44 Future Work Server Development hybrid = multiplexing + multithreading Special Architectures (Network processors & ASICs) resource scheduling investigation of the role I/O use of IRAM (intelligent RAM) architectures integrated network infrastructure server
45
45 Thank you
46
46 go back Array restructuring Loop nest transformation Array Padding
47
47 Testbeds go back Streaming media/web server testbed Software router testbed
48
48 Communication Configurations go back
49
49 Backup slides
50
50 Page fault 56 kbps 300 kbps Memory Performance
51
51 Streaming Server: CPU Utilization
52
52 L2 cache misses (56kbps) Cache Performance cont.
53
53 L2 cache misses (300kbps) Cache Performance cont.
54
54 Web Servers Cache performance L1 cache misses L2 cache misses Transaction
55
55 Latency CPU Utilization Web Servers
56
56 DB-RTP Server L1 cache misses L2 cache misses CPU Utilization
57
57 Memory Performance Evaluation Methodologies Analytical Requires just paper and pencil Accuracy? Simulation Requires programming Time and cost? Measurement Real system or a prototype required Using on-chip counters Benchmarking tools More accurate
58
58 Server Performance Tuning Memory performance tuning Array paddingrray padding Array restructuringrray restructuring Loop nest transformationoop nest transformation Latency hiding and multithreading EPIC (IA-64) VIRAM Impulse Multiprocessing and clustering Task parallelization E.g. Panama cluster router Special Architectures Network processors ASICs and Data flow architectures
59
59 Temporal vs. spatial locality A PDU lacks temporal locality Observation: PDU processing exhibits excellent spatial locality Suppose data cache line is 32 bytes (or 16 words) long Sequential accesses with stride = 1 Accessing one word, brings other 15 words as well Thus, effective MR = 1/16 = 6.2% better than even scientific apps Thus, generally MR = W/L W - Width of each memory access (in bytes) L - Length of each cache line (in bytes) Validation of above observation: Similar special locality characteristics reported via measurements: S. Sohoni et al., “A Study of Memory System Performance of Multimedia Applications,” in proc. of ACM SIGMETRICS 2001 MR for streaming media player better than SPEC benchmark apps!
60
60 Memory-CPU Transfers PDU Processing checksum computation and header updating Typically, one-way data flow (memory to CPU via cache) Memory stall cycles Number of memory stall cycles = (IC)(AR)(MR)(MP) IC – Instruction count per transaction AR – Number of memory accesses/instruction (AR=1) MR – Ratio of cache misses to memory accesses MP – Miss penalty in terms of clock cycles Cache miss rate Worst case: MR = 1 while typically MP = 10 Stall cycles = 10 x IC
61
61 Determine cache overhead wrt execution time: (Execution time)no-cache = (IC)(CPI)(CC) (Execution time)with-cache = (IC)(CPI)(CC) {1 + (MR)(MP)} Cache overhead = 1 + (MR)(MP) Cache overhead in various cases: Worst case: MR = 1 and MP = 10 Cache results in 11 times higher latency for each transaction! Memory-CPU latency dependent on internal bus bandwidth Best case: MR = 0 trivial Average case: MR = 0.1 and MP = 10 and (MR)(MP) 1 Latency due to stalls = ideal execution time without stalls T m-c = S/32B i usec where S is the PDU size and B i is the internal bus BW in MB/s Memory-CPU Transfers cont.
62
62 Open Questions Role of specific-purpose architecture on performance of high throughput servers (e.g. network processor) Role of memory compression Role of scheduling Open Questions
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.