1 MEMORY PERFORMANCE EVALUATION OF HIGH THOUGHPUT SERVERS Garba Ya’u Isa Master’s Thesis Oral Defense Computer Engineering King Fahd University of Petroleum.

1 1 MEMORY PERFORMANCE EVALUATION OF HIGH THOUGHPUT SERVERS Garba Ya’u Isa Master’s Thesis Oral Defense Computer Engineering King Fahd University of Petroleum & Minerals Saturday, 7 th June 2003

2 2  Introduction  Problem Statement  Analysis of Memory Accesses  Measurement Based Performance Evaluation  Design and Implementation of Prototype  Contributions  Conclusions  Future Work Outline

3 3 Introduction  Processor and memory performance discrepancy  Growing network bandwidth  Data rates in Terabits per second possible  Gigabit per second LANs already deployed  High throughput servers in network infrastructure  Streaming media servers  Web servers  Software Routers

4 4 Dealing with Performance Gap  Hierarchical memory architecture  temporal locality  spatial locality  Constrains  Characteristics of network payload data:  Large  won’t fit into cache  Hardly reusable  poor temporal locality

5 5 Problem Statement  Network servers should:  Deliver high throughput  Respond to requests with low latency  Respond to large number of clients  Our goal  Identify specific conditions at which server memory becomes a bottleneck  Includes:  cache,  main memory, and  virtual memory  Benefits  Better server design that alleviates memory bottlenecks  Optimal performance can be achieved  Constraints  Large amount of data flowing through CPU and memory  Writing code to optimize memory utilization is a challenge

6 6 Analysis of Memory Accesses: Data Flow Analysis Four data transfer paths:  Memory-CPU  Memory-memory  Memory-I/O  Memory-network

7 7 Latency Model and Memory Overhead  Each transaction involves:  CPU cycles  Data transfers: one or more of four identified types  Transaction latency:  T trans = T cpu + n 1 T m-c + n 2 T m-m + n 3 T m-disk + n 4 T m-net  T cpu  Total CPU time needed for the transaction  T m-c  Time to transfer entire PDU from memory to CPU for proc.  T m-m  Latency of memory-memory copy of a PDU  T m-disk  Latency of memory-I/O read/write of a block of data  T m-net  Latency of memory-network read/write of a PDU  n i  Number of each type of data movement operations

8 8 Memory-CPU Transfers  PDU Processing  checksum computation and header updating  Typically, one-way data flow (memory to CPU via cache)  Memory stall cycles  Number of memory stall cycles = (IC)(AR)(MR)(MP)  Cache miss rate  Worst case: MR = 1 (not as bad!)  Best case: MR = 0 (trivial)

9 9  Cache overhead in various cases:  Worst case: MR = 1, MP = 10 and (MR)(MP)  10  Best case: MR = 0  trivial  Average case: MR = 0.1, MP = 10 and (MR)(MP)  1  Memory-CPU latency dependent on internal bus bandwidth  T m-c = S/32B i usec where S is the PDU size and B i is the internal bus bandwidth in MB/s Memory-CPU Transfers cont.

10 10  Memory-memory transfer:  Due to memory copy of PDU between protocol layers  Transfers through caches and CPU  Stride =1 (contiguous)  Transfer involves memory  cache  CPU  cache  memory data movement  Latency:  Dependent on internal (system) bus bandwidth  T m-m = 2S/B i usec Memory-Memory Transfers

11 11  Memory-network transfers:  Passes over the I/O bus  DMA can be used  Again, stride = 1 (contiguous)  Latency:  Limiting factor is the I/O bus bandwidth  T m-net = S/B e usec Memory-I/O and Memory-Network Transfers

12 12  RTP Transaction Latency  HTTP Transaction Latency  IP Transaction Latency 1 2 3 Latency of Reference Applications

13 13  Assumptions  CPU usage latency compared to data transfer latency is negligible and can be ignored  Bus contention from multiple simultaneously executed transactions do not result in any additional overhead  Server Throughput = S/T  S = size of transaction data  T = latency of a transaction given by equations 1, 2 and 3 Peak Throughputs

14 14 Peak Throughputs cont. ProcessorInternal bus bandwidth (MB/sec) Throughput of three network applications IP forwarding (Mbits/sec) HTTP (Mbits/sec) RTP Streaming (Mbits/sec) Intel Pentium IV 3.06 GHz320042643640 AMD Athlon XP 3000+270042643291 MIPS R16000 700 MHz320042643640 Sun Ultraspac III 900 MHz120042641862

15 15 Measurement Based Performance Evaluation  Experimental TestbedTestbed  Dual boot server (Pentium IV 2.0 GHz)  256 MB RAM  1.0 GHz NIC  Closed LAN (Cisco catalyst 1.0 GHz 3550 switch)  Tools  Intel Vtune  Windows Performance Monitor  Netstat  Linux tools: vmstat, sar, iostat

16 16 Platforms and Applications  Platforms  Linux (kernel 2.4.7-10)  Windows 2000  Applications  Streaming media servers  Darwin streaming server  Windows media server  Web servers  Apache web server  Microsoft Internet Information server  Software router  Linux kernel IP forwarding

17 17 Analysis of Operating System Role  Memory Throughput Test  ECT (extended copy transfer) – memperf  Locality of reference:  temporal locality – varying working set size (block size)  spatial locality – varying access pattern (strides)

18 18  Context switching overhead Analysis of Operating System Role cont.

19 19 Streaming Media Servers Experimental Design  Factors  Number of streams (streaming clients)  Media encoding rate (56kbps and 300kbps)  Stream distribution (unique and multiple media)  Metrics  Cache miss (L1 and L2 cache)  Page fault rate  Throughput  Benchmarking Tools  DSS - streaming load tool  WMS – media load simulator

20 20 Cache Performance  L1 cache misses (56kbps)

21 21  L1 cache misses (300 kbps) Cache Performance cont.

22 22 Memory Performance  Page fault (300kbps)

23 23 Throughput  Throughput (300kbps)

24 24 Summary: Streaming Media Server Memory Performance  Highest degradation in cache performance (both L1 and L2) when the number of clients is large and the encoding rate is 300kbps with multiple multimedia objects.  When clients demand unique media objects, page fault rate is constant. However, if the request is for multiple objects, the page fault rate increases with the number of clients.  Throughput increases with number of clients. Higher encoding rate - 300kbps, also accounts for more throughputs. Darwin streaming server has less throughput compared to Windows media server.

25 25 Web Servers Experimental Design  Factors  Number of web clients  Document size  Metrics  Cache miss (L1 and L2 cache)  Page fault rate  Throughput  Transactions/sec (connection rate)  Average latency  Benchmarking Tool  Webstone

26 26 Transactions

27 27 L1 Cache Miss

28 28 Page Fault

29 29 Throughput

30 30 Summary: Web Server Memory Performance Evaluation AttributeValue ApacheIIS Max. transaction rate (conn/sec) Max. throughput (Mbps) CPU utilization (%) 2586 217 71 4178 (58 % more than apache) 349 (62% more than Apache) 63 L1 misses (Millions) L2 misses (Millions) Page fault rate (pfs/sec) 424 1673 < 10 200 117 < 10 Comparing Apache and IIS for an average file size of 10K

31 31 Software Router  Experimental Design  Factors  Routing configurationsRouting configurations  TCP message size (64bytes, 10 Kbytes, and 64 Kbytes)  Metrics  Throughput  Number of context switching  Number of active pages  Benchmarking Tool  Netperf

32 32 Software Router Throughput

33 33 CPU Utilization

34 34 Context Switching

35 35 Active Page

36 36 Summary: Software Router Performance Evaluation  Maximum throughput of 449 Mbps for configuration number 2 - full duplex one-to-one communication.  Highest CPU utilization was 84%  Highest context switching rate was 5378/sec  Number of active pages fairly uniformly distributed. Indicates low memory activity.

37 37 Design, Implementation and Evaluation of Prototype DB-RTP Server Architecture  Implementation  Linux platform (C)  Our implementation of RTSP/RTP (why?)

38 38 Double Buffering and Synchronization Buffer read Buffer write

39 39 RTP Server Throughput

40 40 Jitter

41 41  Throughput  DB-RTP server – 63.85 Mbps  RTP server – 59 Mbps.  Both servers exhibit steady jitter, but DB-RTP has relatively lower jitter compared to RTP server. Summary: DB-RTP Server Performance Evaluation

42 42 Contributions  Cache overhead analysis.  Memory latency and bandwidth analysis  Measurement-based performance evaluation  Design, implementation, and evaluation of a prototype streaming server - Double Buffer RTP (DB-RTP) server.

43 43 Conclusions  High throughput is possible with server design enhancement.  Server throughput is significantly degraded by excessive cache misses and page faults.  Latency hiding with pre-fetching and buffering can improve throughput and jitter performance

44 44 Future Work  Server Development  hybrid = multiplexing + multithreading  Special Architectures (Network processors & ASICs)  resource scheduling  investigation of the role I/O  use of IRAM (intelligent RAM) architectures  integrated network infrastructure server

57 57 Memory Performance Evaluation Methodologies  Analytical  Requires just paper and pencil  Accuracy?  Simulation  Requires programming  Time and cost?  Measurement  Real system or a prototype required  Using on-chip counters  Benchmarking tools  More accurate

58 58 Server Performance Tuning  Memory performance tuning  Array paddingrray padding  Array restructuringrray restructuring  Loop nest transformationoop nest transformation  Latency hiding and multithreading  EPIC (IA-64)  VIRAM  Impulse  Multiprocessing and clustering  Task parallelization  E.g. Panama cluster router  Special Architectures  Network processors  ASICs and Data flow architectures

59 59  Temporal vs. spatial locality  A PDU lacks temporal locality  Observation: PDU processing exhibits excellent spatial locality  Suppose data cache line is 32 bytes (or 16 words) long  Sequential accesses with stride = 1  Accessing one word, brings other 15 words as well  Thus, effective MR = 1/16 = 6.2%  better than even scientific apps  Thus, generally MR = W/L  W - Width of each memory access (in bytes)  L - Length of each cache line (in bytes)  Validation of above observation:  Similar special locality characteristics reported via measurements: S. Sohoni et al., “A Study of Memory System Performance of Multimedia Applications,” in proc. of ACM SIGMETRICS 2001  MR for streaming media player better than SPEC benchmark apps!

60 60 Memory-CPU Transfers  PDU Processing  checksum computation and header updating  Typically, one-way data flow (memory to CPU via cache)  Memory stall cycles  Number of memory stall cycles = (IC)(AR)(MR)(MP)  IC – Instruction count per transaction  AR – Number of memory accesses/instruction (AR=1)  MR – Ratio of cache misses to memory accesses  MP – Miss penalty in terms of clock cycles  Cache miss rate  Worst case: MR = 1 while typically MP = 10  Stall cycles = 10 x IC

61 61  Determine cache overhead wrt execution time:  (Execution time)no-cache = (IC)(CPI)(CC)  (Execution time)with-cache = (IC)(CPI)(CC) {1 + (MR)(MP)}  Cache overhead = 1 + (MR)(MP)  Cache overhead in various cases:  Worst case: MR = 1 and MP = 10  Cache results in 11 times higher latency for each transaction!  Memory-CPU latency dependent on internal bus bandwidth  Best case: MR = 0  trivial  Average case: MR = 0.1 and MP = 10 and (MR)(MP)  1  Latency due to stalls = ideal execution time without stalls  T m-c = S/32B i usec where S is the PDU size and B i is the internal bus BW in MB/s Memory-CPU Transfers cont.

62 62 Open Questions  Role of specific-purpose architecture on performance of high throughput servers (e.g. network processor)  Role of memory compression  Role of scheduling Open Questions

