File System Benchmarking Advanced Research Computing
Outline IO benchmarks Benchmarks results for What is benchmarked Micro-benchmarks Synthetic benchmarks Benchmarks results for Shelter NFS server, client on hokiespeed NetApp FAS 3240 server, client on hokiespeed and blueridge
IO BENCHMARKING
IO Benchmarks Micro-benchmarks Synthetic benchmarks: Measure one basic operation in isolation Read and write throughput: dd, IOzone, IOR Metadata operations (file create, stat, remove): mdtest Good for: tuning an operation, system acceptance Synthetic benchmarks: Mix of operations that model real applications Useful if they are good models of real applications Examples: Kernel build, kernel tar and untar NAS BT-IO
IO Benchmark pitfalls Not measuring want you want to measure masking of the results by various caching and buffering mechanisms Examples of different behaviors Sequential bandwidth vs random IO bandwidth; Direct IO bandwidth vs bandwidth in the presence of the page cache (in the latter case an fsync is needed) Caching of file attributes: stat-ing a file on the same node on which the file has been written
What is benchmarked What we measure is the combined effect of: native file system on the NFS server (shelter) NFS server performance which depends on factors such as enabling/disabling write-delay and the number of server threads Too few threads: client retries several times Too many threads: server thrashing network between the compute cluster and the NFS server NFS client and mount options Synchronous or asynchronous Enable/disable attribute caching
Micro-benchmarks IOZone – measure read/write bandwidth Historical benchmark ability to test multiple readers/writers dd – measure read/write bandwidth Tests file write/read mdtest – metadata operations per second file/directory create/stat/remove
Mdtest – metada test Measures the rate of the operations of file/directory create, stat, remove Mdtest creates a tree of files and directories Parameters used tree depth z = 1 branching factor b = 3 number of files/directories per tree node: I = 256 Stat run by another node than the create node: N = 1 Number of repeats of the run: i = 5
Synthetic benchmarks tar-untar-rm – measure time Test large number of small file creation/deletion Test filesystem metadata creation/deletion NAS BT-IO – bandwidth and time doing IO Solve a block tri-diagonal linear system arising from the discretization of Navier-Stokes equations
Kernel source tar-untar-rm Run on 1 to 32 nodes. Tarball size: 890M Total directories: 4732 Max directory depth: 10 Total files: 75984 Max file size: 919 kB <= 1k: 14490 <= 10k: 40190 <=100k: 20518 <= 1M: 786
NAS BT-I/O Test mechanism What it measures BT is a simulated CFD application that uses an implicit algorithm to solve 3-dimensional compressible Navier-Stokes equations. The finite differences solution to the problem is based on an Alternating Direction Implicit (ADI) approximate factorization that decouples the x, y and z dimensions. The resulting systems are Block-Tridiagonal of 5x5 blocks and are solved sequentially along each dimension. BT-I/O is test of different parallel I/O techniques in BT Reference - http://www.nas.nasa.gov/assets/pdf/techreports/1999/nas-99-011.pdf What it measures Multiple cores I/O with a single large file (blocking MPI calls mpi_file_write_at_all and mpi_file_read_at_all) I/O timing percentage, Total data written, I/O data rate
ShelTER NFS RESULTS
dd throughput (MB/sec) Run on 1 to 32 nodes Two block size – 1MB and 4MB Three file sizes – 1GB, 5GB, 15GB Block size File size Average Median Stdev 1M 1G 8.01 6.10 4.58 5G 7.75 5.95 4.52 15G 5.74 5.60 0.34 4M 4G 11.17 11.80 2.87 20G 15.71 12.70 10.68 60G 14.60 10.50 9.22
dd throughput (MB/sec)
IOZone write throughput
IOZone write vs read (single thread)
Mdtest file/directory create rate
Mdtest file/directory remove rate
Mdtest file/directory stat rate
Tar-untar-rm time (sec) Real User Sys Average 781.27 1.35 10.41 Median 1341.72 1.66 13.08 Standard deviation 644.16 0.44 3.39 untar Real User Sys Average 1214.82 1.51 18.02 Median 1200.13 17.90 Standard deviation 99.03 0.06 0.62 rm Real User Sys Average 227.48 0.22 3.91 Median 216.28 3.87 Standard deviation 64.21 0.02 0.16
BT-IO Results Attribute Class C Class D Problem Size 162 x 162 x 162 Iterations 200 250 Number of Processes 4 361 I/O timing percentage 13.44 91.66 Total data written in a single file (MB) 6802.44 135834.62 I/O data rate (MB/sec) 94.99 73.45 Data written or read at every I/O instance into a single file per processor (MB/core) 42.5 7.5
NETAPP FAS 3240 RESULTS
Server and Clients NAS server: NetApp FAS 3240 Clients running on two clusters Hokiespeed Blueridge Hokiespeed: Linux kernel compile, tar-untar and rm tests have been run with: nodes spread uniformly over racks, and consecutive nodes (rack-packed) Blueridge: Linux kernel compile, tar-untar, and rm tests have been run on consecutive nodes
IOzone read and write throughput (KB/s) Hokiespeed
dd bandwidth (MB/sec) Two node placement policies Direct IO was used packed on a rack spread across racks Direct IO was used Two operations: read and write Two block sizes – 1MB and 4MB Three file sizes – 1GB, 5GB, 15 GB Results show throughput in MB/s
dd read throughput (MB/sec), 1MB blocks Hokiespeed BlueRidge Nodes spread Nodes packed Nodes packed
dd read throughput (MB/sec), 4 MB blocks Hokiespeed BlueRidge Nodes packed Nodes spread Nodes packed
dd write throughput (MB/sec), 1MB blocks BlueRidge Hokiespeed Nodes spread Nodes packed Nodes packed
dd write throughput (MB/sec), 4MB blocks Hokiespeed BlueRidge Nodes spread Nodes packed Nodes packed
Linux Kernel tests Two node placement policies Operations packed on a rack spread across racks Operations Compile: make –j 12 Tar creations and extraction Remove directory tree read and write Results show throughput in MB/s
Linux Kernel compile time (sec) Hokiespeed BlueRidge nodes real user sys 1 817 4968 1096 2 990 5014 1138 4 993 5223 1171 8 939 5143 1167 16 1318 5112 1198 32 2561 5087 1183 64 4985 5111 1209 nodes real user sys 1 694 4589 951 2 1092 4572 993 4 2212 4631 1038 8 4451 4691 1073 16 5636 4716 1098 32 5999 4702 1111 64 6609 4699 1089 Nodes spread Nodes packed nodes real user sys 1 733 5001 1116 2 1546 5086 1233 4 3189 5146 1273 8 6343 5219 1317 16 9476 5251 1366 32 10012 5255 1339 Nodes packed
Tar extraction time (sec) Hokiespeed BlueRidge nodes real user sys 1 143 1.05 9.5 2 125 0.98 9.4 4 144 1.04 9.8 8 149 16 216 1.08 10.4 32 399 1.23 12.5 64 809 1.42 15.0 nodes real user sys 1 98 0.6 6.6 2 103 4 106 6.5 8 130 0.7 7.1 16 217 0.8 9.1 32 406 1.2 13 64 818 1.1 14 Nodes spread Nodes packed nodes real user Sys 1 167 1.0 9.5 2 172 0.98 4 177 1.06 9.6 8 202 1.03 9.7 16 312 1.09 10.2 32 421 1.18 11.9 Nodes packed
Rm execution time (sec) Hokiespeed BlueRidge nodes real user sys 1 20 0.12 2.5 2 21 0.15 2.7 4 25 0.16 2.8 8 33 0.17 16 123 0.22 3.7 32 284 0.24 4.0 64 650 0.27 4.4 nodes real user sys 1 19.21 0.07 1.69 2 19.14 0.10 4 26.68 0.11 1.98 8 63.75 0.16 3.16 16 152.59 0.22 4.24 32 324.90 0.26 4.98 64 699.04 0.25 5.06 Nodes spread Nodes packed nodes real user sys 1 21 0.14 2.84 2 22 2.82 4 0.15 2.80 8 47 0.18 3.30 16 135 0.21 3.85 32 248 0.23 4.01 64 811 0.27 4.54 Nodes packed
Uplink switch traffic, runs on hokiespeed Nodes spread Nodes packed
Mdtest file/directory create rate IO ops/sec for mdtest –z 1 –b 3 –I 256 –i 10 –N 1 BlueRidge Hokiespeed
Mdtest file/directory remove rate IO ops/sec for mdtest –z 1 –b 3 –I 256 –i 10 –N 1 Hokiespeed BlueRidge
Mdtest file/directory stat rate IO ops/sec for mdtest –z 1 –b 3 –I 256 –i 10 –N 1 Hokiespeed BlueRidge
250 (I/O after every 5 steps) NAS BT IO results Class D Iterations 250 (I/O after every 5 steps) Number of jobs 50 Total data size (written/read) (TB) 6.5 (50 files of 135GB each) System HokieSpeed BlueRidge Nodes per job 3 4 Total number of cores 1800 3200 Average I/O timing in hours 5.175 5.85 5.3 5.5 Average I/O timing (percentage of total time) 92.6 93.4 92.7 96.6 Average Mop/s/process 80.6 72 79.6 44.5 Average I/O rate per node (MB/s) 2.44 2.15 2.34 1.71 Total I/O rates (MB/s) 357.64 323.02 359.8 343.42
Uplink switch traffic for BT-IO on hokiespeed 1 2 The boxes indicate the three NAS BT IO runs Red is write Green is read 3
EMC Isilon X400 RESULTS
dd bandwidth (MB/sec) Runs on BlueRidge Direct IO was used no special node placement policy Direct IO was used Two operations: read and write Two block sizes – 1MB and 4MB Three file sizes – 1GB, 5GB, 15 GB Results show throughput in MB/s
dd read throughput (MB/sec), 1MB blocks EMC Isilon NetApp
dd read throughput (MB/sec), 4 MB blocks Isilon NetApp
dd write throughput (MB/sec), 1MB blocks Isilon NetApp
dd write throughput (MB/sec), 4MB blocks Isilon NetApp
Linux Kernel tests Runs on BlueRidge Direct IO was used Operations no special node placement policy Direct IO was used Operations Compile: make –j 12 Tar creations and extraction Remove directory tree read and write Results show throughput in MB/s
Linux Kernel compile time (sec) Isilon NetApp nodes real user sys 1 701 4584 957 2 1094 4558 989 4 2228 4631 1038 8 4642 4713 1084 16 5860 4723 1107 32 6655 4754 1120 64 7181 4760 1113 nodes real user sys 1 694 4589 951 2 1092 4572 993 4 2212 4631 1038 8 4451 4691 1073 16 5636 4716 1098 32 5999 4702 1111 64 6609 4699 1089
Tar creation time (sec) Isilon NetApp nodes real user sys 1 32 0.50 4.45 2 0.51 4.54 4 0.47 4.39 8 0.48 4.38 16 33 0.49 4.28 35 4.19 64 57 4.20 nodes real user sys 1 30 0.51 4.50 2 0.49 4.46 4 34 0.50 4.51 8 41 4.45 16 62 0.54 32 116 0.60 4.83 64 238 0.89 7.10
Tar extraction time (sec) Isilon NetApp nodes real user sys 1 230 0.65 10.1 2 234 0.62 10.3 4 237 0.63 10.4 8 255 0.64 10.5 16 300 0.67 10.9 32 431 0.74 11.8 64 754 0.87 14.1 nodes real user sys 1 98 0.6 6.6 2 103 4 106 6.5 8 130 0.7 7.1 16 217 0.8 9.1 32 406 1.2 13 64 818 1.1 14
Rm execution time (sec) Isilon NetApp nodes real user sys 1 110 0.23 4.76 2 113 0.24 4.80 4 124 4.82 8 158 4.85 16 234 0.25 4.93 32 340 0.26 4.99 64 655 5.27 nodes real user sys 1 19.2 0.07 1.69 2 19.1 0.10 4 26.7 0.11 1.98 8 63.7 0.16 3.16 16 152 0.22 4.24 32 324 0.26 4.98 64 699 0.25 5.06
IOZone write throughput (KB/s) Isilon Buffered IO/BlueRidge Direct IO/BlueRidge
IOZone read throughput (KB/s) Isilon Buffered IO/BlueRidge Direct IO/BlueRidge
Iozone write throughput (KB/s) Isilon/BlueRidge NetApp/HokieSpeed
IOzone read throughput (KB/s) Isilon/BlueRidge NetApp/HokieSpeed
Thank you.