Achieving the Ultimate Efficiency for Seismic Analysis

Achieving the Ultimate Efficiency for Seismic Analysis
March 2017 James Coomer

Options for Flash (Lustre)
1 2 3 4 All SSD Lustre Lustre HSM for Data Tiering to HDD namespace Generic Lustre I/O Millions of Read and Write IOPS SFX Block-Level Read Cache Instant Commit, DSS, fadvice() Millions of Read IOPS L2RC OSS-level Read cache Heuristics with FileHeat Millions of Read IOPS IME I/O level Write and Read Cache Transparent + hints 10s of Millions of Read/Write IOPS IME OSS OSS OSS OSS OSS OSS OSS OSS OSS OSS OSS OSS

Paradigm Echos: Write Benchmark Job Stacking
Mark ”wirespeed” with iozone benchmark Execute Echos Write intensive Benchmark on an Ethernet-based system 2x10G 40G, single and dual socket. Increase job packing up to 18 Jobs/node

Paradigm Echos: Read Benchmark Job Stacking
Mark ”wirespeed” with iozone benchmark Execute Echos Read intensive Benchmark on an Ethernet-based system 2x10G 40G, single and dual socket. Increase job packing up to 18 Jobs/node

2. SFX & ReACT – Accelerating Reads Integrated with Lustre DSS
OSS SFX API Small Reads Large Reads Small Rereads DRAM Cache HDD Tier SFX Tier Cache Warm

2. 4 KiB Random I/O Second Time I/O SFA Read Hit First Time I/O

SFX Acceleration “Strided Read” benchmark
IO pattern is based on reading the seismic data back in a different order than the order it was written somewhat random access into the seismic data set. A smaller test system was used and compared 100 spindles alone, with 100 spindles accelerated with 12 SSDs. 4 clients, with each running the 3 instances of the Strided Read benchmark. No SFX SFX Accelerated 1639 MB/s 5110 MB/s

Scale-out, Flash-Native
IME Scale-out, Flash-Native IOSYSTEM

Fundamental Change in handling of Application IO for the Flash Era Meet Performance Goals with Magnitude Reduction in HW

Diverse, high concurrency applications Persistent Data (Disk)
DDN | IME I/O dataflow Diverse, high concurrency applications Compute Persistent Data (Disk) Fast Data NVM & SSD Application issues IO to IME client. Erasure Coding applied IME client sends fragments to IME servers IME servers write buffers to NVM and manage internal metadata IME servers write aligned sequential I/O to SFA backend Parallel File system operates at maximum efficiency

Random, Shared File IO at Wirespeed
DDN | IME I/O dataflow Scale-out Flash Layer Random, Shared File IO at Wirespeed Diverse, high concurrency applications Compute Persistent Data (Disk) Fast Data NVM & SSD Application issues IO to IME client. Erasure Coding applied IME client sends fragments to IME servers IME servers write buffers to NVM and manage internal metadata IME servers write aligned sequential I/O to SFA backend Parallel File system operates at maximum efficiency

TORTIA – RTM Code Experiment use case
Standard C++ with OpenMP & MPI Input and output data in SEG-Y format Requires a temporary scratch area First half of the time loop dump snapshots of velocity fields The second half of the time loop read back the saved snapshots LIFO (Last-In First-out) access pattern Implement 3 different I/O backend for the scratch POSIX MPI-IO In Memory aka “no I/O” Total I/O size Scenario Small 80 GB Quick data validation Medium 950 GB Typical production run Large 8.4 TB High-resolution run

TORTIA: Scratch I/O pattern last in, first out
Write Read Compute time 1 1 2 2 I/O k-2 k-2 High chance of cache miss k-1 k-1 Likely to be in cache Both compute node and storage side

TORTIA on pre-GA DDN IME Total execution time
6 nodes 2 x MPI rank /node 20 x OpenMP thread /rank I/O target In memory Lustre IME Burst Buffer 0.00 0.20 0.40 0.60 0.80 1.00 Small case 80GB Medium case 950GB Large case 8.4 TB Up-to 3x speedup Total execution time In memory not applicable to Large case: not enough memory on the nodes

TORTIA on pre-GA DDN IME Independent run
1.2 1.25 1.3 1.35 1.4 1.45 1.5 1.55 1.6 50 100 150 200 250 300 350 400 1 2 3 4 5 6 7 8 Speedup for IME compared to Lustre Elapsed time in seconds Number of concurrent independent runs Lustre IME Speedup Multiple independent run of the Small test case 1 run x compute node; node count in {1..8}

RTM: IME vs Lustre Time spent in I/O

OakForest PACS | JCAHPC Universities of Tokyo and Tsukuba
#6 in Top500, Japan's Fastest Supercomputer 8,208 Intel Xeon Phi processors with Knights Landing architecture Peak Performance: 25 PF 25 IME14K 940TB NVME SSD (with Erasure Coding) measured 1.2 TB/s FPP and SFF

IME Differentiation Radical Shift in Performance/Watt,RU,Device
Flash-Native implementation Extreme Rebuild Speeds Full Data Protection Improved efficiency of the Parallel Filesystem Radical Shift in Performance/Watt,RU,Device Dramatic Random IO and Shared file IO performance Self-Optimising in Noisy Environments Intelligent Read-ahead

Achieving the Ultimate Efficiency for Seismic Analysis

Similar presentations

Presentation on theme: "Achieving the Ultimate Efficiency for Seismic Analysis"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Achieving the Ultimate Efficiency for Seismic Analysis

Similar presentations

Presentation on theme: "Achieving the Ultimate Efficiency for Seismic Analysis"— Presentation transcript:

Similar presentations

About project

Feedback