Download presentation
Presentation is loading. Please wait.
Published byDelphia Francis Modified over 9 years ago
1
Operated by Los Alamos National Security, LLC for NNSA U N C L A S S I F I E D HPC Storage and IO Trends, Towards the Exascale Era and Beyond Gary Grider Division Leader High Performance Computing Division Los Alamos National Laboratory ggrider@lanl.gov LA-UR-15-23719 Fall 2015
2
Operated by Los Alamos National Security, LLC for NNSA U N C L A S S I F I E D The beginning of storage layer proliferation circa 2009 Economic modeling for large burst of data from memory shows bandwidth / capacity better matched for solid state storage near the compute nodes Economic modeling for archive shows bandwidth / capacity better matched for disk
3
Operated by Los Alamos National Security, LLC for NNSA U N C L A S S I F I E D The Hoopla Parade circa 2014 Data Warp
4
Operated by Los Alamos National Security, LLC for NNSA U N C L A S S I F I E D What are all these storage layers? Why do we need all these storage layers? Why BB: Economics (disk bw/iops too expensive) PFS: Maturity Memory Burst Buffer Parallel File System Campaign Storage Archive Memory Parallel File System Archive HPC Before Trinity HPC After Trinity 1-2 PB/sec Residence – hours Overwritten – continuous 4-6 TB/sec Residence – hours Overwritten – hours 1-2 TB/sec Residence – days/weeks Flushed – weeks 100-300 GB/sec Residence – months-year Flushed – months-year 10s GB/sec (parallel tape Residence – forever HPSS Parallel Tape Lustre Parallel File System DRAM Campaign: Economics (PFS Raid too expensive, PFS solution too rich in function, PFS metadata not scalable enough, PFS designed for scratch use not years residency, Archive BW too expensive/difficult, Archive metadata too slow) Archive: Maturity
5
Operated by Los Alamos National Security, LLC for NNSA U N C L A S S I F I E D Isn’t that too many layers just for storage? If the Burst Buffer does its job very well (and indications are capacity of in system NV will grow radically) and campaign storage works out well (leveraging cloud), do we need a parallel file system anymore, or an archive? Maybe just a bw/iops tier and a capacity tier. Too soon to say, seems feasible longer term Memory Burst Buffer Parallel File System (PFS) Campaign Storage Archive Memory IOPS/BW Tier Parallel File System (PFS) Capacity Tier Archive Diagram courtesy of John Bent EMC Factoids (times are changing!) LANL HPSS = 53 PB and 543 M files Trinity 2 PB memory, 4 PB flash (11% of HPSS) and 80 PB PFS or 150% HPSS) Crossroads may have 5-10 PB memory, 40 PB solid state or 100% of HPSS with data residency measured in days or weeks We would have never contemplated more in system storage than our archive a few years ago
6
Operated by Los Alamos National Security, LLC for NNSA U N C L A S S I F I E D But that’s just a naïve old storage guy’s view… It’s a smorgasbord Amdahl Memory Of Node NV Off System Agile Disk Off System Capacity Disk Off System Tape Is there a way to reason about all this? Throughput memory Other On Chip Memory Off Chip DRAM On Node NV On Node DRAM Off System NV
7
Operated by Los Alamos National Security, LLC for NNSA U N C L A S S I F I E D But that is just economic arm waving. I know, I can turn to the workflow people for inspiration
8
Operated by Los Alamos National Security, LLC for NNSA U N C L A S S I F I E D What did I learn from the workflow-fest circa 04/2015? There are 57 ways to interpret insitu There is no common taxonomy that can be used to reason about data flow/workflow for architects or programmers
9
Operated by Los Alamos National Security, LLC for NNSA U N C L A S S I F I E D Slide 9 V1 Taxonomy Attempt
10
Operated by Los Alamos National Security, LLC for NNSA U N C L A S S I F I E D Slide 10 Circa 05/2015 V2 taxonomy attempt
11
Operated by Los Alamos National Security, LLC for NNSA U N C L A S S I F I E D Workflows enter the realm of RFP/Procurement Trinity/Cori We didn’t specify flops, we specified running bigger app faster We wanted it to go forward 90% of the time We didn’t specify how much burst buffer, or speeds/feeds Vendors didn’t like this at first but realized it was degrees of freedom we were giving them Apex Crossroads/NERSC+1 Still no flops Still want it to go forward a large % of the time Vendors ask: where and how much flash/nvram/pixy dust do we put on node, in network, in ionode, near storage, blah blah We don’t care we want to get the most of these work flows through the system in 6 months Slide 11 V3 Taxonomy Attempt
12
Operated by Los Alamos National Security, LLC for NNSA U N C L A S S I F I E D How does one reason about a future software stack? Well we don’t think it can be like this anymore… 12 Past Evolution (add but never subtract)
13
Operated by Los Alamos National Security, LLC for NNSA U N C L A S S I F I E D Coordinated DOE Storage Fast Forward Exascale Prototype IO Stack Project - (2 Tiers) Intel, EMC, HDF, DDN, Cray Async/non-blocking API’s, structure shipping Many Domain API’s (HDF5, HDFS, OOBD, etc.) Transactional/versioning Exploitation of interconnect Reliable server side storage, end-to-end integrity 2 Tiers (perf and cap) In-transit, on stg analysis Support many/per app namespaces Server Side Collectives, Gosip, Paxos Current Evolutionary Track Throw out the things you don’t like and things you want
14
Operated by Los Alamos National Security, LLC for NNSA U N C L A S S I F I E D Stack slides courtesy Rob Ross ANL Revolutionary Track Everything as a Service
15
Operated by Los Alamos National Security, LLC for NNSA U N C L A S S I F I E D ECI Preliminary Planning Slide 15 Evolutionary Track Revolutionary Track
16
Operated by Los Alamos National Security, LLC for NNSA U N C L A S S I F I E D How do you reason about a future stack logically? Two logical tiers: Performance (on system) and Capacity (off system) Processing allowed in both tiers Revolutionary track Sea of objects that migrate between tiers based on hints, QoS, resilience, perf, etc. (preliminary work done by Sirocco (Sandia ASC) and Triton (ANL DOD) Massive flexible namespace that lives in the capacity tier and the performance tier “checks out” huge portions of it and then merges huge chunks of it back when done. Not just Posix namespace (hierarchy style graph) but others like app centric namespaces that can be utilized /drawn from programming/execution models/systems (preliminary work done by (ANL Graph representations, LANL Legion Persistence, HDF, ORNL ADIOS, LANL MDHIM, etc.) Evolutionary track (adding and subtracting) (informed by Revolutionary work) Objects that migrate between performance and capacity tiers Namespaces that migrate Flexible name space types Slide 16
17
Operated by Los Alamos National Security, LLC for NNSA U N C L A S S I F I E D Thank You and RIPPFS Slide 17
18
Operated by Los Alamos National Security, LLC for NNSA U N C L A S S I F I E D Backup Slide 18
19
Operated by Los Alamos National Security, LLC for NNSA U N C L A S S I F I E D Slide 19 Circa 05/2015 Apply this future stack against current/future apps/workflows (application of new stack in DOE storage FF2)
20
Operated by Los Alamos National Security, LLC for NNSA U N C L A S S I F I E D Slide 20
21
Operated by Los Alamos National Security, LLC for NNSA U N C L A S S I F I E D Slide 21
22
Operated by Los Alamos National Security, LLC for NNSA U N C L A S S I F I E D Slide 22
23
Operated by Los Alamos National Security, LLC for NNSA U N C L A S S I F I E D Slide 23
24
Operated by Los Alamos National Security, LLC for NNSA U N C L A S S I F I E D
25
Operated by Los Alamos National Security, LLC for NNSA U N C L A S S I F I E D
26
Operated by Los Alamos National Security, LLC for NNSA U N C L A S S I F I E D
27
Operated by Los Alamos National Security, LLC for NNSA U N C L A S S I F I E D
28
Operated by Los Alamos National Security, LLC for NNSA U N C L A S S I F I E D
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.