Data access performance through parallelization and vectored access

Slides:

Advertisements

Similar presentations

Andrew Hanushevsky7-Feb Andrew Hanushevsky Stanford Linear Accelerator Center Produced under contract DE-AC03-76SF00515 between Stanford University.

Advertisements

Consistency and Replication Chapter 7 Part II Replica Management & Consistency Protocols.

Implementing A Simple Storage Case Consider a simple case for distributed storage – I want to back up files from machine A on machine B Avoids many tricky.

CSE506: Operating Systems Disk Scheduling. CSE506: Operating Systems Key to Disk Performance Don’t access the disk – Whenever possible Cache contents.

Esma Yildirim Department of Computer Engineering Fatih University Istanbul, Turkey DATACLOUD 2013.

Study of Hurricane and Tornado Operating Systems By Shubhanan Bakre.

Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.

The Google File System. Why? Google has lots of data –Cannot fit in traditional file system –Spans hundreds (thousands) of servers connected to (tens.

OnCall: Defeating Spikes with Dynamic Application Clusters Keith Coleman and James Norris Stanford University June 3, 2003.

Outline Network related issues and thinking for FAX Cost among sites, who has problems Analytics of FAX meta data, what are the problems  The main object.

Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.

GridFTP Guy Warner, NeSC Training.

© 2011 Cisco All rights reserved.Cisco Confidential 1 APP server Client library Memory (Managed Cache) Memory (Managed Cache) Queue to disk Disk NIC Replication.

Distributed File Systems

MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.

Distributed File Systems Overview  A file system is an abstract data type – an abstraction of a storage device.  A distributed file system is available.

11-July-2008Fabrizio Furano - Data access and Storage: new directions1.

10/18: Lecture topics Memory Hierarchy –Why it works: Locality –Levels in the hierarchy Cache access –Mapping strategies Cache performance Replacement.

Introduction to DFS. Distributed File Systems A file system whose clients, servers and storage devices are dispersed among the machines of a distributed.

The Alternative Larry Moore. 5 Nodes and Variant Input File Sizes Hadoop Alternative.

Fast Crash Recovery in RAMCloud. Motivation The role of DRAM has been increasing – Facebook used 150TB of DRAM For 200TB of disk storage However, there.

ROOT and Federated Data Stores What Features We Would Like Fons Rademakers CERN CC-IN2P3, Nov, 2011, Lyon, France.

CPSC 404, Laks V.S. Lakshmanan1 External Sorting Chapter 13: Ramakrishnan & Gherke and Chapter 2.3: Garcia-Molina et al.

02-June-2008Fabrizio Furano - Data access and Storage: new directions1.

CERN IT Department CH-1211 Genève 23 Switzerland t Scalla/xrootd WAN globalization tools: where we are. Now my WAN is well tuned! So what?

Slide 1/29 Informed Prefetching in ROOT Leandro Franco 23 June 2006 ROOT Team Meeting CERN.

Advanced Topics: Prefetching ECE 454 Computer Systems Programming Topics: UG Machine Architecture Memory Hierarchy of Multi-Core Architecture Software.

11-June-2008Fabrizio Furano - Data access and Storage: new directions1.

GridFTP Guy Warner, NeSC Training Team.

Latest Improvements in the PROOF system Bleeding Edge Physics with Bleeding Edge Computing Fons Rademakers, Gerri Ganis, Jan Iwaszkiewicz CERN.

CERN IT Department CH-1211 Genève 23 Switzerland t Xrootd through WAN… Can I? Now my WAN is well tuned! So what? Now my WAN is well tuned.

Lecture 4 Page 1 CS 111 Summer 2013 Scheduling CS 111 Operating Systems Peter Reiher.

SLACFederated Storage Workshop Summary Andrew Hanushevsky SLAC National Accelerator Laboratory April 10-11, 2014 SLAC.

Threads prepared and instructed by Shmuel Wimer Eng. Faculty, Bar-Ilan University 1July 2016Processes.

Vocabulary Prototype: A preliminary sketch of an idea or model for something new. It’s the original drawing from which something real might be built or.

Getting the Most out of Scientific Computing Resources

Getting the Most out of Scientific Computing Resources

Jonathan Walpole Computer Science Portland State University

Jacob R. Lorch Microsoft Research

Diskpool and cloud storage benchmarks used in IT-DSS

Lecture 16: Data Storage Wednesday, November 6, 2006.

CSC 4250 Computer Architectures

Fabrizio Furano: “From IO-less to Networks”

CSE 486/586 Distributed Systems Consistency --- 1

Multiprocessor Cache Coherency

Swapping Segmented paging allows us to have non-contiguous allocations

5.2 Eleven Advanced Optimizations of Cache Performance

Cache Memory Presentation I

Gregory Kesden, CSE-291 (Storage Systems) Fall 2017

Gregory Kesden, CSE-291 (Cloud Computing) Fall 2016

Lecture 11: DMBS Internals

The Google File System Sanjay Ghemawat, Howard Gobioff and Shun-Tak Leung Google Presented by Jiamin Huang EECS 582 – W16.

CMSC 611: Advanced Computer Architecture

湖南大学-信息科学与工程学院-计算机与科学系

Distributed Systems CS

Lecture: DRAM Main Memory

Part Three: Data Management

CSE 486/586 Distributed Systems Consistency --- 1

Adapted from slides by Sally McKee Cornell University

Discretized Streams: A Fault-Tolerant Model for Scalable Stream Processing Zaharia, et al (2012)

Persistence: hard disk drive

Prof. Leonardo Mostarda University of Camerino

Distributed Systems CS

ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg

Process Scheduling B.Ramamurthy 4/11/2019.

Process Scheduling B.Ramamurthy 4/7/2019.

Outline Review of Quiz #1 Distributed File Systems 4/20/2019 COP5611.

THE GOOGLE FILE SYSTEM.

Database System Architectures

Algorithms for Selecting Mirror Sites for Parallel Download

Presentation transcript:

Data access performance through parallelization and vectored access Some results Fabrizio Furano INFN – Istituto Nazionale di Fisica Nucleare furano@pd.infn.it Andrew Hanushevsky SLAC – Stanford Linear Accelerator Center abh@slac.stanford.edu

Motivation We want to make WAN data analysis convenient The typical way in which HEP data is processed is (or can be) often known in advance TTreeCache does an amazing job for this xrootd: fast and scalable server side Makes things run quite smooth Gives room for improvement at the client side About WHEN transferring the data There might be better moments to trigger a chunk xfer with respect to the moment it is needed Better if the app has not to pause while it receives data Many chunks together Also raise the throughput in many cases

A simple job An extreme simplification of the problem A single-source job Processes data “sequentially” i.e, the sequentiality is in the basic idea, but not necessarily in the produced data requests e.g. scanning a ROOT persistent data structure while not finished d = read(next_data_piece) process_data(d) end

Simplified schema Client Client Server We can make some simple hypotheses. Hypothesis: the data are not local to the client machine Hypothesis: there is a server which provides data chunks 1 client 1 server A job consisting of several interactions Latency Server Latency Data Processing Prepare next step

Latency The transmission latency (TCP) may vary network load geographical distance (WAN) e.g. Min. latency PD-SLAC is around 80ms (about ½ of the ping time) e.g, if our job consists of 1E6 transactions 160ms * 1E6 makes our job last 160.000 seconds more than the ideal “no latency” one 160.000 seconds (2 days!) of wasted time The worst enemy for WAN data access. Achievable throughput comes second Can we do something for this?

Latency Two ways to attack this issue: Hiding it Parallelism: request/send many chunks at the same time If the app computes between chunks, we can xfer data while the app is computing Gluing chunks together Divides by a const factor the number of times we pay the latency Huge readaheads (e.g. for cp-like apps which read everything) Vectored reads Multiple single overlapped async requests (reordered by the client) In any case we need to know something about the future. Statistically (the caching + read ahead case, leads to the “hit rate” concept) Exactly (the ROOT TTreeCache is fine for that)

Throughput The throughput is a measure of the actual data xfer speed Data xfer can be a non negligible part of the delays Throughput varies depending on the network Capacity and distance For TCP WANs it is very difficult to use the available bandwidth e.g. A scp app PD-SLAC does not score better than 350-400KB/sec, much less than the available bandwidth Even more important: We want to access the data files where they are By annihilating all the latencies and USING the bandwidth

Throughput via multiple streams Client Network Server Client Network Server

Intuitive scenarios (1/2) Pre-xfer data “locally” Remote access Remote access+ Overhead Need for potentially useless replicas Latency Wasted CPU cycles but practical Interesting! Efficient practical Sometimes (some %) the prediction can be wrong. Can be acceptable! Data Processing Data access

Intuitive scenarios (2/2) Vectored reads Async vectored reads Inform that some chunks are needed. Processing starts normally, waits only if hitting a “travelling” one. In principle, this can be made very unlikely. (by asking chunks more in advance) Efficient No data xfer overhead. Only used chunks get prefetched All (or many) of the chunks are requested at once. Pays latency only once if no probability of mistakes. But pays the actual data xfer

First evaluation Processing at CERN (MacBook Duo), data at SLAC (160ms RTT) 24.80076 14.921 9.87976 async (15 str.)‏ 20.05859 12.5978 7.46079 async (10 str.)‏ 16.91783 11.9273 4.99053 async (5 str.)‏ 17.43715 14.4711 2.96605 async (1 str.)‏ 23.63164 21.2345 2.39714 sync readv 6.978863 6.76074 0.218123 Local File Total Draw Branch Open File Courtesy of Leandro Franco and the ROOT team

First evaluation Confirmation of the huge improvement Simple strategy: >30min of wasted time! With some surprises Reading async the single small chunks (some KB) seemed faster than reading them as a unique readv 15 streams show a data rate worse than 10 streams Opening streams towards SLAC takes 0.6 secs per stream Maybe we can reduce this to <1 sec in total Anyway not a dramatic problem Latency paid only once when contacting a server The connections are shared between multiple clients/files through the XrdClient connection multiplexing We wanted to decide where it is worth optimising And what to do and what to expect

A simple benchmark using WANs One server in Padova, hosting a 500MB file A benchmark client app At SLAC (noric) Ping time: about 160ms At IN2P3 (ccali) Ping time: about 23ms We want to read 50000 chunks of 10KB each legacy reads, async reads in advance (1024), huge readaheads , readV (up to 20480 subchunks) Spanning all the file The sorting and the disks performance are of secondary importance here If compared to the latency of the network Only the huge readaheads need chunks to be sequential

Data rate vs stream count: IN2P3<--PD

IN2P3<--PD test case no additional streams: async chunk xfer slightly faster than sync readv We also remember that async chunk xfer makes the server able to overlap the operations the network bandwidth tends to saturate with few streams the max data rates are comparable (9MB/s vs 7.5MB/s) But there is still room for improvement We get a nearly 24X performance boost Reaching about 80% of the higher one And we did not have a processing phase to hide latencies behind Heading for 28X

Time elapsed vs stream count: IN2P3<--PD

Data rate vs stream count: SLAC<--PD

SLAC<--PD test case no additional streams: async chunk xfer still performs better than sync readv With sync readv the app can proceed only if the whole block is complete. A delayed TCP packet puts in stall the whole super-chunk. with the small chunks via multistreaming, async chunk xfer gets a boost not better than 3X With large chunks we reach a plateau of 9-10X (>8MB/s) Hence, there is room for improvement in the many streams-small chunks case As the sub-chunk size decreases (< 700KB), performance decreases (see 15 add. streams with the huge readahead) 44X boost over the “legacy” case (heading for 150X!)

Time elapsed vs stream count: SLAC<--PD

Conclusion Implementation platform: xrootd readV, readV_async, read_async are OK readV_async is not multistream aware yet We expect to fill with this the performance gap for small chunks ROOT TTrees have the TTreeCache, the async xfers are under advanced dev/test Add to this the fault tolerance and the Parallel Open Purpose (other than raising performance): To be able to give alternative choices for data or replica placement, eventually for fallbacks Let people choose why one should need replicas (more reliability, more performance, willing to 'own' a copy of the data, ...) Without forcing everybody to get (and keep track of!) potentially useless ones.

Prefetching Read ahead the communication library asks for data in advance, sequentially Informed prefetching the application informs the comm library about its near-future data requests One/few at each request (async read) or in big bunches (vectored reads) Client Server Cache hit! But we keep asking for data in advance, in parallel

But avoiding overheads! More than a cache Trouble: data could be requested several times The client has to keep track of pending xfers Originally named “Cache placeholders” (Patterson) Client Server Cache hit! But we keep asking for data in advance, in parallel It seems a miss but the data is in transit Better to wait for the right pending chunk than re-requesting! But avoiding overheads!

How to do it ? (1/4) while not finished d = read(next_data_piece) process_data(d) end

How to do it ? (2/4) enumerate_needed_chunks() readV(needed_chunks) Vectored reads Efficient No data xfer overhead. Only used chunks get prefetched enumerate_needed_chunks() readV(needed_chunks) populate_buffer(needed_chunks) while not finished d = get_next_chunk_frombuffer() process_data(d) end All (or many) of the chunks are requested at once. Pays latency only once if no probability of mistakes. But pays the actual data xfer All this can be transparent to ROOT Ttree users via TtreeCache, which now is used to feed HUGE sync vectored reads

How to do it ? (3/4) while not finished if (few_chunks_travelling) Remote access+ XrdClient keeps the data as it arrives, in parallel! while not finished if (few_chunks_travelling) enumerate_next_M_chunks() foreach chunk Read_async(chunk) d = read(nth_chunk) process_data(d) end While the app processes one, we want to keep many travelling chunks, as many as the network pipe can contain, to “fill” the latency. Many data chunks are coming while this one is processed. We can also look more ahead (M=2...1024 ...)

How to do it ? (4/4) while not finished if (few_chunks_travelling) Async vectored reads XrdClient keeps the data as it arrives, in parallel! Inform that some chunks are needed. Processing starts normally, waits only if hitting a “travelling” one. In principle, this can be made very unlikely. (by asking chunks more in advance) while not finished if (few_chunks_travelling) enumerate_next_M_chunks() ReadV_async(M_chunks) d = read(nth_chunk) process_data(d) end Many future data chunks are coming while the nth is processed. Unfortunately the multistream xfer is not yet ready for the async readVs :-(

One Physical connection per Multiple streams (1/3) Clients see One Physical connection per server Application Client1 Server TCP (control+data) Client2 Client3

One Physical connection per Multiple streams (2/3) Clients still see One Physical connection per server Application Client1 Server TCP (control) Client2 TCP(data) Client3 Async data gets automatically splitted

Multiple streams (3/3) It is not a copy-only tool to move data Can be used to speed up access to remote repos Totally transparent to apps making use of *_async reqs xrdcp uses it (-S option) results comparable to other cp-like tools For now only reads are implemented Writes are on the way. Not easy to keep the fault tolerance Client and server automatically agree on the TCP windowsize

will take 5000 seconds for the app to start processing Resource access: open An application could need to open many files before starting the computation We cannot rely on an optimized struct of the application, hence we assume: for (i = 0; i < 1000; i++) open file i Process_data() Cleanup and exit Problem: a mean of only 5 seconds for locating/opening a file will take 5000 seconds for the app to start processing

Parallel open Usage (XrdClient): No strange things, no code changes, nothing required An open() call returns immediately The request is treated as pending Threads takes care of the pending opens The opener is put to wait ONLY when it tries to access the data (but still all the open requests are in progress) Usage (ROOT 5.xx): The ROOT primitives (e.g. Map()) semantically rely on the completeness of an open() call The parallel open has to be requested through TFile::AsyncOpenRequest(filename)

Parallel open results Test performed at SLAC towards the “Kanga cluster” 1576 sequential open requests are parallelized transparently