Slide 1/29 Informed Prefetching in ROOT Leandro Franco 23 June 2006 ROOT Team Meeting CERN.

Slides:



Advertisements
Similar presentations
EISCambridge 2009MoN8 Exploring Markov models for gate-limited service and their application to network-based services Glenford Mapp and Dhawal Thakker.
Advertisements

Sockets Tutorial Ross Shaull cs146a What we imagine Network request… response… The packets that comprise your request are orderly.
Copyright © 2010, 2007, 2004 Pearson Education, Inc. Chapter 18 Sampling Distribution Models.
Copyright © 2010, 2007, 2004 Pearson Education, Inc. Chapter 18 Sampling Distribution Models.
G. Alonso, D. Kossmann Systems Group
Restricted Slow-Start for TCP William Allcock 1,2, Sanjay Hegde 3 and Rajkumar Kettimuthu 1,2 1 Argonne National Laboratory 2 The University of Chicago.
External Sorting CS634 Lecture 10, Mar 5, 2014 Slides based on “Database Management Systems” 3 rd ed, Ramakrishnan and Gehrke.
Disk Access Model. Using Secondary Storage Effectively In most studies of algorithms, one assumes the “RAM model”: –Data is in main memory, –Access to.
FALL 2006CENG 351 Data Management and File Structures1 External Sorting.
1 Thursday, June 15, 2006 Confucius says: He who play in root, eventually kill tree.
Scuola Superiore Sant’Anna Project Assignments Operating Systems.
1 External Sorting for Query Processing Yanlei Diao UMass Amherst Feb 27, 2007 Slides Courtesy of R. Ramakrishnan and J. Gehrke.
The StarNet Analyzer. Contact SNA Department x172
Fundamentals of Python: From First Programs Through Data Structures
External Sorting Chapter 13.. Why Sort? A classic problem in computer science! Data requested in sorted order  e.g., find students in increasing gpa.
Welcome To. Improving Remote File Transfer Speeds By The Solution For: %
TCP Sockets Reliable Communication. TCP As mentioned before, TCP sits on top of other layers (IP, hardware) and implements Reliability In-order delivery.
Lect3..ppt - 09/12/04 CIS 4100 Systems Performance and Evaluation Lecture 3 by Zornitza Genova Prodanoff.
GridFTP Guy Warner, NeSC Training.
1 Chapter Client-Server Interaction. 2 Functionality  Transport layer and layers below  Basic communication  Reliability  Application layer.
Collage of Information Technology University of Palestine Advanced programming MultiThreading 1.
Nachos Phase 1 Code -Hints and Comments
CMPE 421 Parallel Computer Architecture
Distributed File Systems
Distributed Protein Structure Analysis By Jeremy S. Brown Travis E. Brown.
Shared File Performance Improvements LDLM Lock Ahead Patrick Farrell
Win32 Programming Lesson 10: Thread Scheduling and Priorities.
REVIEW On Friday we explored Client-Server Applications with Sockets. Servers must create a ServerSocket object on a specific Port #. They then can wait.
New features for CORBA 3.0 by Steve Vinoski Presented by Ajay Tandon.
Chapter 12 Transmission Control Protocol (TCP)
CInt Function Stub Removal ROOT Team Meeting CERN Leandro Franco (Joint work with Diego Marcos)
Main Memory CS448.
Copyright © 2009 Pearson Education, Inc. Chapter 18 Sampling Distribution Models.
1 Chapter 18 Sampling Distribution Models. 2 Suppose we had a barrel of jelly beans … this barrel has 75% red jelly beans and 25% blue jelly beans.
ROOT and Federated Data Stores What Features We Would Like Fons Rademakers CERN CC-IN2P3, Nov, 2011, Lyon, France.
CPSC 404, Laks V.S. Lakshmanan1 External Sorting Chapter 13: Ramakrishnan & Gherke and Chapter 2.3: Garcia-Molina et al.
CE Operating Systems Lecture 13 Linux/Unix interprocess communication.
ECE450 - Software Engineering II1 ECE450 – Software Engineering II Today: Design Patterns VIII Chain of Responsibility, Strategy, State.
Empirical Quantification of Opportunities for Content Adaptation in Web Servers Michael Gopshtein and Dror Feitelson School of Engineering and Computer.
1 Client-Server Interaction. 2 Functionality Transport layer and layers below –Basic communication –Reliability Application layer –Abstractions Files.
CSIT 220 (Blum)1 Remote Procedure Calls Based on Chapter 38 in Computer Networks and Internets, Comer.
Grid Technology CERN IT Department CH-1211 Geneva 23 Switzerland t DBCF GT Upcoming Features and Roadmap Ricardo Rocha ( on behalf of the.
GridFTP Guy Warner, NeSC Training Team.
INDIANAUNIVERSITYINDIANAUNIVERSITY Tsunami File Transfer Protocol Presentation by ANML January 2003.
Copyright © 2010, 2007, 2004 Pearson Education, Inc. Chapter 18 Sampling Distribution Models.
Improving Performance using the LINUX IO Scheduler Shaun de Witt STFC ISGC2016.
I/O aspects for parallel event processing frameworks Workshop on Concurrency in the many-Cores Era Peter van Gemmeren (Argonne/ATLAS)
An Analysis of Data Access Methods within WLCG Shaun de Witt, Andrew Lahiff (STFC)
CENG 3511 External Sorting. CENG 3512 Outline Introduction Heapsort Multi-way Merging Multi-step merging Replacement Selection in heap-sort.
ROOT : Outlook and Developments WLCG Jamboree Amsterdam June 2010 René Brun/CERN.
Threads prepared and instructed by Shmuel Wimer Eng. Faculty, Bar-Ilan University 1July 2016Processes.
9. Principles of Reliable Data Transport – Part 1
TCP - Part II.
Block 5: An application layer protocol: HTTP
Memory Management.
Future of WAN Access in ATLAS
5. End-to-end protocols (part 1)
Out-of-Process Components
How can a detector saturate a 10Gb link through a remote file system
Web Caching? Web Caching:.
Cache Memory Presentation I
Dave Hitz and Andy Watson Network Appliance, Inc
Example Cache Coherence Problem
Dave Hitz and Andy Watson Network Appliance, Inc
Out-of-Process Components
THE GOOGLE FILE SYSTEM.
CENG 351 Data Management and File Structures
Algorithms for Selecting Mirror Sites for Parallel Download
Lecture 12 Input/Output (programmer view)
Summer 2002 at SLAC Ajay Tirumala.
Presentation transcript:

Slide 1/29 Informed Prefetching in ROOT Leandro Franco 23 June 2006 ROOT Team Meeting CERN

Slide 2/29 Roadmap ● Description of the problem. ● Definition of a possible solution. ● Limitations of that solution. ● Implementation. ● Tests and comments. ● Future work. ● Conclusion.

Slide 3/29 Problem ● While processing a large (remote) file the data must be transfered in small chunks. ● The way to work with this file can be seen as: ● while ( NOT EOF) ● Read Buffer ● Process Data ● Go to Next Position ● The time waiting for the data will be a considerable part of the total time.

Slide 4/29 Problem Description Client Server Latency Response Time Round Trip Time ( RTT ) = 2*Latency + Response Time Runt Trip Time ( RTT ) Client Process Time ( CPT ) Total Time = 3 * [Client Process Time ( CPT )] + 3*[Round Trip Time ( RTT )] Total Time = 3* ( CPT ) + 3 * ( Response time ) + 3 * ( 2 * Latency )

Slide 5/29 Problem Description ● Depending on the conditions of the transmission, the latency could be greater than the time needed to process the date (normal case). ● The time for a given job will be directly proportional to the latency and if latency >> CPT or latency >> response time the total time will be mostly composed of un-used time. ● In which case the best idea would be be to eliminate the latency time altogether.

Slide 6/29 Evolution of the problem ● Total time is proportionally direct to the latency. ● The real time needed by the job is very small in comparison ( obtained when latency = 0 ). ● The number of reads gives us the exact way in which they are related (slope of the line).

Slide 7/29

Slide 8/29 Idea ( diagram ) ● Perform a big request instead of many small requests (only possible if the future reads are known !! ) Client Server Latency Response Time Client Process Time ( CPT ) Total Time = 3* ( CPT ) + 3 * ( Response time ) + ( 2 * Latency )

Slide 9/29 Idea ( performance gain ) ● Such method would allow us to (almost) eliminate the dependence from the latency and add it only as an additional constant. ● Imperceptible if compared to the original one (but the latency is still there).

Slide 10/29 Idea ( limitations - xrootd ) ● Saying that we will transfer all the data in a single request is not realistic. The best we can do is to transfer blocks big enough to cause an improvement in performance. ● Let's say our small blocks are usually 2.5KB, if we have a buffer of 256KB we will be able to perform 100 requests in a single transfer.

Slide 11/29 Idea ( limitations - TCP ) ● But that's still unrealistic, the transfer size depends ultimately on the network ( and operating system ). If it has not been changed, the default will probably be very small. ● A typical value will be 64KB... which will reduce by 4 the performance of the last graph.

Slide 12/29 How can we get there? ● We need a class that could take many small requests, put them in a list, order them and try to get them all at once. ● To do that, we have the class TFilePrefetch created by Rene. – Prefetch(Long64_t pos, Int_t len) : Puts a request in the list. – ReadBuffer(char *buf, Long64_t pos, Int_t len) : Reads a buffer (is it's the first time, it will sort the list and will try to get it from the underlying mechanism).

Slide 13/29 What underlying mechanism? ● If we don't implement it we have to read all the requests (one by one and return it to TFilePrefetch) – TFile::ReadBuffers(char *buf, Long64_t *pos, Int_t *len, Int_t nbuf) - It will read every element of the list and will put it in the buffer. - Note that only with this we gain in performance since we avoid random seeks. ● If we want to provide the service, every descendant of TFile has to overload ReadBuffers() to provide a specialized version. For the moment changes have been made to support http ( Fons ), rootd and xrootd.

Slide 14/29 How do we know what requests we must pass to TFilePrefetch ? ● Fortunately, that is possible when processing root trees. ● This is done with a specialization of TFilePrefetch, called TTreeFilePrefetch. – - At the beginning it will enter a “learning phase”, which will add to a list the branches where we can find the events requested. – - After a given number of requests ( say 100 for example ) it will stop registering branches and will prefetch only the ones that had already been specified.

Slide 15/29 Does it work?

Slide 16/29 Example ( h2fast ) - Simulated latency ( xrootd )

Slide 17/29 Example ( h2fast ) - Simulated latency ( xrootd )

Slide 18/29 The same test on rootd instead of xrootd

Slide 19/29 Details about the test ● 4802 calls without prefetch. ● 57 calls with a big buffer (although it's limited by a 256KB limit on the server size). ● 97 calls with a buffer of 64KB that should be similar to the TCP window size (probably the most realistic case). ● The average size per call is around 1.3KB ● The latency is simulated with a system sleep, which is not accurate below 10ms.

Slide 20/29 Comments about the test ● After the implementation we see the improvement is as big as predicted. ● But as we saw, there are restrictions on the block size. – Client: Will be limited to send 4095 requests in one call ( if every request is 1KB this should be around 4MB ). – Server: The response will be sent in 256KB chunks. – Network: TCP window size limitation ( 64KB should be a conservative assumption). ● Therefore, we will be limited by the smaller one.

Slide 21/29 Comments ● In addition to avoiding network latency TFilePrefetch can be a big improvement on the server side since the calls are ordered. ● This is very useful if there are many clients, specially if we can guarantee the atomicity of the vectored read. ● i.e. reducing disk latency when switching contexts... average latency around 5ms?

Slide 22/29 What about a 'real' test? ( using http ) TFilePrefetch buffer sizeNo TFilePrefetch done with cp in 3 seconds... could we get there?

Slide 23/29 Future work ( client side ) ● we can try a parallel transfer ( multiple threads asking for different chunks of the same buffer ) to avoid latency ( protocol specific ). If we remember the first graphs we would be dividing the slope by the number of threads. ● We can implement a client-side ReadAhead mechanism ( also multithreaded ) to ask the server for future chunks ( parallel if possible but could be seen as another thread transferring data while the main thread does something else ).

Slide 24/29 Future work ( server side ) ● We could use the pre-read mechanism specified in the xrootd protocol for example (to avoid the disk latency), but this doesn't help much with the network latency. – Although this is implemented in the server, modification in the client must be made ( we have to tell the server the buffers we want to pre- read ).

Slide 25/29 Future work ( different issue ) ● After having the buffer with all the requests, create a thread to decompress the chunks that will be used. Avoiding the latency of the decompression and reducing the footprint since right now it's copied twice before being unzipped. ● This is not really related related to the other subject but could be interesting ;).

Slide 26/29 Conclusion ● TFilePrefetch – State: Implemented – Potential Improvement: Critical in high latency networks ( can go to 2 orders of magnitude ). ● Pre-reads on the xrootd Server – State: Already implemented on the server. Modifications in the client side are easy. – Potential Improvement: Reduce disk latency. ● Parallel Reading – State: Working on it, beginning with one additional thread and passing to a pool. – Potential Improvement: Avoid the limitation of the block size in the xrootd server ( new latency = old latency / number of threads ).

Slide 27/29 Conclusion ● Read Ahead in the client side – State: Implemented independently of TFilePrefetch (integration pending). – Potential Improvement: Use all the CPU time to transfer data at the same time ( in a different thread ). ● Unzipping Ahead? – State: Idea – Potential Improvement: The application won't need to wait since the data has been unzipped in advance ( by another thread ). This could result in gain by a factor of 2.

Slide 28/29 Questions ?? or comments ?

Slide 29/29 Thank you !