Presentation is loading. Please wait.

Presentation is loading. Please wait.

Slide 1/29 Informed Prefetching in ROOT Leandro Franco 23 June 2006 ROOT Team Meeting CERN.

Similar presentations


Presentation on theme: "Slide 1/29 Informed Prefetching in ROOT Leandro Franco 23 June 2006 ROOT Team Meeting CERN."— Presentation transcript:

1 Slide 1/29 Informed Prefetching in ROOT Leandro Franco 23 June 2006 ROOT Team Meeting CERN

2 Slide 2/29 Roadmap ● Description of the problem. ● Definition of a possible solution. ● Limitations of that solution. ● Implementation. ● Tests and comments. ● Future work. ● Conclusion.

3 Slide 3/29 Problem ● While processing a large (remote) file the data must be transfered in small chunks. ● The way to work with this file can be seen as: ● while ( NOT EOF) ● Read Buffer ● Process Data ● Go to Next Position ● The time waiting for the data will be a considerable part of the total time.

4 Slide 4/29 Problem Description Client Server Latency Response Time Round Trip Time ( RTT ) = 2*Latency + Response Time Runt Trip Time ( RTT ) Client Process Time ( CPT ) Total Time = 3 * [Client Process Time ( CPT )] + 3*[Round Trip Time ( RTT )] Total Time = 3* ( CPT ) + 3 * ( Response time ) + 3 * ( 2 * Latency )

5 Slide 5/29 Problem Description ● Depending on the conditions of the transmission, the latency could be greater than the time needed to process the date (normal case). ● The time for a given job will be directly proportional to the latency and if latency >> CPT or latency >> response time the total time will be mostly composed of un-used time. ● In which case the best idea would be be to eliminate the latency time altogether.

6 Slide 6/29 Evolution of the problem ● Total time is proportionally direct to the latency. ● The real time needed by the job is very small in comparison ( obtained when latency = 0 ). ● The number of reads gives us the exact way in which they are related (slope of the line).

7 Slide 7/29

8 Slide 8/29 Idea ( diagram ) ● Perform a big request instead of many small requests (only possible if the future reads are known !! ) Client Server Latency Response Time Client Process Time ( CPT ) Total Time = 3* ( CPT ) + 3 * ( Response time ) + ( 2 * Latency )

9 Slide 9/29 Idea ( performance gain ) ● Such method would allow us to (almost) eliminate the dependence from the latency and add it only as an additional constant. ● Imperceptible if compared to the original one (but the latency is still there).

10 Slide 10/29 Idea ( limitations - xrootd ) ● Saying that we will transfer all the data in a single request is not realistic. The best we can do is to transfer blocks big enough to cause an improvement in performance. ● Let's say our small blocks are usually 2.5KB, if we have a buffer of 256KB we will be able to perform 100 requests in a single transfer.

11 Slide 11/29 Idea ( limitations - TCP ) ● But that's still unrealistic, the transfer size depends ultimately on the network ( and operating system ). If it has not been changed, the default will probably be very small. ● A typical value will be 64KB... which will reduce by 4 the performance of the last graph.

12 Slide 12/29 How can we get there? ● We need a class that could take many small requests, put them in a list, order them and try to get them all at once. ● To do that, we have the class TFilePrefetch created by Rene. – Prefetch(Long64_t pos, Int_t len) : Puts a request in the list. – ReadBuffer(char *buf, Long64_t pos, Int_t len) : Reads a buffer (is it's the first time, it will sort the list and will try to get it from the underlying mechanism).

13 Slide 13/29 What underlying mechanism? ● If we don't implement it we have to read all the requests (one by one and return it to TFilePrefetch) – TFile::ReadBuffers(char *buf, Long64_t *pos, Int_t *len, Int_t nbuf) - It will read every element of the list and will put it in the buffer. - Note that only with this we gain in performance since we avoid random seeks. ● If we want to provide the service, every descendant of TFile has to overload ReadBuffers() to provide a specialized version. For the moment changes have been made to support http ( Fons ), rootd and xrootd.

14 Slide 14/29 How do we know what requests we must pass to TFilePrefetch ? ● Fortunately, that is possible when processing root trees. ● This is done with a specialization of TFilePrefetch, called TTreeFilePrefetch. – - At the beginning it will enter a “learning phase”, which will add to a list the branches where we can find the events requested. – - After a given number of requests ( say 100 for example ) it will stop registering branches and will prefetch only the ones that had already been specified.

15 Slide 15/29 Does it work?

16 Slide 16/29 Example ( h2fast ) - Simulated latency ( xrootd )

17 Slide 17/29 Example ( h2fast ) - Simulated latency ( xrootd )

18 Slide 18/29 The same test on rootd instead of xrootd

19 Slide 19/29 Details about the test ● 4802 calls without prefetch. ● 57 calls with a big buffer (although it's limited by a 256KB limit on the server size). ● 97 calls with a buffer of 64KB that should be similar to the TCP window size (probably the most realistic case). ● The average size per call is around 1.3KB ● The latency is simulated with a system sleep, which is not accurate below 10ms.

20 Slide 20/29 Comments about the test ● After the implementation we see the improvement is as big as predicted. ● But as we saw, there are restrictions on the block size. – Client: Will be limited to send 4095 requests in one call ( if every request is 1KB this should be around 4MB ). – Server: The response will be sent in 256KB chunks. – Network: TCP window size limitation ( 64KB should be a conservative assumption). ● Therefore, we will be limited by the smaller one.

21 Slide 21/29 Comments ● In addition to avoiding network latency TFilePrefetch can be a big improvement on the server side since the calls are ordered. ● This is very useful if there are many clients, specially if we can guarantee the atomicity of the vectored read. ● i.e. reducing disk latency when switching contexts... average latency around 5ms?

22 Slide 22/29 What about a 'real' test? ( using http ) TFilePrefetch buffer sizeNo TFilePrefetch done with cp in 3 seconds... could we get there?

23 Slide 23/29 Future work ( client side ) ● we can try a parallel transfer ( multiple threads asking for different chunks of the same buffer ) to avoid latency ( protocol specific ). If we remember the first graphs we would be dividing the slope by the number of threads. ● We can implement a client-side ReadAhead mechanism ( also multithreaded ) to ask the server for future chunks ( parallel if possible but could be seen as another thread transferring data while the main thread does something else ).

24 Slide 24/29 Future work ( server side ) ● We could use the pre-read mechanism specified in the xrootd protocol for example (to avoid the disk latency), but this doesn't help much with the network latency. – Although this is implemented in the server, modification in the client must be made ( we have to tell the server the buffers we want to pre- read ).

25 Slide 25/29 Future work ( different issue ) ● After having the buffer with all the requests, create a thread to decompress the chunks that will be used. Avoiding the latency of the decompression and reducing the footprint since right now it's copied twice before being unzipped. ● This is not really related related to the other subject but could be interesting ;).

26 Slide 26/29 Conclusion ● TFilePrefetch – State: Implemented – Potential Improvement: Critical in high latency networks ( can go to 2 orders of magnitude ). ● Pre-reads on the xrootd Server – State: Already implemented on the server. Modifications in the client side are easy. – Potential Improvement: Reduce disk latency. ● Parallel Reading – State: Working on it, beginning with one additional thread and passing to a pool. – Potential Improvement: Avoid the limitation of the block size in the xrootd server ( new latency = old latency / number of threads ).

27 Slide 27/29 Conclusion ● Read Ahead in the client side – State: Implemented independently of TFilePrefetch (integration pending). – Potential Improvement: Use all the CPU time to transfer data at the same time ( in a different thread ). ● Unzipping Ahead? – State: Idea – Potential Improvement: The application won't need to wait since the data has been unzipped in advance ( by another thread ). This could result in gain by a factor of 2.

28 Slide 28/29 Questions ?? or comments ?

29 Slide 29/29 Thank you !


Download ppt "Slide 1/29 Informed Prefetching in ROOT Leandro Franco 23 June 2006 ROOT Team Meeting CERN."

Similar presentations


Ads by Google