Algorithms for Selecting Mirror Sites for Parallel Download Sonali Patankar CS522 Semester Project Dec 05, 2001
Download Mechanism What happens when we download? Popular Software is required to be downloaded frequently Mirror site concept With the advent of internet , information sharing has become easy. Whenever we need a document that exists on one of the computers on the internet, we download it for our use. What really happens when we download. In simplest words, we set up a TCP connection, with the computer that has the required file. Depending on the characteristics of the path selected for this download, our download may take more or less time, which may be because of variety of reasons such as packet loss, bottleneck bandwidth etc In case of freeware/shareware , the cheapest way to offer a software is to offer it online for download. If the software is popular, it is going to have many requests for download. No matter how nice our server is, on which we have put the software for download, the actual download performance is dependent on many factors that are outside our network , and also probably outside, the users network , who is trying to download. With more people to access the same site makes this scenario even worse. How do we solve this problem ? Mirror Sites are the sites the contain the exact replica of the file that we are interested in offering for download. By doing that, the users will be able to select the site geographically closer to their location, thereby distributing traffic on the network and not putting the load on a single site. This approach improves the overall performance for download, but is still limited by the allowed bandwidth on the path. How can we improve this ?
Parallel Access of Mirror Sites Access Mirror sites in parallel Is it that simple ? What Techniques we can use ? History Based TCP parallel-access Dynamic TCP parallel access Now that we have multiple mirror sites, we can divide the work amongst the mirror sites and get better results. What we really do here is for example if we have a file of 10 MB to be downloaded, and we have 10 mirror sites, we will get 1Mb piece of the file from each mirror site and combine them once we receive all the parts. That way we divide the work and get a better download time, than downloading from a single site. Though this approach looks very simple, there are many considerations that we need to account for. If most of the mirror sites we used are geographically located farther then it will take considerable time for that site to send its piece. While it is taking its own time, the mirror site nearer to us, might have already sent its piece, and is idle from our point of view. To achieve benefit of this technique, we need to have a balance between the parameters which decide how much time is involved in this procedure. Research has been conducted in this area , to determine the techniques that allow us to use this technique with performance improvement. These techniques are History based TCP Parallel access and Dynamic TCP parallel access.
History based TCP parallel-access History data of the servers is used Client decided how much data a server should send Limitations of this approach In this technique, we use the history data for a given server, which will mainly include the bandwidth and response time. This data is typically obtained by querying a database which maintains and provides such vital statistics for the server. With the knowledge of this data, now client has a little better understanding of the server capabilities. Very often, it will not decide to request equal size of file from each server. The paths/mirror sites which have a lower bandwidth , will be asked to carry less data than those capable of carrying more data in less time. Based on these decisions the client will request separate parts from different servers, collect the data , request more data from that server if required , and the download will be complete. For this technique to work at its best efficiency, we expect that all the servers must deliver useful data to client until the download is complete. There is a possibility that one of the servers might be done with sending its piece, and the download is not yet complete, since we have not yet fully received one of the pieces from another server. As with any technique there are pros and cons with this one too. The pro is that our download performance is improved. As to cons, there are some which can make this technique not look so good. The weakest link in the path to a server plays a major role on the outcome of the download. Hence the dialup modem users may not benefit from this technique as their connecting link becomes the bottleneck( weakest link). Another important factor is the history data that we are using for deciding what servers should carry how much data. Many times network/server conditions are not the same at two different occasions. At one time server/network may be not so busy and at other times it may be heavily loaded. If we happen to use this algorithm, at the busiest time of server/network, we may not get the best results possible. Now we look at the next technique
Dynamic TCP parallel access Client partitions the document into small size blocks and makes the first request On receipt of the block it negotiates with the server for the next block This technique does not base its decisions on any history data. Client first divides the document to be received in small chunks. Since it does not know about the characteristics of different servers, it requests a separate block from each server. When one of the blocks is received by the client from a server, it request the next block that has not yet been requested earlier. This process of deciding which block to request and actually requesting it from the server is called negotiation. For every such negotiation, the client spends sometime in which no useful data transmission is taking place from the server in negotiation. The size of block determined by the client also plays in to how many or how few negotiations may be necessary. The immediate benefit, of this technique is that based on the current network/server conditions, this technique will keep all the servers busy working for the download, there by providing performance improvement. But sometimes it may take a long time, for a slow server/network combination to send very little data it has been asked for. This may ultimately result in the waiting on client side because of the missing data that is yet to be received.
Selecting a subset of mirror sites Both techniques are good but there are limitations How can we select the best mirror sites we should use and not worry about the rest As we saw before both techniques are good but there are some limitations. In both of the above techniques, we are using all the mirror sites in the set that has been provided. What if we do not want to use all the mirror sites . In that case how do we choose what mirror site should be used or not used. Using the two techniques learnt before, we can create some sort of hybrid techniques.
Hybrid Algorithm (Best 5 mirror sites) Choose the 5 paths which have highest bottleneck bandwidths and lowest roundtrip time (request a sample piece of file) For the paths chosen, request a relatively small piece of data from all the paths Upon receipt of response, use the actual time required to retrieve the data, to determine, how much efficient that path really is, and decide how much data we should request from that server Given a set of 10 mirror sites, we need to find out the best five which we should use for parallel download. As we have learnt earlier, the weakest link in the path is the most important part of that path. Using the TCP diagnostic utilities such as pathchar, we can find out the bottleneck bandwidth for a given path and the RTT(round trip time). We request a small piece of the file for all the sites and figure out the time , for the client to receive that packet. Now based on the bandwidth (if available from earlier pathchar) and the time required for receiving data, we can choose the 5 paths which have the highest bottleneck bandwidths and lowest roundtrip time. Now request a relatively small piece of the file, from each of the chosen 5 servers, when the servers respond, we will be able to figure out how much time the server is really taking to deliver our data. The server that responds early, we can increase the next data size requested by 2 times. If a server responded worse than its last request, we will not increase the data size requested. The servers that are responding late we can continue to request same amount of data and continue to watch if there is an improvement in the response. If there is an improvement, we can increase the size requested. The increase in size requested helps us to compensate for the time spent in negotiations. This way we can utilize the path more that has given us better results, which makes best use of the current condition of the network to the best use.
Reference Parallel Aceess for mirror Sites in the Internet – Pablo Rodrigues, Andreas Kirpal, Ernst Biersack