Beyond Music Sharing: An Evaluation of Peer-to-Peer Data Dissemination Techniques in Large Scientific Collaborations Thesis defense: Samer Al-Kiswany
2 Introduction Data-intensive science: large-scale simulations and new scientific instruments generate huge volumes of data (PetaBytes). User communities: large, geographically dispersed Requirement : Efficient data dissemination tools Samer Al-Kiswany /26
3 Introduction - Example Samer Al-Kiswany /26
4 Question ? What data dissemination strategies perform best in today's Grids deployments? Grido Data dissemination solutions: IP-Multicast, Bullet, BitTorrent, SPIDER, OMNI, ALMI, Logistical-Multicast, Narada, Scribe, Grido, FastReplica … and many others. Samer Al-Kiswany /26
5 Workload characteristics Deployment platform characteristics Data dissemination proposed solutions Evaluation Recommendations What data dissemination strategies perform best in today's Grids deployments? Roadmap Samer Al-Kiswany /26
6 Data-intensive scientific collaboration characteristics: Scale of data: massive data collections (TeraBytes) Data usage: Uniform popularity distributions, and co ‑ usage Near real time processing. Workload and Deployment Platform Resource availability: low churn rate, high node availability, well-provisioned networks. Collaborative environments: no freeriding, thus less effort is needed to control fair resource sharing. Deployment platform characteristics: Samer Al-Kiswany /26
7 Workload characteristics Deployment platform characteristics Data dissemination proposed solutions Evaluation Recommendations What data dissemination strategies perform best in today's Grids deployments? Roadmap Samer Al-Kiswany /26
8 Classification of Approaches TechniqueProtocol Tree based techniquesALM and SPIDER SwarmingBullet and BitTorrent Techniques employing intermediate storage capabilities Logistical Multicasting Base Cases: IP-Multicast. Parallel transfers: separate data channels from the source to each destination. Samer Al-Kiswany /26
9 Separate Transfer from the Source to every Destination /26 Drawbacks: Overwhelms the source – does not scale Generates high duplicate traffic at the links around the source Does not exploit all available transport capacity.
10 IP Multicasting /
11 IP Multicast /26 Drawbacks: Limited deployment Vulnerability to nodes failures Does not exploit all available transport capacity. Throughput limited by bottleneck link 10 5
12 Tree Based Techniques: Application Level Multicast (ALM) Source ALM Tree /26
13 Tree Based Techniques: Application Level Multicast (ALM) /26 Source ALM Tree Drawbacks: Vulnerability to nodes failures Does not exploit all possible routes in the network.
14 Swarming Techniques: BitTorrent and Bullet 1234 Complete file /26 4
15 4 Swarming Techniques: BitTorrent and Bullet 1234 Complete file /
16 Swarming Techniques: BitTorrent and Bullet / Complete file Drawbacks: Generates high duplicate traffic.
17 Logistical Multicasting /26
18 Roadmap Question: What data dissemination strategies perform best in today's Grids deployments? Evaluation Workload characteristics Deployment platform characteristics Data dissemination proposed solutions Recommendations Analytical Modeling Deployment based Simulation Evaluation Approaches: Samer Al-Kiswany /26
19 Samer Al-Kiswany Methodology Simulator Design: Block-level simulation. Simulates physical layer link-contention /26 Inputs: -Real topologies of three deployed Grid testbeds: LCG, GridPP, EGEE. -Generated topologies: 100 (using BRITE)
20 Methodology Success criteriaMetrics Dissemination timeTransfer time. OverheadMB x hop Load balancingVolume of in/out data. FairnessLink stress Samer Al-Kiswany /26
21 Transfer Time Number of destinations that have completed the file transfer for the original EGEE topology. Samer Al-Kiswany /26
22 Transfer Time – With reduced core-link bandwidth Number of destinations that have completed the file transfer – EGEE topology with core bandwidth reduced to 1 / 8 of the original one. Conclusions : On well-provisioned topologies even naïve algorithms perform well. On constrained topologies application ‑ level techniques perform uniformly well: are among the first to finish the transfer with good intermediate progress. Samer Al-Kiswany /26
23 Summary Motivating question: What data dissemination strategies perform best in today's Grids deployments? In this project, we: Simulated representative solutions. Considering the characteristics of the workload and deployed platforms Our results provide guidelines for selecting the data dissemination technique, depending on the: Target environment. Overall system workload characteristics. Success Criteria. Samer Al-Kiswany /26
24 Research Publications Samer Al-Kiswany /26 This work resulted in two refereed publications, and one journal submission: Beyond Music Sharing: An Evaluation of Peer-to-Peer Data Dissemination Techniques in Large Scientific Collaborations, S. Al-Kiswany, M. Ripeanu, A. Iamnitchi, and S. Vazhkudai, Submitted to the Journal of Grid Computing. Are P2P Data-Dissemination Techniques Viable in Today's Data-Intensive Scientific Collaborations?, S. Al-Kiswany, M. Ripeanu, A. Iamnitchi, and S. Vazhkudai, EuroPar, 2007, France.( acceptance rate = 26%) A Simulation Study of Data Distribution Strategies for Large-scale Scientific Data Collaborations, S. Al-Kiswany and M. Ripeanu, IEEE CCECE 2007.
25 Other Research Work I am involved in another two research projects: Scavenged Storage System stdchk: A Checkpoint Storage System for Desktop Grid Computing A High-Performance GridFTP Server at Desktop Cost StoreGPU Exploiting the GPU for computationally intensive storage system operations. Samer Al-Kiswany /26
26 Thank you