Presentation is loading. Please wait.

Presentation is loading. Please wait.

Data Consolidation: A Task Scheduling and Data Migration Technique for Grid Networks Author: P. Kokkinos, K. Christodoulopoulos, A. Kretsis, and E. Varvarigos.

Similar presentations


Presentation on theme: "Data Consolidation: A Task Scheduling and Data Migration Technique for Grid Networks Author: P. Kokkinos, K. Christodoulopoulos, A. Kretsis, and E. Varvarigos."— Presentation transcript:

1 Data Consolidation: A Task Scheduling and Data Migration Technique for Grid Networks Author: P. Kokkinos, K. Christodoulopoulos, A. Kretsis, and E. Varvarigos Department of Computer Engineering and informatics, University of Patras, Greece and Research Academic Computer Technology Institute, Patras, Greece Conference: CCGRID 2008

2 Outline Introduction Previous Work Problem Formulation Data Consolidation Techniques Simulation Conclusion

3 Outline Introduction Previous Work Problem Formulation Data Consolidation Techniques Simulation Conclusion

4 Introduction Lots of applications benefit from Grid computing:  Computation-intensive applications: Involving computationally intensive problems on small datasets.  Data-intensive applications: Performing computations on large sized data stored at geographically distributed resources. (NOTE: Such Grid is usually referred to as a Data-Grid.)

5 Introduction We evaluate a task scheduling and data migration problem called data consolidation(DC).

6 Outline Introduction Previous Work Problem Formulation Data Consolidation Techniques Simulation Conclusion

7 Previous Work Most of the related works assume that each task needs, for its execution, only one piece of large data. As a result, this obvious scenario is ignored in most related works.

8 Previous Work In “Intelligent Scheduling and Replication in Datagrids: a Synergistic Approach”  Each task need one or more pieces of data for its execution.  Tabu-search scheduler  Optimize execution time and system utilization

9 Outline Introduction Previous Work Problem Formulation Data Consolidation Techniques Simulation Conclusion

10 Problem Formulation A Grid Network consists of:  a set R of N sites: each r ∈ R contains at least one of the following entities  computation resource  storage resource  network resource  Each computation resource has a local scheduler and a queue.  There is a central scheduler responsible for the task scheduling and data management. (This scheduler has complete knowledge of the static and dynamic characteristics of the sites)

11 Problem Formulation

12 On receiving the user’s request, the central scheduler examines the computation and data related characteristics of the task. Based on the used DC algorithm, the central scheduler selects: 1. The sites that hold replicas of the datasets the task needs. 2. The site where these datasets will consolidate and the task will be executed. (This site is called DC site.) NOTE: The inequality must be satisfied.

13 Problem Formulation The scheduler orders the data holding sites to transfer the datasets to the DC site. And orders the user to transfer his task to the DC site. After the task finishes execution, the result return back to the originating user.

14 Outline Introduction Previous Work Problem Formulation Data Consolidation Techniques Simulation Conclusion

15 Theoretical Analysis Assume that the scheduler has selected the data holding sites, r k ∈ R, for all datasets I k, k=1,2,…,L, and the DC site. DC site may already have some pieces of data and thus no transferring is required for these pieces.

16 Theoretical Analysis In general, the data-intensive task experiences 1. communication delay (D comm ) 2. processing delay (D proc )

17 Theoretical Analysis 1. communication delay (D comm ) D comm =D cons +D output =

18 Theoretical Analysis 2. processing delay (D proc ) D proc =

19 Theoretical Analysis The total delay suffered by a task is D DC =D comm +D proc.

20 Proposed Techniques We propose a number of categories of DC algorithms:  Time ConsCost ExecCost TotalCost  Traffic SmallTrans

21 Time 1. Consolidation-Cost (ConsCost) algorithm: We select the replicas and the DC site that minimize the data consolidation time (D cons ) Given a candidate DC site r j, for each I k we search r i holding I k such that is min, and hence the data consolidation time of r j is Finally, we can determine the DC site:

22 Time 2. Execution-Cost (ExecCost) algorithm: We select the DC site that minimizes the task’s execution time: While the data replicas are randomly chosen. NOTE: is difficult to calculate, but we can estimate it based on: the tasks already assigned to it (r i ). the average delay the tasks executed on it have experienced.

23 Time 3. Total-Cost (TotalCost) algorithm: We select the replicas and the DC site that minimize the total task delay. Namely, the algorithm is the combination of the two above algorithms (ConsCost and ExecCost).

24 Traffic Smallest-Data Transfer (SmallTrans) algorithm: We select the DC site for which the smallest number of datasets (or the datasets with the smallest total size) need to be consolidated for the task’s execution.

25 Random 1. Random-Random (Rand) algorithm: The data replicas used by the task and the DC site are randomly chosen. 2. Random-Origin (RandOrig) algorithm: The data replicas used by the task are randomly chosen and the DC site is the one that created the task.

26 Outline Introduction Previous Work Problem Formulation Data Consolidation Techniques Simulation Conclusion

27 Simulation We use NSFNET topology, which contains: 14 nodes (only 5 nodes are equipped with a computation and storage resource (such nodes are called sites)) And each site has equal storage and computation capacity. one node exists in the network acting as a Tier 0 site and holds all the datasets 21 links (all link capacities are equal to 1Gbps)

28 NSFNET topology

29 Assumption Only one transmission is possible at a time over a link. is not taken into account. 50 datasets exist in the network initially. Two copies exists for each dataset. (one is distributed among 5 sites, the other is placed at Tier 0 site) In each experiment, users generate a total of 50,000 tasks. We keep constant the average total data size S (15000 MB): (L: number of datasets a task requests; I : the average size of each dataset) And we examine the following (L, I ) pair values: (2,7500),(3,5000),(4,3750),(6,2500),(8,1875),(10,1500) The workload of a task correlates with the average total data size: ( a is a parameter such that tasks are more data-intensive as a decreases)

30 Simulations DC probability: The probability that the DC site will not have all the required datasets

31 Simulations Task delay: the time between its creation and the time it completes.

32 Simulations Network load depends on: 1. the size of datasets transferred. 2. the number of hops these datasets traverse.

33 Simulations

34

35 Outline Introduction Previous Work Problem Formulation Data Consolidation Techniques Simulation Conclusion

36 If DC is performed efficiently, important benefits can be obtained in terms of task delay, network load and other performance parameters of interest.


Download ppt "Data Consolidation: A Task Scheduling and Data Migration Technique for Grid Networks Author: P. Kokkinos, K. Christodoulopoulos, A. Kretsis, and E. Varvarigos."

Similar presentations


Ads by Google