Download presentation
Presentation is loading. Please wait.
1
A Cost-Effective Strategy for Intermediate Data Storage in Scientific Cloud Workflow Systems Dong Yuan, Yun Yang, Xiao Liu, Jinjun Chen Swinburne University of Technology Melbourne, Australia
2
Outline > Part 1: Introduction to our Work > Part 2: A Cost-Effective Intermediate Data Storage Strategy for Scientific Cloud Workflow Systems
3
Part 1: Introduction to our Work >SwinDeW Workflow Series >SwinCloud System
4
SwinDeW Workflow Series SwinDeW – Swin burne De centralised W orkflow - foundation prototype based on p2p –SwinDeW – past –SwinDeW-S (for Services) – past –SwinDeW-B (for BPEL4WS) – past –SwinDeW-G (for Grid) – past –SwinDeW-A (for Agents) – current –SwinDeW-V (for Verification) – current –SwinDeW-C (for Cloud) – current
5
SwinCloud
6
Part 2: A Cost-Effective Intermediate Data Storage Strategy for Scientific Cloud Workflow Systems >A Motivating Example and Problem Analysis >Important Concepts and Cost Model of Datasets Storage in the Cloud >A Cost-Effective Datasets Storage Strategy for Scientific Cloud Workflow Systems >Evaluation and Conclusion
7
Part 2: A Cost-Effective Data Storage Strategy >A Motivating Example and Problem Analysis
8
A Motivating Example >Parkes radio telescope and pulsar survey >Pulsar searching workflow
9
A Motivating Example >Current storage strategy –Delete all the intermediate data, due to storage limitation >Some intermediate data should be stored. >Some need not.
10
A Motivating Example >Scientific cloud workflow systems –a scientific workflow system in the Cloud. –Storage is not bottle-neck anymore. Large data centres Unlimited storage resource with pay-for-use model –Data products can be shared easily. All the data are managed in the data centres Internet based access and SOA
11
Problem Analysis >Which datasets should be stored? –Data challenge: double every year over the next decade and further -- [Szalay et al. Nature, 2006] –Datasets should be stored based on the trade-off of computation cost and storage cost. –Scientific workflows are very complex and there are dependencies among datasets. –Furthermore, one scientist can not decide the storage status of a dataset anymore. >A cost-effective datasets storage strategy is needed.
12
Part 2: A Cost-Effective Data Storage Strategy >Important Concepts and Cost Model of Datasets Storage in the Cloud
13
Intermediate data Dependency Graph (IDG) >A classification of the application data –Input data (original) and intermediate data (generated data) >Data provenance –A kind of meta-data that records how data are generated. >IDG
14
Datasets Storage Cost Model >Cost = C + S –Cost: total cost of managing intermediate datasets –C: total cost of computation resources –S: total cost of storage resources >We use CostC (USDs per time unit) and CostS (USDs per time unit multiply data size) to denote the prices of computation resources and storage resources
15
IDG with Cost Model >A dataset d i in IDG has the attributes: – size : size of d i – flag : denotes storage status of d i – t p : time to produce d i from its direct predecessors – t : usage rate of d i in the system – pSet : set of deleted datasets linked to d i – fSet : set of deleted datasets linked by d i – CostR : d i ’s cost rate
16
IDG with Cost Model >Generation cost of d i : >If d i ’s storage status changes, the generation cost of all the datasets in d i.fSet will be affected by genCost(d i )
17
IDG with Cost Model > CostR : d i ’s cost rate, which means the average cost per time unit of the dataset d i in the system –If d i is a stored dataset : –If d i is a deleted dataset : >The total cost rate of the system is : >Given a time duration, denoted as [ T 0, T n ], the total system cost is :
18
Part 2: A Cost-Effective Data Storage Strategy >A cost-effective strategy for intermediate data storage in scientific cloud workflow systems
19
Intermediate data storage strategy >Algorithm 1: deciding newly generated intermediate datasets’ storage status >Algorithm 2: managing stored intermediate datasets >Algorithm 3: deciding the regenerated intermediate datasets’ storage status
20
Algorithm 1 >Suppose d 0 is a newly generated intermediate dataset >First, we add its information to the IDG >Next, we check if d 0 needs to be stored or not by comparing:
21
Algorithm 2 >Suppose d 0 is a stored dataset >We set a threshold time to d 0 as the frequence to check d 0 ’s storage status, where >To check if d 0 still need to be stored, we have to compare:
22
Lemma and Theorem > Lemma: The deletion of stored intermediate dataset d i in the IDG does not affect the stored datasets adjacent to d i > Theorem: If regenerated intermediate dataset d i is stored, only the stored datasets adjacent to d i in the IDG may need to be deleted to reduce the system cost.
23
Algorithm 3 >Suppose d 0 is a regenerated dataset. >We assume it should be stored, and calculate the potential cost benefit. >Then we check if the stored predecessor and successor datasets of d 0 still need to be stored, and accumulate the cost benefit. >We calculate the final cost benefit to decide d 0 ’s storage status:
24
Part 2: A Cost-Effective Data Storage Strategy > Evaluation and Conclusion
25
Evaluation >IDG of the pulsar searching workflow >Adopt Amazon’s cost model (EC2+S3): –$0.15 per Gigabyte per month for the storage resources. –$0.1 per CPU hour for the computation resources.
26
Evaluation >Simulation strategies: 1) Store all the datasets; 2) Delete all the datasets; 3) Store high generation cost datasets; 4) Store often used datasets; 5) Dependency based strategy.
27
Conclusion and Future Work >Conclusion –Our strategy is cost-effective! –Based on datasets’ cost rates –Considered the dependencies among datasets >Future work –Data placement –Minimum cost benchmark
28
End >Questions?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.