A Cost-Effective Strategy for Intermediate Data Storage in Scientific Cloud Workflow Systems Dong Yuan, Yun Yang, Xiao Liu, Jinjun Chen Swinburne University.

A Cost-Effective Strategy for Intermediate Data Storage in Scientific Cloud Workflow Systems Dong Yuan, Yun Yang, Xiao Liu, Jinjun Chen Swinburne University of Technology Melbourne, Australia

Outline > Part 1: Introduction to our Work > Part 2: A Cost-Effective Intermediate Data Storage Strategy for Scientific Cloud Workflow Systems

Part 1: Introduction to our Work >SwinDeW Workflow Series >SwinCloud System

SwinDeW Workflow Series SwinDeW – Swin burne De centralised W orkflow - foundation prototype based on p2p –SwinDeW – past –SwinDeW-S (for Services) – past –SwinDeW-B (for BPEL4WS) – past –SwinDeW-G (for Grid) – past –SwinDeW-A (for Agents) – current –SwinDeW-V (for Verification) – current –SwinDeW-C (for Cloud) – current

SwinCloud

Part 2: A Cost-Effective Intermediate Data Storage Strategy for Scientific Cloud Workflow Systems >A Motivating Example and Problem Analysis >Important Concepts and Cost Model of Datasets Storage in the Cloud >A Cost-Effective Datasets Storage Strategy for Scientific Cloud Workflow Systems >Evaluation and Conclusion

Part 2: A Cost-Effective Data Storage Strategy >A Motivating Example and Problem Analysis

A Motivating Example >Parkes radio telescope and pulsar survey >Pulsar searching workflow

A Motivating Example >Current storage strategy –Delete all the intermediate data, due to storage limitation >Some intermediate data should be stored. >Some need not.

A Motivating Example >Scientific cloud workflow systems –a scientific workflow system in the Cloud. –Storage is not bottle-neck anymore. Large data centres Unlimited storage resource with pay-for-use model –Data products can be shared easily. All the data are managed in the data centres Internet based access and SOA

Problem Analysis >Which datasets should be stored? –Data challenge: double every year over the next decade and further -- [Szalay et al. Nature, 2006] –Datasets should be stored based on the trade-off of computation cost and storage cost. –Scientific workflows are very complex and there are dependencies among datasets. –Furthermore, one scientist can not decide the storage status of a dataset anymore. >A cost-effective datasets storage strategy is needed.

Part 2: A Cost-Effective Data Storage Strategy >Important Concepts and Cost Model of Datasets Storage in the Cloud

Intermediate data Dependency Graph (IDG) >A classification of the application data –Input data (original) and intermediate data (generated data) >Data provenance –A kind of meta-data that records how data are generated. >IDG

Datasets Storage Cost Model >Cost = C + S –Cost: total cost of managing intermediate datasets –C: total cost of computation resources –S: total cost of storage resources >We use CostC (USDs per time unit) and CostS (USDs per time unit multiply data size) to denote the prices of computation resources and storage resources

IDG with Cost Model >A dataset d i in IDG has the attributes: – size : size of d i – flag : denotes storage status of d i – t p : time to produce d i from its direct predecessors – t : usage rate of d i in the system – pSet : set of deleted datasets linked to d i – fSet : set of deleted datasets linked by d i – CostR : d i ’s cost rate

IDG with Cost Model >Generation cost of d i : >If d i ’s storage status changes, the generation cost of all the datasets in d i.fSet will be affected by genCost(d i )

IDG with Cost Model > CostR : d i ’s cost rate, which means the average cost per time unit of the dataset d i in the system –If d i is a stored dataset : –If d i is a deleted dataset : >The total cost rate of the system is : >Given a time duration, denoted as [ T 0, T n ], the total system cost is :

Part 2: A Cost-Effective Data Storage Strategy >A cost-effective strategy for intermediate data storage in scientific cloud workflow systems

Intermediate data storage strategy >Algorithm 1: deciding newly generated intermediate datasets’ storage status >Algorithm 2: managing stored intermediate datasets >Algorithm 3: deciding the regenerated intermediate datasets’ storage status

Algorithm 1 >Suppose d 0 is a newly generated intermediate dataset >First, we add its information to the IDG >Next, we check if d 0 needs to be stored or not by comparing:

Algorithm 2 >Suppose d 0 is a stored dataset >We set a threshold time to d 0 as the frequence to check d 0 ’s storage status, where >To check if d 0 still need to be stored, we have to compare:

Lemma and Theorem > Lemma: The deletion of stored intermediate dataset d i in the IDG does not affect the stored datasets adjacent to d i > Theorem: If regenerated intermediate dataset d i is stored, only the stored datasets adjacent to d i in the IDG may need to be deleted to reduce the system cost.

Algorithm 3 >Suppose d 0 is a regenerated dataset. >We assume it should be stored, and calculate the potential cost benefit. >Then we check if the stored predecessor and successor datasets of d 0 still need to be stored, and accumulate the cost benefit. >We calculate the final cost benefit to decide d 0 ’s storage status:

Part 2: A Cost-Effective Data Storage Strategy > Evaluation and Conclusion

Evaluation >IDG of the pulsar searching workflow >Adopt Amazon’s cost model (EC2+S3): –$0.15 per Gigabyte per month for the storage resources. –$0.1 per CPU hour for the computation resources.

Evaluation >Simulation strategies: 1) Store all the datasets; 2) Delete all the datasets; 3) Store high generation cost datasets; 4) Store often used datasets; 5) Dependency based strategy.

Conclusion and Future Work >Conclusion –Our strategy is cost-effective! –Based on datasets’ cost rates –Considered the dependencies among datasets >Future work –Data placement –Minimum cost benchmark

End >Questions?

A Cost-Effective Strategy for Intermediate Data Storage in Scientific Cloud Workflow Systems Dong Yuan, Yun Yang, Xiao Liu, Jinjun Chen Swinburne University.

Similar presentations

Presentation on theme: "A Cost-Effective Strategy for Intermediate Data Storage in Scientific Cloud Workflow Systems Dong Yuan, Yun Yang, Xiao Liu, Jinjun Chen Swinburne University."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

A Cost-Effective Strategy for Intermediate Data Storage in Scientific Cloud Workflow Systems Dong Yuan, Yun Yang, Xiao Liu, Jinjun Chen Swinburne University.

Similar presentations

Presentation on theme: "A Cost-Effective Strategy for Intermediate Data Storage in Scientific Cloud Workflow Systems Dong Yuan, Yun Yang, Xiao Liu, Jinjun Chen Swinburne University."— Presentation transcript:

Similar presentations

About project

Feedback