A Local-Optimization based Strategy for Cost-Effective Datasets Storage of Scientific Applications in the Cloud Many slides from authors’ presentation on CLOUD 2011 Presenter: Guagndong Liu Mar 13 th, 2012
Dec 8 th, 2011 Outline Introduction A Motivating Example Problem Analysis Important Concepts and Cost Model of Datasets Storage in the Cloud A Local-Optimization based Strategy for Cost-Effective Datasets Storage of Scientific Applications in the Cloud Evaluation and Simulation
Dec 8 th, 2011 Introduction Scientific applications –Computation and data intensive Generated data sets: terabytes or even petabytes in size Huge computation: e.g. scientific workflow –Intermediate data: important! Reuse or reanalyze For sharing between institutions Regeneration vs storing
Dec 8 th, 2011 Introduction Cloud computing –A new way for deploying scientific applications –Pay-as-you-go model Storing strategy –Which generated dataset should be stored? –Tradeoff between cost and user preference –Cost-effective strategy
Dec 8 th, 2011 A Motivating Example Parkes radio telescope and pulsar survey Pulsar searching workflow
Dec 8 th, 2011 A Motivating Example Current storage strategy –Delete all the intermediate data, due to storage limitation Some intermediate data should be stored Some need not
Dec 8 th, 2011 Problem Analysis Which datasets should be stored? –Data challenge: double every year over the next decade and further -- [Szalay et al. Nature, 2006] –Different strategies correspond to different costs –Scientific workflows are very complex and there are dependencies among datasets –Furthermore, one scientist can not decide the storage status of a dataset anymore –Data accessing delay –Datasets should be stored based on the trade-off of computation cost and storage cost A cost-effective datasets storage strategy is needed
Dec 8 th, 2011 Important Concepts Data Dependency Graph (DDG) –A classification of the application data Original data and generated data –Data provenance A kind of meta-data that records how data are generated –DDG
Dec 8 th, 2011 Important Concepts Attributes of a Dataset in DDG –A dataset d i in DDG has the attributes: x i ($) denotes the generation cost of dataset d i from its direct predecessors. y i ($/t) denotes the cost of storing dataset d i in the system per time unit. f i (Boolean) is a flag, which denotes the status whether dataset d i is stored or deleted in the system. v i (Hz) denotes the usage frequency, which indicates how often d i is used.
Dec 8 th, 2011 Important Concepts Attributes of a Dataset in DDG –provSet i denotes the set of stored provenances that are needed when regenerating dataset d i. –CostR i ($/t) is d i ’s cost rate, which means the average cost per time unit of d i in the system. Cost = C + S –C: total cost of computation resources –S: total cost of storage resources
Dec 8 th, 2011 Cost Model of Datasets Storage in the Cloud Total cost rate of a DDG: –S is the storage strategy of the DDG For a DDG with n datasets, there are 2 n different storage strategies
Dec 8 th, 2011 CTT-SP Algorithm To find the minimum cost storage strategy for a DDG Philosophy of the algorithm: –Construct a Cost Transitive Tournament (CTT) based on the DDG. In the CTT, the paths (from the start to the end dataset) have one-to-one mapping to the storage strategies of the DDG The length of each path equals to the total cost rate of the corresponding storage strategy. –The Shortest Path (SP) represents the minimum cost storage strategy
Dec 8 th, 2011 CTT-SP Algorithm Example The weights of cost edges:
Dec 8 th, 2011 A Local-Optimization based Datasets Storage Strategy Requirements of Storage Strategy –Efficiency and Scalability The strategy is used at runtime in the cloud and the DDG may be large The strategy itself takes computation resources –Reflect users’ preference and data accessing delay Users may want to store some datasets Users may have certain tolerance of data accessing delay
Dec 8 th, 2011 A Local-Optimization based Datasets Storage Strategy Introduce two new attributes of the datasets in DDG to represent users’ accessing delay tolerance, which are T i is a duration of time that denotes users’ tolerance of dataset d i ’s accessing delay λ i is the parameter to denote users’ cost related tolerance of dataset d i ’s accessing delay, which is a value between 0 and 1
Dec 8 th, 2011 A Local-Optimization based Datasets Storage Strategy
Dec 8 th, 2011 A Local-Optimization based Datasets Storage Strategy Efficiency and Scalability –A general DDG is very complex. The computation complexity of CTT-SP algorithm is O(n 9 ), which is not efficient and scalable to be used on large DDGs Partition the large DDG into small linear segments Utilize CTT-SP algorithm on linear DDG segments in order to guarantee a localized optimum
Dec 8 th, 2011 Evaluation Use random generated DDG for simulation –Size: randomly distributed from 100GB to 1TB. –Generation time : randomly distributed from 1 hour to 10 hours –Usage frequency: randomly distributed 1 day to 10 days (time between every usage). –Users’ delay tolerance (T i ), randomly distributed from 10 hours to one day –Cost parameter (λ i ) : randomly distributed from 0.7 to 1 to every datasets in the DDG Adopt Amazon cloud services’ price model (EC2+S3): –$0.15 per Gigabyte per month for the storage resources. –$0.1 per CPU hour for the computation resources.
Dec 8 th, 2011 Evaluation Compare different storage strategies with proposed strategy –Usage based strategy –Generation cost based strategy –Cost rate based strategy
Dec 8 th, 2011 Evaluation
Dec 8 th, 2011 Evaluation
Dec 8 th, 2011 ©2007 The Board of Regents of the University of Nebraska. All rights reserved. Thanks