A Local-Optimization based Strategy for Cost-Effective Datasets Storage of Scientific Applications in the Cloud Many slides from authors’ presentation.

Slides:



Advertisements
Similar presentations
DataGarage: Warehousing Massive Performance Data on Commodity Servers
Advertisements

L3S Research Center University of Hanover Germany
Evaluating Caching and Storage Options on the Amazon Web Services Cloud Gagan Agrawal, Ohio State University - Columbus, OH David Chiu, Washington State.
Scalable Routing In Delay Tolerant Networks
Designing Services for Grid-based Knowledge Discovery A. Congiusta, A. Pugliese, Domenico Talia, P. Trunfio DEIS University of Calabria ITALY
A MapReduce Workflow System for Architecting Scientific Data Intensive Applications By Phuong Nguyen and Milton Halem phuong3 or 1.
© 2009 IBM Corporation IBM Research Xianglong Liu 1, Junfeng He 2,3, and Bo Lang 1 1 Beihang University, Beijing, China 2 Columbia University, New York,
1 Weiren Yu 1,2, Xuemin Lin 1, Wenjie Zhang 1 1 University of New South Wales 2 NICTA, Australia Towards Efficient SimRank Computation over Large Networks.
Indexing DNA Sequences Using q-Grams
Remote Real-Time Trajectory Simplification Ralph Lange, Tobias Farrell, Frank Dürr, Kurt Rothermel Institute of Parallel and Distributed Systems (IPVS)
1/22 Worst and Best-Case Coverage in Sensor Networks Seapahn Meguerdichian, Farinaz Koushanfar, Miodrag Potkonjak, and Mani Srivastava IEEE TRANSACTIONS.
Suleyman Cetintas 1, Monica Rogati 2, Luo Si 1, Yi Fang 1 Identifying Similar People in Professional Social Networks with Discriminative Probabilistic.
Large Scale Computing Systems
Jialin Liu, Bradly Crysler, Yin Lu, Yong Chen Oct. 15. Seminar Data-Intensive Scalable Computing Laboratory (DISCL) Locality-driven High-level.
Evaluating “find a path” reachability queries P. Bouros 1, T. Dalamagas 2, S.Skiadopoulos 3, T. Sellis 1,2 1 National Technical University of Athens 2.
Correlation Search in Graph Databases Yiping Ke James Cheng Wilfred Ng Presented By Phani Yarlagadda.
~1~ Infocom’04 Mar. 10th On Finding Disjoint Paths in Single and Dual Link Cost Networks Chunming Qiao* LANDER, CSE Department SUNY at Buffalo *Collaborators:
Cloud Computing Resource provisioning Keke Chen. Outline  For Web applications statistical Learning and automatic control for datacenters  For data.
1 Analysis of Topographical Leverage- driven Capacity Trading in Internet Storage Infrastructures Anna Ye Du (SUNY, Buffalo), Xianjun Geng (UW, Seattle),
An Approach to Evaluate Data Trustworthiness Based on Data Provenance Department of Computer Science Purdue University.
Data Management in Cloud Workflow Systems Dong Yuan Faculty of Information and Communication Technology Swinburne University of Technology.
A Cost-Effective Strategy for Intermediate Data Storage in Scientific Cloud Workflow Systems Dong Yuan, Yun Yang, Xiao Liu, Jinjun Chen Swinburne University.
Multiobjective VLSI Cell Placement Using Distributed Simulated Evolution Algorithm Sadiq M. Sait, Mustafa I. Ali, Ali Zaidi.
1 Data Persistence in Large-scale Sensor Networks with Decentralized Fountain Codes Yunfeng Lin, Ben Liang, Baochun Li INFOCOM 2007.
TRADING OFF PREDICTION ACCURACY AND POWER CONSUMPTION FOR CONTEXT- AWARE WEARABLE COMPUTING Presented By: Jeff Khoshgozaran.
Peer-to-Peer Based Multimedia Distribution Service Zhe Xiang, Qian Zhang, Wenwu Zhu, Zhensheng Zhang IEEE Transactions on Multimedia, Vol. 6, No. 2, April.
Video summarization by video structure analysis and graph optimization M. Phil 2 nd Term Presentation Lu Shi Dec 5, 2003.
Proxy Cache Management for Fine-Grained Scalable Video Streaming Jiangchuan Liu, Xiaowen Chu, and Jianliang Xu INFOCOM 2004.
Comparing path-based and vertically-partitioned RDF databases Preetha Lakshmi & Chris Mueller 12/10/2007 CSCI 8715 Shashi Shekhar.
CoNA : Dynamic Application Mapping for Congestion Reduction in Many-Core Systems 2012 IEEE 30th International Conference on Computer Design (ICCD) M. Fattah,
U.S. Department of the Interior U.S. Geological Survey David V. Hill, Information Dynamics, Contractor to USGS/EROS 12/08/2011 Satellite Image Processing.
Applying Twister to Scientific Applications CloudCom 2010 Indianapolis, Indiana, USA Nov 30 – Dec 3, 2010.
Sensor Network Navigation without Locations Mo Li, Yunhao Liu, Jiliang Wang, and Zheng Yang Department of Computer Science and Engineering Hong Kong University.
On comparison of different approaches to the stability radius calculation Olga Karelkina Department of Mathematics University of Turku MCDM 2011.
Exposure In Wireless Ad-Hoc Sensor Networks Seapahn Meguerdichian Computer Science Department University of California, Los Angeles Farinaz Koushanfar.
1 SciCSM: Novel Contrast Set Mining over Scientific Datasets Using Bitmap Indices Gangyi Zhu, Yi Wang, Gagan Agrawal The Ohio State University.
An Autonomic Framework in Cloud Environment Jiedan Zhu Advisor: Prof. Gagan Agrawal.
Low-Power Gated Bus Synthesis for 3D IC via Rectilinear Shortest-Path Steiner Graph Chung-Kuan Cheng, Peng Du, Andrew B. Kahng, and Shih-Hung Weng UC San.
RANI NALAMARU DEPARTMENT OF COMPUTER SCIENCE BALL STATE UNIVERSITY RANI NALAMARU DEPARTMENT OF COMPUTER SCIENCE BALL STATE UNIVERSITY Efficient Transmission.
Prediction Assisted Single-copy Routing in Underwater Delay Tolerant Networks Zheng Guo, Bing Wang and Jun-Hong Cui Computer Science & Engineering Department,
1 Optimal Cycle Vida Movahedi Elder Lab, January 2008.
MAP: Multi-Auctioneer Progressive Auction in Dynamic Spectrum Access Lin Gao, Youyun Xu, Xinbing Wang Shanghai Jiaotong University.
1 Distributed Energy-Efficient Scheduling for Data-Intensive Applications with Deadline Constraints on Data Grids Cong Liu and Xiao Qin Auburn University.
Center for E-Business Technology Seoul National University Seoul, Korea BrowseRank: letting the web users vote for page importance Yuting Liu, Bin Gao,
Data Placement and Task Scheduling in cloud, Online and Offline 赵青 天津科技大学
The New Zealand Institute for Plant & Food Research Limited Use of Cloud computing in impact assessment of climate change Kwang Soo Kim and Doug MacKenzie.
Zibin Zheng DR 2 : Dynamic Request Routing for Tolerating Latency Variability in Cloud Applications CLOUD 2013 Jieming Zhu, Zibin.
Reporter : Yu Shing Li 1.  Introduction  Querying and update in the cloud  Multi-dimensional index R-Tree and KD-tree Basic Structure Pruning Irrelevant.
Andreas Papadopoulos - [DEXA 2015] Clustering Attributed Multi-graphs with Information Ranking 26th International.
1 11 Channel Assignment for Maximum Throughput in Multi-Channel Access Point Networks Xiang Luo, Raj Iyengar and Koushik Kar Rensselaer Polytechnic Institute.
CCGrid, 2012 Supporting User Defined Subsetting and Aggregation over Parallel NetCDF Datasets Yu Su and Gagan Agrawal Department of Computer Science and.
Efficient Computing k-Coverage Paths in Multihop Wireless Sensor Networks XuFei Mao, ShaoJie Tang, and Xiang-Yang Li Dept. of Computer Science, Illinois.
A Runtime Verification Based Trace-Oriented Monitoring Framework for Cloud Systems Jingwen Zhou 1, Zhenbang Chen 1, Ji Wang 1, Zibin Zheng 2, and Wei Dong.
A Fast Repair Code Based on Regular Graphs for Distributed Storage Systems Yan Wang, East China Jiao Tong University Xin Wang, Fudan University 1 12/11/2013.
TÜBİTAK An Optimization Approach for Airport Ground Operations with A Shortest Path Algorithm 12 November 2015 Orhan Eroglu - TUBITAK BILGEM, Turkey Zafer.
Data Consolidation: A Task Scheduling and Data Migration Technique for Grid Networks Author: P. Kokkinos, K. Christodoulopoulos, A. Kretsis, and E. Varvarigos.
On Mobile Sink Node for Target Tracking in Wireless Sensor Networks Thanh Hai Trinh and Hee Yong Youn Pervasive Computing and Communications Workshops(PerComW'07)
1 Using Network Coding for Dependent Data Broadcasting in a Mobile Environment Chung-Hua Chu, De-Nian Yang and Ming-Syan Chen IEEE GLOBECOM 2007 Reporter.
Construction of Optimal Data Aggregation Trees for Wireless Sensor Networks Deying Li, Jiannong Cao, Ming Liu, and Yuan Zheng Computer Communications and.
1 Double-Patterning Aware DSA Template Guided Cut Redistribution for Advanced 1-D Gridded Designs Zhi-Wen Lin and Yao-Wen Chang National Taiwan University.
1 Performance Impact of Resource Provisioning on Workflows Gurmeet Singh, Carl Kesselman and Ewa Deelman Information Science Institute University of Southern.
Extrapolation to Speed-up Query- dependent Link Analysis Ranking Algorithms Muhammad Ali Norozi Department of Computer Science Norwegian University of.
Online Parameter Optimization for Elastic Data Stream Processing Thomas Heinze, Lars Roediger, Yuanzhen Ji, Zbigniew Jerzak (SAP SE) Andreas Meister (University.
Locality-driven High-level I/O Aggregation
Applying Twister to Scientific Applications
Efficient Document Analytics on Compressed Data: Method, Challenges, Algorithms, Insights Feng Zhang †⋄, Jidong Zhai ⋄, Xipeng Shen #, Onur Mutlu ⋆, Wenguang.
Declarative Transfer Learning from Deep CNNs at Scale
Automatic and Efficient Data Virtualization System on Scientific Datasets Li Weng.
Presentation transcript:

A Local-Optimization based Strategy for Cost-Effective Datasets Storage of Scientific Applications in the Cloud Many slides from authors’ presentation on CLOUD 2011 Presenter: Guagndong Liu Mar 13 th, 2012

Dec 8 th, 2011 Outline Introduction A Motivating Example Problem Analysis Important Concepts and Cost Model of Datasets Storage in the Cloud A Local-Optimization based Strategy for Cost-Effective Datasets Storage of Scientific Applications in the Cloud Evaluation and Simulation

Dec 8 th, 2011 Introduction Scientific applications –Computation and data intensive Generated data sets: terabytes or even petabytes in size Huge computation: e.g. scientific workflow –Intermediate data: important! Reuse or reanalyze For sharing between institutions Regeneration vs storing

Dec 8 th, 2011 Introduction Cloud computing –A new way for deploying scientific applications –Pay-as-you-go model Storing strategy –Which generated dataset should be stored? –Tradeoff between cost and user preference –Cost-effective strategy

Dec 8 th, 2011 A Motivating Example Parkes radio telescope and pulsar survey Pulsar searching workflow

Dec 8 th, 2011 A Motivating Example Current storage strategy –Delete all the intermediate data, due to storage limitation Some intermediate data should be stored Some need not

Dec 8 th, 2011 Problem Analysis Which datasets should be stored? –Data challenge: double every year over the next decade and further -- [Szalay et al. Nature, 2006] –Different strategies correspond to different costs –Scientific workflows are very complex and there are dependencies among datasets –Furthermore, one scientist can not decide the storage status of a dataset anymore –Data accessing delay –Datasets should be stored based on the trade-off of computation cost and storage cost A cost-effective datasets storage strategy is needed

Dec 8 th, 2011 Important Concepts Data Dependency Graph (DDG) –A classification of the application data Original data and generated data –Data provenance A kind of meta-data that records how data are generated –DDG

Dec 8 th, 2011 Important Concepts Attributes of a Dataset in DDG –A dataset d i in DDG has the attributes: x i ($) denotes the generation cost of dataset d i from its direct predecessors. y i ($/t) denotes the cost of storing dataset d i in the system per time unit. f i (Boolean) is a flag, which denotes the status whether dataset d i is stored or deleted in the system. v i (Hz) denotes the usage frequency, which indicates how often d i is used.

Dec 8 th, 2011 Important Concepts Attributes of a Dataset in DDG –provSet i denotes the set of stored provenances that are needed when regenerating dataset d i. –CostR i ($/t) is d i ’s cost rate, which means the average cost per time unit of d i in the system. Cost = C + S –C: total cost of computation resources –S: total cost of storage resources

Dec 8 th, 2011 Cost Model of Datasets Storage in the Cloud Total cost rate of a DDG: –S is the storage strategy of the DDG For a DDG with n datasets, there are 2 n different storage strategies

Dec 8 th, 2011 CTT-SP Algorithm To find the minimum cost storage strategy for a DDG Philosophy of the algorithm: –Construct a Cost Transitive Tournament (CTT) based on the DDG. In the CTT, the paths (from the start to the end dataset) have one-to-one mapping to the storage strategies of the DDG The length of each path equals to the total cost rate of the corresponding storage strategy. –The Shortest Path (SP) represents the minimum cost storage strategy

Dec 8 th, 2011 CTT-SP Algorithm Example The weights of cost edges:

Dec 8 th, 2011 A Local-Optimization based Datasets Storage Strategy Requirements of Storage Strategy –Efficiency and Scalability The strategy is used at runtime in the cloud and the DDG may be large The strategy itself takes computation resources –Reflect users’ preference and data accessing delay Users may want to store some datasets Users may have certain tolerance of data accessing delay

Dec 8 th, 2011 A Local-Optimization based Datasets Storage Strategy Introduce two new attributes of the datasets in DDG to represent users’ accessing delay tolerance, which are T i is a duration of time that denotes users’ tolerance of dataset d i ’s accessing delay λ i is the parameter to denote users’ cost related tolerance of dataset d i ’s accessing delay, which is a value between 0 and 1

Dec 8 th, 2011 A Local-Optimization based Datasets Storage Strategy

Dec 8 th, 2011 A Local-Optimization based Datasets Storage Strategy Efficiency and Scalability –A general DDG is very complex. The computation complexity of CTT-SP algorithm is O(n 9 ), which is not efficient and scalable to be used on large DDGs Partition the large DDG into small linear segments Utilize CTT-SP algorithm on linear DDG segments in order to guarantee a localized optimum

Dec 8 th, 2011 Evaluation Use random generated DDG for simulation –Size: randomly distributed from 100GB to 1TB. –Generation time : randomly distributed from 1 hour to 10 hours –Usage frequency: randomly distributed 1 day to 10 days (time between every usage). –Users’ delay tolerance (T i ), randomly distributed from 10 hours to one day –Cost parameter (λ i ) : randomly distributed from 0.7 to 1 to every datasets in the DDG Adopt Amazon cloud services’ price model (EC2+S3): –$0.15 per Gigabyte per month for the storage resources. –$0.1 per CPU hour for the computation resources.

Dec 8 th, 2011 Evaluation Compare different storage strategies with proposed strategy –Usage based strategy –Generation cost based strategy –Cost rate based strategy

Dec 8 th, 2011 Evaluation

Dec 8 th, 2011 Evaluation

Dec 8 th, 2011 ©2007 The Board of Regents of the University of Nebraska. All rights reserved. Thanks