AMAZON S3 FOR SCIENCE GRIDS: A VIABLE SOLUTION? Mayur Palankar and Adriana Iamnitchi University of South Florida Matei Ripeanu University of British Columbia.

Slides:



Advertisements
Similar presentations
Wei Lu 1, Kate Keahey 2, Tim Freeman 2, Frank Siebenlist 2 1 Indiana University, 2 Argonne National Lab
Advertisements

GridNets2009, Athens 08/09/09 Business Models, Accounting and Billing Concepts in Grid-aware Networks Serafim Kotrotsos 1, Peter Racz 2, Cristian Morariu.
Designing Services for Grid-based Knowledge Discovery A. Congiusta, A. Pugliese, Domenico Talia, P. Trunfio DEIS University of Calabria ITALY
Jens G Jensen Atlas Petabyte store Supporting Multiple Interfaces to Mass Storage Providing Tape and Mass Storage to Diverse Scientific Communities.
S4: A Simple Storage Service for Sciences Matei Ripeanu Adriana Iamnitchi University of British Columbia University of South Florida.
STANFORD UNIVERSITY INFORMATION TECHNOLOGY SERVICES IT Services Storage And Backup Low Cost Central Storage (LCCS) January 9,
The Total Cost of (Non) Ownership of Storage In The Cloud Jinesh Varia Technology Evangelist.
PROVENANCE FOR THE CLOUD (USENIX CONFERENCE ON FILE AND STORAGE TECHNOLOGIES(FAST `10)) Kiran-Kumar Muniswamy-Reddy, Peter Macko, and Margo Seltzer Harvard.
EHarmony in Cloud Subtitle Brian Ko. eHarmony Online subscription-based matchmaking service Available in United States, Canada, Australia and United Kingdom.
High Performance Computing Course Notes Grid Computing.
Data Grids Jon Ludwig Leor Dilmanian Braden Allchin Andrew Brown.
Grid Security Infrastructure Tutorial Von Welch Distributed Systems Laboratory U. Of Chicago and Argonne National Laboratory.
CoreGRID Workpackage 5 Virtual Institute on Grid Information and Monitoring Services Authorizing Grid Resource Access and Consumption Erik Elmroth, Michał.
Storage Management and Caching in PAST, a large-scale, persistent peer- to-peer storage utility Authors: Antony Rowstorn (Microsoft Research) Peter Druschel.
Are P2P Data-Dissemination Techniques Viable in Today's Data- Intensive Scientific Collaborations? Samer Al-Kiswany – University of British Columbia joint.
Large Scale Sharing GFS and PAST Mahesh Balakrishnan.
1 stdchk : A Checkpoint Storage System for Desktop Grid Computing Matei Ripeanu – UBC Sudharshan S. Vazhkudai – ORNL Abdullah Gharaibeh – UBC The University.
1 Exploring Data Reliability Tradeoffs in Replicated Storage Systems NetSysLab The University of British Columbia Abdullah Gharaibeh Matei Ripeanu.
MATE-EC2: A Middleware for Processing Data with Amazon Web Services Tekin Bicer David Chiu* and Gagan Agrawal Department of Compute Science and Engineering.
1 Exploring Data Reliability Tradeoffs in Replicated Storage Systems NetSysLab The University of British Columbia Abdullah Gharaibeh Advisor: Professor.
PhD course - Milan, March /09/ Some additional words about cloud computing Lionel Brunie National Institute of Applied Science (INSA) LIRIS.
Application and Usage of Cloud Computing and Data Security
CSC 456 Operating Systems Seminar Presentation (11/13/2012) Leon Weingard, Liang Xin The Google File System.
EE616 Technical Project Video Hosting Architecture By Phillip Sutton.
Cloud Computing. What is Cloud Computing? Cloud computing is a model for enabling convenient, on-demand network access to a shared pool of configurable.
Presenter: Dipesh Gautam.  Introduction  Why Data Grid?  High Level View  Design Considerations  Data Grid Services  Topology  Grids and Cloud.
1 Configurable Security for Scavenged Storage Systems NetSysLab The University of British Columbia Abdullah Gharaibeh with: Samer Al-Kiswany, Matei Ripeanu.
November 7, 2001Dutch Datagrid SARA 1 DØ Monte Carlo Challenge A HEP Application.
NoSQL Databases Oracle - Berkeley DB. Content A brief intro to NoSQL About Berkeley Db About our application.
1 A Framework for Data-Intensive Computing with Cloud Bursting Tekin Bicer David ChiuGagan Agrawal Department of Compute Science and Engineering The Ohio.
A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Graduate Student Department Of CSE 1.
A semi autonomic infrastructure to manage non functional properties of a service Pierre de Leusse Panos Periorellis Paul Watson Theo Dimitrakos UK e-Science.
Data Transfers in the Grid: Workload Analysis of Globus GridFTP Nicolas Kourtellis, Lydia Prieto, Gustavo Zarrate, Adriana Iamnitchi University of South.
Instrumentation of the SAM-Grid Gabriele Garzoglio CSC 426 Research Proposal.
Virtual Data Grid Architecture Ewa Deelman, Ian Foster, Carl Kesselman, Miron Livny.
Sharing Social Content from Home: A Measurement-driven Feasibility Study Massimiliano Marcon Bimal Viswanath Meeyoung Cha Krishna Gummadi NOSSDAV 2011.
09/02 ID099-1 September 9, 2002Grid Technology Panel Patrick Dreher Technical Panel Discussion: Progress in Developing a Web Services Data Analysis Grid.
Ames Research CenterDivision 1 Information Power Grid (IPG) Overview Anthony Lisotta Computer Sciences Corporation NASA Ames May 2,
Kia Manoochehri.  Background  Threat Classification ◦ Traditional Threats ◦ Availability of cloud services ◦ Third-Party Control  The “Notorious Nine”
1 NETE4631 Working with Cloud-based Storage Lecture Notes #11.
Amit Warke Jerry Philip Lateef Yusuf Supraja Narasimhan Back2Cloud: Remote Backup Service.
ESFRI & e-Infrastructure Collaborations, EGEE’09 Krzysztof Wrona September 21 st, 2009 European XFEL.
CEOS Working Group on Information Systems and Services - 1 Data Services Task Team Discussions on GRID and GRIDftp Stuart Doescher, USGS WGISS-15 May 2003.
High Energy FermiLab Two physics detectors (5 stories tall each) to understand smallest scale of matter Each experiment has ~500 people doing.
Elastic Cloud Caches for Accelerating Service-Oriented Computations Gagan Agrawal Ohio State University Columbus, OH David Chiu Washington State University.
CBM Computing Model First Thoughts CBM Collaboration Meeting, Trogir, 9 October 2009 Volker Friese.
Copyright © 2012 Cleversafe, Inc. All rights reserved. 1 Combining the Power of Hadoop with Object-Based Dispersed Storage.
File Grouping for Scientific Data Management: Lessons from Experimenting with Real Traces Shyamala Doraimani* and Adriana Iamnitchi University of South.
1 Workload Analysis of Globus’ GridFTP Nicolas Kourtellis Joint Work with:Lydia Prieto, Gustavo Zarrate, Adriana Iamnitchi, Dan Fraser University of South.
An Architectural Approach to Managing Data in Transit Micah Beck Director & Associate Professor Logistical Computing and Internetworking Lab Computer Science.
Cloud Archive By: Kimberly Nolan. What it is?  The goal of a cloud archiving service is to provide a data storage (ex. Google drive and SkyDrive) as.
IT-DSS Alberto Pace2 ? Detecting particles (experiments) Accelerating particle beams Large-scale computing (Analysis) Discovery We are here The mission.
Amazon Web Services. Amazon Web Services (AWS) - robust, scalable and affordable infrastructure for cloud computing. This session is about:
Energy Management Solution
100% Exam Passing Guarantee & Money Back Assurance
Scalable Web Apps Target this solution to brand leaders responsible for customer engagement and roll-out of global marketing campaigns. Implement scenarios.
Amazon Simple Storage Service (S3)
Grid Security.
Amazon Storage- S3 and Glacier
Introduction to Data Management in EGI
Energy Management Solution
Scalable Web Apps Target this solution to brand leaders responsible for customer engagement and roll-out of global marketing campaigns. Implement scenarios.
Patrick Dreher Research Scientist & Associate Director
Haiyan Meng and Douglas Thain
Building a Database on S3
AWS Cloud Computing Masaki.
Data Management Components for a Research Data Archive
Content Distribution Network
Outline for today Oceanstore: An architecture for Global-Scale Persistent Storage – University of California, Berkeley. ASPLOS 2000 Feasibility of a Serverless.
Presentation transcript:

AMAZON S3 FOR SCIENCE GRIDS: A VIABLE SOLUTION? Mayur Palankar and Adriana Iamnitchi University of South Florida Matei Ripeanu University of British Columbia Simson Garfinkel Harvard University

Amazon S3 for science grids: a viable solution?2

3 Science Grids Data-intensive scientific collaborations Produce, analyze, and archive huge volumes of data (PetaBytes) –High data management and maintenance costs –Files are often used by groups of users and not individually Amazon Simple Storage Service (S3) Novel storage ‘utility’: –Direct access to storage Self-defined performance targets: –scalable, infinite data durability, 99.99% availability, fast data access Pay-as-you go pricing: –$0.15/month/GB stored and $0.10-$0.17/GB transferred Is offloading data storage from an in-house mass storage system to S3 feasible and cost- effective? Keeps decreasing

Amazon S3 for science grids: a viable solution?4 Approach Characterize S3 –Does it live up to its own expectations? Toy scenario: evaluate a representative scientific application (DZero) in this context –Estimate performance and costs –Is the functionality provided adequate? Outline S3 architecture S3 performance evaluation Toy scenario: S3-supported DZero: cost and functionality requirements Lessons/suggested improvements

Amazon S3 for science grids: a viable solution?5 S3 Architecture Two-level namespace –Buckets (think directories) Global names Two goals: data organization and charging –Data objects Opaque object (max 5GB) Metadata (attribute-value, up to 4KB) Functionality –Simple put/get functionality –Limited search functionality –Objects are immutable, cannot be renamed Data access protocols –SOAP –REST –BitTorrent

Amazon S3 for science grids: a viable solution?6 S3 Architecture (…cont) Security –Identities Assigned by S3 when initial contract is ‘signed’ –Authentication Public/private key scheme But private key is generated by Amazon! –Access control Access control lists (limited to 100 principals) ACL attributes –FullControl –Read & Write (objects cannot be written) –ReadACL & WriteACL (for buckets or objects) –Auditing (pseudo) S3 can provide a log record

Amazon S3 for science grids: a viable solution?7 Storage Service Requirements for Science Grids Data durability Data availability Access performance Usability Support for access control and privacy policies Low cost Approach Characterize S3 –Does it live up to its own expectations? Toy scenario: evaluate a representative scientific application (DZero) in this context –Estimate performance and costs –Is the functionality provided adequate?

Amazon S3 for science grids: a viable solution?8 S3 Characterization Methodology Black-box approach using PlanetLab nodes to estimate: durability availability access performance the effect of BitTorrent on cost savings

Amazon S3 for science grids: a viable solution?9 Evaluating Amazon S3 Durability: perfect –But limited scale experiment (1 year) –More than 137,000 requests for 10,000 files Availability –From EC2: higher than declared 99.99% –From over the Internet: 23 weeks of testing About 15,456 access requests (every 15 mins) Retry protocol, exponential back-off –95.89% availability after the original access –98.06% availability after the 1st retry –99.01% availability after the 2nd retry –99.76% availability after the 3rd retry –99.94% availability after the 4th retry –100% availability after the 5th retry

Amazon S3 for science grids: a viable solution?10 S3 Access Performance: Single Thread

Amazon S3 for science grids: a viable solution?11 S3 Access Performance: Multi-Thread

Amazon S3 for science grids: a viable solution?12 Remote Access Performance

13 Access Performance via BitTorrent

Amazon S3 for science grids: a viable solution?14 Approach Characterize S3 –Does it live up to its own expectations? Toy scenario: evaluate a representative scientific application (DZero) in this context –Estimate performance and costs –Is the functionality provided adequate?

15 Trace recording interval01/2003 – 03/2005 Number of jobs113,062 Hours of computation973,892 Total storage volume375 TB Total data processed5.2 PB Average data access rate273 GB/hour High-energy physics collaboration Traces from January ‘03 to March ’05 (27 months) 375 TB data, 5.2 PB transferred Shared data usage: no access control 561 users from 70+ institutions in 18 countries High intensity data usage: ~550Mbps sustained access rate in DZero 113,062 jobs running for 973,892 hours over the period of 27 months The DØ Experiment

Amazon S3 for science grids: a viable solution?16 S3 Evaluation: Cost All data stored at S3 and processed by DZero –Storage cost: $691,000/year ($829,440 for S3-Europe) –Transfer: $335,012/year  $85,500/month Reducing transfer costs: –Caching: 50TB cooperative cache: $43,888 per year in transfer costs (~10 times lower) –BitTorrent and distributed replicas –Use EC2: Replace transfer costs with $43,284/year Reducing storage costs: –Archive cold data – lifetime of 30% files < 24 hours, 40% < a week, 50% < a month –Throw away derived data –Distinguish between raw and derived data

Amazon S3 for science grids: a viable solution?17 Key Idea: Unbundling performance characteristics High availability, high durability, high performance are bundled at a single pricing point. Some applications need only one or two of them –A cache: availability and access performance –A backup solution (for DZero): durability and availability Solution: SLAs that allow the user to specify their requirements and chose pricing point.

Amazon S3 for science grids: a viable solution?18 The resources needed to provide high performance data access, high data availability and long data durability are different Application class DurabilityAvailability High performance data access CacheNoDependsYes Long-term archival YesNo Online production NoYes Batch productionNo Yes Unbundling Performance Characteristics

19 S3 Evaluation: Security Risks –Traditional risks with distributed storage are still a concern: Permanent data loss, Temporary data unavailability (DoS), Loss of confidentiality Malicious or erroneous data modifications –New risk: direct monetary loss Magnified as there is no built-in solution to limit loss Security scheme’s big advantage: it’s simple … but has limitations –Access control Hard to use ACLs in large systems – needs at least groups (now available) ACLs limited to 100 principals –No support for fine grained delegation –Implicit trust between users and the service S3 No support for un-repudiabiliy –No tools to limit risk

Amazon S3 for science grids: a viable solution?20 Lessons: Suggested Improvements To lower costs: unbundle performance characteristics –High availability, High durability, High performance are bundled at a single pricing point. –Some applications need only one or two of them –Solution: SLAs that allow the user to specify their requirements and chose pricing point. To provide specific support for science collaborations –Better security support for complex collaborations –Additional functionality for better usability: Metadata based searches Renaming and mutating objects –Relax hard-coded limitations: 100 buckets, 100 users in ACL, etc. Lesson for application integrators: Use application-level information to reduce costs –Raw vs. derived data –Exploit usage patterns: e.g., data gets cold. The answer: not yet.

Amazon S3 for science grids: a viable solution?21 Thank you