A Service for Data-Intensive Computations on Virtual Clusters Rainer Schmidt, Christian Sadilek, and Ross King Intensive 2009,

Slides:



Advertisements
Similar presentations
DRIVER Building a worldwide scientific data repository infrastructure in support of scholarly communication 1 JISC/CNI Conference, Belfast, July.
Advertisements

Digital Library Service – An overview Introduction System Architecture Components and their functionalities Experimental Results.
SLA-Oriented Resource Provisioning for Cloud Computing
EHarmony in Cloud Subtitle Brian Ko. eHarmony Online subscription-based matchmaking service Available in United States, Canada, Australia and United Kingdom.
SALSA HPC Group School of Informatics and Computing Indiana University.
Clouds from FutureGrid’s Perspective April Geoffrey Fox Director, Digital Science Center, Pervasive.
Funded by: © AHDS Sherpa DP – a Technical Architecture for a Disaggregated Preservation Service Mark Hedges Arts and Humanities Data Service King’s College.
Planning for Flexible Integration via Service-Oriented Architecture (SOA) APSR Forum – The Well-Integrated Repository Sydney, Australia February 2006 Sandy.
1 Software & Grid Middleware for Tier 2 Centers Rob Gardner Indiana University DOE/NSF Review of U.S. ATLAS and CMS Computing Projects Brookhaven National.
SOAPI: a flexible toolkit for implementing ingest and preservation workflows Mark Hedges Centre for e-Research, King’s College London Arts and Humanities.
Integrated Scientific Workflow Management for the Emulab Network Testbed Eric Eide, Leigh Stoller, Tim Stack, Juliana Freire, and Jay Lepreau and Jay Lepreau.
Robust Tools for Archiving and Preserving Digital Data Joseph JaJa, Mike Smorul, and Mike McGann Institute for Advanced Computer Studies Department of.
Tools and Services for the Long Term Preservation and Access of Digital Archives Joseph JaJa, Mike Smorul, and Sangchul Song Institute for Advanced Computer.
Workload Management Massimo Sgaravatto INFN Padova.
The Planets Interoperability Framework Rainer Schmidt AIT Austrian Institute of Technology 1st DPIF Symposium, April 21-23, 2010,
A Framework for Distributed Preservation Workflows Rainer Schmidt AIT Austrian Institute of Technology iPres 2009, Oct. 5, San.
MapReduce in the Clouds for Science CloudCom 2010 Nov 30 – Dec 3, 2010 Thilina Gunarathne, Tak-Lon Wu, Judy Qiu, Geoffrey Fox {tgunarat, taklwu,
1 Building National Cyberinfrastructure Alan Blatecky Office of Cyberinfrastructure EPSCoR Meeting May 21,
Hadoop Team: Role of Hadoop in the IDEAL Project ●Jose Cadena ●Chengyuan Wen ●Mengsu Chen CS5604 Spring 2015 Instructor: Dr. Edward Fox.
Scientific Data Infrastructure in CAS Dr. Jianhui Scientific Data Center Computer Network Information Center Chinese Academy of Sciences.
Hussein Suleman University of Cape Town Department of Computer Science Advanced Information Management Laboratory High Performance.
Software Architecture
GT Components. Globus Toolkit A “toolkit” of services and packages for creating the basic grid computing infrastructure Higher level tools added to this.
Per Møldrup-Dalum State and University Library SCAPE Information Day State and University Library, Denmark, SCAPE Scalable Preservation Environments.
Introduction to Apache Hadoop Zibo Wang. Introduction  What is Apache Hadoop?  Apache Hadoop is a software framework which provides open source libraries.
SCAPE Scalable Preservation Environments. 2 Its all about scalability! Scalable services for planning and execution of institutional preservation strategies.
CSE 548 Advanced Computer Network Security Document Search in MobiCloud using Hadoop Framework Sayan Cole Jaya Chakladar Group No: 1.
Large Scale Sky Computing Applications with Nimbus Pierre Riteau Université de Rennes 1, IRISA INRIA Rennes – Bretagne Atlantique Rennes, France
Chapter 4 Realtime Widely Distributed Instrumention System.
The Limitation of MapReduce: A Probing Case and a Lightweight Solution Zhiqiang Ma Lin Gu Department of Computer Science and Engineering The Hong Kong.
TESTBED FOR FUTURE INTERNET SERVICES TEFIS at the EU-Canada Future Internet Workshop, March Annika Sällström – Botnia Living Lab at Centre for.
Alastair Duncan STFC Pre Coffee talk STFC July 2014 The Trials and Tribulations and ultimate success of parallelisation using Hadoop within the SCAPE project.
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
 Apache Airavata Architecture Overview Shameera Rathnayaka Graduate Assistant Science Gateways Group Indiana University 07/27/2015.
Ames Research CenterDivision 1 Information Power Grid (IPG) Overview Anthony Lisotta Computer Sciences Corporation NASA Ames May 2,
Grid Computing & Semantic Web. Grid Computing Proposed with the idea of electric power grid; Aims at integrating large-scale (global scale) computing.
Holding slide prior to starting show. A Portlet Interface for Computational Electromagnetics on the Grid Maria Lin and David Walker Cardiff University.
NA-MIC National Alliance for Medical Image Computing UCSD: Engineering Core 2 Portal and Grid Infrastructure.
The KB e-Depot long-term preservation of scientific publications in practice Marcel Ras, National library of The Netherlands.
GRID Overview Internet2 Member Meeting Spring 2003 Sandra Redman Information Technology and Systems Center and Information Technology Research Center National.
Ruth Pordes November 2004TeraGrid GIG Site Review1 TeraGrid and Open Science Grid Ruth Pordes, Fermilab representing the Open Science.
SCAPE Rainer Schmidt SCAPE Training Event September 16 th – 17 th, 2013 The British Library Building Scalable Environments Technologies and SCAPE Platform.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
26/05/2005 Research Infrastructures - 'eInfrastructure: Grid initiatives‘ FP INFRASTRUCTURES-71 DIMMI Project a DI gital M ulti M edia I nfrastructure.
PDAC-10 Middleware Solutions for Data- Intensive (Scientific) Computing on Clouds Gagan Agrawal Ohio State University (Joint Work with Tekin Bicer, David.
Experiments in Utility Computing: Hadoop and Condor Sameer Paranjpye Y! Web Search.
Directions in eScience Interoperability and Science Clouds June Interoperability in Action – Standards Implementation.
Next Generation of Apache Hadoop MapReduce Owen
PARALLEL AND DISTRIBUTED PROGRAMMING MODELS U. Jhashuva 1 Asst. Prof Dept. of CSE om.
Distributed Process Discovery From Large Event Logs Sergio Hernández de Mesa {
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI UMD Roadmap Steven Newhouse 14/09/2010.
INTRODUCTION TO GRID & CLOUD COMPUTING U. Jhashuva 1 Asst. Professor Dept. of CSE.
DIRAC for Grid and Cloud Dr. Víctor Méndez Muñoz (for DIRAC Project) LHCb Tier 1 Liaison at PIC EGI User Community Board, October 31st, 2013.
The EPIKH Project (Exchange Programme to advance e-Infrastructure Know-How) gLite Grid Introduction Salma Saber Electronic.
EGI-InSPIRE RI EGI Compute and Data Services for Open Access in H2020 Tiziana Ferrari Technical Director, EGI.eu
A Semi-Automated Digital Preservation System based on Semantic Web Services Jane Hunter Sharmin Choudhury DSTC PTY LTD, Brisbane, Australia Slides by Ananta.
Grid Services for Digital Archive Tao-Sheng Chen Academia Sinica Computing Centre
Agenda’s for Preservation Research Micah Altman MIT Libraries Prepared for SAA Research Forum Atlanta August 2016.
Data Grids, Digital Libraries and Persistent Archives: An Integrated Approach to Publishing, Sharing and Archiving Data. Written By: R. Moore, A. Rajasekar,
EGI-InSPIRE RI An Introduction to European Grid Infrastructure (EGI) March An Introduction to the European Grid Infrastructure.
Accessing the VI-SEEM infrastructure
Workload Management Workpackage
Pasquale Pagano (CNR-ISTI) Project technical director
Pasquale Pagano CNR, Italy
Joseph JaJa, Mike Smorul, and Sangchul Song
SCALABLE OPEN ACCESS Hussein Suleman
CS110: Discussion about Spark
Large Scale Distributed Computing
Wide Area Workload Management Work Package DATAGRID project
Presentation transcript:

A Service for Data-Intensive Computations on Virtual Clusters Rainer Schmidt, Christian Sadilek, and Ross King Intensive 2009, Valencia, Spain Executing Preservation Strategies at Scale

Intensive 2009, Valencia, Spain Planets Project  “Permanent Long-term Access through NETworked Services”  Addresses the problem of digital preservation  driven by National Libraries and Archives  Project instrument: FP6 Integrated Project  5. IST Call  Consortium: 16 organisations from 7 countries  Duration: 48 months, June 2006 – May 2010  Budget: 14 Million Euro 

Intensive 2009, Valencia, Spain Outline  Digital Preservation  Need for eScience Repositories  The Planets Preservation Environment  Distributed Data and Metadata Integration  Data-Intensive Computations  Grid Service for Job Execution on AWS  Experimental Results  Conclusions

Intensive 2009, Valencia, Spain Drivers for Digital Preservation  Exponential growth of digital data across all sectors of society  Production large quantities of often irreplaceable information  Produced by academic and other research  Data becomes significantly more complex and diverse  Digital information is fragile  Ensure long term access and interpretability  Requires data curation and preservation  Development and assessment of preservation strategies  Need for integrated and largely automated environments

Intensive 2009, Valencia, Spain Repository Systems and eScience  Digital repositories and libraries are designed for the creation, discovery, and publication of digital collections  systems often heterogeneous, closed, limited in scale,  Data grid middleware focus on the controlled sharing and management of large data sets in a distributed environment.  limited in metadata mgnt., and preservation functionality  Integrate repositories into large-scale environment  distributed data management capabilities  integration of policies, tools, and workflows  integration with computational facilities

Intensive 2009, Valencia, Spain The Planets Environment  A collaborative research infrastructure for the systematic development and evaluation and of preservation strategies.  A decision-support system for the development of preservation plans  An integrated environment for dynamically providing, discovering and accessing a wide range of preservation tools, services, and technical registries.  A workflow environment for the data integration, process execution, and the maintenance of preservation information.

Intensive 2009, Valencia, Spain Service Gateway Architecture Preservation Planning (Plato) Experimentation Testbed Application Notification and Logging System Workflow Execution UI Workflow Execution and Monitoring Experiment Data and Metadata Repository Service and Tool Registry Application Services Execution Services Data Storage Services Administration UI Authentication and Authorization User Applications Portal Services Application Execution and Data Services Physical Resources, Computers, Networks

Intensive 2009, Valencia, Spain Data and Metadata Integration  Support distributed and remote storage facilities  data registry and metadata repository  Integrate different data sources and encoding standards  Currently a range of interfaces/protocols (OAI-PMH, SRU) and encoding standards (DC, METS) are supported.  Development of common digital object model  Recording of provenance information, preservation, integrity, and descriptive metadata  Dynamic mapping of objects from different digital libraries (TEL, memory institutions, research collections)

Intensive 2009, Valencia, Spain Data Catalogue Experiment Data Metadata Repository DigObj Content Repository PUID Temp Storage Portal Application Service Data and Metadata Registry

Intensive 2009, Valencia, Spain

Data Intensive Applications  Development Planets Job Submission Services  Allow Job Submission to a PC cluster (e.g. Hadoop, Condor)  Standard grid protocols/interfaces (SOAP, HPC-BP, JSDL)  Demand for massive compute and storage resources  Instantiate cluster on top of (leased) cloud resources based on AWS EC2 + S3, (or alternatively in-house in a PC lab.)  Allows computation to be moved to data or vice-versa  On-demand cluster based on virtual machine images  Many Preservation Tools are/rely on 3rd party applications  Software can be preinstalled on virtual cluster nodes

Intensive 2009, Valencia, Spain Experimental Architecture JSDL Virtual Node (Xen) JSS Data Service Cloud Infrastructure (EC2) Storage Infrastructure (S3) Virtual Cluster and File System (Apache Hadoop) Digital Objects Job Description File Job HPCBP Workflow Environment

Intensive 2009, Valencia, Spain A Light-Weight Job Execution Service TLS/WS Security WS - Container BES/HPCBP API Account Manager JSDL Parser Session Handler Exec. Manager Data Resolver Input Generator Job Manager

Intensive 2009, Valencia, Spain A Light-Weight Job Execution Service TLS/WS Security WS - Container BES/HPCBP API Account Manager JSDL Parser Session Handler Exec. Manager Data Resolver Input Generator Job Manager HDFS S3 Hadoop

Intensive 2009, Valencia, Spain The MapReduce Application  Map-Reduce implements a framework and prog. model for processing large documents (Sorting, Searching, Indexing) on multiple nodes.  Automated decomposition (split)  Mapping to intermediary pairs (map), optionally (combine)  Merge output (reduce)  Provides implementation for data parallel + i/o intensive applications  Migrating a digital object (e.g web page, folder, archive)  Decompose into atomic pieces (e.g. file, image, movie)  On each node, process input splits  Merge pieces and create new data collections

Intensive 2009, Valencia, Spain Experimental Setup  Amazon Elastic Compute Cloud (EC2)  1 – 150 cluster nodes  Custom image based on Fedora 8 i386  Amazon Simple Storage Service (S3)  max. 1TB I/O per experiment  All measurements based on a single core per virtual machine  Apache Hadoop (v.0.18)  MapReduce Implementation  Preinstalled command line tools  Quantitative evaluation of VMs (AWS) based on execution time, number of tasks, number of nodes, physical size

Intensive 2009, Valencia, Spain Experimental Results 1 – Scaling Job Size number of nodes = 5 x(1k) = 3,5 x(1k) = 4,4 x(1k) = 3,6 x(1k) = t_seq / t_parallel and tasks = 1000

Intensive 2009, Valencia, Spain Experimental Results 2 – Scaling #nodes X X X X X X n=1, t=36, s1 = 0.72, e=72% n=5, t=4.5, s=3.3, e=66% n=10, t=4.5, s=5.5, e=55% n=50, t=1.68, s=16, e=31% n=100, t=1.03, s=26, e=26% n=1 (local), t=26

Intensive 2009, Valencia, Spain Results and Observations  AWS + Hadoop + JSS  Robust and fault tolerant, scalable up to large numbers of nodes  ~32,5MB/s download / ~13,8MB/s upload (cloud internally)  Also small clusters of virtual machines reasonable  S = 4,4 for p = 5, with n=1000 and s=7,5MB  Ideally, size of cluster grows/shrinks on demand  Overheads:  SLE vs. Cloud (p=1, n=1000) 30% due to file system master  In average, less then 10% due to S3  Small overheads due to coordination, latencies, pre-processing

Intensive 2009, Valencia, Spain Conclusion  Challenges in digital preservation are diversity and scale  Focus: scientific data, arts and humanities  Repositories Preservation systems need to be embedded with large- scale distributed environments  grids, clouds, eScience environments  Cloud Computing provides a powerful solution for getting on- demand access to an IT infrastructure.  Currently integration issues: Policies, Legal Aspects, SLAs  We have presented current developments in the context of the EU project Planets on building an integrated research environment for the development and evaluation of preservation strategies

Intensive 2009, Valencia, Spain Fin