A Service for Data-Intensive Computations on Virtual Clusters Rainer Schmidt, Christian Sadilek, and Ross King Intensive 2009,

A Service for Data-Intensive Computations on Virtual Clusters Rainer Schmidt, Christian Sadilek, and Ross King rainer.schmidt@arcs.ac.at Intensive 2009, Valencia, Spain Executing Preservation Strategies at Scale

Intensive 2009, Valencia, Spain Planets Project  “Permanent Long-term Access through NETworked Services”  Addresses the problem of digital preservation  driven by National Libraries and Archives  Project instrument: FP6 Integrated Project  5. IST Call  Consortium: 16 organisations from 7 countries  Duration: 48 months, June 2006 – May 2010  Budget: 14 Million Euro  http://www.planets-project.eu/ http://www.planets-project.eu/

Intensive 2009, Valencia, Spain Outline  Digital Preservation  Need for eScience Repositories  The Planets Preservation Environment  Distributed Data and Metadata Integration  Data-Intensive Computations  Grid Service for Job Execution on AWS  Experimental Results  Conclusions

Intensive 2009, Valencia, Spain Drivers for Digital Preservation  Exponential growth of digital data across all sectors of society  Production large quantities of often irreplaceable information  Produced by academic and other research  Data becomes significantly more complex and diverse  Digital information is fragile  Ensure long term access and interpretability  Requires data curation and preservation  Development and assessment of preservation strategies  Need for integrated and largely automated environments

Intensive 2009, Valencia, Spain Repository Systems and eScience  Digital repositories and libraries are designed for the creation, discovery, and publication of digital collections  systems often heterogeneous, closed, limited in scale,  Data grid middleware focus on the controlled sharing and management of large data sets in a distributed environment.  limited in metadata mgnt., and preservation functionality  Integrate repositories into large-scale environment  distributed data management capabilities  integration of policies, tools, and workflows  integration with computational facilities

Intensive 2009, Valencia, Spain The Planets Environment  A collaborative research infrastructure for the systematic development and evaluation and of preservation strategies.  A decision-support system for the development of preservation plans  An integrated environment for dynamically providing, discovering and accessing a wide range of preservation tools, services, and technical registries.  A workflow environment for the data integration, process execution, and the maintenance of preservation information.

Intensive 2009, Valencia, Spain Service Gateway Architecture Preservation Planning (Plato) Experimentation Testbed Application Notification and Logging System Workflow Execution UI Workflow Execution and Monitoring Experiment Data and Metadata Repository Service and Tool Registry Application Services Execution Services Data Storage Services Administration UI Authentication and Authorization User Applications Portal Services Application Execution and Data Services Physical Resources, Computers, Networks

Intensive 2009, Valencia, Spain Data and Metadata Integration  Support distributed and remote storage facilities  data registry and metadata repository  Integrate different data sources and encoding standards  Currently a range of interfaces/protocols (OAI-PMH, SRU) and encoding standards (DC, METS) are supported.  Development of common digital object model  Recording of provenance information, preservation, integrity, and descriptive metadata  Dynamic mapping of objects from different digital libraries (TEL, memory institutions, research collections)

Intensive 2009, Valencia, Spain Data Catalogue Experiment Data Metadata Repository DigObj Content Repository PUID Temp Storage Portal Application Service Data and Metadata Registry

Intensive 2009, Valencia, Spain

Data Intensive Applications  Development Planets Job Submission Services  Allow Job Submission to a PC cluster (e.g. Hadoop, Condor)  Standard grid protocols/interfaces (SOAP, HPC-BP, JSDL)  Demand for massive compute and storage resources  Instantiate cluster on top of (leased) cloud resources based on AWS EC2 + S3, (or alternatively in-house in a PC lab.)  Allows computation to be moved to data or vice-versa  On-demand cluster based on virtual machine images  Many Preservation Tools are/rely on 3rd party applications  Software can be preinstalled on virtual cluster nodes

Intensive 2009, Valencia, Spain Experimental Architecture JSDL Virtual Node (Xen) JSS Data Service Cloud Infrastructure (EC2) Storage Infrastructure (S3) Virtual Cluster and File System (Apache Hadoop) Digital Objects Job Description File Job HPCBP Workflow Environment

Intensive 2009, Valencia, Spain A Light-Weight Job Execution Service TLS/WS Security WS - Container BES/HPCBP API Account Manager JSDL Parser Session Handler Exec. Manager Data Resolver Input Generator Job Manager

Intensive 2009, Valencia, Spain A Light-Weight Job Execution Service TLS/WS Security WS - Container BES/HPCBP API Account Manager JSDL Parser Session Handler Exec. Manager Data Resolver Input Generator Job Manager HDFS S3 Hadoop

Intensive 2009, Valencia, Spain The MapReduce Application  Map-Reduce implements a framework and prog. model for processing large documents (Sorting, Searching, Indexing) on multiple nodes.  Automated decomposition (split)  Mapping to intermediary pairs (map), optionally (combine)  Merge output (reduce)  Provides implementation for data parallel + i/o intensive applications  Migrating a digital object (e.g web page, folder, archive)  Decompose into atomic pieces (e.g. file, image, movie)  On each node, process input splits  Merge pieces and create new data collections

Intensive 2009, Valencia, Spain Experimental Setup  Amazon Elastic Compute Cloud (EC2)  1 – 150 cluster nodes  Custom image based on Fedora 8 i386  Amazon Simple Storage Service (S3)  max. 1TB I/O per experiment  All measurements based on a single core per virtual machine  Apache Hadoop (v.0.18)  MapReduce Implementation  Preinstalled command line tools  Quantitative evaluation of VMs (AWS) based on execution time, number of tasks, number of nodes, physical size

Intensive 2009, Valencia, Spain Experimental Results 1 – Scaling Job Size number of nodes = 5 x(1k) = 3,5 x(1k) = 4,4 x(1k) = 3,6 x(1k) = t_seq / t_parallel and tasks = 1000

Intensive 2009, Valencia, Spain Experimental Results 2 – Scaling #nodes X X X X X X n=1, t=36, s1 = 0.72, e=72% n=5, t=4.5, s=3.3, e=66% n=10, t=4.5, s=5.5, e=55% n=50, t=1.68, s=16, e=31% n=100, t=1.03, s=26, e=26% n=1 (local), t=26

Intensive 2009, Valencia, Spain Results and Observations  AWS + Hadoop + JSS  Robust and fault tolerant, scalable up to large numbers of nodes  ~32,5MB/s download / ~13,8MB/s upload (cloud internally)  Also small clusters of virtual machines reasonable  S = 4,4 for p = 5, with n=1000 and s=7,5MB  Ideally, size of cluster grows/shrinks on demand  Overheads:  SLE vs. Cloud (p=1, n=1000) 30% due to file system master  In average, less then 10% due to S3  Small overheads due to coordination, latencies, pre-processing

Intensive 2009, Valencia, Spain Conclusion  Challenges in digital preservation are diversity and scale  Focus: scientific data, arts and humanities  Repositories Preservation systems need to be embedded with large- scale distributed environments  grids, clouds, eScience environments  Cloud Computing provides a powerful solution for getting on- demand access to an IT infrastructure.  Currently integration issues: Policies, Legal Aspects, SLAs  We have presented current developments in the context of the EU project Planets on building an integrated research environment for the development and evaluation of preservation strategies

Intensive 2009, Valencia, Spain Fin

A Service for Data-Intensive Computations on Virtual Clusters Rainer Schmidt, Christian Sadilek, and Ross King Intensive 2009,

Similar presentations

Presentation on theme: "A Service for Data-Intensive Computations on Virtual Clusters Rainer Schmidt, Christian Sadilek, and Ross King Intensive 2009,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

A Service for Data-Intensive Computations on Virtual Clusters Rainer Schmidt, Christian Sadilek, and Ross King Intensive 2009,

Similar presentations

Presentation on theme: "A Service for Data-Intensive Computations on Virtual Clusters Rainer Schmidt, Christian Sadilek, and Ross King Intensive 2009,"— Presentation transcript:

Similar presentations

About project

Feedback