Download presentation
Presentation is loading. Please wait.
1
A Service for Data-Intensive Computations on Virtual Clusters Rainer Schmidt, Christian Sadilek, and Ross King rainer.schmidt@arcs.ac.at Intensive 2009, Valencia, Spain Executing Preservation Strategies at Scale
2
Intensive 2009, Valencia, Spain Planets Project “Permanent Long-term Access through NETworked Services” Addresses the problem of digital preservation driven by National Libraries and Archives Project instrument: FP6 Integrated Project 5. IST Call Consortium: 16 organisations from 7 countries Duration: 48 months, June 2006 – May 2010 Budget: 14 Million Euro http://www.planets-project.eu/ http://www.planets-project.eu/
3
Intensive 2009, Valencia, Spain Outline Digital Preservation Need for eScience Repositories The Planets Preservation Environment Distributed Data and Metadata Integration Data-Intensive Computations Grid Service for Job Execution on AWS Experimental Results Conclusions
4
Intensive 2009, Valencia, Spain Drivers for Digital Preservation Exponential growth of digital data across all sectors of society Production large quantities of often irreplaceable information Produced by academic and other research Data becomes significantly more complex and diverse Digital information is fragile Ensure long term access and interpretability Requires data curation and preservation Development and assessment of preservation strategies Need for integrated and largely automated environments
5
Intensive 2009, Valencia, Spain Repository Systems and eScience Digital repositories and libraries are designed for the creation, discovery, and publication of digital collections systems often heterogeneous, closed, limited in scale, Data grid middleware focus on the controlled sharing and management of large data sets in a distributed environment. limited in metadata mgnt., and preservation functionality Integrate repositories into large-scale environment distributed data management capabilities integration of policies, tools, and workflows integration with computational facilities
6
Intensive 2009, Valencia, Spain The Planets Environment A collaborative research infrastructure for the systematic development and evaluation and of preservation strategies. A decision-support system for the development of preservation plans An integrated environment for dynamically providing, discovering and accessing a wide range of preservation tools, services, and technical registries. A workflow environment for the data integration, process execution, and the maintenance of preservation information.
7
Intensive 2009, Valencia, Spain Service Gateway Architecture Preservation Planning (Plato) Experimentation Testbed Application Notification and Logging System Workflow Execution UI Workflow Execution and Monitoring Experiment Data and Metadata Repository Service and Tool Registry Application Services Execution Services Data Storage Services Administration UI Authentication and Authorization User Applications Portal Services Application Execution and Data Services Physical Resources, Computers, Networks
8
Intensive 2009, Valencia, Spain Data and Metadata Integration Support distributed and remote storage facilities data registry and metadata repository Integrate different data sources and encoding standards Currently a range of interfaces/protocols (OAI-PMH, SRU) and encoding standards (DC, METS) are supported. Development of common digital object model Recording of provenance information, preservation, integrity, and descriptive metadata Dynamic mapping of objects from different digital libraries (TEL, memory institutions, research collections)
9
Intensive 2009, Valencia, Spain Data Catalogue Experiment Data Metadata Repository DigObj Content Repository PUID Temp Storage Portal Application Service Data and Metadata Registry
10
Intensive 2009, Valencia, Spain
11
Data Intensive Applications Development Planets Job Submission Services Allow Job Submission to a PC cluster (e.g. Hadoop, Condor) Standard grid protocols/interfaces (SOAP, HPC-BP, JSDL) Demand for massive compute and storage resources Instantiate cluster on top of (leased) cloud resources based on AWS EC2 + S3, (or alternatively in-house in a PC lab.) Allows computation to be moved to data or vice-versa On-demand cluster based on virtual machine images Many Preservation Tools are/rely on 3rd party applications Software can be preinstalled on virtual cluster nodes
12
Intensive 2009, Valencia, Spain Experimental Architecture JSDL Virtual Node (Xen) JSS Data Service Cloud Infrastructure (EC2) Storage Infrastructure (S3) Virtual Cluster and File System (Apache Hadoop) Digital Objects Job Description File Job HPCBP Workflow Environment
13
Intensive 2009, Valencia, Spain A Light-Weight Job Execution Service TLS/WS Security WS - Container BES/HPCBP API Account Manager JSDL Parser Session Handler Exec. Manager Data Resolver Input Generator Job Manager
14
Intensive 2009, Valencia, Spain A Light-Weight Job Execution Service TLS/WS Security WS - Container BES/HPCBP API Account Manager JSDL Parser Session Handler Exec. Manager Data Resolver Input Generator Job Manager HDFS S3 Hadoop
15
Intensive 2009, Valencia, Spain The MapReduce Application Map-Reduce implements a framework and prog. model for processing large documents (Sorting, Searching, Indexing) on multiple nodes. Automated decomposition (split) Mapping to intermediary pairs (map), optionally (combine) Merge output (reduce) Provides implementation for data parallel + i/o intensive applications Migrating a digital object (e.g web page, folder, archive) Decompose into atomic pieces (e.g. file, image, movie) On each node, process input splits Merge pieces and create new data collections
16
Intensive 2009, Valencia, Spain Experimental Setup Amazon Elastic Compute Cloud (EC2) 1 – 150 cluster nodes Custom image based on Fedora 8 i386 Amazon Simple Storage Service (S3) max. 1TB I/O per experiment All measurements based on a single core per virtual machine Apache Hadoop (v.0.18) MapReduce Implementation Preinstalled command line tools Quantitative evaluation of VMs (AWS) based on execution time, number of tasks, number of nodes, physical size
17
Intensive 2009, Valencia, Spain Experimental Results 1 – Scaling Job Size number of nodes = 5 x(1k) = 3,5 x(1k) = 4,4 x(1k) = 3,6 x(1k) = t_seq / t_parallel and tasks = 1000
18
Intensive 2009, Valencia, Spain Experimental Results 2 – Scaling #nodes X X X X X X n=1, t=36, s1 = 0.72, e=72% n=5, t=4.5, s=3.3, e=66% n=10, t=4.5, s=5.5, e=55% n=50, t=1.68, s=16, e=31% n=100, t=1.03, s=26, e=26% n=1 (local), t=26
19
Intensive 2009, Valencia, Spain Results and Observations AWS + Hadoop + JSS Robust and fault tolerant, scalable up to large numbers of nodes ~32,5MB/s download / ~13,8MB/s upload (cloud internally) Also small clusters of virtual machines reasonable S = 4,4 for p = 5, with n=1000 and s=7,5MB Ideally, size of cluster grows/shrinks on demand Overheads: SLE vs. Cloud (p=1, n=1000) 30% due to file system master In average, less then 10% due to S3 Small overheads due to coordination, latencies, pre-processing
20
Intensive 2009, Valencia, Spain Conclusion Challenges in digital preservation are diversity and scale Focus: scientific data, arts and humanities Repositories Preservation systems need to be embedded with large- scale distributed environments grids, clouds, eScience environments Cloud Computing provides a powerful solution for getting on- demand access to an IT infrastructure. Currently integration issues: Policies, Legal Aspects, SLAs We have presented current developments in the context of the EU project Planets on building an integrated research environment for the development and evaluation of preservation strategies
21
Intensive 2009, Valencia, Spain Fin
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.