SCAPE Dr. Rainer Schmidt AIT Austrian Institute of Technology GmbH APA 2011 Conference London, 8-9 November, 2011 The SCAPE Project Overview, Objectives, and Approaches
SCAPE SCAPE – what is it about? Planning and managing resource-intensive (digital) preservation processes such as large-scale ingestion, analysis, or modification of digital data sets Focus on scalability, robustness, and automation. SCAPE is a follow-up to the highly successful FP6 IP Planets.
SCAPE SCAPE Project Data Project instrument: FP7 Integrated Project 6. Call Objective ICT : Digital Libraries and Digital Preservation Target outcome (a) Scalable systems and services for preserving digital content Duration: 42 months February 2011 – July 2014 Budget: 11.3 Million Euro Funded: 8.6 Million Euro
SCAPE SCAPE Consortium NumberPartner nameShort name RoleCountry 1 (crd.)AIT Austrian Institute of Technology GmbHAITAT 2British LibraryBLUK 3Internet Memory FoundationIMFNL 4Ex Libris LtdEXLIL 5Fachinformationszentrum KarlsruheFIZDE 6Koninklijke BibliotheekKBNL 7KEEP SolutionsKEEPSPT 8Microsoft ResearchMSRUK 9Österreichische NationalbibliothekONBAT 10Open Planets FoundationOPFUK 11Statsbiblioteket AarhusSBDK 12Science and Technology Facilities CouncilSTFCUK 13Technische Universität BerlinTUBDE 14Technische Universität WienTUWAT 15University of ManchesterUNIMANUK 16Pierre & Marie Curie Université Paris 6UPMCFR
SCAPE SCAPE Project Overview SCAPE will enhance the state of the art in digital preservation in three ways: A scalable infrastructure and tools for preservation actions Automated, quality-assured preservation workflows Integration of these components with policy-based automated preservation planning and watch SCAPE results will be validated in three large-scale testbeds: Digital Repositories Web Content Research Data Sets The SCAPE Consortium brings together a broad spectrum of expertise from Memory institutions Data centres Research labs Universities Industrial firms
SCAPE Selected Scape Data Collections Data collections provided by 6 institutions Complete Web archives and snapshots of public domains (.dk,.it,.eu, gov.uk, …) Millions of digitised newspapers, posters, law gazettes, and th century broadsheets Collections of multi-file objects such as books, papyri, and incunabula (up to 230MB/object) images of East Asian manuscripts in different quality levels TBs of voluntary deposit in a wide variety of formats 500TB of broadcast radio and TV output (up to 73GB/object) Many hundreds of thousands of data sets from synchrotron, neutron, and muon instruments items from a selection of open access journal articles
SCAPE Selected SCAPE Testbed Scenarios Characterise large video files The master MPEG2 files are so large that it is difficult to apply JHOVE and insufficient detail is provided. A detailed characterisation of the MPEG2 streams is needed in order to identify technical dependencies for extracting from or rendering the MPEG2 stream. This would enable preservation risks related to current access services to be monitored and action taken as necessary to ensure continued access and preservation. Carry out large scale migrations Migrating from one format to another introduces the possibility of damaging the content or failing to capture significant properties of the original in the resulting destination format. Specific requirements include: Solution tools that operate reliably at scale (80TB, 2 million pages) Automated QA, ideally with no manual intervention on a file by file basis QA performed by process independently from the migration Demonstrating strong evidence of significant properties being captured in the destination format Quality assurance in web harvesting For large scale crawls, automation of the quality control processes is a necessary requirement. Currently, this process relies on random sampling and very basic quantitative checks. from digitalbevaring.dk
SCAPE Selected SCAPE Challenges Bridging the gap between experimental workflows and production scenarios e.g. coping with amount and size of payload data Employing data intensive technologies for processing binary content generation and evaluation of workflow results Exploiting data locality Avoiding data transfer by placing processors next to the data Repository Integration Horizontal scalability Scalable ingest/access Preservation Planning Automation of monitoring and decision processes Automated Quality Assurance Advanced Image Processing Scientific data How to preserve contextual information?
SCAPE SCAPE Solutions SCAPE Platform Environment for carrying out preservation workflows at scale Software package and shared deployment (the Central Instance) Dynamic deployment of environments virtualisation and cloud-based technologies. support for native tools and environments Builds upon data-centric execution platform (Hadoop/Stratosphere) Simple and natural tool support and automated mapping of graphical (Taverna-based) workflows to parallel programming model Three levels of parallelization Distribution of files Splitting content Parallel query execution Repository integration based on two open reference implementations PPL dataflow program compile Multi-Stage M/R Flow
SCAPE SCAPE Solutions OPF Result Evaluation Framework (Ref) Large RDF quadstore for storing SCAPE workflow results developed in cooperation with University of Southampton Shared database to publish and query these results Supports progress tracking and monitoring over time Input for Preservation Planning and Watch
SCAPE SCAPE Solutions Context-aware Planning and Watch Automated watch monitoring trends in web harvests and repositories linked with Results Evaluation Framework (REF) database Formalized policy model and representation using semantic technologies Automated Planning Building on the Planets PLATO tool Key factors and decision criteria Automated policy-driven planning
SCAPE SCAPE Solutions Automated Quality Assurance QA in web harvesting and digitisation through automated comparison of rendered pages Characterization – feature extraction Level 1 - Metadata information: using characterization components. Level 2 – Global content description: discriminant global features for individual media types. Level 3 – Structural content description: detect structural similarities in images Comparison Discrete solution and smart metrics (level 2+3) Development of metrics and measures of similarity, quality, relationship to user perception
SCAPE Selected Achievements Public Website: Development Infrastructure hosted by the Open Planets Foundation and GitHub: First Deliverables available for download Publications 13 in the first nine months, including 6 at iPres last week Report: Comparative analysis of identification tools Report: Analysis of scalability challenges for Digital Object Repositories - Classification and design of approaches. Platform Infrastructure 10 nodes (dual-core), 20 TB experimental cluster hosted by AIT Virtualization based on Xen + Eucalyptus Hardware for the Platform’s Central Instance currently being set-up within data centre at IMF. 13
SCAPE Thank you for your attention!