SCAPE Dr. Rainer Schmidt AIT Austrian Institute of Technology GmbH APA 2011 Conference London, 8-9 November, 2011 The SCAPE Project Overview, Objectives,

Slides:

Advertisements

Similar presentations

DRIVER Long Term Preservation for Enhanced Publications in the DRIVER Infrastructure 1 WePreserve Workshop, October 2008 Dale Peters, Scientific Technical.

Advertisements

Curating Research: problems and policy Dale Peters Scientific Technical Manager DRIVER II.

Introduction to Planets Hans Hofman Nationaal Archief Netherlands Prague, 17 October 2008.

A centre of expertise in data curation and preservation DigCCur2007 Symposium, Chapel Hill, N.C., April 18-20, 2007 Co-operation for digital preservation.

Preservation of Software Barbara Sierman (digital preservation manager) E-Humanities Software and Tools Sustainability,

A centre of expertise in digital information management A QA Framework To Support Your Library Web Site Review Brian Kelly UKOLN University of Bath Bath.

Providing collections, tools and services for digital humanities A national library perspective Clément Oury Head of Digital Legal Deposit Bibliothèque.

EU-funded Digital Preservation Research APA 2014 Conference Brussels, 22 October 2014 Dr. Manuela Speiser European Commission DG CONNECT, unit "Creativity"

Dr. Ross King AIT Austrian Institute of Technology GmbH SCAPE/OPF Executive Seminar: Managing Digital Preservation The Hague, April 2, 2014 SCAPE Tools.

Versioning Requirements and Proposed Solutions CM Jones, JE Brace, PL Cave & DR Puplett OR nd April

Funded by: © AHDS Sherpa DP – a Technical Architecture for a Disaggregated Preservation Service Mark Hedges Arts and Humanities Data Service King’s College.

Selecting Preservation Strategies for Web Archives Stephan Strodl, Andreas Rauber Department of Software.

1 Archiving Workflow between a Local Repository and the National Library Archive Experiences from the DiVA Project Eva Müller, Peter Hansson, Uwe Klosa,

Chronopolis: Preserving Our Digital Heritage David Minor UC San Diego San Diego Supercomputer Center.

SOAPI: a flexible toolkit for implementing ingest and preservation workflows Mark Hedges Centre for e-Research, King’s College London Arts and Humanities.

Preservation and Long-term access through Networked Services Adam Farquhar, The British Library iPres2006 Cornell University, October 2006.

Robust Tools for Archiving and Preserving Digital Data Joseph JaJa, Mike Smorul, and Mike McGann Institute for Advanced Computer Studies Department of.

© 2010 Microsoft Corporation. All rights reserved. Quality Assurance: Towards Tools for Characterizing and Comparing Digital Documents Natasa Milic-Frayling.

Tools and Services for the Long Term Preservation and Access of Digital Archives Joseph JaJa, Mike Smorul, and Sangchul Song Institute for Advanced Computer.

Neil McKenzie, Dedicon Multimedia training Packages by EUAIN.

Mike Smorul Saurabh Channan Digital Preservation and Archiving at the Institute for Advanced Computer Studies University of Maryland, College Park.

1 EuropeanaLocal- Europeana Knowledge Sharing Workshop EuropeanaLocal- Europeana Knowledge Sharing Workshop 13/14 January 2009 Rob Davies, Scientific Co-ordinator.

The Planets Interoperability Framework Rainer Schmidt AIT Austrian Institute of Technology 1st DPIF Symposium, April 21-23, 2010,

A Framework for Distributed Preservation Workflows Rainer Schmidt AIT Austrian Institute of Technology iPres 2009, Oct. 5, San.

A Service for Data-Intensive Computations on Virtual Clusters Rainer Schmidt, Christian Sadilek, and Ross King Intensive 2009,

1 WEB ARCHIVING IN THE BRITISH LIBRARY John Tuck Head of British Collections February 2004.

 Cloud computing  Workflow  Workflow lifecycle  Workflow design  Workflow tools : xcp, eucalyptus, open nebula.

LIBER Digitisation Conference, Copenhagen The cost of digitisation and preservation: The LIFE Project October 2007 Richard Davies LIFE 2 Project.

LIFE 3 LIFE3: Predicting Long Term Preservation Costs Paul Wheatley Digital Preservation Manager The British Library.

SCAPE Dr. Ross King AIT Austrian Institute of Technology GmbH APA Conference Frascati, November 7, 2012 SCAPE Scalable Preservation Tools and Infrastructure.

The Digital Object Management Programme (DOM) Richard Masters, Programme Manager PRESERV Partners Meeting 18 th November

Building Scalable Web Archives Florent Carpentier, Leïla Medjkoune Internet Memory Foundation IIPC GA, Paris, May 2014.

WORKFLOWS AND OTHER CONSIDERATIONS FOR DIGITIZATION  Steve Bingo  Processing Archivist Washington State University Libraries  Alex Merrill  Assistant.

Per Møldrup-Dalum State and University Library SCAPE Information Day State and University Library, Denmark, SCAPE Scalable Preservation Environments.

SCAPE Scalable Preservation Environments. 2 Its all about scalability! Scalable services for planning and execution of institutional preservation strategies.

Access Across Time: How the NAA Preserves Digital Records Andrew Wilson Assistant Director, Preservation.

E-science in the Netherlands Maria Heijne TU Delft Library Director / Chair Consortium of University Libraries and National Library.

VIRTUAL INFORMATION AND KNOWLEDGE ENVIRONMENT FRAMEWORK IP-FP

HathiTrust’s Past, Present and Future. Short- and Long-term Functional Objectives Short-term Page turner mechanism (and Mobile!) Branding (overall initiative;

BlogForever Project Presentation Vangelis Banos, Project Manager, ALTEC Software Stratos Arampatzis, Dissemination Manager, Tero Dr. Alexandra Cristea,

Implementor’s Panel: BL’s eJournal Archiving solution using METS, MODS and PREMIS Markus Enders, British Library DC2008, Berlin.

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

1 Annual Meeting 2004 CrossRef Publishers International Linking Association, Inc Charles Hotel, Cambridge, MA November 9 th, 2004.

CBSOR,Indian Statistical Institute 30th March 07, ISI,Kokata 1 Digital Repository support for Consortium Dr. Devika P. Madalli Documentation Research &

Alastair Duncan STFC Pre Coffee talk STFC July 2014 The Trials and Tribulations and ultimate success of parallelisation using Hadoop within the SCAPE project.

SCAP E SCAPE Project EU project aimed at building a scalable platform for planning and execution of computation intensive processes for ingestion or migration.

1 Computing Challenges for the Square Kilometre Array Mathai Joseph & Harrick Vin Tata Research Development & Design Centre Pune, India CHEP Mumbai 16.

The KB e-Depot long-term preservation of scientific publications in practice Marcel Ras, National library of The Netherlands.

SCAPE Rainer Schmidt SCAPE Training Event September 16 th – 17 th, 2013 The British Library Building Scalable Environments Technologies and SCAPE Platform.

1 BCS, Oxfordshire, 19 February, 2004 WEB ARCHIVING issues and challenges Deborah Woodyard Digital Preservation Coordinator.

26/05/2005 Research Infrastructures - 'eInfrastructure: Grid initiatives‘ FP INFRASTRUCTURES-71 DIMMI Project a DI gital M ulti M edia I nfrastructure.

PLANETS, OPF & SCAPE A summary of the tools from these preservation projects, and where their development is heading.

Cultural Heritage in Tomorrow ’s Knowledge Society Cultural Heritage in Tomorrow ’s Knowledge Society Claude Poliart Project Officer Cultural Heritage.

Collection Description considerations in the nof-digitise programme Sarah Mitchell Programme Manager New Opportunities Fund.

Meeting of the Member States Expert Group on Digitisation and Digital Preservation , Luxembourg European Archival Records and Knowledge Preservation.

Fedora Commons Overview and Background Sandy Payette, Executive Director UK Fedora Training London January 22-23, 2009.

Fourth UNICA Scholarly Communication Seminar, Prague The LIFE Project Costing Digital Preservation May 2008 Richard Davies LIFE 2 Project Manager,

A Semi-Automated Digital Preservation System based on Semantic Web Services Jane Hunter Sharmin Choudhury DSTC PTY LTD, Brisbane, Australia Slides by Ananta.

Accessing the VI-SEEM infrastructure

GISELA & CHAIN Workshop Digital Cultural Heritage Network

Building A Repository for Digital Objects

An Introduction to Tessella and The Safety Deposit Box Platform

Joseph JaJa, Mike Smorul, and Sangchul Song

Bentley Project Reel Digitization Bentley Historical Library t

VI-SEEM Data Repository

Digital Preservation Planning:

GISELA & CHAIN Workshop Digital Cultural Heritage Network

Managing the Institutional Repository for OA Khawulile Radebe: Librarian: Repository Administrator & Metadata.

Presentation transcript:

SCAPE Dr. Rainer Schmidt AIT Austrian Institute of Technology GmbH APA 2011 Conference London, 8-9 November, 2011 The SCAPE Project Overview, Objectives, and Approaches

SCAPE SCAPE – what is it about? Planning and managing resource-intensive (digital) preservation processes such as large-scale ingestion, analysis, or modification of digital data sets Focus on scalability, robustness, and automation. SCAPE is a follow-up to the highly successful FP6 IP Planets.

SCAPE SCAPE Project Data Project instrument: FP7 Integrated Project 6. Call Objective ICT : Digital Libraries and Digital Preservation Target outcome (a) Scalable systems and services for preserving digital content Duration: 42 months February 2011 – July 2014 Budget: 11.3 Million Euro Funded: 8.6 Million Euro

SCAPE SCAPE Consortium NumberPartner nameShort name RoleCountry 1 (crd.)AIT Austrian Institute of Technology GmbHAITAT 2British LibraryBLUK 3Internet Memory FoundationIMFNL 4Ex Libris LtdEXLIL 5Fachinformationszentrum KarlsruheFIZDE 6Koninklijke BibliotheekKBNL 7KEEP SolutionsKEEPSPT 8Microsoft ResearchMSRUK 9Österreichische NationalbibliothekONBAT 10Open Planets FoundationOPFUK 11Statsbiblioteket AarhusSBDK 12Science and Technology Facilities CouncilSTFCUK 13Technische Universität BerlinTUBDE 14Technische Universität WienTUWAT 15University of ManchesterUNIMANUK 16Pierre & Marie Curie Université Paris 6UPMCFR

SCAPE SCAPE Project Overview SCAPE will enhance the state of the art in digital preservation in three ways: A scalable infrastructure and tools for preservation actions Automated, quality-assured preservation workflows Integration of these components with policy-based automated preservation planning and watch SCAPE results will be validated in three large-scale testbeds: Digital Repositories Web Content Research Data Sets The SCAPE Consortium brings together a broad spectrum of expertise from Memory institutions Data centres Research labs Universities Industrial firms

SCAPE Selected Scape Data Collections Data collections provided by 6 institutions Complete Web archives and snapshots of public domains (.dk,.it,.eu, gov.uk, …) Millions of digitised newspapers, posters, law gazettes, and th century broadsheets Collections of multi-file objects such as books, papyri, and incunabula (up to 230MB/object) images of East Asian manuscripts in different quality levels TBs of voluntary deposit in a wide variety of formats 500TB of broadcast radio and TV output (up to 73GB/object) Many hundreds of thousands of data sets from synchrotron, neutron, and muon instruments items from a selection of open access journal articles

SCAPE Selected SCAPE Testbed Scenarios Characterise large video files The master MPEG2 files are so large that it is difficult to apply JHOVE and insufficient detail is provided. A detailed characterisation of the MPEG2 streams is needed in order to identify technical dependencies for extracting from or rendering the MPEG2 stream. This would enable preservation risks related to current access services to be monitored and action taken as necessary to ensure continued access and preservation. Carry out large scale migrations Migrating from one format to another introduces the possibility of damaging the content or failing to capture significant properties of the original in the resulting destination format. Specific requirements include: Solution tools that operate reliably at scale (80TB, 2 million pages) Automated QA, ideally with no manual intervention on a file by file basis QA performed by process independently from the migration Demonstrating strong evidence of significant properties being captured in the destination format Quality assurance in web harvesting For large scale crawls, automation of the quality control processes is a necessary requirement. Currently, this process relies on random sampling and very basic quantitative checks. from digitalbevaring.dk

SCAPE Selected SCAPE Challenges Bridging the gap between experimental workflows and production scenarios e.g. coping with amount and size of payload data Employing data intensive technologies for processing binary content generation and evaluation of workflow results Exploiting data locality Avoiding data transfer by placing processors next to the data Repository Integration Horizontal scalability Scalable ingest/access Preservation Planning Automation of monitoring and decision processes Automated Quality Assurance Advanced Image Processing Scientific data How to preserve contextual information?

SCAPE SCAPE Solutions SCAPE Platform Environment for carrying out preservation workflows at scale Software package and shared deployment (the Central Instance) Dynamic deployment of environments virtualisation and cloud-based technologies. support for native tools and environments Builds upon data-centric execution platform (Hadoop/Stratosphere) Simple and natural tool support and automated mapping of graphical (Taverna-based) workflows to parallel programming model Three levels of parallelization Distribution of files Splitting content Parallel query execution Repository integration based on two open reference implementations PPL dataflow program compile Multi-Stage M/R Flow

SCAPE SCAPE Solutions OPF Result Evaluation Framework (Ref) Large RDF quadstore for storing SCAPE workflow results developed in cooperation with University of Southampton Shared database to publish and query these results Supports progress tracking and monitoring over time Input for Preservation Planning and Watch

SCAPE SCAPE Solutions Context-aware Planning and Watch Automated watch monitoring trends in web harvests and repositories linked with Results Evaluation Framework (REF) database Formalized policy model and representation using semantic technologies Automated Planning Building on the Planets PLATO tool Key factors and decision criteria Automated policy-driven planning

SCAPE SCAPE Solutions Automated Quality Assurance QA in web harvesting and digitisation through automated comparison of rendered pages Characterization – feature extraction Level 1 - Metadata information: using characterization components. Level 2 – Global content description: discriminant global features for individual media types. Level 3 – Structural content description: detect structural similarities in images Comparison Discrete solution and smart metrics (level 2+3) Development of metrics and measures of similarity, quality, relationship to user perception

SCAPE Selected Achievements Public Website: Development Infrastructure hosted by the Open Planets Foundation and GitHub: First Deliverables available for download Publications 13 in the first nine months, including 6 at iPres last week Report: Comparative analysis of identification tools Report: Analysis of scalability challenges for Digital Object Repositories - Classification and design of approaches. Platform Infrastructure 10 nodes (dual-core), 20 TB experimental cluster hosted by AIT Virtualization based on Xen + Eucalyptus Hardware for the Platform’s Central Instance currently being set-up within data centre at IMF. 13

SCAPE Thank you for your attention!