SCAPE Dr. Ross King AIT Austrian Institute of Technology GmbH APA Conference Frascati, November 7, 2012 SCAPE Scalable Preservation Tools and Infrastructure.

Slides:



Advertisements
Similar presentations
Curating Research: problems and policy Dale Peters Scientific Technical Manager DRIVER II.
Advertisements

Introduction to Planets Hans Hofman Nationaal Archief Netherlands Prague, 17 October 2008.
Pulling it all together… with thanks to Sheila Anderson.
Capacity Building Passing on the Experience Dr. Noha Adly World Digital Library Arab Peninsula Regional Group meeting.
SCAPE Carl Wilson Open Planets Foundation SCAPE Training Guimarães Characterisation An introduction to the identification and characterisation of.
Providing collections, tools and services for digital humanities A national library perspective Clément Oury Head of Digital Legal Deposit Bibliothèque.
EU-funded Digital Preservation Research APA 2014 Conference Brussels, 22 October 2014 Dr. Manuela Speiser European Commission DG CONNECT, unit "Creativity"
Bibliothèque nationale de France Tallinn, BnF update: production and development priorities in 2015.
DRS 2 one in a series of periodic updates Harvard University Library Andrea Goethals October 21, 2009 DRS = Digital Repository Service.
Dr. Ross King AIT Austrian Institute of Technology GmbH SCAPE/OPF Executive Seminar: Managing Digital Preservation The Hague, April 2, 2014 SCAPE Tools.
Project Overview APA Conference 2012 ESA/ESRIN (Frascati), 6-7 November 2012 M. Albani (European Space Agency), Project Coordinator.
1 Software & Grid Middleware for Tier 2 Centers Rob Gardner Indiana University DOE/NSF Review of U.S. ATLAS and CMS Computing Projects Brookhaven National.
FP6 Thematic Priority 2: Information Society Technologies Dr. Neil T. M. Hamilton Executive Director.
Preservation and Long-term access through Networked Services Adam Farquhar, The British Library iPres2006 Cornell University, October 2006.
© 2010 Microsoft Corporation. All rights reserved. Quality Assurance: Towards Tools for Characterizing and Comparing Digital Documents Natasa Milic-Frayling.
Neil McKenzie, Dedicon Multimedia training Packages by EUAIN.
1 EuropeanaLocal- Europeana Knowledge Sharing Workshop EuropeanaLocal- Europeana Knowledge Sharing Workshop 13/14 January 2009 Rob Davies, Scientific Co-ordinator.
A Service for Data-Intensive Computations on Virtual Clusters Rainer Schmidt, Christian Sadilek, and Ross King Intensive 2009,
1 Archive-It Training University of Maryland July 12, 2007.
Hadoop Team: Role of Hadoop in the IDEAL Project ●Jose Cadena ●Chengyuan Wen ●Mengsu Chen CS5604 Spring 2015 Instructor: Dr. Edward Fox.
How to Face the Challenges of Web Archiving? The experiences of a small library on the edge. Chloe Martin, Internet Memory Catherine Ryan, National Library.
LIFE 3 LIFE3: Predicting Long Term Preservation Costs Paul Wheatley Digital Preservation Manager The British Library.
Building Scalable Web Archives Florent Carpentier, Leïla Medjkoune Internet Memory Foundation IIPC GA, Paris, May 2014.
Artur Kulmukhametov Vienna University of Technology SCAPE PW Training Event Aarhus, November 2013 Content Profiling and C3PO.
RENESENG builds dedicated systems engineering technology for biorefineries. RENESENG builds tools to prevent expensive experimentation saving capital,
Mid-Michigan Digital Practitioners, March 14, 2014 The National Digital Stewardship Alliance Agenda Mid-Michigan Digital Practitioners Meeting Abigail.
San Diego Supercomputer CenterUniversity of California, San Diego Preservation Research Roadmap Reagan W. Moore San Diego Supercomputer Center
Michael Ernst, page 1 Collaborative Learning for Security and Repair in Application Communities Performers: MIT and Determina Michael Ernst MIT Computer.
Annick Le Follic Bibliothèque nationale de France Tallinn,
Permanent access to digital knowledge – the challenges for digital preservation Pat Manson Head of Unit European Commission DG Information Society and.
A Lightweight Platform for Integration of Resource Limited Devices into Pervasive Grids Stavros Isaiadis and Vladimir Getov University of Westminster
Co-funded by the European Union under FP7-ICT Co-ordinated by aparsen.eu #APARSEN Why persistent identifiers are crucial in digital preservation.
ICT PSP Infoday Brussels Call 2011 – Theme 2 Digital Content ICT-PSP Call Theme 2: Digital Content Federico Milani, Marc Röder Infso E6/eContent.
Per Møldrup-Dalum State and University Library SCAPE Information Day State and University Library, Denmark, SCAPE Scalable Preservation Environments.
SCAPE Dr. Rainer Schmidt AIT Austrian Institute of Technology GmbH APA 2011 Conference London, 8-9 November, 2011 The SCAPE Project Overview, Objectives,
SCAPE Scalable Preservation Environments. 2 Its all about scalability! Scalable services for planning and execution of institutional preservation strategies.
Caring and Sharing Collaboration in Digital Curation outside North America Ross Harvey Simmons College, Boston Curation Matters: 17 June 2010.
CDRS.COLUMBIA.EDU CCLIP December 5, CDRS.COLUMBIA.EDU What We Do Partner with researchers and scholars at Columbia to share new knowledge through.
TESTBED FOR FUTURE INTERNET SERVICES TEFIS at the EU-Canada Future Internet Workshop, March Annika Sällström – Botnia Living Lab at Centre for.
1 NumericNumeric Developing a statistical framework for measuring the digitisation of Europe’s cultural heritage  Numeric  Phillip Ramsdale The study.
CLARIN work packages. Conference Place yyyy-mm-dd
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Alastair Duncan STFC Pre Coffee talk STFC July 2014 The Trials and Tribulations and ultimate success of parallelisation using Hadoop within the SCAPE project.
SCAP E SCAPE Project EU project aimed at building a scalable platform for planning and execution of computation intensive processes for ingestion or migration.
1 Computing Challenges for the Square Kilometre Array Mathai Joseph & Harrick Vin Tata Research Development & Design Centre Pune, India CHEP Mumbai 16.
Recent Developments in CLARIN-NL Jan Odijk P11 LREC, Istanbul, May 23,
Gateways Heather Brown Project Officer, State Library of S.A, for Business Information Program, University of S.A. and Assistant Director, Paper, Artlab.
SCAPE Rainer Schmidt SCAPE Training Event September 16 th – 17 th, 2013 The British Library Building Scalable Environments Technologies and SCAPE Platform.
1 BCS, Oxfordshire, 19 February, 2004 WEB ARCHIVING issues and challenges Deborah Woodyard Digital Preservation Coordinator.
26/05/2005 Research Infrastructures - 'eInfrastructure: Grid initiatives‘ FP INFRASTRUCTURES-71 DIMMI Project a DI gital M ulti M edia I nfrastructure.
ESPON Workshop at the Open Days 2012 “Creating Results informed by Territorial Evidence” Brussels, 10 October 2012 Introduction to ESPON Piera Petruzzi,
PLANETS, OPF & SCAPE A summary of the tools from these preservation projects, and where their development is heading.
Per Møldrup-Dalum State and University Library SCAPE Information Day State and University Library, Denmark, A Weekend with Nanite Large scale.
ARIADNE is funded by the European Commission's Seventh Framework Programme Archiving and Repositories Holly Wright.
The DEER Distributed European Electronic Resource Dr Suzanne Keene Francesca Monti University College London.
SCAPE Andy Jackson The British Library SCAPEdev1 AIT, Vienna - 6 th – 7 th June 2011 Welcome First SCAPE Developers’ Workshop.
A Semi-Automated Digital Preservation System based on Semantic Web Services Jane Hunter Sharmin Choudhury DSTC PTY LTD, Brisbane, Australia Slides by Ananta.
Leeds Aims : design a lightweight survey to maximise response understand data volume and variety for capacity planning raise awareness of new institutional.
Barbara Sierman SCAPE Training Statsbiblioteket, Aarhus, November 2013 Preservation Policy in SCAPE.
Accessing the VI-SEEM infrastructure
GISELA & CHAIN Workshop Digital Cultural Heritage Network
An Introduction to Tessella and The Safety Deposit Box Platform
Tomas Kliment Junior Researcher Italian National Research Council
Research Data Context Preservation in SCAPE
Joseph JaJa, Mike Smorul, and Sangchul Song
ICT NCP Infoday Brussels, 23 June 2010
Extraction, aggregation and classification at Web Scale
Digital Preservation Planning:
GISELA & CHAIN Workshop Digital Cultural Heritage Network
Presentation transcript:

SCAPE Dr. Ross King AIT Austrian Institute of Technology GmbH APA Conference Frascati, November 7, 2012 SCAPE Scalable Preservation Tools and Infrastructure

SCAPE Digital Preservation – New Motives Some growth rates Number of bytes stored: 60% Costs of storage media: -20% Cost to store: ((1.6x0.8)-1) = 28% Growth of IT budgets: 4% This massive volume of digital material raises a number of issues: What is worth preserving? How to preserve so much? How to access preserved data? 2

SCAPE SCAPE – what is it about? Planning and executing computing-intensive digital preservation processes such as the large-scale ingestion, characterisation or migration of large (multi-Terabyte) and complex data sets SCAPE results include Preservation scenarios Preservation tools Preservation workflows Preservation infrastructure Preservation best-practices SCAPE is a follow-up to the highly successful FP6 IP Planets. 3

SCAPE SCAPE Project Data Project instrument: FP7 Integrated Project 6. Call Objective ICT : Digital Libraries and Digital Preservation Target outcome (a) Scalable systems and services for preserving digital content Duration: 42 months February 2011 – July 2014 Budget: 11.3 Million Euro Funded: 8.6 Million Euro 4

SCAPE SCAPE Consortium NumberPartner namePartner short nameCountry 1 (coordinator)AIT Austrian Institute of Technology GmbHAITAT 2British LibraryBLUK 3Internet Memory FoundationIMNL 4Ex Libris LtdEXLIL 5Fachinformationszentrum KarlsruheFIZDE 6Koninklijke BibliotheekKBNL 7KEEP SolutionsKEEPSPT 8Microsoft ResearchMSRUK 9Österreichische NationalbibliothekONBAT 10Open Planets FoundationOPFUK 11Statsbiblioteket AarhusSBDK 12Science and Technology Facilities CouncilSTFCUK 13Technische Universität BerlinTUBDE 14Technische Universität WienTUWAT 15University of ManchesterUNIMANUK 16Pierre & Marie Curie Université Paris 6UPMCFR 5

SCAPE SCAPE Project Overview SCAPE will enhance the state of the art in digital preservation in three ways: Infrastructure and tools for scalable preservation actions A framework for automated, quality-assured preservation workflows Integration of these components with policy-based automated preservation planning and watch SCAPE results will be validated in three large-scale testbeds: Digital Repositories Web Content Research Data Sets The SCAPE Consortium brings together a broad spectrum of expertise from Memory institutions Data centres Research labs Universities Industrial firms 6 Preservation Components Quality Assurance Scalable Components Automation-ready Tools Platform Automation Workflows Parallelization Virtualization Planning and Watch Institutional Policies Technical Watch Automated Planning Testbeds Corpora Integration Benchmarking Validation Takeup Stakeholders Communities Dissemination Training Activities Sustainability Cross-project Activities Project Management Technical Coordination Research Roadmap

SCAPE Selected SCAPE Testbed Scenarios Carry out large scale image migrations The master files from legacy digitized image collections are typically TIFF files that can be costly to store due to their size. The cost benefit can only be realized if one can remove the original TIFFs and this can only be done if one can provide evidence of successful migration. Detect poor sound quality In a collection of mp3 files (20 Tbytes files) we have discovered files with very bad sound quality. Before ingesting everything into our DOMS we would like to be able to discover the bad files and potentially get those re-digitized from the original analogue media. RAW to NEXUS conversion Apart from the file size, volume of content challenges identified in IS29 for nexus files, the raw to nexus format migration tool can be customised to take into account of various other types of experiment data files in the process of the migration. However, the scalability challenge here is that for different instrument (specific to each facility), the other types of experiment data files vary significantly. Quality assurance in web harvesting Web crawling is a process that is highly susceptible to errors. Often, essential data is missed by the crawler and thus not captured and preserved. Currently, quality assurance requires manual effort and because crawls often contain millions of pages, manual quality assurance will be neither very efficient nor effective. 7 from digitalbevaring.dk See

SCAPE Selected SCAPE Challenges Bridging the gap between test workflows and scalable workflows Applying Map/Reduce to binary data Locality of data Bring the data to the computation, or bring the computation to the data? Repository Integration Repository Consistency Scalable Ingest Preservation Planning How to scale? How to automate? Research data sets How to preserve contextual information? 8 from digitalbevaring.dk

SCAPE SCAPE Solutions Automated Planning Component Builds on the Planets PLATO tool and methodology Emphasizes simplicity, scalability and automation Makes use of the Taverna workflow engine Integrates with existing repositories Uses semantically formalized policies 9

SCAPE SCAPE Solutions Automated Watch Component Based on Gathering information from various external sources from diverse domains Creating a centralized knowledge base with information of interest for preservation Expressing preservation risks and opportunities as questions to this knowledge base Monitoring the result of question assessment to reveal significant events that indicate the existence of the defined risks and opportunities 10

SCAPE SCAPE Solutions SCAPE Platform HADOOP, Eucalyptus Virtualized cluster Repository integration HBASE, HDFS - Fedora Three levels of parallelization Distribution of files Splitting binary files Parallelisation of algorithms Multiple instances (how-to) Mapping Taverna to HADOOP 11 from digitalbevaring.dk

SCAPE SCAPE Solutions SCAPE Platform Use case: Characterisation of file formats in JICS UK Domain dataset (35 TB) Compared DROID engine and Apache Tika Conclusions Apache Tika has a significantly lower failure rate than DROID-B Most formats last much longer than 5 years Network effects to appear to stabilise formats New formats appear at a modest, manageable rate. Hence the “Rosenthal hypothesis” is confirmed to some extent HOWEVER, this study is about format usage; it does not yet address format renderability 12 A. Jackson. See also:

SCAPE SCAPE Solutions Automated Quality Assurance QA in web harvesting through automated comparison of rendered pages – combined structural and image analysis MarcAlizer QA in image migration through deep characterisation Jpylyzer QA in image digitisation through automated duplicate detection matchbox 13

SCAPE SCAPE Solutions Automated Quality Assurance – Jpylyzer Parses a file and tests against format specification (ISO/IEC ) Tests for required boxes and restrictions defined by the standard Proves a file does not conform to the standard but cannot prove it does, valid means “probably valid”. Next steps Run on Hadoop cluster: reduce run time from 21 days to 21 hours! Add “repair” functionality to workflow 14

SCAPE SCAPE Solutions Automated Quality Assurance – matchbox Various sources in the digital book production process (e.g. different scanning sources, various book page image versions, etc.) can introduce image duplicates in the compiled version of a digital book matchbox provides an automated solution to the duplicate image detection problem using the following algorithm: Detection of salient regions and extraction of most discriminative descriptors using standard SIFT detector and descriptors. A visual dictionary following a Bag of Word approach is created from a set of spatially distinctive descriptors. Once the dictionary is set up, fingerprints - visual histograms expressing the term frequency for each visual work in the corresponding image - are extracted for each image. Comparison of images becomes matching of visual fingerprints and results in a ranked shortlist of possible duplicates. Next steps Optimise performance Run on Hadoop cluster 15

SCAPE Additional Resources of Interest Development Infrastructure Code repository hosted by the Open Planets Foundation and GitHub Development Wiki Experimental Workflows Publications Public Deliverables 16

SCAPE First SCAPE Training Event Keeping Control – Scalable Preservation Environments for Identification and Characterisation 6-7 December 2012 Archaeological Museum of the Martins Sarmento Society, Guimarães, Portugal Hosted by KEEP Solutions Registration: This event is also supported by the European Capital of Culture 2012:

SCAPE SCAPE Contact Information Twitter: #scapeproject Dr. Ross King AIT Austrian Institute of Technology GmbH Donau-City-Strasse 1 A-1220 Wien 18

SCAPE Thank you for your attention! 19