SCAPE Rainer Schmidt SCAPE Training Event September 16 th – 17 th, 2013 The British Library The SCAPE Platform Overview
SCAlable Preservation Environments SCAPE Goal of the SCAPE Platform Hardware and software platform to support scalable preservation in terms of computation and storage. Employing an scale-out architecture to supporting preservation activities against large amounts of data. Integration of existing tools, workflows, and data sources and sinks. A data center service providing a scalable execution and storage backend for different object management systems. Based a minimal set of defined services for processing tools and/or queries closely to the data.
SCAlable Preservation Environments SCAPE Underlying Technologies The SCAPE Platform is built on top of existing data-intensive computing technologies. Reference Implementation leverages Hadoop Software Stack (HDFS, MapReduce, Hive, …) Virtualization and packaging model for dynamic deployments of tools and environments Debian packages and IaaS suppot. Repository Integration and Services Data/Storage Connector API (Fedora and Lily) Object Exchange Format (METS/PREMIS representation) Workflow modeling, translation, and provisioning. Taverna Workbench and Component Catalogue Workflow Compiler and Job Submission Service
SCAlable Preservation Environments SCAPE Architectural Overview (Core) Component Catalogue Workflow Modeling Environment Component Lookup API Component Registration API
SCAlable Preservation Environments SCAPE Architectural Overview (Core) Component Catalogue Workflow Modeling Environment Component Lookup API Component Registration API Focus of this talk
SCAlable Preservation Environments SCAPE Hadoop Overview
SCAlable Preservation Environments SCAPE Open-source software framework for large-scale data- intensive computations running on large clusters of commodity hardware. Derived from publications Google File System and MapReduce publications. Hadoop = MapReduce + HDFS MapReduce: Programming Model (Map, Shuffle/Sort, Reduce) and Execution Environment. HDFS: Virtual distributed file system overlay on top of local file systems. The Framework
SCAlable Preservation Environments SCAPE Designed for write one read many times access model. Data IO is handled via HDFS. Data divided into blocks (typically 64MB) and distributed and replicated over data nodes. Parallelization logic is strictly separated from user program. Automated data decomposition and communication between processing steps. Applications benefit from built-in support for data-locality and fail-safety. Applications scale-out on big clusters processing very large data volumes. Programming Model
SCAlable Preservation Environments SCAPE Cluster Set-up
SCAlable Preservation Environments SCAPE Platform Deployment There is no prescribed deployment model Private, institutionally-shared, external data center Possible to deploy on “bare-metal” or using virtualization and cloud middleware. Platform Environment packaged as VM image Automated and scalable deployment. Presently supporting Eucalyptus (and AWS) clouds. SCAPE provides two shared Platform instances Stable non-virtualized data-center cluster Private-cloud based development cluster Partitioning and dynamic reconfiguration
SCAlable Preservation Environments SCAPE Deploying Environments IaaS enabling packaging and dynamic deployment of (complex) Software Environments But requires complex virtualization infrastructure Data-intensive technology is able to deal with a constantly varying number of cluster nodes. Node failures are expected and automatically handled System can grow/shrink on demand Network Attached Storage solution can be used as data source But does not scalability and performance needs for computation SCAPE Hadoop Clusters Linux + Preservation tools + SCAPE Hadoop libraries Optionally Higher-level services (repository, workflow, …)
SCAlable Preservation Environments SCAPE Using the Cluster
SCAlable Preservation Environments SCAPE 13 Wrapping Sequential Tools Using a wrapper script (Hadoop Streaming API) PT’s generic Java wrapper allows one to use pre-defined patterns (based on toolspec language) Works well for processing a moderate number of files e.g. applying migration tools or FITS. Writing a custom MapReduce application Much more powerful and usually performs better. Suitable for more complex problems and file formats, such as Web archives. Using a High-level Language like Hive and Pig Very useful to perform analysis of (semi-)structured data, e.g. characterization output.
SCAlable Preservation Environments SCAPE Preservation tools and libraries are pre-packaged so they can be automatically deployed on cluster nodes SCAPE Debian Packages Supporting SCAPE Tool Specification Language MapReduce libs for processing large container files For example METS and (W)arc RecordReader Application Scripts Based on Apache Hive, Pig, Mahout Software components to assemble a complex data-parallel workflows Taverna and Oozie Workflows Available Tools
SCAlable Preservation Environments SCAPE 15 Sequential Workflows In order to run a workflow (or activity) on the cluster it will have to be parallelized first! A number of different parallelization strategies exist Approach typically determined on a case-by-case basis May lead to changes of activities, workflow structure, or the entire application. Automated parallelization will only work to a certain degree Trivial workflows can be deployed/executed using without requiring individual parallelization (wrapper approach). SCAPE driver program for parallelizing Taverna workflows. SCAPE template workflows for different institutional scenarios developed.
SCAlable Preservation Environments SCAPE 16 Parallel Workflows Are typically derived from sequential (conceptual) workflows created for desktop environment (but may differ substantially!). Rely on MapReduce as the parallel programming model and Apache Hadoop as execution environment Data decomposition is handled by Hadoop framework based on input format handlers (e.g text, warc, mets-xml, etc. ) Can make use of a workflow engine (like Taverna and Oozie) for orchestrating complex (composite) processes. May include interactions with data mgnt. sytems (repositories) and sequential (concurrently executed) tools. Tools invocations are based on API or cmd-line interface and performed as part of a MapReduce application.
SCAlable Preservation Environments SCAPE MapRed Tool Wrapper
SCAlable Preservation Environments SCAPE 18 Tool Specification Language The SCAPE Tool Specification Language (toolspec) provides a schema to formalize command line tool invocations. Can be used to automate a complex tool invocation (many arguments) based on a keyword (e.g. ps2pdfs) Provides a simple and flexible mechanism to define tool dependencies, for example of a workflow. Can be resolved by the execution system using Linux packages. The toolspec is minimalistic and can be easily created for individual tools and scripts. Tools provided as SCAPE Debian packages come with a toolspec document by default.
SCAlable Preservation Environments SCAPE 19 MapRed Toolwrapper Hadoop provides scalability, reliability, and robustness supporting processing data that does not fit on a single machine. Application must however be made compliant with the execution environment. Our intention was to provide a wrapper allowing one to execute a command-line tool on the cluster in a similar way like on a desktop environment. User simply specifies toolspec file, command name, and payload data. Supports HDFS references and (optionally) standard IO streams. Supports the SCAPE toolspec to execute preinstalled tools or other applications available via OS command-line interface.
SCAlable Preservation Environments SCAPE 20 Hadoop Streaming API Hadoop streaming API supports the execution of scripts (e.g. bash or python) which are automatically translated and executed as MapReduce applications. Can be used to process data with common UNIX filters using commands like echo, awk, tr. Hadoop is designed to process its input based on key/value pairs. This means the input data is interpreted and split by the framework. Perfect for processing text but difficult to process binary data. The steaming API uses streams to read/write from/to HDFS. Preservation tools typically do not support HDFS file pointers and/or IO streaming through stdin/sdout. Hence, DP tools are difficult to use with streaming API
SCAlable Preservation Environments SCAPE 21 Suitable Use-Cases Use MapRed Toolwrapper when dealing with (a large number of) single files. Be aware that this may not be an ideal strategy and there are more efficient ways to deal with many files on Hadoop (Sequence Files, Hbase, etc. ). However, practical and sufficient in many cases, as there is no additional application development required. A typical example is file format migration on a moderate number of files (e.g s), which can be included in a workflow with additional QA components. Very helpful when payload is simply too big to be computed on a single machine.
SCAlable Preservation Environments SCAPE 22 Example – Exploring an uncompressed WARC Unpacked a 1GB WARC.GZ on local computer 2.2 GB unpacked => files `ls` took ~40s, count *.html files with `file` took ~4 hrs => html files Provided corresponding bash command as toolspec: if [ "$(file ${input} | awk "{print \$2}" )" == HTML ]; then echo "HTML" ; fi Moved data to HDFS and executed pt-mapred with toolspec. 236min on local file system 160min with 1 mapper on HDFS (this was a surprise!) 85min (2), 52min (4), 27min (8) 26min with 8 mappers and IO streaming (also a surprise)
SCAlable Preservation Environments SCAPE 23 Ongoing Work Source project and README on Github presently under openplanets/scape/pt-mapred * Will be migrated to its own repository soon. Presently required to generate an input file that specifies input file paths (along with optional output file names). TODO: Input binary directly based on input directory path allowing Hadoop to take advantage of data locality. Input/output steaming and piping between toolspec commands has already been implemented. TODO: Add support for Hadoop Sequence Files. Look into possible integration with Hadoop Streaming API. *
SCAlable Preservation Environments SCAPE 24