SCAPE Rainer Schmidt SCAPE Information Day May 5 th, 2014 Österreichische Nationalbibliothek The SCAPE Platform Overview.

Slides:

Advertisements

Similar presentations

Dr. Ross King AIT Austrian Institute of Technology GmbH SCAPE/OPF Executive Seminar: Managing Digital Preservation The Hague, April 2, 2014 SCAPE Tools.

Advertisements

O’Reilly – Hadoop: The Definitive Guide Ch.5 Developing a MapReduce Application 2 July 2010 Taewhi Lee.

Next Generation Node (NGN) Technical Overview April 2007.

A Framework for Distributed Preservation Workflows Rainer Schmidt AIT Austrian Institute of Technology iPres 2009, Oct. 5, San.

Presenter: Joshan V John Robert Dyer, Hoan Anh Nguyen, Hridesh Rajan & Tien N. Nguyen Iowa State University, USA Instructor: Christoph Csallner 1 Joshan.

Copyright Arshi Khan1 System Programming Instructor Arshi Khan.

CS525: Big Data Analytics MapReduce Languages Fall 2013 Elke A. Rundensteiner 1.

Hadoop Ecosystem Overview

SCAPE Rainer Schmidt SCAPE Training Event September 16 th – 17 th, 2013 The British Library The SCAPE Platform Overview.

TIBCO Designer TIBCO BusinessWorks is a scalable, extensible, and easy to use integration platform that allows you to develop, deploy, and run integration.

Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc

HADOOP ADMIN: Session -2

Workflow Management CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook.

Hadoop Team: Role of Hadoop in the IDEAL Project ●Jose Cadena ●Chengyuan Wen ●Mengsu Chen CS5604 Spring 2015 Instructor: Dr. Edward Fox.

The SAM-Grid Fabric Services Gabriele Garzoglio (for the SAM-Grid team) Computing Division Fermilab.

U.S. Department of the Interior U.S. Geological Survey David V. Hill, Information Dynamics, Contractor to USGS/EROS 12/08/2011 Satellite Image Processing.

Data Mining on the Web via Cloud Computing COMS E6125 Web Enhanced Information Management Presented By Hemanth Murthy.

USING HADOOP & HBASE TO BUILD CONTENT RELEVANCE & PERSONALIZATION Tools to build your big data application Ameya Kanitkar.

Building Scalable Web Archives Florent Carpentier, Leïla Medjkoune Internet Memory Foundation IIPC GA, Paris, May 2014.

MapReduce April 2012 Extract from various presentations: Sudarshan, Chungnam, Teradata Aster, …

Operating Systems CS3502 Fall 2014 Dr. Jose M. Garrido

CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.

An Approach for Processing Large and Non-uniform Media Objects on MapReduce-Based Clusters Rainer Schmidt and Matthias Rella Speaker: Lin-You Wu.

Flexibility and user-friendliness of grid portals: the PROGRESS approach Michal Kosiedowski

Per Møldrup-Dalum State and University Library SCAPE Information Day State and University Library, Denmark, SCAPE Scalable Preservation Environments.

MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.

Introduction to Apache Hadoop Zibo Wang. Introduction  What is Apache Hadoop?  Apache Hadoop is a software framework which provides open source libraries.

Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.

Introduction to Hadoop and HDFS

f ACT s  Data intensive applications with Petabytes of data  Web pages billion web pages x 20KB = 400+ terabytes  One computer can read

MapReduce High-Level Languages Spring 2014 WPI, Mohamed Eltabakh 1.

Introduction to Hadoop Programming Bryon Gill, Pittsburgh Supercomputing Center.

Alastair Duncan STFC Pre Coffee talk STFC July 2014 The Trials and Tribulations and ultimate success of parallelisation using Hadoop within the SCAPE project.

Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

11 CORE Architecture Mauro Bruno, Monica Scannapieco, Carlo Vaccari, Giulia Vaste Antonino Virgillito, Diego Zardetto (Istat)

SCAPE Rainer Schmidt SCAPE Training Event September 16 th – 17 th, 2013 The British Library Building Scalable Environments Technologies and SCAPE Platform.

Tools and Deployment University of Illinois at Urbana-Champaign.

CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.

August 2003 At A Glance The IRC is a platform independent, extensible, and adaptive framework that provides robust, interactive, and distributed control.

Site Technology TOI Fest Q Celebration From Keyword-based Search to Semantic Search, How Big Data Enables That?

1 DIRAC Job submission A.Tsaregorodtsev, CPPM, Marseille LHCb-ATLAS GANGA Workshop, 21 April 2004.

PLANETS, OPF & SCAPE A summary of the tools from these preservation projects, and where their development is heading.

Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies

Chapter 1 Basic Concepts of Operating Systems Introduction Software A program is a sequence of instructions that enables the computer to carry.

Next Generation of Apache Hadoop MapReduce Owen

Scaling up R computation with high performance computing resources.

INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.

Moscow, November 16th, 2011 The Hadoop Ecosystem Kai Voigt, Cloudera Inc.

Apache Tez : Accelerating Hadoop Query Processing Page 1.

SCAPE Andy Jackson The British Library SCAPEdev1 AIT, Vienna - 6 th – 7 th June 2011 Welcome First SCAPE Developers’ Workshop.

Grid Technology CERN IT Department CH-1211 Geneva 23 Switzerland t DBCF GT Our experience with NoSQL and MapReduce technologies Fabio Souto.

Image taken from: slideshare

Agenda:- DevOps Tools Chef Jenkins Puppet Apache Ant Apache Maven Logstash Docker New Relic Gradle Git.

An Open Source Project Commonly Used for Processing Big Data Sets

Spark Presentation.

Hadoop Clusters Tess Fulkerson.

Extraction, aggregation and classification at Web Scale

Ministry of Higher Education

Introduction to PIG, HIVE, HBASE & ZOOKEEPER

Introduction to Apache

Module 01 ETICS Overview ETICS Online Tutorials

Overview of big data tools

Charles Tappert Seidenberg School of CSIS, Pace University

Server & Tools Business

Copyright © JanBask Training. All rights reserved Get Started with Hadoop Hive HiveQL Languages.

Map Reduce, Types, Formats and Features

Pig Hive HBase Zookeeper

Presentation transcript:

SCAPE Rainer Schmidt SCAPE Information Day May 5 th, 2014 Österreichische Nationalbibliothek The SCAPE Platform Overview

SCAlable Preservation Environments SCAPE 2 The SCAPE Platform Sub-project Develops an integrated environment that is based a set of mature technologies including Apache Hadoop Taverna Workflow Stack Fedora Commons Aims at enabling the execution of preservation workflows in a scalable fashion. Supporting different preservation tools, workflows, and various data sets. An infrastructure that has been deployed in different configurations and at various SCAPE institutions. Part of the SCAPE environment which also involves planning, packaging, scenario development, …

SCAlable Preservation Environments SCAPE Architectural Overview (Core) Component Catalogue Workflow Modeling Environment Component Lookup API Component Registration API

SCAlable Preservation Environments SCAPE Architectural Overview (Core) Component Catalogue Workflow Modeling Environment Component Lookup API Component Registration API Focus of this talk

SCAlable Preservation Environments SCAPE Execution Platform

SCAlable Preservation Environments SCAPE 6 Wrapping Sequential Tools Using a wrapper script (Hadoop Streaming API) PT’s ToMaR supports pre-defined patterns (based on toolspec language) Works well for processing a moderate number of files e.g. applying migration tools or FITS. Writing a custom MapReduce application Much more powerful and usually performs better. Suitable for problems at larger-scale and/or handling complex file formats (e.g. Web archiving). Using a High-level Language like Hive and Pig Very useful to perform analysis of (semi-)structured data, e.g. characterization output.

SCAlable Preservation Environments SCAPE 7 Wrapping Sequential Tools Using a wrapper script (Hadoop Streaming API) PT’s ToMaR supports pre-defined patterns (based on toolspec language) Works well for processing a moderate number of files e.g. applying migration tools or FITS. Writing a custom MapReduce application Much more powerful and usually performs better. Suitable for problems at larger-scale and/or handling complex file formats (e.g. Web archiving). Using a High-level Language like Hive and Pig Very useful to perform analysis of (semi-)structured data, e.g. characterization output.

SCAlable Preservation Environments SCAPE Preservation tools and libraries are pre-packaged so they can be automatically deployed on cluster nodes SCAPE Debian Packages Supporting SCAPE Tool Specification Language MapReduce libs for processing large container files For example METS and (W)arc RecordReader Application Scripts Based on Apache Hive, Pig, Mahout Software components to assemble a complex data-parallel workflows Taverna, Pig, and Oozie Workflows Available Tools

SCAlable Preservation Environments SCAPE MapRed Tool Wrapper - ToMaR

SCAlable Preservation Environments SCAPE 10 Hadoop Streaming API Hadoop streaming API supports the execution of scripts (e.g. bash or python) which are automatically translated and executed as MapReduce applications. Can be used to process data with common UNIX filters using commands like echo, awk, tr. Hadoop is designed to process its input based on key/value pairs. This means the input data is interpreted and split by the framework. Perfect for processing text but difficult to process binary data. The steaming API uses streams to read/write from/to HDFS. Preservation tools typically do not support HDFS file pointers and/or IO streaming through stdin/sdout. Hence, DP tools are difficult to use with streaming API

SCAlable Preservation Environments SCAPE 11 Tool-to-MapReduce (TOMAR) Hadoop provides scalability, reliability, and robustness supporting processing data that does not fit on a single machine. Application must however be made compliant with the execution environment. Our intention was to provide a wrapper allowing one to execute a command-line tool on the cluster in a similar way like on a desktop environment. User simply specifies toolspec file, command name, and payload data. Supports file references and (optionally) standard IO streams. Supports the SCAPE toolspec to execute preinstalled tools or other applications available via OS command-line interface.

SCAlable Preservation Environments SCAPE 12 Tool Specification Language The SCAPE Tool Specification Language (toolspec) provides a schema to formalize command line tool invocations. Can be used to automate a complex tool invocation (many arguments) based on a keyword (e.g. ps2pdfs) Provides a simple and flexible mechanism to define tool dependencies, for example of a workflow. Can be resolved by the execution system using Linux packages. The toolspec is minimalistic and can be easily created for individual tools and scripts. Tools provided as SCAPE Debian packages come with a toolspec document by default.

SCAlable Preservation Environments SCAPE 13 Ghostscript Example

SCAlable Preservation Environments SCAPE 14 Suitable Use-Cases Use MapRed Toolwrapper when dealing with (a large number of) single files. Be aware that this may not be an ideal strategy and there are more efficient ways to deal with many files on Hadoop (Sequence Files, Hbase, etc. ). However, practical and sufficient in many cases, as there is no additional application development required. A typical example is file format migration on a moderate number of files (e.g s), which can be included in a workflow with additional QA components. Very helpful when payload is simply too big to be computed on a single machine.

SCAlable Preservation Environments SCAPE 15 Example – Exploring an uncompressed WARC Unpacked a 1GB WARC.GZ on local computer 2.2 GB unpacked => files `ls` took ~40s, count *.html files with `file` took ~4 hrs => html files Provided corresponding bash command as toolspec: if [ "$(file ${input} | awk "{print \$2}" )" == HTML ]; then echo "HTML" ; fi Moved data to HDFS and executed pt-mapred with toolspec. 236min on local file system 160min with 1 mapper on HDFS (this was a surprise!) 85min (2), 52min (4), 27min (8) 26min with 8 mappers and IO streaming (also a surprise)

SCAlable Preservation Environments SCAPE Workflow Support

SCAlable Preservation Environments SCAPE 17 What we mean by Workflow Formalized (and repeatable) processes/experiments consisting of one or more activities interpreted by a workflow engine. Usually modeled as DAGs based on control-flow and/or data-flow logic. Workflow engine functions as a coordinator/scheduler that triggers the execution of the involved activities May be performed by a desktop or server-sided component. Example workflow engines are Taverna workbench, Taverna server, and Apache Oozie. Not equally rich and designed for different purposes: experimentation & science, SOA, Hadoop integration.

SCAlable Preservation Environments SCAPE 18 No significant changes in workflow structure compared to sequential workflow. Orchestrating remote activities using Taverna’s Tool Plugin over SSH. Using Platform’s MapRed toolwrapper to invoke cmd- line tools on cluster ToMaR and Taverna Workbench Command: hadoop jar mpt-mapred.jar -j $jobname -i $infile -r toolspecs

SCAlable Preservation Environments SCAPE 19 ToMaR and MapReduce workflows ToMaR utilized as a stand alone MapReduce applications or chained with other Hadoop applications. Apache Oozie provides a server (cluster) sided workflow engine and scheduler for Hadoop jobs. Supporting ToMaR as MapReduce actions ToMaR is very well suited to help implementing ETL workflows dealing with binary content (e.g metadata extraction). Apache PIG provides a widely used ETL tool for Hadoop implementing a script based query language. Implemented UDF that enables one to use ToMaR within PIG scripts. Enables use of data flow mechanisms and data base abstractions to process CLI-baed tool output.

SCAlable Preservation Environments SCAPE 20 Ongoing Work Source project and README on available on Github Presently required to generate an input file that specifies input file paths (along with optional output file names). Input binary directly based on input directory path Enable Hadoop to take advantage of data locality Input/output steaming and piping between toolspec commands has already been implemented. Support for Hadoop file types like SequenceFiles. Integration with Hadoop Streaming API and PIG.

SCAlable Preservation Environments SCAPE Thank you! Questions?