SCAPE Rainer Schmidt SCAPE Training Event September 16 th – 17 th, 2013 The British Library The SCAPE Platform Overview.

Slides:

Advertisements

Similar presentations

 Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware  Created by Doug Cutting and.

Advertisements

EHarmony in Cloud Subtitle Brian Ko. eHarmony Online subscription-based matchmaking service Available in United States, Canada, Australia and United Kingdom.

Spark: Cluster Computing with Working Sets

EUROPEAN UNION Polish Infrastructure for Supporting Computational Science in the European Research Space User Oriented Provisioning of Secure Virtualized.

A Framework for Distributed Preservation Workflows Rainer Schmidt AIT Austrian Institute of Technology iPres 2009, Oct. 5, San.

A Service for Data-Intensive Computations on Virtual Clusters Rainer Schmidt, Christian Sadilek, and Ross King Intensive 2009,

Course Instructor: Aisha Azeem

Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.

Hadoop Ecosystem Overview

INTRODUCTION TO CLOUD COMPUTING Cs 595 Lecture 5 2/11/2015.

Google Distributed System and Hadoop Lakshmi Thyagarajan.

Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc

HADOOP ADMIN: Session -2

Hadoop Team: Role of Hadoop in the IDEAL Project ●Jose Cadena ●Chengyuan Wen ●Mengsu Chen CS5604 Spring 2015 Instructor: Dr. Edward Fox.

U.S. Department of the Interior U.S. Geological Survey David V. Hill, Information Dynamics, Contractor to USGS/EROS 12/08/2011 Satellite Image Processing.

Data Mining on the Web via Cloud Computing COMS E6125 Web Enhanced Information Management Presented By Hemanth Murthy.

A Brief Overview by Aditya Dutt March 18 th ’ Aditya Inc.

USING HADOOP & HBASE TO BUILD CONTENT RELEVANCE & PERSONALIZATION Tools to build your big data application Ameya Kanitkar.

 Cloud computing  Workflow  Workflow lifecycle  Workflow design  Workflow tools : xcp, eucalyptus, open nebula.

Cloud Distributed Computing Environment Content of this lecture is primarily from the book “Hadoop, The Definite Guide 2/e)

MapReduce April 2012 Extract from various presentations: Sudarshan, Chungnam, Teradata Aster, …

CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.

An Approach for Processing Large and Non-uniform Media Objects on MapReduce-Based Clusters Rainer Schmidt and Matthias Rella Speaker: Lin-You Wu.

Per Møldrup-Dalum State and University Library SCAPE Information Day State and University Library, Denmark, SCAPE Scalable Preservation Environments.

MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.

W HAT IS H ADOOP ? Hadoop is an open-source software framework for storing and processing big data in a distributed fashion on large clusters of commodity.

Introduction to Apache Hadoop Zibo Wang. Introduction  What is Apache Hadoop?  Apache Hadoop is a software framework which provides open source libraries.

Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.

Introduction to Hadoop and HDFS

f ACT s  Data intensive applications with Petabytes of data  Web pages billion web pages x 20KB = 400+ terabytes  One computer can read

Contents HADOOP INTRODUCTION AND CONCEPTUAL OVERVIEW TERMINOLOGY QUICK TOUR OF CLOUDERA MANAGER.

The Limitation of MapReduce: A Probing Case and a Lightweight Solution Zhiqiang Ma Lin Gu Department of Computer Science and Engineering The Hong Kong.

MARISSA: MApReduce Implementation for Streaming Science Applications 作者 : Fadika, Z. ; Hartog, J. ; Govindaraju, M. ; Ramakrishnan, L. ; Gunter, D. ; Canon,

MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.

Tool Integration with Data and Computation Grid GWE - “Grid Wizard Enterprise”

Alastair Duncan STFC Pre Coffee talk STFC July 2014 The Trials and Tribulations and ultimate success of parallelisation using Hadoop within the SCAPE project.

Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!

NA-MIC National Alliance for Medical Image Computing UCSD: Engineering Core 2 Portal and Grid Infrastructure.

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

SCAPE Rainer Schmidt SCAPE Information Day May 5 th, 2014 Österreichische Nationalbibliothek The SCAPE Platform Overview.

SCAPE Rainer Schmidt SCAPE Training Event September 16 th – 17 th, 2013 The British Library Building Scalable Environments Technologies and SCAPE Platform.

Tools and Deployment University of Illinois at Urbana-Champaign.

Windows Azure. Azure Application platform for the public cloud. Windows Azure is an operating system You can: – build a web application that runs.

CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.

Copyright © 2015, SAS Institute Inc. All rights reserved. THE ELEPHANT IN THE ROOM SAS & HADOOP.

Tool Integration with Data and Computation Grid “Grid Wizard 2”

Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies

Big Data Analytics Platforms. Our Team NameApplication Viborov MichaelApache Spark Bordeynik YanivApache Storm Abu Jabal FerasHPCC Oun JosephGoogle BigQuery.

Cloud Distributed Computing Environment Hadoop. Hadoop is an open-source software system that provides a distributed computing environment on cloud (data.

Next Generation of Apache Hadoop MapReduce Owen

PARALLEL AND DISTRIBUTED PROGRAMMING MODELS U. Jhashuva 1 Asst. Prof Dept. of CSE om.

Beyond Hadoop The leading open source system for processing big data continues to evolve, but new approaches with added features are on the rise. Ibrahim.

INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.

By: Joel Dominic and Carroll Wongchote 4/18/2012.

COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University

BI 202 Data in the Cloud Creating SharePoint 2013 BI Solutions using Azure 6/20/2014 SharePoint Fest NYC.

Canadian Bioinformatics Workshops

Introduction to Distributed Platforms

INTRODUCTION TO PIG, HIVE, HBASE and ZOOKEEPER

Spark Presentation.

Rahi Ashokkumar Patel U

Ministry of Higher Education

Introduction to Apache

Module 01 ETICS Overview ETICS Online Tutorials

Overview of big data tools

Software models - Software Architecture Design Patterns

Lecture 16 (Intro to MapReduce and Hadoop)

Charles Tappert Seidenberg School of CSIS, Pace University

Pig Hive HBase Zookeeper

Presentation transcript:

SCAPE Rainer Schmidt SCAPE Training Event September 16 th – 17 th, 2013 The British Library The SCAPE Platform Overview

SCAlable Preservation Environments SCAPE Goal of the SCAPE Platform Hardware and software platform to support scalable preservation in terms of computation and storage. Employing an scale-out architecture to supporting preservation activities against large amounts of data. Integration of existing tools, workflows, and data sources and sinks. A data center service providing a scalable execution and storage backend for different object management systems. Based a minimal set of defined services for processing tools and/or queries closely to the data.

SCAlable Preservation Environments SCAPE Underlying Technologies The SCAPE Platform is built on top of existing data-intensive computing technologies. Reference Implementation leverages Hadoop Software Stack (HDFS, MapReduce, Hive, …) Virtualization and packaging model for dynamic deployments of tools and environments Debian packages and IaaS suppot. Repository Integration and Services Data/Storage Connector API (Fedora and Lily) Object Exchange Format (METS/PREMIS representation) Workflow modeling, translation, and provisioning. Taverna Workbench and Component Catalogue Workflow Compiler and Job Submission Service

SCAlable Preservation Environments SCAPE Architectural Overview (Core) Component Catalogue Workflow Modeling Environment Component Lookup API Component Registration API

SCAlable Preservation Environments SCAPE Architectural Overview (Core) Component Catalogue Workflow Modeling Environment Component Lookup API Component Registration API Focus of this talk

SCAlable Preservation Environments SCAPE Hadoop Overview

SCAlable Preservation Environments SCAPE Open-source software framework for large-scale data- intensive computations running on large clusters of commodity hardware. Derived from publications Google File System and MapReduce publications. Hadoop = MapReduce + HDFS MapReduce: Programming Model (Map, Shuffle/Sort, Reduce) and Execution Environment. HDFS: Virtual distributed file system overlay on top of local file systems. The Framework

SCAlable Preservation Environments SCAPE Designed for write one read many times access model. Data IO is handled via HDFS. Data divided into blocks (typically 64MB) and distributed and replicated over data nodes. Parallelization logic is strictly separated from user program. Automated data decomposition and communication between processing steps. Applications benefit from built-in support for data-locality and fail-safety. Applications scale-out on big clusters processing very large data volumes. Programming Model

SCAlable Preservation Environments SCAPE Cluster Set-up

SCAlable Preservation Environments SCAPE Platform Deployment There is no prescribed deployment model Private, institutionally-shared, external data center Possible to deploy on “bare-metal” or using virtualization and cloud middleware. Platform Environment packaged as VM image Automated and scalable deployment. Presently supporting Eucalyptus (and AWS) clouds. SCAPE provides two shared Platform instances Stable non-virtualized data-center cluster Private-cloud based development cluster Partitioning and dynamic reconfiguration

SCAlable Preservation Environments SCAPE Deploying Environments IaaS enabling packaging and dynamic deployment of (complex) Software Environments But requires complex virtualization infrastructure Data-intensive technology is able to deal with a constantly varying number of cluster nodes. Node failures are expected and automatically handled System can grow/shrink on demand Network Attached Storage solution can be used as data source But does not scalability and performance needs for computation SCAPE Hadoop Clusters Linux + Preservation tools + SCAPE Hadoop libraries Optionally Higher-level services (repository, workflow, …)

SCAlable Preservation Environments SCAPE Using the Cluster

SCAlable Preservation Environments SCAPE 13 Wrapping Sequential Tools Using a wrapper script (Hadoop Streaming API) PT’s generic Java wrapper allows one to use pre-defined patterns (based on toolspec language) Works well for processing a moderate number of files e.g. applying migration tools or FITS. Writing a custom MapReduce application Much more powerful and usually performs better. Suitable for more complex problems and file formats, such as Web archives. Using a High-level Language like Hive and Pig Very useful to perform analysis of (semi-)structured data, e.g. characterization output.

SCAlable Preservation Environments SCAPE Preservation tools and libraries are pre-packaged so they can be automatically deployed on cluster nodes SCAPE Debian Packages Supporting SCAPE Tool Specification Language MapReduce libs for processing large container files For example METS and (W)arc RecordReader Application Scripts Based on Apache Hive, Pig, Mahout Software components to assemble a complex data-parallel workflows Taverna and Oozie Workflows Available Tools

SCAlable Preservation Environments SCAPE 15 Sequential Workflows In order to run a workflow (or activity) on the cluster it will have to be parallelized first! A number of different parallelization strategies exist Approach typically determined on a case-by-case basis May lead to changes of activities, workflow structure, or the entire application. Automated parallelization will only work to a certain degree Trivial workflows can be deployed/executed using without requiring individual parallelization (wrapper approach). SCAPE driver program for parallelizing Taverna workflows. SCAPE template workflows for different institutional scenarios developed.

SCAlable Preservation Environments SCAPE 16 Parallel Workflows Are typically derived from sequential (conceptual) workflows created for desktop environment (but may differ substantially!). Rely on MapReduce as the parallel programming model and Apache Hadoop as execution environment Data decomposition is handled by Hadoop framework based on input format handlers (e.g text, warc, mets-xml, etc. ) Can make use of a workflow engine (like Taverna and Oozie) for orchestrating complex (composite) processes. May include interactions with data mgnt. sytems (repositories) and sequential (concurrently executed) tools. Tools invocations are based on API or cmd-line interface and performed as part of a MapReduce application.

SCAlable Preservation Environments SCAPE MapRed Tool Wrapper

SCAlable Preservation Environments SCAPE 18 Tool Specification Language The SCAPE Tool Specification Language (toolspec) provides a schema to formalize command line tool invocations. Can be used to automate a complex tool invocation (many arguments) based on a keyword (e.g. ps2pdfs) Provides a simple and flexible mechanism to define tool dependencies, for example of a workflow. Can be resolved by the execution system using Linux packages. The toolspec is minimalistic and can be easily created for individual tools and scripts. Tools provided as SCAPE Debian packages come with a toolspec document by default.

SCAlable Preservation Environments SCAPE 19 MapRed Toolwrapper Hadoop provides scalability, reliability, and robustness supporting processing data that does not fit on a single machine. Application must however be made compliant with the execution environment. Our intention was to provide a wrapper allowing one to execute a command-line tool on the cluster in a similar way like on a desktop environment. User simply specifies toolspec file, command name, and payload data. Supports HDFS references and (optionally) standard IO streams. Supports the SCAPE toolspec to execute preinstalled tools or other applications available via OS command-line interface.

SCAlable Preservation Environments SCAPE 20 Hadoop Streaming API Hadoop streaming API supports the execution of scripts (e.g. bash or python) which are automatically translated and executed as MapReduce applications. Can be used to process data with common UNIX filters using commands like echo, awk, tr. Hadoop is designed to process its input based on key/value pairs. This means the input data is interpreted and split by the framework. Perfect for processing text but difficult to process binary data. The steaming API uses streams to read/write from/to HDFS. Preservation tools typically do not support HDFS file pointers and/or IO streaming through stdin/sdout. Hence, DP tools are difficult to use with streaming API

SCAlable Preservation Environments SCAPE 21 Suitable Use-Cases Use MapRed Toolwrapper when dealing with (a large number of) single files. Be aware that this may not be an ideal strategy and there are more efficient ways to deal with many files on Hadoop (Sequence Files, Hbase, etc. ). However, practical and sufficient in many cases, as there is no additional application development required. A typical example is file format migration on a moderate number of files (e.g s), which can be included in a workflow with additional QA components. Very helpful when payload is simply too big to be computed on a single machine.

SCAlable Preservation Environments SCAPE 22 Example – Exploring an uncompressed WARC Unpacked a 1GB WARC.GZ on local computer 2.2 GB unpacked => files `ls` took ~40s, count *.html files with `file` took ~4 hrs => html files Provided corresponding bash command as toolspec: if [ "$(file ${input} | awk "{print \$2}" )" == HTML ]; then echo "HTML" ; fi Moved data to HDFS and executed pt-mapred with toolspec. 236min on local file system 160min with 1 mapper on HDFS (this was a surprise!) 85min (2), 52min (4), 27min (8) 26min with 8 mappers and IO streaming (also a surprise)

SCAlable Preservation Environments SCAPE 23 Ongoing Work Source project and README on Github presently under openplanets/scape/pt-mapred * Will be migrated to its own repository soon. Presently required to generate an input file that specifies input file paths (along with optional output file names). TODO: Input binary directly based on input directory path allowing Hadoop to take advantage of data locality. Input/output steaming and piping between toolspec commands has already been implemented. TODO: Add support for Hadoop Sequence Files. Look into possible integration with Hadoop Streaming API. *

SCAlable Preservation Environments SCAPE 24