Clemens Neudecker KB National Library of the Netherlands SCAPE & OPF Hackathon Vienna, 2 dec 2013 What is Hadoop? Hadoop Driven Digital Preservation.

Slides:



Advertisements
Similar presentations
Meet Hadoop Doug Cutting & Eric Baldeschwieler Yahoo!
Advertisements

Hadoop at ContextWeb February ContextWeb: Traffic Traffic – up to 6 thousand Ad requests per second. Comscore Trend Data:
Big Data I/Hadoop explained Presented to ITS at the UoA on December 6 th 2012.
Introduction to Hadoop Richard Holowczak Baruch College.
Syncsort Data Integration Update Summary Helping Data Intensive Organizations Across the Big Data Continuum Hadoop – The Operating System.
Introduction to cloud computing Jiaheng Lu Department of Computer Science Renmin University of China
Apache Hadoop and Hive.
Intro to Map-Reduce Feb 21, map-reduce? A programming model or abstraction. A novel way of thinking about designing a solution to certain problems…
Suggested Course Outline Cloud Computing Bahga & Madisetti, © 2014Book website:
Dan Bassett, Jonathan Canfield December 13, 2011.
SCAPE Carl Wilson Open Planets Foundation SCAPE Training Guimarães Characterisation An introduction to the identification and characterisation of.
University of Minnesota Optimizing MapReduce Provisioning in the Cloud Michael Cardosa, Aameek Singh†, Himabindu Pucha†, Abhishek Chandra
 Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware  Created by Doug Cutting and.
Introduction to Advanced Computing Platforms for Data Analysis Ruoming Jin.
Computing & Information Sciences Kansas State University Kansas State University Olathe Workshop on Big Data – August, 2014 KSU Laboratory for Knowledge.
A Hadoop Overview. Outline Progress Report MapReduce Programming Hadoop Cluster Overview HBase Overview Q & A.
BigData Tools Seyyed mohammad Razavi. Outline  Introduction  Hbase  Cassandra  Spark  Acumulo  Blur  MongoDB  Hive  Giraph  Pig.
 Need for a new processing platform (BigData)  Origin of Hadoop  What is Hadoop & what it is not ?  Hadoop architecture  Hadoop components (Common/HDFS/MapReduce)
ETM Hadoop. ETM IDC estimate put the size of the “digital universe” at zettabytes in forecasting a tenfold growth by 2011 to.
Overview of Hadoop for Data Mining Federal Big Data Group confidential Mark Silverman Treeminer, Inc. 155 Gibbs Street Suite 514 Rockville, Maryland
Hadoop Ecosystem Overview
Big Data and Hadoop and DLRL Introduction to the DLRL Hadoop Cluster Sunshin Lee and Edward A. Fox DLRL, CS, Virginia Tech 21 May 2015 presentation for.
Next Generation of Apache Hadoop MapReduce Arun C. Murthy - Hortonworks Founder and Architect Formerly Architect, MapReduce.
Introduction to Apache Hadoop CSCI 572: Information Retrieval and Search Engines Summer 2010.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
HADOOP ADMIN: Session -2
Distributed and Parallel Processing Technology Chapter1. Meet Hadoop Sun Jo 1.
Facebook (stylized facebook) is a Social Networking System and website launched in February 2004, operated and privately owned by Facebook, Inc. As.
Per Møldrup-Dalum State and University Library SCAPE Information Day State and University Library, Denmark, Hadoop and its applications at the.
Cloud Distributed Computing Environment Content of this lecture is primarily from the book “Hadoop, The Definite Guide 2/e)
MapReduce April 2012 Extract from various presentations: Sudarshan, Chungnam, Teradata Aster, …
Scaling for Large Data Processing What is Hadoop? HDFS and MapReduce
Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.
Hadoop Basics -Venkat Cherukupalli. What is Hadoop? Open Source Distributed processing Large data sets across clusters Commodity, shared-nothing servers.
Presented by John Dougherty, Viriton 4/28/2015 Infrastructure and Stack.
Apache Hadoop MapReduce What is it ? Why use it ? How does it work Some examples Big users.
Introduction to Apache Hadoop Zibo Wang. Introduction  What is Apache Hadoop?  Apache Hadoop is a software framework which provides open source libraries.
Introduction to Hadoop and HDFS
SEMINAR ON Guided by: Prof. D.V.Chaudhari Seminar by: Namrata Sakhare Roll No: 65 B.E.Comp.
Contents HADOOP INTRODUCTION AND CONCEPTUAL OVERVIEW TERMINOLOGY QUICK TOUR OF CLOUDERA MANAGER.
An Introduction to HDInsight June 27 th,
Alastair Duncan STFC Pre Coffee talk STFC July 2014 The Trials and Tribulations and ultimate success of parallelisation using Hadoop within the SCAPE project.
Hadoop implementation of MapReduce computational model Ján Vaňo.
Map-Reduce Big Data, Map-Reduce, Apache Hadoop SoftUni Team Technical Trainers Software University
MapReduce & Hadoop IT332 Distributed Systems. Outline  MapReduce  Hadoop  Cloudera Hadoop  Tutorial 2.
Cloud Distributed Computing Environment Hadoop. Hadoop is an open-source software system that provides a distributed computing environment on cloud (data.
Beyond Hadoop The leading open source system for processing big data continues to evolve, but new approaches with added features are on the rise. Ibrahim.
This is a free Course Available on Hadoop-Skills.com.
Data Science Hadoop YARN Rodney Nielsen. Rodney Nielsen, Human Intelligence & Language Technologies Lab Outline Classical Hadoop What’s it all about Hadoop.
Microsoft Partner since 2011
An Introduction to Big Data (With a strong focus on Apache) Nick Burch Senior Developer, Alfresco Software VP ConCom, ASF Member.
BI 202 Data in the Cloud Creating SharePoint 2013 BI Solutions using Azure 6/20/2014 SharePoint Fest NYC.
Apache David Schneider (schnei21) ITEC400. What is Hadoop? Distributed Computing Open Source Reliable Scalable Fun Facts What is a Hadoop? Hadoop was.
Hadoop Introduction. Audience Introduction of students – Name – Years of experience – Background – Do you know Java? – Do you know linux? – Any exposure.
Hadoop Javad Azimi May What is Hadoop? Software platform that lets one easily write and run applications that process vast amounts of data. It includes:
Introduction to Hadoop
Hadoop Aakash Kag What Why How 1.
Apache hadoop & Mapreduce
Hadoopla: Microsoft and the Hadoop Ecosystem
Rahi Ashokkumar Patel U
Hadoop Clusters Tess Fulkerson.
Central Florida Business Intelligence User Group
The Basics of Apache Hadoop
Hadoop Basics.
Introduction to Apache
Overview of big data tools
TIM TAYLOR AND JOSH NEEDHAM
Charles Tappert Seidenberg School of CSIS, Pace University
Introduction to MapReduce
Analysis of Structured or Semi-structured Data on a Hadoop Cluster
Presentation transcript:

Clemens Neudecker KB National Library of the Netherlands SCAPE & OPF Hackathon Vienna, 2 dec 2013 What is Hadoop? Hadoop Driven Digital Preservation

2 Timeline This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐ (Grant Agreement number ). Dec 2004: Dean/Ghemawat (Google) MapReduce paper 2005: Doug Cutting and Mike Cafarella (Yahoo) create Hadoop, at first only to extend Nutch (the name is derived from Doug’s son’s toy elephant) 2006: Yahoo runs Hadoop on 5-20 nodes

3 Timeline This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐ (Grant Agreement number ). March 2008: Cloudera founded July 2008: Hadoop wins TeraByte sort benchmark (1 st time a Java program won this competition) April 2009: Amazon introduce “Elastic MapReduce” as a service on S3/EC2 June 2011: Hortonworks founded

4 Timeline This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐ (Grant Agreement number ). 27 dec 2011: Apache Hadoop release June 2012: Facebook claim “biggest Hadoop cluster”, totalling more than 100 PetaBytes in HDFS 2013: Yahoo runs Hadoop on 42,000 nodes, computing about 500,000 MapReduce jobs per day 15 oct 2013: Apache Hadoop release (YARN)

5 Contributions This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐ (Grant Agreement number ). (Cf.

6 “Core” Hadoop This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐ (Grant Agreement number ). Hadoop Common (formerly Hadoop Core) Hadoop MapReduce Hadoop YARN (MapReduce 2.0) Hadoop Distributed File System (HDFS)

7 The wider Hadoop Ecosystem This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐ (Grant Agreement number ). Ambari, Zookeeper (managing & monitoring) HBase, Cassandra (database) Hive, Pig (data warehouse and query language) Mahout (machine learning) Chukwa, Avro, Oozie, Giraph, and many more

8 The wider Hadoop Ecosystem This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐ (Grant Agreement number ). collins-charles-zedlewski-cloudera

“Hadoop is a hammer. Start by figuring out what house you‘re gonna build.“ Alistair Croll “If all you have is a hammer, throw away everything that is not a nail!“ Jimmy Lin 9 “Hadoop is a hammer” This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐ (Grant Agreement number ).

10 MapReduce in 41 words (including “library”) This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐ (Grant Agreement number ). Goal: count the number of books in the library. Map: You count up shelf #1, I count up shelf #2. (The more people we get, the faster this part goes) Reduce: We all get together and add up our individual counts. (Cf.

MapReduce in a nutshell 11 This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐ (Grant Agreement number ). Task1 Task 2 Task 3 Output data Aggregated Result © Sven Schlarb

12 MapReduce “v1” issues This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐ (Grant Agreement number ). JobTracker as a single-point of failure Deficiencies in scalability, memory consumption, threading-model, reliability and performance ( 278) 278 Aim to support programming paradigms other than MapReduce (BSP)

13 MapReduce vs YARN This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐ (Grant Agreement number ). (Cf.

14 When to use Hadoop? This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐ (Grant Agreement number ). Generally, always when “standard tools” don’t work anymore because of sheer data size (rule of thumb: if your data fits on a regular hard drive, your better off sticking to Python/SQL/Bash/etc.!) Aggregation across large data sets: use the power of Reducers! Large-scale ETL operations (extract, transform, load)

Tom White: Hadoop. The Definitive Guide (get 3rd ed. for extra YARN chapter) YARN explained (really quite well): 0-in-hadoop-0-23/ 0-in-hadoop-0-23/ Jimmy Lin: Text Processing with MapReduce: ml ml Reading 15 This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐ (Grant Agreement number ).

16 Happy Hadooping! This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐ (Grant Agreement number ).