Bryon Gill Pittsburgh Supercomputing Center

Slides:



Advertisements
Similar presentations
Mapreduce and Hadoop Introduce Mapreduce and Hadoop
Advertisements

MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
Introduction to Hadoop Programming Bryon Gill, Pittsburgh Supercomputing Center.
Map reduce with Hadoop streaming and/or Hadoop. Hadoop Job Hadoop Mapper Hadoop Reducer Partitioner Hadoop FileSystem Combiner Shuffle Sort Shuffle Sort.
Jian Wang Based on “Meet Hadoop! Open Source Grid Computing” by Devaraj Das Yahoo! Inc. Bangalore & Apache Software Foundation.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Hadoop, Hadoop, Hadoop!!! Jerome Mitchell Indiana University.
Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.
HAMS Technologies 1
Introduction to Hadoop and HDFS
HAMS Technologies 1
Contents HADOOP INTRODUCTION AND CONCEPTUAL OVERVIEW TERMINOLOGY QUICK TOUR OF CLOUDERA MANAGER.
Nutch in a Nutshell Presented by Liew Guo Min Zhao Jin.
MapReduce.
Introduction to Hadoop Programming Bryon Gill, Pittsburgh Supercomputing Center.
大规模数据处理 / 云计算 Lecture 5 – Hadoop Runtime 彭波 北京大学信息科学技术学院 7/23/2013 This work is licensed under a Creative Commons.
Introduction to HDFS Prasanth Kothuri, CERN 2 What’s HDFS HDFS is a distributed file system that is fault tolerant, scalable and extremely easy to expand.
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA
MapReduce. What is MapReduce? (1) A programing model for parallel processing of a distributed data on a cluster It is an ideal solution for processing.
Introduction to HDFS Prasanth Kothuri, CERN 2 What’s HDFS HDFS is a distributed file system that is fault tolerant, scalable and extremely easy to expand.
Presented by: Katie Woods and Jordan Howell. * Hadoop is a distributed computing platform written in Java. It incorporates features similar to those of.
Hadoop: what is it?. Hadoop manages: – processor time – memory – disk space – network bandwidth Does not have a security model Can handle HW failure.
Copyright © 2015, SAS Institute Inc. All rights reserved. THE ELEPHANT IN THE ROOM SAS & HADOOP.
Team3: Xiaokui Shu, Ron Cohen CS5604 at Virginia Tech December 6, 2010.
Cloud Computing Mapreduce (2) Keke Chen. Outline  Hadoop streaming example  Hadoop java API Framework important APIs  Mini-project.
INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.
1 Student Date Time Wei Li Nov 30, 2015 Monday 9:00-9:25am Shubbhi Taneja Nov 30, 2015 Monday9:25-9:50am Rodrigo Sanandan Dec 2, 2015 Wednesday9:00-9:25am.
Practical Hadoop: do’s and don’ts by example Kacper Surdy, Zbigniew Baranowski.
MapReduce using Hadoop Jan Krüger … in 30 minutes...
Introduction to Hadoop Programming Bryon Gill, Pittsburgh Supercomputing Center.
Introduction to Hadoop
Petr Škoda, Jakub Koza Astronomical Institute Academy of Sciences
Hadoop.
Introduction to Distributed Platforms
By Chris immanuel, Heym Kumar, Sai janani, Susmitha
Apache hadoop & Mapreduce
INTRODUCTION TO BIGDATA & HADOOP
HADOOP ADMIN: Session -2
An Open Source Project Commonly Used for Processing Big Data Sets
What is Apache Hadoop? Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created.
Chapter 10 Data Analytics for IoT
Hadoop MapReduce Framework
TABLE OF CONTENTS. TABLE OF CONTENTS Not Possible in single computer and DB Serialised solution not possible Large data backup difficult so data.
Hadoop: what is it?.
Lecture 17 (Hadoop: Getting Started)
Rahi Ashokkumar Patel U
Calculation of stock volatility using Hadoop and map-reduce
Software Engineering Introduction to Apache Hadoop Map Reduce
Central Florida Business Intelligence User Group
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
Ministry of Higher Education
Big Data Programming: an Introduction
MIT 802 Introduction to Data Platforms and Sources Lecture 2
Wordcount CSCE 587 Spring 2018.
CS6604 Digital Libraries IDEAL Webpages Presented by
湖南大学-信息科学与工程学院-计算机与科学系
WordCount 빅데이터 분산컴퓨팅 박영택.
GARRETT SINGLETARY.
Hadoop Distributed Filesystem
Hadoop Basics.
Wordcount CSCE 587 Spring 2018.
Lecture 18 (Hadoop: Programming Examples)
VI-SEEM data analysis service
Lecture 16 (Intro to MapReduce and Hadoop)
Charles Tappert Seidenberg School of CSIS, Pace University
Apache Hadoop and Spark
MIT 802 Introduction to Data Platforms and Sources Lecture 2
Analysis of Structured or Semi-structured Data on a Hadoop Cluster
Pig Hive HBase Zookeeper
Presentation transcript:

Bryon Gill Pittsburgh Supercomputing Center Hadoop: An Overview Bryon Gill Pittsburgh Supercomputing Center

What Is Hadoop? Programming platform Filesystem Software ecosystem Stuffed elephant

What does Hadoop do? Distributes files Computes Replication Closer to the CPU Computes Map/Reduce Other

MapReduce Map function Maps k/v to intermediate k/v Reduce function Shuffle/Sort/Reduce Aggregates results of map Data Data Data Map Shuffle/Sort Reduce Results

HDFS: Hadoop Distributed File System Replication Failsafe Predistribution Write Once Read Many (WORM) Streaming throughput Simplified Data Coherency No Random Access (contrast with RDBMS)

HDFS: Hadoop Distributed File System Meta filesystem Requires underlying FS Special access commands Exports NFS Fuse Vendor filesystems

HDFS Source: https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html

HDFS: Daemons Namenode Datanode Metadata server Holds blocks Compute node

YARN: Yet Another Resource Negotiator Programming interface (replaces MapReduce) Include MapReduce API (compatible with 1.x) Assigns resources for applications

YARN: Daemons ResourceManager NodeManager Applications Manager Scheduler (pluggable) NodeManager Worker Node Containers (tasks from ApplicationManager)

YARN Source: https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html

Using Hadoop Load data to hdfs Write a program Submit a job Fs commands Write a program Java Hadoop Streaming Submit a job

Fs Commands “FTP-style” commands hdfs dfs –put /local/path/myfile /user/$USER/ hdfs dfs –cat /user/$USER/myfile # | more hdfs dfs –ls hdfs dfs –get /user/$USER/myfile

Moving Files #on bridges: hdfs dfs –put /home/training/hadoop/datasets / # if you don’t have permissions for / (eg. shared cluster) # you can put it in your home directory # (making sure to adjust paths in examples): hdfs dfs –put /home/training/hadoop/datasets

Writing a MapReduce Program Hadoop Streaming Mapper and reducer scripts read/write stdin/stdout whole line is key, value is null (unless there’s a tab) Use builtin utilities (wc, grep, cat) Write in any language (python) Java (compile/jar/run)

Simple MapReduce Job (HadoopStreaming) cat as mapper wc as reducer hadoop jar \ $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming*.jar \ -input /datasets/plays/ -output streaming-out \ -mapper '/bin/cat' -reducer '/usr/bin/wc -l

Python MapReduce (HadoopStreaming) hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming*.jar \ -file ~training/hadoop/mapper.py -mapper mapper.py \ -file ~training/hadoop/reducer.py -reducer reducer.py \ -input /datasets/plays/ -output pyout

MapReduce Java: Compile, Jar, Run cp /home/training/hadoop/*.java ./ hadoop com.sun.tools.javac.Main WordCount.java jar cf wc.jar WordCount*.class hadoop jar wc.jar WordCount /datasets/compleat.txt output

Getting Output hdfs dfs –cat /user/$USER/streaming-out/part-00000 | more hdfs dfs –get /user/$USER/streaming-out/part-00000

Further Exploration Write a Hadoop Streaming wordcount program in the language of your choice Advanced: Extend the word count example to use a custom input format splitting the text by sentence rather than by line.

Questions? Thanks!