Hadoop: what is it?. Hadoop manages: – processor time – memory – disk space – network bandwidth Does not have a security model Can handle HW failure.

Slides:



Advertisements
Similar presentations
Introduction to cloud computing Jiaheng Lu Department of Computer Science Renmin University of China
Advertisements

 Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware  Created by Doug Cutting and.
Serverless Network File Systems. Network File Systems Allow sharing among independent file systems in a transparent manner Mounting a remote directory.
O’Reilly – Hadoop: The Definitive Guide Ch.3 The Hadoop Distributed Filesystem June 4 th, 2010 Taewhi Lee.
File System Implementation
Homework 2 In the docs folder of your Berkeley DB, have a careful look at documentation on how to configure BDB in main memory. In the docs folder of your.
70-270, MCSE/MCSA Guide to Installing and Managing Microsoft Windows XP Professional and Windows Server 2003 Chapter Nine Managing File System Access.
Hadoop Setup. Prerequisite: System: Mac OS / Linux / Cygwin on Windows Notice: 1. only works in Ubuntu will be supported by TA. You may try other environments.
5.1 © 2004 Pearson Education, Inc. Exam Managing and Maintaining a Microsoft® Windows® Server 2003 Environment Lesson 5: Working with File Systems.
NFS. The Sun Network File System (NFS) An implementation and a specification of a software system for accessing remote files across LANs. The implementation.
Hadoop File System B. Ramamurthy 4/19/2017.
Jian Wang Based on “Meet Hadoop! Open Source Grid Computing” by Devaraj Das Yahoo! Inc. Bangalore & Apache Software Foundation.
MCTS Guide to Microsoft Windows Server 2008 Network Infrastructure Configuration Chapter 7 Configuring File Services in Windows Server 2008.
File Systems (2). Readings r Silbershatz et al: 11.8.
Google Distributed System and Hadoop Lakshmi Thyagarajan.
The Hadoop Distributed File System, by Dhyuba Borthakur and Related Work Presented by Mohit Goenka.
Dr. G Sudha Sadhasivam Professor, CSE PSG College of Technology Coimbatore INTRODUCTION TO HADOOP.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Hadoop Distributed File System by Swathi Vangala.
The Hadoop Distributed File System: Architecture and Design by Dhruba Borthakur Presented by Bryant Yao.
1 The Google File System Reporter: You-Wei Zhang.
CSC 456 Operating Systems Seminar Presentation (11/13/2012) Leon Weingard, Liang Xin The Google File System.
Lesson 7-Creating and Changing Directories. Overview Using directories to create order. Managing files in directories. Using pathnames to manage files.
The Hadoop Distributed File System
SOFTWARE SYSTEMS DEVELOPMENT MAP-REDUCE, Hadoop, HBase.
Pooja Shetty Usha B Gowda.  Network File Systems (NFS)  Drawbacks of NFS  Parallel Virtual File Systems (PVFS)  PVFS components  PVFS application.
Cloud Distributed Computing Environment Content of this lecture is primarily from the book “Hadoop, The Definite Guide 2/e)
Overview Hadoop is a framework for running applications on large clusters built of commodity hardware. The Hadoop framework transparently provides applications.
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
Distributed File Systems
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.
Introduction to Hadoop and HDFS
10.1 Silberschatz, Galvin and Gagne ©2005 Operating System Principles 10.4 File System Mounting A file system must be mounted before it can be accessed.
1 Administering Shared Folders Understanding Shared Folders Planning Shared Folders Sharing Folders Combining Shared Folder Permissions and NTFS Permissions.
Introduction to DFS. Distributed File Systems A file system whose clients, servers and storage devices are dispersed among the machines of a distributed.
File Management Chapter 12. File Management File management system is considered part of the operating system Input to applications is by means of a file.
Introduction to HDFS Prasanth Kothuri, CERN 2 What’s HDFS HDFS is a distributed file system that is fault tolerant, scalable and extremely easy to expand.
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
MapReduce and GFS. Introduction r To understand Google’s file system let us look at the sort of processing that needs to be done r We will look at MapReduce.
Introduction to HDFS Prasanth Kothuri, CERN 2 What’s HDFS HDFS is a distributed file system that is fault tolerant, scalable and extremely easy to expand.
Linux+ Guide to Linux Certification, Third Edition
Distributed and Parallel Processing Technology Chapter3. The Hadoop Distributed filesystem Kuldeep Gurjar 19 th March
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition File System Implementation.
HADOOP DISTRIBUTED FILE SYSTEM HDFS Reliability Based on “The Hadoop Distributed File System” K. Shvachko et al., MSST 2010 Michael Tsitrin 26/05/13.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
IBM Research ® © 2007 IBM Corporation Introduction to Map-Reduce and Join Processing.
 Introduction  Architecture NameNode, DataNodes, HDFS Client, CheckpointNode, BackupNode, Snapshots  File I/O Operations and Replica Management File.
CSE 548 Advanced Computer Network Security Trust in MobiCloud using Hadoop Framework Updates Sayan Kole Jaya Chakladar Group No: 1.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
Lecture 02 File and File system. Topics Describe the layout of a Linux file system Display and set paths Describe the most important files, including.
Review CS File Systems - Partitions What is a hard disk partition?
Distributed File System. Outline Basic Concepts Current project Hadoop Distributed File System Future work Reference.
BIG DATA/ Hadoop Interview Questions.
Presenter: Yue Zhu, Linghan Zhang A Novel Approach to Improving the Efficiency of Storing and Accessing Small Files on Hadoop: a Case Study by PowerPoint.
CCD-410 Cloudera Certified Developer for Apache Hadoop (CCDH) Cloudera.
File System Implementation
Hadoop: what is it?.
CSCE 822 Project Presentation Schedule
Gregory Kesden, CSE-291 (Storage Systems) Fall 2017
Gregory Kesden, CSE-291 (Cloud Computing) Fall 2016
GARRETT SINGLETARY.
Hadoop Distributed Filesystem
Outline Announcements Lab2 Distributed File Systems 1/17/2019 COP5611.
Chapter 15: File System Internals
Outline Review of Quiz #1 Distributed File Systems 4/20/2019 COP5611.
Lecture 4: File-System Interface
Network File System (NFS)
Presentation transcript:

Hadoop: what is it?

Hadoop manages: – processor time – memory – disk space – network bandwidth Does not have a security model Can handle HW failure

Issues: – race conditions – synchronization – deadlock i.e., same issues as distributed OS & distributed filesystem

Hadoop vs other existing approaches Grid computing: (What is this?) e.g. Condor – MPI model is more complicated – does not automatically distribute data – requires separate managed SAN

Hadoop: – simplified programming model – data distributed as it is loaded HDFS splits large data files across machines HDFS replicates data – failure causes additional replication

Distribute data at load time

MapReduce Core idea: records are processed in isolation Benefit: reduced communication Jargon: – mapper – task that processes records – Reducer – task that aggregates results from mappers

MapReduce

How is the previous picture different from normal grid/cluster computing? Grid/cluster: Programmer manages communication via MPI vs Hadoop: communication is implicit Hadoop manages data transfer and cluster topology issues

Scalability Hadoop overhead – MPI does better for small numbers of nodes Hadoop – flat scalabity  pays off with large data – Little extra work to go from few to many nodes MPI – requires explicit refactoring from small to larger number of nodes

Hadoop Distributed File System NFS: the Network File System – Saw this in OS class – Supports file system exporting – Supports mounting of remote file system

Operating System Concepts NFS Mounting: Three Independent File Systems

Operating System Concepts Mounting in NFS Mounts Cascading mounts

Operating System Concepts NFS Mount Protocol Establishes logical connection between server and client. Mount operation: name of remote directory & name of server – Mount request is mapped to corresponding RPC and forwarded to mount server running on server machine. – Export list – specifies local file systems that server exports for mounting, along with names of machines that are permitted to mount them.

Operating System Concepts NFS Mount Protocol server returns a file handle—a key for further accesses. File handle – a file-system identifier, and an inode number to identify the mounted directory The mount operation changes only the user’s view and does not affect the server side.

NFS Advantages – Transparency – clients unaware of local vs remote – Standard operations - open(), close(), fread(), etc. NFS disadvantages – Files in an NFS volume reside on a single machine – No reliability guarantees if that machine goes down – All clients must go to this machine to retrieve their data

Hadoop Distributed File System HDFS Advantages: – designed to store terabytes or petabytes – data spread across a large number of machines – supports much larger file sizes than NFS – stores data reliably (replication)

Hadoop Distributed File System HDFS Advantages: – provides fast, scalable access – serve more clients by adding more machines – integrates with MapReduce  local computation

Hadoop Distributed File System HDFS Disadvantages – Not as general-purpose as NFS – Design restricts use to a particular class of applications – HDFS optimized for streaming read performance  not good at random access

Hadoop Distributed File System HDFS Disadvantages – Write once read many model – Updating a files after it has been closed is not supported (can’t append data) – System does not provide a mechanism for local caching of data

Hadoop Distributed File System HDFS – block-structured file system File broken into blocks distributed among DataNodes DataNodes – machines used to store data blocks

Hadoop Distributed File System Target machines chosen randomly on a block-by-block basis Supports file sizes far larger than a single-machine DFS Each block replicated across a number of machines (3, by default)

Hadoop Distributed File System

Expects large file size – Small number of large files – Hundreds of MB to GB each Expects sequential access Default block size in HDFS is 64MB Result: – Reduces amount of metadata storage per file – Supports fast streaming of data (large amounts of contiguous data)

Hadoop Distributed File System HDFS expects to read a block start-to-finish – Useful for MapReduce – Not good for random access – Not a good general purpose file system

Hadoop Distributed File System HDFS files are NOT part of the ordinary file system HDFS files are in separate name space Not possible to interact with files using ls, cp, mv, etc. Don’t worry: HDFS provides similar utilities

Hadoop Distributed File System Meta data handled by NameNode – Deal with synchronization by only allowing one machine to handle it – Store meta data for entire file system – Not much data: file names, permissions, & locations of each block of each file

Hadoop Distributed File System

What happens if the NameNode fails? – Bigger problem than failed DataNode – Better be using RAID ;-) – Cluster is kaput until NameNode restored Not exactly relevant but: – DataNodes are more likely to fail. – Why?

Cluster Configuration First download and unzip a copy of Hadoop ( Or better yet, follow this lecture first ;-) Important links: – Hadoop website – Hadoop Users Guide dist/hadoop-hdfs/HdfsUserGuide.htmlhttp://hadoop.apache.org/docs/current/hadoop-project- dist/hadoop-hdfs/HdfsUserGuide.html – 2012 Edition of Hadoop User’s Guide

Cluster Configuration HDFS configuration is in conf/hadoop-defaults.xml – Don’t change this file. – Instead modify conf/hadoop-site.xml – Be sure to replicate this file across all nodes in your cluster – Format of entries in this file: property-name property-value

Cluster Configuration Necessary settings: 1.fs.default.name - describes the NameNode – Format: protocol specifier, hostname, port – Example: hdfs://punchbowl.cse.sc.edu: dfs.data.dir – path on the local file system in which the DataNode instance should store its data – Format: pathname – Example: /home/sauron/hdfs/data – Can differ from DataNode to DataNode – Default is /tmp – /tmp is not a good idea in a production system ;-)

Cluster Configuration 3.dfs.name.dir - path on the local FS of the NameNode where the NameNode metadata is stored – Format: pathname – Example: /home/sauron/hdfs/name – Only used by NameNode – Default is /tmp – /tmp is not a good idea in a production system ;-) 4.dfs.replication – default replication factor – Default is 3 – Fewer than 3 will impact availability of data.

Single Node Configuration fs.default.name hdfs://your.server.name.com:9000 dfs.data.dir /home/username/hdfs/data dfs.name.dir /home/username/hdfs/name

Configuration The Master Node needs to know the names of the DataNode machines – Add hostnames to conf/slaves – One fully-qualified hostname per line – (NameNode runs on Master Node) Create Necessary directories mkdir -p $HOME/hdfs/data mkdir -p $HOME/hdfs/name – Note: owner needs read/write access to all directories – Can run under your own name in a single machine cluster – Do not run Hadoop as root. Duh!

Starting HDFS Start by formatting the FS bin/hadoop namenode -format – Only do this once ;-) Start the File System bin/start-dfs.sh – This starts the NameNode server – Script will ssh into each slave to start each DataNode

Working with HDFS Most commands use bin/hadoop General format: bin/hadoop moduleName -cmd args... Where moduleName specifies HDFS functionality Where cmd specifies which command Example from previous slide: bin/hadoop namenode -format

Working with HDFS Listing files: bin/hadoop dfs -ls Note: nothing listed!!! -ls command returns zilch Why? No concept of current working directory With NO arguments, -ls refers to “home directory” in HDFS This is not /home/rose

Working with HDFS Try specifying a directory bin/hadoop dfs -ls / Found 2 items drwxr-xr-x - hduser supergroup :40 /hadoop drwxr-xr-x - hduser supergroup :08 /tmp Note: in this example hduser is the user name under which the hadoop daemons NameNode and DataNode were started.

Loading data First create a directory bin/hadoop dfs -mkdir /user/rose Next upload a file (uploads to /user/rose) bin/hadoop dfs -put InterestingFile.txt /user/rose/ Verify the file is in HDFS bin/hadoop dfs -ls /user/rose

Loading data You can also “put” multiple files Consider a directory: myfiles Contents: file1.txt file2.txt subdir/ Subdirectory: subdir/ Content: anotherFile.txt Let’s “put” all of these file in the HDFS bin/hadoop -put myfiles /user/rose bin/hadoop dfs -ls /user/rose Found 1 items -rw-r--r-- 1 hduser supergroup :29 /user/rose/README.txt bin/hadoop dfs -ls myfiles Found 3 items /user/rose/myfiles/file1.txt :59 rw-r--r-- hduser supergroup /user/rose/myfiles/file2.txt :59 rw-r--r-- hduser supergroup /user/rose/myfiles/subdir :59 rwxr-xr-x hduser supergroup

Loading data There was something new in the directory listing: /user/rose/myfiles/file1.txt :59 rw-r--r-- hduser supergroup Q: What does mean? A: The number of replicas of this file (Note: when I tried this on saluda there was no )

Getting data from HDFS We can use cat to display a file to stdout bin/hadoop dfs -cat /user/rose/README.txt We can “get” files from the Hadoop File System bin/hadoop dfs -get /user/rose/README.txt AAAREADME.txt ls AAAREADME.txthadoop-ant jarivyREADME.txt Binhadoop-client jarivy.xmlsbin build.xmlhadoop-core jarlibshare c++hadoop-examples jarlibexecsrc CHANGES.txthadoop-minicluster jarLICENSE.txtwebapps Confhadoop-test jarlogs Contribhadoop-tools jarNOTICE.txt

Shutting down HDFS bin/stop-dfs.sh Response: stopping namenode localhost: stopping datanode localhost: stopping secondarynamenode

Other commands Hadoop will list all commands that can be run with the FS Shell with the command: bin/hadoop dfs