HDFS -Hadoop Overview 2- 2009.01.20 유현정. Data Replication HDFS’s blocks in a file except the last block are the same size. The block size and replication.

Slides:



Advertisements
Similar presentations
Introduction to cloud computing Jiaheng Lu Department of Computer Science Renmin University of China
Advertisements

Apache Hadoop and Hive.
1 Lei Xu. Brief Introduction  Hadoop  An apache project for data-intensive applications  Typical application: Map-Reduce (OSDI’04), a distributed algorithm.
Sanjay Ghemawat, Howard Gobioff and Shun-Tak Leung
THE GOOGLE FILE SYSTEM CS 595 LECTURE 8 3/2/2015.
G O O G L E F I L E S Y S T E M 陳 仕融 黃 振凱 林 佑恩 Z 1.
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google Jaehyun Han 1.
The Google File System Authors : Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung Presentation by: Vijay Kumar Chalasani 1CS5204 – Operating Systems.
 Need for a new processing platform (BigData)  Origin of Hadoop  What is Hadoop & what it is not ?  Hadoop architecture  Hadoop components (Common/HDFS/MapReduce)
MapReduce. 2 (2012) Average Searches Per Day: 5,134,000,000 (2012) Average Searches Per Day: 5,134,000,000.
Lecture 6 – Google File System (GFS) CSE 490h – Introduction to Distributed Computing, Winter 2008 Except as otherwise noted, the content of this presentation.
The Google File System. Why? Google has lots of data –Cannot fit in traditional file system –Spans hundreds (thousands) of servers connected to (tens.
15-440, Hadoop Distributed File System Allison Naaktgeboren  Wut u mean? I iz loadin a HA-doop fileh  Ur doin' it rong kitteh.
The Google File System.
Hadoop File System B. Ramamurthy 4/19/2017.
Google File System.
Apache Hadoop and Hive Dhruba Borthakur Apache Hadoop Developer
Google Distributed System and Hadoop Lakshmi Thyagarajan.
The Hadoop Distributed File System, by Dhyuba Borthakur and Related Work Presented by Mohit Goenka.
Dr. G Sudha Sadhasivam Professor, CSE PSG College of Technology Coimbatore INTRODUCTION TO HADOOP.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Hadoop Distributed File System by Swathi Vangala.
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google∗
The Hadoop Distributed File System: Architecture and Design by Dhruba Borthakur Presented by Bryant Yao.
1 The Google File System Reporter: You-Wei Zhang.
The Hadoop Distributed File System
Advanced Software Engineering Cloud Computing and Big Data Prof. Harold Liu.
Introduction to Hadoop 趨勢科技研發實驗室. Copyright Trend Micro Inc. Outline Introduction to Hadoop project HDFS (Hadoop Distributed File System) overview.
HDFS Hadoop Distributed File System
Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.
Introduction to Hadoop and HDFS
Hadoop & Condor Dhruba Borthakur Project Lead, Hadoop Distributed File System Presented at the The Israeli Association of Grid Technologies.
The Google File System Presenter: Gladon Almeida Authors: Sanjay Ghemawat Howard Gobioff Shun-Tak Leung Year: OCT’2003 Google File System14/9/2013.
The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware. It has many similarities with existing distributed.
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
Presenters: Rezan Amiri Sahar Delroshan
Eduardo Gutarra Velez. Outline Distributed Filesystems Motivation Google Filesystem Architecture The Metadata Consistency Model File Mutation.
HDFS (Hadoop Distributed File System) Taejoong Chung, MMLAB.
Introduction to HDFS Prasanth Kothuri, CERN 2 What’s HDFS HDFS is a distributed file system that is fault tolerant, scalable and extremely easy to expand.
HADOOP DISTRIBUTED FILE SYSTEM HDFS Reliability Based on “The Hadoop Distributed File System” K. Shvachko et al., MSST 2010 Michael Tsitrin 26/05/13.
)1()1( Presenter: Noam Presman Advanced Topics in Storage Systems – Semester B 2013 Authors: A.Cidon, R.Stutsman, S.Rumble, S.Katti,
 Introduction  Architecture NameNode, DataNodes, HDFS Client, CheckpointNode, BackupNode, Snapshots  File I/O Operations and Replica Management File.
Presenter: Seikwon KAIST The Google File System 【 Ghemawat, Gobioff, Leung 】
Distributed File System. Outline Basic Concepts Current project Hadoop Distributed File System Future work Reference.
Distributed File Systems Sun Network File Systems Andrew Fıle System CODA File System Plan 9 xFS SFS Hadoop.
The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Presenter: Chao-Han Tsai (Some slides adapted from the Google’s series lectures)
B. RAMAMURTHY Big-data Computing 6/22/2016 Bina Ramamurthy
Dr. Zahoor Tanoli COMSATS Attock 1.  Motivation  Assumptions  Architecture  Implementation  Current Status  Measurements  Benefits/Limitations.
File Systems for Cloud Computing Chittaranjan Hota, PhD Faculty Incharge, Information Processing Division Birla Institute of Technology & Science-Pilani,
Slides modified from presentation by B. Ramamurthy
CSS534: Parallel Programming in Grid and Cloud
Dhruba Borthakur Apache Hadoop Developer Facebook Data Infrastructure
HDFS Yarn Architecture
Google File System.
The Google File System (GFS)
Google File System CSE 454 From paper by Ghemawat, Gobioff & Leung.
Understanding Real World Data Corruptions in Cloud Systems
Gregory Kesden, CSE-291 (Storage Systems) Fall 2017
Gregory Kesden, CSE-291 (Cloud Computing) Fall 2016
Software Engineering Introduction to Apache Hadoop Map Reduce
Big-data Computing: Hadoop Distributed File System
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
The Basics of Apache Hadoop
GARRETT SINGLETARY.
Hadoop Technopoints.
Big-data Computing: Hadoop Distributed File System
by Mikael Bjerga & Arne Lange
The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google SOSP’03, October 19–22, 2003, New York, USA Hyeon-Gyu Lee, and Yeong-Jae.
Presentation transcript:

HDFS -Hadoop Overview 유현정

Data Replication HDFS’s blocks in a file except the last block are the same size. The block size and replication factor are configurable per file. The NameNode periodically receives a Heartbeat and a Blockreport from each of the DataNodes in the cluster. –DataNodes send Heartbeat to the NameNode. NameNode used Heartbeats to detect DataNode failure. –DataNode periodically sends a report of all existing blocks to the NameNode.

Replica Placement For the common case, replication factor == 3 –One replica on one node in the local rack –Another on a different node in the local rack –The last on a different node in a different rack –If replication factor > 3, additional replicas are randomly placed

Replica Placement Does not impact data reliability and availability guarantees. However, it does reduce the aggregate network bandwidth used when reading data. (3 개의 rack 이 아닌, 2 개의 rack 에 데 이터를 저장하기 때문 ) Replicas of file 은 공평하게 분배되지 않음 This policy is a work in progress.

Replica Selection To minimize global bandwidth consumption and read latency, HDFS tries to satisfy a read request from a replica that is closest to the reader.

SafeMode 시작 시, NameNode 는 SafeMode 상태 데이터 block 의 복제는 안전모드 상태일 때 발생하지 않음 안전하게 복제된 data block 의 percentage 를 점검한 후, 안전모드 상태에서 벗어남 명시된 replication factor 보다 적은 data block 의 list 를 check NameNode 가 위 block 들을 다른 데이터노 드에 복재함

NameNode Meta-data The NameNode uses a tansaction log called the EditLog to persistently record every change that occurs to file system metadata. –E.g.) creating a file, deleting a file, or changing the replication factor of a file The entire file system namespace, including the mapping of blocks to files and file system properties, is stored in a file called the FsImage. EditLog & FsImage is stored as files in the NameNode’s local file system.

Checkpoint When the NameNode starts up, –NameNode 는 FsImage 와 EditLog 를 디스크로부터 읽 고, EditLog 로부터의 모든 transaction 들을 FsImage 에 적용한 뒤, 새로운 버전의 FsImage 로 디스크에 저장 –EditLog 의 transactions 은 FsImage 에 저장되었기 때문 에 버림 현재, checkpoint 는 NameNode 시작 시에만 발생 주기적으로 checkpointing 을 지원하는 작업 구현 중

The communication protocol Layered on top of the TCP/IP protocol Client Protocol : client ↔ NameNode DataNode Protocol : DataNodes↔ NameNode A Remote Procedure Call(RPC) abstration wraps both the Client Protocol and the DataNode Protocol. –NameNode 는 어떠한 RPC 들도 초기화하지 않음 –NameNode 는 DataNodes 나 Clients 에 발행된 요 청에 대해서만 응답

Robustness The three common types of failure –NameNode failures –DataNode failures –Network partitions

Data Disk Failure A network partition can cause a subset of DataNodes to lose connectivity with the NameNode. –Using a Heartbeat message The necessity for re-replication’s reasons –A DataNode may become unavailable like a dead DataNode –A replica may become corrupted –A hard disk on a DataNode may fail –The replication factor of a file may be increased

NameNode Failure A single point of failure 현재, 자동적인 재 시작과 다른 머신에 의한 NameNode software 의 장애 극복은 지원되 지 않음

Data Correctness/Integrity Use Checksums to validate data –Use CRC32 DataNode stores the checksum.

Snapshots 특정 시점 순간의 사본을 저장하는 기능 현재는 지원 안함

Replication Pipelining DataNode 는 pipeline 내의 이전 DataNode 로부터 데이터를 받는 동시에 Pipeline 내의 다음 DataNode 로 전송 The data is pipelined from one DataNode to the next.

File Deletes and Undeletes 사용자나 application 에 의해서 파일이 삭제 되었을 때, 그 파일은 HDFS 에서 바로 삭제 되지 않음 –/trash 폴더의 파일로 먼저 이름 변경 –/trash 폴더에 있다면, 복원 가능 – 일정 시간 후, NameNode 는 해당 파일을 Namespace 에서 삭제 해당 파일과 그에 관련된 블록들의 해제

File Deletes and Undeletes /trash 폴더는 삭제된 파일의 최근 사본을 갖고 있다. /trash 폴더 안에 파일이 남아있다면, 그 파 일을 삭제 후에도 취소 가능 현재, default policy : –6 시간 이상의 것들이 /trash 폴더에서 삭제