Presentation is loading. Please wait.

Presentation is loading. Please wait.

Hadoop&HDFS 1. OUTLINE Introduction Architecture Hadoop Distribution File System – Architecture of HDFS NameNode DataNode HDFS Client – Replica Management.

Similar presentations


Presentation on theme: "Hadoop&HDFS 1. OUTLINE Introduction Architecture Hadoop Distribution File System – Architecture of HDFS NameNode DataNode HDFS Client – Replica Management."— Presentation transcript:

1 Hadoop&HDFS 1

2 OUTLINE Introduction Architecture Hadoop Distribution File System – Architecture of HDFS NameNode DataNode HDFS Client – Replica Management 2

3 OUTLINE Introduction Architecture Hadoop Distribution File System – Architecture of HDFS NameNode DataNode HDFS Client – Replica Management 3

4 What is Hadoop? 4

5 Hadoop 起源 (2002~2004) 發起人- Doug Cutting Lucene – 用 Java 設計的高效能文件索引引擎 API – 索引文件中的每一字,讓搜尋的效率比傳統逐 字比較還要高的多 Nutch – 開放原始碼的網站搜尋引擎 – 利用 Lucene 函式庫開發 5

6 Hadoop 轉折點 Nutch 遇到處理大量網站資料的瓶頸 Google 發表三大關鍵技術 – SOSP 2003 : “The Google File System” – OSDI 2004 : “MapReduce : Simplifed Data Processing on Large Cluster” – OSDI 2006 : “Bigtable: A Distributed Storage System for Structured Data” 6

7 Hadoop 起源 (2004~Now) 參考 Google 提出的技術並先後於 Nutch 上實作 – 分散式檔案系統 Nutch Distributed File System (NDFS) – MapReduce 在 2006 年時, Nutch 把分散式計算 (Distributed Computing) 的部分獨立出來,稱 之為 Hadoop NDFS 改名為 Hadoop Distributed File System (HDFS) 7

8 Hadoop 的特色 在數據資料沒有相依性的情況下,可以有 效率的平行處理這些資料。 可以透過自動維護資料副本的功能,提供 容錯機制,讓錯誤發生時可自動回復。 可以提供可靠的資料儲存及分析處理的能 力。 8

9 Linux 9

10 Hadoop 的架構 (1/3) Hadoop Core HDFS MapReduce HBase Pig Chukwa Hive Avro ZooKeeper Hadoop 專案包含一些相關子專案 10

11 Hadoop 的架構 (2/3) – Hadoop Core : 核心部分包含一些分散式檔案系統及一般輸出入的重要 元件跟介面。 – Avro : 一個有效率,跨越各種語言的 RPC 的資料序列化系統。 – MapReduce : 一個分散式資料處理模式及執行環境。 – HDFS : 一個分散式檔案系統。 – Pig : 處理大量資料集的資料流語言與執行環境。 11

12 Hadoop 的架構 (3/3) – HBase : 一個以列 (row) 為導向的分散式資料庫系統。 – ZooKeeper : 一個分散式協同服務,可以提供分散式應用程式的 原始指令。 – Hive : 一個分散式資料倉儲系統,管理 HDFS 上所儲存的資 料,並提供 SQL 為基礎的查詢語言。 – Chukwa : 一個分散式資料收集及分析系統。 12

13 Google References The Google File System [2003] MapReduce [2004] Bigtable [2006] GoogleHadoop Google File SystemHDFS MapReduceMapReduce Framework BigtableHBase 13

14 Hadoop 與 Google 架構的不同 開發團隊 GoogleApache 贊助者 GoogleYahoo, Amazon 資源 open documentopen source 作業系統 LinuxLinux / GPL 搜尋引擎 GoogleNutch 程式撰寫模式 MapReduce Hadoop MapReduce 檔案系統 GFSHDFS 資料庫系統 BigtableHBase 特定領域的程式語言 Hive, PigSawzall 協調服務 ZooKeeperChubby 14

15 OUTLINE Introduction Architecture Hadoop Distribution File System – Architecture of HDFS NameNode DataNode HDFS Client – Replica Management 15

16 OUTLINE Introduction Architecture Hadoop Distribution File System – Architecture of HDFS NameNode DataNode HDFS Client – Replica Management 16

17 Architecture of HDFS NN: NameNode DN: DataNode Cluster HDFS Client NNDN 17

18 File Storing 18 Block 64MB Block 64MB File 100MB File 100MB Temp Block 36MB Temp Block 36MB Tempo Block 64MB Tempo Block 64MB Block 64MB Block 64MB DN DN: DataNode Block 36MB Block 36MB Block 36MB Block 36MB Block 36MB Block 36MB Block 64MB Block 64MB

19 OUTLINE Introduction Architecture Hadoop Distribution File System – Architecture of HDFS NameNode DataNode HDFS Client – Replica Management 19

20 Responsibilities of NameNode Maintaining the namespace tree and the mapping of file blocks to DataNodes Replica management 20

21 Namespace Files and directories are represented by inodes. The inode data and the list of blocks belonging to each file comprise to metadata of the name system called image. The persistent record of the image called checkpoint. The modification log of the image called journal. 21

22 Namespace Storing NameNode keeps the image in RAM. Checkpoint and journal are stored in the local host’s native files system. 22

23 Checkpoint & Journal 23 JournalCheckpoint

24 NameNode’s Version 24

25 Protecting the Critical Information If ether the checkpoint or the journal is missing, or be corrupt, the namespace will be lost party or entirely. Storing checkpoint and journal in multiple store directories and NFS server Creating periodic checkpoints by either CheckpointNode or BackupNode, and storing checkpoint in it. 25

26 CheckpointNode Options Downloading checkpoint and journal from NameNode Combining the checkpoint and the journal to create a new checkpoint and an empty journal Returning the new checkpoint back to the NameNode 26

27 BackupNode BackupNode like a Checkpoint, but in addition maintains an image in memory. 27

28 OUTLINE Introduction Architecture Hadoop Distribution File System – Architecture of HDFS NameNode DataNode HDFS Client – Replica Management 28

29 Responsibilities of Each DataNode Storing blocks and theirs metadata Sending block report and heartbeats to the NameNode 29

30 Blocks &Metadata 30

31 DataNode’s Version 31

32 Verification Log 32

33 Block Report Once an hour Contains block id, generation stamp and the size of each block Is important information for Replica Management 33

34 Heartbeats Once every three seconds To confirm the block replicas are available Contains total storage capacity, fraction of storage in use and number of data transfers currently in progress NameNode controls the DataNode by replying the heartbeats 34

35 OUTLINE Introduction Architecture Hadoop Distribution File System – Architecture of HDFS NameNode DataNode HDFS Client – Replica Management 35

36 Block Writing NN: NameNode DN: DataNode Cluster Request DN List Write HDFS Client NNDN 36

37 Writing a Block 37

38 File Appending File Data Appended Data Write Read Client 38

39 Block Reading NN: NameNode DN: DataNode Cluster Request DN List Read HDFS Client NNDN 39

40 OUTLINE Introduction Architecture Hadoop Distribution File System – Architecture of HDFS NameNode DataNode HDFS Client – Replica Management 40

41 Topology Example N00 N01 N02 N10 N11 N12 Rack0 Rack1 41

42 Read Example N00 N01 N02 N10 N11 N12 Rack0 Rack1 BR Client BR Block Replica Selected Replica 42

43 Distance Example 1 N00 N01 N02 N10 N11 N12 Rack0 Rack1 BR Client BR Distance is 4 Block Replica BR Selected Replica 43

44 Distance Example 2 N00 N01 N02 N10 N11 N12 Rack0 Rack1 BR Client BR Distance is 2 Block Replica BR Selected Replica 44

45 Block Placement N00 N01 N02 N10 N11 N12 Rack0 Rack1 BR Client BR Block Replica 45

46 Only one replica at one node 46

47 Most two replicas in the same rack If the number of nodes Is twice the number of racks 47

48 Replication Management Over-Replicated Under-Replicated 48

49 Over-Replicated N00 N01 N02 N10 N11 N12 Rack0 Rack1 BR 50% 51% Block Replica Disk Space Utilization 49

50 Under-Replicated N00 N01 N02 N10 N11 N12 Rack0 Rack1 BR Block Replica BR 50

51 Under-Replicated N00 N01 N02 N10 N11 N12 Rack0 Rack1 BR Block Replica BR 51

52 Block Scanner To Verify the blocks 52

53 Balancer N00 N01 N02 N10 N11 N12 Rack0 Rack1 52% Block Replica Disk Space Utilization 51%50%62% BR Cluster Utilization 51% Threshold Value 10% 40%51% BR 53

54 Key Requirement N00 N01 N02 N10 N11 N12 Rack0 Rack1 62%52% Block Replica Disk Space Utilization 51%40%51% BR Cluster Utilization 51% Threshold Value 10% 50% NO BLOCK CAN BE MOVED 54


Download ppt "Hadoop&HDFS 1. OUTLINE Introduction Architecture Hadoop Distribution File System – Architecture of HDFS NameNode DataNode HDFS Client – Replica Management."

Similar presentations


Ads by Google