Download presentation
Presentation is loading. Please wait.
1
15-440, Hadoop Distributed File System Allison Naaktgeboren Wut u mean? I iz loadin a HA-doop fileh Ur doin' it rong kitteh
2
Annoucements Go Vote! Interpretive Dances happen only after Lecture Office Hour Change Mon: 6:30-9:30 Tues: 6-7:30 Exams are graded
3
Hadoop Core at 30,000 ft
4
Back to the Map Reduce Model Recall that –map (in_key, in_value) -> (inter_key, inter_value) list combine (inter_key, inter_value) → (inter_key, inter_value) combine (inter_key, inter_value) → (inter_key, inter_value) –reduce (inter_key, inter_value list) -> (out_key, out_vlaue) What resource are we most constrained by? “Oceans of Data, Skinny pipes” How many types of data will the file system care about? How long will we need each kind? What is the common case for each?
6
What would a MR Filesytem need? General Use case: large files Mostly append to end, long sequential reads, few deletes Appends might be concurrent Scability Adding (or losing) machines should be relatively painless Nodes work on nearby data Minimize moving data between machines Bandwidth is our limiting resource Remember how much data Failure (handling)is Common Yea, yea we know, we took 213, we know hardware sucks No, really failure (handling) is common (constant) Disks, processors,whole nodes, racks, and datacenters
7
Addressing Those Concerns Sequential Reads, appends need to be fast Deletes can be painful “Hot plug” machines Add or lose machines while system is running jobs System should auto detect the change HDFS should distribute data somewhat evenly So that all workers have a reasonable amount of data to chew on And coordinating with the Jobtracker (job master) Data Replication Should be spread out. Why? What type of problems could arise?
8
Moving into the Details Nodes in HDFS NameNode (master) ( like GFS Master) DataNodes (slaves) ( like GFS chunkservers) NB – Hadoop and HDFS closely paired “careful use of jargon defines the true expert” “worker node A” and “data node 1” are frequently the same machine Two types of Masters Jobtracker (Hadoop Job Master) NameNode (file system Master) What I mean by 'master' for the rest of the lecture
9
Your Data goes in.... Files are divided into Chunks 64 MB The mapping between filename and chunks goes to the Master Each chunk is replicated and sent off to DataNodes By default, 3 The master determines which dataNodes
10
What the Clients Do Where the data starts On file creation creates a seperate file w/checksum When data fetched back from a dataNode, checksum computed again Cache file data Avoid bothering the Master too often When a Client has 1 chunk's worth of data Contacts the Master, Master sends name of dataNodes to send it to ONLY sends it to the 1 st
11
What the DataNodes Do Heartbeat to the Master Opens, closes, or replicates a chunk if requested from Master During replication, sends data to next dataNode in chain
12
What the Namespace Node Does System metadata! Holds Name->ID mapping Chunk replicas locations Transcation Logs EditLog FSImage It is responsible for coherency Uses the logs atomically Addresses the conccurent writes issue It is checkpointed Similar to AFS volume snapshots Will pull last consistent log upon restart
13
What the Namespace Node Does Listens for Heartbeats Listens for Client Requests If no heartbeat marks a node as dead Its data is deregistered It selects dataNodes Which nodes get which chunks Signals creating, opening, closing Deletes Orders move to /trash Starts delete timer
14
All together Now!
15
Additional Resources Hadoop wiki Youtube → “Hadoop” → Google developer videos (1-3 will be helpful) Google University Includes UW course, the other UW course, a couple others Use are your own risk “The Google File System” paper is rather readable as research papers go
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.