Jian Wang Based on “Meet Hadoop! Open Source Grid Computing” by Devaraj Das Yahoo! Inc. Bangalore & Apache Software Foundation
Need to process 10TB datasets On 1 node: ◦ 50MB/s = 2.3 days On 1000 node cluster: ◦ 50MB/s = 3.3 min Need Efficient, Reliable and Usable framework ◦ Google File System (GFS) paper Google File System ◦ Google's MapReduce paper GoogleMapReduce
Hadoop uses HDFS, a distributed file system based on GFS, as its shared file system ◦ Files are divided into large blocks and distributed across the cluster (64MB) ◦ Blocks replicated to handle hardware failure ◦ Current block replication is 3 (configurable) ◦ It cannot be directly mounted by an existing operating system. Once you use the DFS (put something in it), relative paths are from /user/{your usr id}. E.G. if your id is jwang30 … your “home dir” is /user/jwang30
Master-Slave Architecture HDFS Master “Namenode” (irkm-1) ◦ Accepts MR jobs submitted by users ◦ Assigns Map and Reduce tasks to Tasktrackers ◦ Monitors task and tasktracker status, re-executes tasks upon failure HDFS Slaves “Datanodes” (irkm-1 to irkm-6) ◦ Run Map and Reduce tasks upon instruction from the Jobtracker ◦ Manage storage and transmission of intermediate output
Hadoop is locally “installed” on each machine ◦ Version ◦ Installed location is in /home/tmp/hadoop ◦ Slave nodes store their data in /tmp/hadoop- ${user.name} (configurable)
If it is the first time that you use it, you need to format the namenode: ◦ - log to irkm-1 ◦ - cd /home/tmp/hadoop ◦ - bin/hadoop namenode –format Basically we see most commands look similar ◦ bin/hadoop “some command” options ◦ If you just type hadoop you get all possible commands (including undocumented)
hadoop dfs ◦ [-ls ] ◦ [-du ] ◦ [-cp ] ◦ [-rm ] ◦ [-put ] ◦ [-copyFromLocal ] ◦ [-moveFromLocal ] ◦ [-get [-crc] ] ◦ [-cat ] ◦ [-copyToLocal [-crc] ] ◦ [-moveToLocal [-crc] ] ◦ [-mkdir ] ◦ [-touchz ] ◦ [-test -[ezd] ] ◦ [-stat [format] ] ◦ [-help [cmd]]
bin/start-all.sh – starts all slave nodes and master node bin/stop-all.sh – stops all slave nodes and master node Run jps to check the status
Log to irkm-1 rm –fr /tmp/hadoop/$userID cd /home/tmp/hadoop bin/hadoop dfs –ls bin/hadoop dfs –copyFromLocal example example After that bin/hadoop dfs –ls
Mapper.py
Reducer.py
bin/hadoop dfs -ls bin/hadoop dfs –copyFromLocal example example bin/hadoop jar contrib/streaming/hadoop streaming.jar -file wordcount-py.example/mapper.py -mapper wordcount- py.example/mapper.py -file wordcount-py.example/reducer.py -reducer wordcount-py.example/reducer.py -input example -output java-output bin/hadoop dfs -cat java-output/part bin/hadoop dfs -copyToLocal java-output/part java-output-local
Hadoop job tracker Hadoop job tracker ◦ Hadoop task tracker ◦ Hadoop dfs checker ◦