Download presentation
Presentation is loading. Please wait.
Published byMyrtle Green Modified over 9 years ago
1
Jian Wang Based on “Meet Hadoop! Open Source Grid Computing” by Devaraj Das Yahoo! Inc. Bangalore & Apache Software Foundation
2
Need to process 10TB datasets On 1 node: ◦ scanning @ 50MB/s = 2.3 days On 1000 node cluster: ◦ scanning @ 50MB/s = 3.3 min Need Efficient, Reliable and Usable framework ◦ Google File System (GFS) paper Google File System ◦ Google's MapReduce paper GoogleMapReduce
3
Hadoop uses HDFS, a distributed file system based on GFS, as its shared file system ◦ Files are divided into large blocks and distributed across the cluster (64MB) ◦ Blocks replicated to handle hardware failure ◦ Current block replication is 3 (configurable) ◦ It cannot be directly mounted by an existing operating system. Once you use the DFS (put something in it), relative paths are from /user/{your usr id}. E.G. if your id is jwang30 … your “home dir” is /user/jwang30
4
Master-Slave Architecture HDFS Master “Namenode” (irkm-1) ◦ Accepts MR jobs submitted by users ◦ Assigns Map and Reduce tasks to Tasktrackers ◦ Monitors task and tasktracker status, re-executes tasks upon failure HDFS Slaves “Datanodes” (irkm-1 to irkm-6) ◦ Run Map and Reduce tasks upon instruction from the Jobtracker ◦ Manage storage and transmission of intermediate output
6
Hadoop is locally “installed” on each machine ◦ Version 0.19.2 ◦ Installed location is in /home/tmp/hadoop ◦ Slave nodes store their data in /tmp/hadoop- ${user.name} (configurable)
7
If it is the first time that you use it, you need to format the namenode: ◦ - log to irkm-1 ◦ - cd /home/tmp/hadoop ◦ - bin/hadoop namenode –format Basically we see most commands look similar ◦ bin/hadoop “some command” options ◦ If you just type hadoop you get all possible commands (including undocumented)
8
hadoop dfs ◦ [-ls ] ◦ [-du ] ◦ [-cp ] ◦ [-rm ] ◦ [-put ] ◦ [-copyFromLocal ] ◦ [-moveFromLocal ] ◦ [-get [-crc] ] ◦ [-cat ] ◦ [-copyToLocal [-crc] ] ◦ [-moveToLocal [-crc] ] ◦ [-mkdir ] ◦ [-touchz ] ◦ [-test -[ezd] ] ◦ [-stat [format] ] ◦ [-help [cmd]]
9
bin/start-all.sh – starts all slave nodes and master node bin/stop-all.sh – stops all slave nodes and master node Run jps to check the status
10
Log to irkm-1 rm –fr /tmp/hadoop/$userID cd /home/tmp/hadoop bin/hadoop dfs –ls bin/hadoop dfs –copyFromLocal example example After that bin/hadoop dfs –ls
14
Mapper.py
15
Reducer.py
16
bin/hadoop dfs -ls bin/hadoop dfs –copyFromLocal example example bin/hadoop jar contrib/streaming/hadoop-0.19.2-streaming.jar -file wordcount-py.example/mapper.py -mapper wordcount- py.example/mapper.py -file wordcount-py.example/reducer.py -reducer wordcount-py.example/reducer.py -input example -output java-output bin/hadoop dfs -cat java-output/part-00000 bin/hadoop dfs -copyToLocal java-output/part-00000 java-output-local
17
Hadoop job tracker Hadoop job tracker ◦ http://irkm-1.soe.ucsc.edu:50030/jobtracker.jsp http://irkm-1.soe.ucsc.edu:50030/jobtracker.jsp Hadoop task tracker ◦ http://irkm-1.soe.ucsc.edu:50060/tasktracker.jsp http://irkm-1.soe.ucsc.edu:50060/tasktracker.jsp Hadoop dfs checker ◦ http://irkm-1.soe.ucsc.edu:50070/dfshealth.jsp http://irkm-1.soe.ucsc.edu:50070/dfshealth.jsp
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.