Download presentation
Presentation is loading. Please wait.
Published byBrice Jacobs Modified over 9 years ago
1
Hadoop Setup
2
Prerequisite: System: Mac OS / Linux / Cygwin on Windows Notice: 1. only works in Ubuntu will be supported by TA. You may try other environments for challenge. 2. Cygwin on Windows is not recommended, for its instability and unforeseen bugs. Java Runtime Environment, Java TM 1.6.x recommended ssh must be installed and sshd must be running to use the Hadoop scripts that manage remote Hadoop daemons. Hadoop Setup
3
Single Node Setup (Usually for debug) Untar hadoop-*.**.*.tar.gz to your user path About Version: The latest stable version 1.0.1 is recommended. edit the file conf/hadoop-env.sh to define at least JAVA_HOME to be the root of your Java installation edit the files to configure properties: conf/core-site.xml: fs.default.name hdfs://localhost:9000 conf/hdfs-site.xml: dfs.replication 1 conf/mapred-site.xml: mapred.job.tracker localhost:9001 Hadoop Setup
4
Set dfs.name.dir and dfs.data.dir property in hdfs-site.xml Cluster Setup ( the only acceptable setup for HW) Add the master’s node name to conf/master Add all the slaves’ node name to conf/slaves Edit /etc/hosts in each node: add IP and node name item for each node Suppose your master’s node name is ubuntu1 and its IP is 192.168.0.2, then add line “192.168.0.2 ubuntu1” to the file Copy the folder to the same path of all nodes Notice: JAVA_HOME may not be set the same in each node Same steps as single node setup
5
Hadoop Setup generating ssh keygen. Passphrase will be omitted when starting up: $ ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa $ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys $ ssh localhost Execution Format a new distributed-filesystem: $ bin/hadoop namenode –format Start the hadoop daemons: $ bin/start-all.sh The hadoop daemon log output is written to the ${HADOOP_LOG_DIR} directory (defaults to ${HADOOP_HOME}/logs).
6
Hadoop Setup Copy the input files into the distributed filesystem: $ bin/hadoop fs -put conf input Run some of the examples provided: $ bin/hadoop jar hadoop-*-examples.jar grep input output 'dfs[a-z.]+' Examine the output files: View the output files on the distributed filesystem: $ bin/hadoop fs -cat output/* When you're done, stop the daemons with: $ bin/stop-all.sh Execution(continued)
7
Hadoop Setup Hadoop configuration is driven by two types of important configuration files: 1.Read-only default configuration: src/core/core-default.xml src/hdfs/hdfs-default.xml src/mapred/mapred-default.xml conf/mapred-queues.xml.template. 2.Site-specific configuration: conf/core-site.xml conf/hdfs-site.xml conf/mapred-site.xml conf/mapred-queues.xml Details About Configuration Files
8
Hadoop Setup Details About Configuration Files (continued) ParameterValueNotes fs.default.nameURI of NameNode.hdfs://hostname/ ParameterValueNotes dfs.name.dir Path on the local filesystem where the NameNode stores the namespace and transactions logs persistently. If this is a comma-delimited list of directories then the name table is replicated in all of the directories, for redundancy. dfs.data.dir Comma separated list of paths on the local filesystem of a DataNode where it should store its blocks. If this is a comma-delimited list of directories, then data will be stored in all named directories, typically on different devices. conf/core-site.xml: conf/hdfs-site.xml:
9
Hadoop Setup Details About Configuration Files (continued) conf/mapred-site.xml: ParameterValueNotes mapred.job.trackerHost or IP and port of JobTracker.host:port pair. mapred.system.dir Path on the HDFS where where the Map/Reduce framework stores system files e.g. /hadoop/mapred/system/. This is in the default filesystem (HDFS) and must be accessible from both the server and client machines. mapred.local.dir Comma-separated list of paths on the local filesystem where temporary Map/Reduce data is written. Multiple paths help spread disk i/o. mapred.tasktracker.{map|reduce}.tasks.maximum The maximum number of Map/Reduce tasks, which are run simultaneously on a given TaskTracker, individually. Defaults to 2 (2 maps and 2 reduces), but vary it depending on your hardware. dfs.hosts/dfs.hosts.excludeList of permitted/excluded DataNodes. If necessary, use these files to control the list of allowable datanodes. mapred.hosts/mapred.hosts.excludeList of permitted/excluded TaskTrackers. If necessary, use these files to control the list of allowable TaskTrackers. mapred.queue.names Comma separated list of queues to which jobs can be submitted. The Map/Reduce system always supports atleast one queue with the name as default. Hence, this parameter's value should always contain the string default. Some job schedulers supported in Hadoop, like the Capacity Scheduler, support multiple queues. If such a scheduler is being used, the list of configured queue names must be specified here. Once queues are defined, users can submit jobs to a queue using the property name mapred.job.queue.name in the job configuration. There could be a separate configuration file for configuring properties of these queues that is managed by the scheduler. Refer to the documentation of the scheduler for information on the same.
10
Hadoop Setup You may get detailed information from The official site: http://hadoop.apache.org Course slides & Textbooks: http://www.cs.sjtu.edu.cn/~liwujun/course/mmds.html Michael G. Noll's Blog (a good guide): http://www.michael-noll.com/ If you have good materials to share, please send them to TA.
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.