Lecture 16B: Instructions on how to use Hadoop on Amazon Web Services CSE 482: Big Data Analysis Lecture 16B: Instructions on how to use Hadoop on Amazon Web Services
Using Amazon Web Service (AWS) First, you must sign up for an AWS account: Go to http://aws.amazon.com and click Sign Up Now. Follow the on-screen instructions. (You need to provide your credit or debit card information.) AWS will notify you by email when your account is available to be use Next, obtain your AWS education grant credit https://aws.amazon.com/education/awseducate/apply/ Once approved, add the promo code (which will be sent to your email address) to your AWS account You will be given a $35 AWS credit and will not be charged until your credits run out
Services available on AWS Amazon Elastic Compute Cloud (EC2) Provides access to the cloud computing platform You can launch an SSH terminal to interact with the server Amazon Elastic MapReduce (EMR) An EC2 server with Hadoop, Pig, Hive, and other software already pre-installed You can launch an SSH terminal to interact with the EMR server
Logging in to AWS
Logging in to AWS
After Logging in to AWS Account
AWS Management Console My account: account information (can use this to close your account) My Billing Dashboard: check your bill; redeem your AWS credit My Security Credentials: create authentication tokens
Billing & Cost Management Dashboard Click here to redeem your AWS credit
Security Credentials Click here
Creating Access Keys Click here
Creating Access Keys Click here to download and save the access key file (needed to use AWS API)
Using SSH to AWS Elastic MapReduce To connect to AWS cloud computer: Create public/private key pairs for access to EC2 Copy the key file to the machine where you want to run your SSH from Edit the Security Group Security group specifies who can connect to the compute cluster you’ve launched Launch the AWS EMR cluster Connect to the cluster using SSH Terminate the cluster (*VERY IMPORTANT*) Steps 1-3 need to be performed only once
Step 1: Creating Key Pairs on EC2 Sign in to AWS (aws.amazon.com) Click on the "Amazon Elastic EC2" tab. Click on the "Key Pairs" link. Click on the "Create Key Pair" button. Enter a name and save the key file (*.pem). Download the key file onto the machine from which you want to run the SSH from. For example, if you want to run ssh from arctic, you will need to save the *.pem file on your CSE account on arctic.
Step 1: Creating Key Pairs on EC2 Click on Services
Step 1: Creating Key Pairs on EC2 Click on EC2
Step 1: Creating Key Pairs on EC2 Click on Key Pairs and then the Create Key Pair button
Step 1: Creating Key Pairs on EC2
Step 1: Creating Key Pairs on EC2
Step 2: Copy the Key File After creating the key pair, you will obtain a private key, “*.pem” file. Connecting from Mac/Linux Save the file to the directory from which you will run your SSH. Connecting from Windows Convert the “*.pem” private key file to “.ppk” file using puttygen Download puttygen.exe from http://www.chiark.greenend.org.uk/~sgtatham/putty/latest.html Follow the instruction from the following page to convert the .pem file to .ppk file http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/putty.html
Step 3: Edit the Security Group Click on EC2
Step 3: Edit the Security Group Click here to edit security group
Step 3: Edit the Security Group Select the Elastic MapReduce master and then click on Inbound tab
Step 3: Edit the Security Group Add the rule below to the security group: type protocol Port Range Source SSH TCP 22 Anywhere This allows you to connect to the master node from anywhere. You can also specify a specific IP address to prevent anyone from accessing it
Step 4: Launch AWS EMR Cluster Click on EMR
Step 4: Launch AWS EMR Cluster Click on Create Cluster
Step 4: Launch AWS EMR Cluster Specify m1: medium as Instance type and set number of instances to be 2 (the larger the number of instances, the more costly it is) Select the key pair from the list provided
Step 4: Launch AWS EMR Cluster After you have launched the new cluster, wait for several minutes until the cluster has started.
Step 4: Launch AWS EMR Cluster Click on the SSH link. Read this document carefully Follow the instruction to connect to the cluster using Putty (for Windows) or SSH on Linux.
Step 5A: Connect to EMR from Windows Read the instruction on how to connect to the EMR cluster on Windows This is the ppk file generated from Step 2
Step 5A: Connect to EMR from Windows Start the Putty program and enter the cluster name into the Host Name field
Step 5A: Connect to EMR from Windows Click on SSH -> Auth Provide the private key file (*.ppk) you had generated in step 2
Step 5A: Connect to EMR from Windows Click open
Step 5A: Connect to EMR from Windows Click open
Step 5A: Connect to EMR from Windows Success! You’re now connected to the EMR cluster from Puttygen
Step 5A: Connect to EMR from Windows
Step 5B: Connect to EMR from Linux Read the instruction on how to connect to the EMR cluster on Mac/Linux Host name to connect to the EMR cluster
Step 5B: Connect to EMR from Linux Login to one of the machines on CSE server (e.g. arctic.cse.msu.edu or black.cse.msu.edu) Go to the directory that contains the *.pem file (see step 2) Invoke ssh to connect to the cluster Cluster-name is the name given on the previous slide Replace mykey.pem with the name of your *.pem file arctic> ssh hadoop@<cluster-name> -i mykey.pem
Step 5B: Connect to EMR from Linux Result after opening the SSH connection
Step 6: Terminate Your Cluster After you’ve completed the task, terminate the cluster you have launched (VERY IMPORTANT). This is very important step. You will be charged as long as your cluster is still running. Note that you have only $35 credit on AWS