MapReduce in Amazon Web Services
Introduction Amazon Elastic MapReduce – Amazon provides MapReduce framework and interface – Data Store: Amazon Simple Storage Service (Amazon S3) – Interface: Web, Console, API Running Hadoop Manually – Setup Amazon EC2 instances – Setup Hadoop Manually on the instances
Amazon Web Services Amazon EC2 – Amazon Elastic Compute Cloud (Amazon EC2) is a web service that provides resizable compute capacity in the cloud. It is designed to make web-scale computing easier for developers. – i.e., 컴퓨터, 단 인스턴스의 전원이 내려가면 초기화 됨 Amazon EBS – Amazon Elastic Block Store (EBS) provides block level storage volumes for use with Amazon EC2 instances. Amazon EBS volumes are off-instance storage that persists independently from the life of an instance – i.e., EC2 에 연결해 사용할 수 있는 외장 하드, 데이터는 지속적으로 저장됨. Amazon S3 – Amazon S3 provides a simple web services interface that can be used to store and retrieve any amount of data, at any time, from anywhere on the web. – i.e., HDFS 와 같은 분산 저장 시스템, 읽기 쓰기를 위해서는 별도의 API 사용 Amazon Elastic MapReduce – It utilizes a hosted Hadoop framework running on the web-scale infrastructure of Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3). – i.e., 아마존 제공 MapReduce 솔루션, MapReduce 프로그램을 실행할 수 있는 인 터페이스 제공
Amazon Elastic MapReduce
Running Hadoop Manually Setup Methods 1.Hadoop 이 이미 설치된 이미지로 EC2 를 기동한 후 수동 설정 2.EBS 기반 AMI 에 하둡 설치 및 복사 후 수동 설정 3.Hadoop 에 포함된 hadoop-ec2 를 사용하는 방법 4.Whirr 을 사용함 1,2 의 방법은 EC2 인스턴스를 기동하거나, 기동된 EC2 인스턴스의 IP 주소들을 알아내서 Hadoop 을 설정해야 하는 등 많은 노력이 들어감 3 의 방법은 Hadoop 의 contrib 패키지안에 포함된 프로그램으로 현재 는 Whirr 에서 진행되고 있지만 지속적으로 유지보수가 되지 않음 4 의 방법이 가장 편리함 – 단점으로는 클러스터가 내려갈 시, 변경된 HDFS 의 내용이 사라짐 – EBS 나 S3 같은 외부 스토리지 서비스에 데이터를 저장할 필요가 있음 Reference – amazon-ec2/
Amazon Web Services Create an AWS Account
Amazon Web Services Account Information Payment Method
Amazon Web Services Payment Method Sing in to the AWS Management Console
Amazon Web Services AWS Management Console
whirr Apache Incubator Project Amazon EC2 와 같은 상용 클라우드 환경에서 원하는 서비스에 대한 설 치, 설정, 실행을 자동으로 수행하는 라이브러리 지원 클라우드 환경 및 서비스 Cloud provider CassandraHadoopZooKeeperHBaseelasticsearchVoldemort Amazon EC2Yes Rackspace Cloud Servers Yes
Preparation Security Credentials
Create a new Access Key Security Credentials
Preparation Download Hadoop and Whir Extract them Whirr in 5 minutes export AWS_ACCESS_KEY_ID=... export AWS_SECRET_ACCESS_KEY=... curl -O tar zxf whirr incubating.tar.gz; cd whirr incubating ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa_whirr bin/whirr launch-cluster --config recipes/zookeeper-ec2.properties --private-key-file ~/.ssh/id_rsa_whirr echo "ruok" | nc $(awk '{print $3}' ~/.whirr/zookeeper/instances | head -1) 2181; echo bin/whirr destroy-cluster --config recipes/zookeeper-ec2.properties --private-key-file ~/.ssh/id_rsa_whirr
Whirr in 5 minutes
Configuring Setting Environment Variables to Specify AWS Credentials – AWS Access Key ID – AWS Secret Access Key Configure a Hadoop cluster – Make the copy of hadoop-ec2.properties – Edit the hadoop-ec2-mod.properties cd whirr incubating cp recipes/hadoop-ec2.properties./hadoop-ec2-mod.properties vim hadoop-ec2-mod.properties export AWS_ACCESS_KEY_ID=... export AWS_SECRET_ACCESS_KEY=...
Configuring hadoop-ec2-mod.properties whirr.cluster-user=hadoop whirr.cluster-name=hadoopcluster whirr.instance-templates=1 hadoop-jobtracker+hadoop-namenode,2 hadoop-datanode+hadoop-tasktracker whirr.provider=aws-ec2 whirr.identity=${env:AWS_ACCESS_KEY_ID} whirr.credential=${env:AWS_SECRET_ACCESS_KEY} whirr.private-key-file=${sys:user.home}/.ssh/id_rsa whirr.public-key-file=${whirr.private-key-file}.pub whirr.hardware-id=m1.xlarge whirr.image-id=us-east-1/ami-08f40561 whirr.location-id=us-east-1d # Expert: specify the version of Hadoop to install. #whirr.hadoop.version= #whirr.hadoop.tarball.url= ${whirr.hadoop.version}/hadoop-${whirr.hadoop.version}.tar.gz
Configuring whirr.instance-templates – The number of instances to launch for each set of roles in a service – e.g., 1 nn+jt,10 dn+tt means one instance with the roles nn (namenode) and jt (jobtracker), and ten instances each with the roles dn (datanode) and tt (tasktracker) whirr.image-id – The ID of the image to use for instances. If not specified then a vanilla Linux image is chosen. – e.g., whirr.location-id – The location to launch instances in. If not specified then an arbitrary location will be chosen. – If you choose a different location, make sure whirr.image-id is updated too
Configuring whirr.hardware-id –
Configuring Price of On-Demand Instances
Configuring Generate a keypair ssh-keygen -t rsa -P ''
Launch Run the following command to launch a cluster bin/whirr launch-cluster --config hadoop-ec2-mod.properties
Run a MapReduce Job hadoop-site.xml file is created in the directory ~/.whirr/ You can use this to connect to the cluster by setting the HADOOP_CONF_DIR environment variable Run a proxy export HADOOP_CONF_DIR=~/.whirr/hadoopcluster. ~/.whirr/hadoopcluster/hadoop-proxy.sh
Run a MapReduce Job You should now be able to browse HDFS: cd.. cd hadoop / bin/hadoop fs –ls /
Run a MapReduce Job You can run a MapReduce job at a localhost bin/hadoop fs -mkdir input bin/hadoop fs -put LICENSE.txt input bin/hadoop jar hadoop examples.jar wordcount input output
Run a MapReduce Job You get a result of the MapReduce job bin/hadoop fs -cat output/part-* |tail
Destroy a cluster When you've finished using a cluster you can terminate the instances and clean up resources with the following. All data will be deleted when you destroy the cluster. bin/whirr destroy-cluster --config hadoop-ec2-mod.properties
Using Amazon EBS Transfer your data which can be reused
Using Amazon EBS
ssh -i /home/xeryeon/.ssh/id_rsa mkdir ebs sudo mkfs.ext4 /dev/sdf sudo mount /dev/sdf./ebs/