Presentation is loading. Please wait.

Presentation is loading. Please wait.

MapReduce in Amazon Web Services. Introduction Amazon Elastic MapReduce – Amazon provides MapReduce framework and interface – Data Store: Amazon Simple.

Similar presentations


Presentation on theme: "MapReduce in Amazon Web Services. Introduction Amazon Elastic MapReduce – Amazon provides MapReduce framework and interface – Data Store: Amazon Simple."— Presentation transcript:

1 MapReduce in Amazon Web Services

2 Introduction Amazon Elastic MapReduce – Amazon provides MapReduce framework and interface – Data Store: Amazon Simple Storage Service (Amazon S3) – Interface: Web, Console, API Running Hadoop Manually – Setup Amazon EC2 instances – Setup Hadoop Manually on the instances

3 Amazon Web Services Amazon EC2 – Amazon Elastic Compute Cloud (Amazon EC2) is a web service that provides resizable compute capacity in the cloud. It is designed to make web-scale computing easier for developers. – i.e., 컴퓨터, 단 인스턴스의 전원이 내려가면 초기화 됨 Amazon EBS – Amazon Elastic Block Store (EBS) provides block level storage volumes for use with Amazon EC2 instances. Amazon EBS volumes are off-instance storage that persists independently from the life of an instance – i.e., EC2 에 연결해 사용할 수 있는 외장 하드, 데이터는 지속적으로 저장됨. Amazon S3 – Amazon S3 provides a simple web services interface that can be used to store and retrieve any amount of data, at any time, from anywhere on the web. – i.e., HDFS 와 같은 분산 저장 시스템, 읽기 쓰기를 위해서는 별도의 API 사용 Amazon Elastic MapReduce – It utilizes a hosted Hadoop framework running on the web-scale infrastructure of Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3). – i.e., 아마존 제공 MapReduce 솔루션, MapReduce 프로그램을 실행할 수 있는 인 터페이스 제공

4 Amazon Elastic MapReduce

5

6 Running Hadoop Manually Setup Methods 1.Hadoop 이 이미 설치된 이미지로 EC2 를 기동한 후 수동 설정 2.EBS 기반 AMI 에 하둡 설치 및 복사 후 수동 설정 3.Hadoop 에 포함된 hadoop-ec2 를 사용하는 방법 4.Whirr 을 사용함 1,2 의 방법은 EC2 인스턴스를 기동하거나, 기동된 EC2 인스턴스의 IP 주소들을 알아내서 Hadoop 을 설정해야 하는 등 많은 노력이 들어감 3 의 방법은 Hadoop 의 contrib 패키지안에 포함된 프로그램으로 현재 는 Whirr 에서 진행되고 있지만 지속적으로 유지보수가 되지 않음 4 의 방법이 가장 편리함 – 단점으로는 클러스터가 내려갈 시, 변경된 HDFS 의 내용이 사라짐 – EBS 나 S3 같은 외부 스토리지 서비스에 데이터를 저장할 필요가 있음 Reference – http://diveintodata.org/2011/03/whirr-usage-for-hadoop-cluster-in- amazon-ec2/

7 Amazon Web Services http://aws.amazon.com/ Create an AWS Account

8 Amazon Web Services Account Information Payment Method

9 Amazon Web Services Payment Method Sing in to the AWS Management Console

10 Amazon Web Services AWS Management Console

11 whirr https://incubator.apache.org/whirr/ Apache Incubator Project Amazon EC2 와 같은 상용 클라우드 환경에서 원하는 서비스에 대한 설 치, 설정, 실행을 자동으로 수행하는 라이브러리 지원 클라우드 환경 및 서비스 Cloud provider CassandraHadoopZooKeeperHBaseelasticsearchVoldemort Amazon EC2Yes Rackspace Cloud Servers Yes

12 Preparation Security Credentials

13 Create a new Access Key Security Credentials

14 Preparation Download Hadoop and Whir Extract them Whirr in 5 minutes export AWS_ACCESS_KEY_ID=... export AWS_SECRET_ACCESS_KEY=... curl -O http://www.apache.org/dist/incubator/whirr/whirr-0.5.0-incubating/whirr-0.5.0-incubating.tar.gz tar zxf whirr-0.5.0-incubating.tar.gz; cd whirr-0.5.0-incubating ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa_whirr bin/whirr launch-cluster --config recipes/zookeeper-ec2.properties --private-key-file ~/.ssh/id_rsa_whirr echo "ruok" | nc $(awk '{print $3}' ~/.whirr/zookeeper/instances | head -1) 2181; echo bin/whirr destroy-cluster --config recipes/zookeeper-ec2.properties --private-key-file ~/.ssh/id_rsa_whirr

15 Whirr in 5 minutes

16 Configuring Setting Environment Variables to Specify AWS Credentials – AWS Access Key ID – AWS Secret Access Key Configure a Hadoop cluster – Make the copy of hadoop-ec2.properties – Edit the hadoop-ec2-mod.properties cd whirr-0.5.0-incubating cp recipes/hadoop-ec2.properties./hadoop-ec2-mod.properties vim hadoop-ec2-mod.properties export AWS_ACCESS_KEY_ID=... export AWS_SECRET_ACCESS_KEY=...

17 Configuring hadoop-ec2-mod.properties http://incubator.apache.org/whirr/configuration-guide.html whirr.cluster-user=hadoop whirr.cluster-name=hadoopcluster whirr.instance-templates=1 hadoop-jobtracker+hadoop-namenode,2 hadoop-datanode+hadoop-tasktracker whirr.provider=aws-ec2 whirr.identity=${env:AWS_ACCESS_KEY_ID} whirr.credential=${env:AWS_SECRET_ACCESS_KEY} whirr.private-key-file=${sys:user.home}/.ssh/id_rsa whirr.public-key-file=${whirr.private-key-file}.pub whirr.hardware-id=m1.xlarge whirr.image-id=us-east-1/ami-08f40561 whirr.location-id=us-east-1d # Expert: specify the version of Hadoop to install. #whirr.hadoop.version=0.20.203.0 #whirr.hadoop.tarball.url=http://archive.apache.org/dist/hadoop/core/hadoop- ${whirr.hadoop.version}/hadoop-${whirr.hadoop.version}.tar.gz

18 Configuring whirr.instance-templates – The number of instances to launch for each set of roles in a service – e.g., 1 nn+jt,10 dn+tt means one instance with the roles nn (namenode) and jt (jobtracker), and ten instances each with the roles dn (datanode) and tt (tasktracker) whirr.image-id – The ID of the image to use for instances. If not specified then a vanilla Linux image is chosen. – e.g., http://alestic.com/ whirr.location-id – The location to launch instances in. If not specified then an arbitrary location will be chosen. – If you choose a different location, make sure whirr.image-id is updated too

19 Configuring whirr.hardware-id – http://aws.amazon.com/ec2/instance-types/

20 Configuring Price of On-Demand Instances

21 Configuring Generate a keypair ssh-keygen -t rsa -P ''

22 Launch Run the following command to launch a cluster bin/whirr launch-cluster --config hadoop-ec2-mod.properties

23 Run a MapReduce Job hadoop-site.xml file is created in the directory ~/.whirr/ You can use this to connect to the cluster by setting the HADOOP_CONF_DIR environment variable Run a proxy export HADOOP_CONF_DIR=~/.whirr/hadoopcluster. ~/.whirr/hadoopcluster/hadoop-proxy.sh

24 Run a MapReduce Job You should now be able to browse HDFS: cd.. cd hadoop-0.20.2/ bin/hadoop fs –ls /

25 Run a MapReduce Job You can run a MapReduce job at a localhost bin/hadoop fs -mkdir input bin/hadoop fs -put LICENSE.txt input bin/hadoop jar hadoop-0.20.2-examples.jar wordcount input output

26 Run a MapReduce Job You get a result of the MapReduce job bin/hadoop fs -cat output/part-* |tail

27 Destroy a cluster When you've finished using a cluster you can terminate the instances and clean up resources with the following. All data will be deleted when you destroy the cluster. bin/whirr destroy-cluster --config hadoop-ec2-mod.properties

28 Using Amazon EBS Transfer your data which can be reused

29 Using Amazon EBS

30 ssh -i /home/xeryeon/.ssh/id_rsa hadoop@ec2-50-17-27-13.compute-1.amazonaws.com mkdir ebs sudo mkfs.ext4 /dev/sdf sudo mount /dev/sdf./ebs/


Download ppt "MapReduce in Amazon Web Services. Introduction Amazon Elastic MapReduce – Amazon provides MapReduce framework and interface – Data Store: Amazon Simple."

Similar presentations


Ads by Google