Download presentation
Presentation is loading. Please wait.
Published byReynold Dorsey Modified over 9 years ago
1
Introduction to Hadoop Programming Bryon Gill, Pittsburgh Supercomputing Center
2
Hadoop Overview Framework for Big Data Map/Reduce Platform for Big Data Applications
3
Map/Reduce Apply a Function to all the Data Harvest, Sort, and Process the Output
4
4 © 2010 Pittsburgh Supercomputing Center © 2014 Pittsburgh Supercomputing Center Big Data … Split n Split 1 Split 3 Split 4 Split 2 … Output n Output 1 Output 3 Output 4 Output 2 Reduce F(x) Map F(x) Result Map/Reduce
5
HDFS Distributed FS Layer WORM fs –Optimized for Streaming Throughput Exports Replication Process data in place
6
HDFS Invocations: Getting Data In and Out hadoop dfs -ls hadoop dfs -put hadoop dfs -get hadoop dfs -rm hadoop dfs -mkdir hadoop dfs -rmdir
7
Writing Hadoop Programs Wordcount Example: Wordcount.java –Map Class –Reduce Class
8
Compiling javac -cp $HADOOP_HOME/hadoop-core*.jar \ -d WordCount/ WordCount.java
9
Packaging jar -cvf WordCount.jar -C WordCount/.
10
Submitting your Job hadoop \ jar WordCount.jar \ org.myorg.WordCount \ /datasets/compleat.txt \ $MYOUTPUT \ -D mapred.reduce.tasks=2
11
Configuring your Job Submission Mappers and Reducers Java options Other parameters
12
Monitoring Important Ports: –Hearth-00.psc.edu:50030 – Jobtracker (MapReduce Jobs) –Hearth-00.psc.edu:50070 – HDFS (Namenode) –Hearth-03.psc.edu:50060 – Tasktracker (Worker Node) –Hearth-03.psc.edu:50075 – Datanode
13
Hadoop Streaming Write Map/Reduce Jobs in any language Excellent for Fast Prototyping
14
Hadoop Streaming: Bash Example Bash wc and cat hadoop jar $HADOOP_HOME/contrib/streaming/hadoop-streaming*.jar \ - input /datasets/plays/ \ -output mynewoutputdir \ -mapper '/bin/cat' \ -reducer '/usr/bin/wc -l '
15
Hadoop Streaming Python Example Wordcount in python hadoop jar $HADOOP_HOME/contrib/streaming/hadoop-streaming*.jar \ -file mapper.py \ -mapper mapper.py \ -file reducer.py \ -reducer reducer.py \ -input /datasets/plays/ \ -output pyout
16
Applications in the Hadoop Ecosystem Hbase (NoSQL database) Hive (Data warehouse with SQL-like language) Pig (SQL-style mapreduce) Mahout (Machine learning via mapreduce) Spark (Caching computation framework)
17
Spark Alternate programming framework using HDFS Optimized for in-memory computation Well supported in Java, Python, Scala
18
Spark Resilient Distributed Dataset (RDD) RDD for short Persistence-enabled data collections Transformations Actions Flexible implementation: memory vs. hybrid vs. disk
19
Spark example lettercount.py
20
Spark Machine Learning Library Clustering (K-Means) Many others, list at http://spark.apache.org/docs/1.0.1/mllib-guide.html
21
K-Means Clustering Randomly seed cluster starting points Test each point with respect to the others in its cluster to find a new mean If the centroids change do it again If the centroids stay the same they've converged and we're done. Awesome visualization: http://www.naftaliharris.com/blog/visualizing-k-means-clustering/
22
K-Means Examples spark-submit \ $SPARK_HOME/examples/src/main/python/mllib/kmeans.py \ hdfs://hearth-00.psc.edu:/datasets/kmeans_data.txt 3 spark-submit \ $SPARK_HOME/examples/src/main/python/mllib/kmeans.py \ hdfs://hearth-00.psc.edu:/datasets/archiver.txt 2
23
Questions? Thanks!
24
References and Useful Links HDFS shell commands: http://hadoop.apache.org/docs/r0.18.3/hdfs_shell.html http://hadoop.apache.org/docs/r0.18.3/hdfs_shell.html Writing and running your first program: http://www.drdobbs.com/database/hadoop-writing-and-running-your-first-pr/240153197 http://www.drdobbs.com/database/hadoop-writing-and-running-your-first-pr/240153197 https://sites.google.com/site/hadoopandhive/home/how-to-run-and-compile-a-hadoop-program Hadoop Streaming: http://hadoop.apache.org/docs/stable1/streaming.html https://sites.google.com/site/hadoopandhive/home/how-to-run-and-compile-a-hadoop-program/hadoop-streaming http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/ http://hadoop.apache.org/docs/stable1/streaming.html https://sites.google.com/site/hadoopandhive/home/how-to-run-and-compile-a-hadoop-program/hadoop-streaming http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/ Hadoop Stable API: http://hadoop.apache.org/docs/r1.2.1/api/ http://hadoop.apache.org/docs/r1.2.1/api/ Hadoop Official Releases: https://hadoop.apache.org/releases.html https://hadoop.apache.org/releases.html Spark Documentation http://spark.apache.org/docs/latest/
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.