Introduction to Hadoop Programming Bryon Gill, Pittsburgh Supercomputing Center
Hadoop Overview Framework for Big Data Map/Reduce Platform for Big Data Applications
Map/Reduce Apply a Function to all the Data Harvest, Sort, and Process the Output
4 © 2010 Pittsburgh Supercomputing Center © 2014 Pittsburgh Supercomputing Center Big Data … Split n Split 1 Split 3 Split 4 Split 2 … Output n Output 1 Output 3 Output 4 Output 2 Reduce F(x) Map F(x) Result Map/Reduce
HDFS Distributed FS Layer WORM fs –Optimized for Streaming Throughput Exports Replication Process data in place
HDFS Invocations: Getting Data In and Out hadoop dfs -ls hadoop dfs -put hadoop dfs -get hadoop dfs -rm hadoop dfs -mkdir hadoop dfs -rmdir
Writing Hadoop Programs Wordcount Example: Wordcount.java –Map Class –Reduce Class
Compiling javac -cp $HADOOP_HOME/hadoop-core*.jar \ -d WordCount/ WordCount.java
Packaging jar -cvf WordCount.jar -C WordCount/.
Submitting your Job hadoop \ jar WordCount.jar \ org.myorg.WordCount \ /datasets/compleat.txt \ $MYOUTPUT \ -D mapred.reduce.tasks=2
Configuring your Job Submission Mappers and Reducers Java options Other parameters
Monitoring Important Ports: –Hearth-00.psc.edu:50030 – Jobtracker (MapReduce Jobs) –Hearth-00.psc.edu:50070 – HDFS (Namenode) –Hearth-03.psc.edu:50060 – Tasktracker (Worker Node) –Hearth-03.psc.edu:50075 – Datanode
Hadoop Streaming Write Map/Reduce Jobs in any language Excellent for Fast Prototyping
Hadoop Streaming: Bash Example Bash wc and cat hadoop jar $HADOOP_HOME/contrib/streaming/hadoop-streaming*.jar \ - input /datasets/plays/ \ -output mynewoutputdir \ -mapper '/bin/cat' \ -reducer '/usr/bin/wc -l '
Hadoop Streaming Python Example Wordcount in python hadoop jar $HADOOP_HOME/contrib/streaming/hadoop-streaming*.jar \ -file mapper.py \ -mapper mapper.py \ -file reducer.py \ -reducer reducer.py \ -input /datasets/plays/ \ -output pyout
Applications in the Hadoop Ecosystem Hbase (NoSQL database) Hive (Data warehouse with SQL-like language) Pig (SQL-style mapreduce) Mahout (Machine learning via mapreduce) Spark (Caching computation framework)
Spark Alternate programming framework using HDFS Optimized for in-memory computation Well supported in Java, Python, Scala
Spark Resilient Distributed Dataset (RDD) RDD for short Persistence-enabled data collections Transformations Actions Flexible implementation: memory vs. hybrid vs. disk
Spark example lettercount.py
Spark Machine Learning Library Clustering (K-Means) Many others, list at
K-Means Clustering Randomly seed cluster starting points Test each point with respect to the others in its cluster to find a new mean If the centroids change do it again If the centroids stay the same they've converged and we're done. Awesome visualization:
K-Means Examples spark-submit \ $SPARK_HOME/examples/src/main/python/mllib/kmeans.py \ hdfs://hearth-00.psc.edu:/datasets/kmeans_data.txt 3 spark-submit \ $SPARK_HOME/examples/src/main/python/mllib/kmeans.py \ hdfs://hearth-00.psc.edu:/datasets/archiver.txt 2
Questions? Thanks!
References and Useful Links HDFS shell commands: Writing and running your first program: Hadoop Streaming: Hadoop Stable API: Hadoop Official Releases: Spark Documentation