Introduction to Hadoop Programming Bryon Gill, Pittsburgh Supercomputing Center.

Introduction to Hadoop Programming Bryon Gill, Pittsburgh Supercomputing Center

Hadoop Overview Framework for Big Data Map/Reduce Platform for Big Data Applications

Map/Reduce Apply a Function to all the Data Harvest, Sort, and Process the Output

4 © 2010 Pittsburgh Supercomputing Center © 2014 Pittsburgh Supercomputing Center Big Data … Split n Split 1 Split 3 Split 4 Split 2 … Output n Output 1 Output 3 Output 4 Output 2 Reduce F(x) Map F(x) Result Map/Reduce

HDFS Distributed FS Layer WORM fs –Optimized for Streaming Throughput Exports Replication Process data in place

HDFS Invocations: Getting Data In and Out hadoop dfs -ls hadoop dfs -put hadoop dfs -get hadoop dfs -rm hadoop dfs -mkdir hadoop dfs -rmdir

Writing Hadoop Programs Wordcount Example: Wordcount.java –Map Class –Reduce Class

Compiling javac -cp $HADOOP_HOME/hadoop-core*.jar \ -d WordCount/ WordCount.java

Packaging jar -cvf WordCount.jar -C WordCount/.

Submitting your Job hadoop \ jar WordCount.jar \ org.myorg.WordCount \ /datasets/compleat.txt \ $MYOUTPUT \ -D mapred.reduce.tasks=2

Configuring your Job Submission Mappers and Reducers Java options Other parameters

Monitoring Important Ports: –Hearth-00.psc.edu:50030 – Jobtracker (MapReduce Jobs) –Hearth-00.psc.edu:50070 – HDFS (Namenode) –Hearth-03.psc.edu:50060 – Tasktracker (Worker Node) –Hearth-03.psc.edu:50075 – Datanode

Hadoop Streaming Write Map/Reduce Jobs in any language Excellent for Fast Prototyping

Hadoop Streaming: Bash Example Bash wc and cat hadoop jar $HADOOP_HOME/contrib/streaming/hadoop-streaming*.jar \ - input /datasets/plays/ \ -output mynewoutputdir \ -mapper '/bin/cat' \ -reducer '/usr/bin/wc -l '

Hadoop Streaming Python Example Wordcount in python hadoop jar $HADOOP_HOME/contrib/streaming/hadoop-streaming*.jar \ -file mapper.py \ -mapper mapper.py \ -file reducer.py \ -reducer reducer.py \ -input /datasets/plays/ \ -output pyout

Applications in the Hadoop Ecosystem Hbase (NoSQL database) Hive (Data warehouse with SQL-like language) Pig (SQL-style mapreduce) Mahout (Machine learning via mapreduce) Spark (Caching computation framework)

Spark Alternate programming framework using HDFS Optimized for in-memory computation Well supported in Java, Python, Scala

Spark Resilient Distributed Dataset (RDD) RDD for short Persistence-enabled data collections Transformations Actions Flexible implementation: memory vs. hybrid vs. disk

Spark example lettercount.py

Spark Machine Learning Library Clustering (K-Means) Many others, list at http://spark.apache.org/docs/1.0.1/mllib-guide.html

K-Means Clustering Randomly seed cluster starting points Test each point with respect to the others in its cluster to find a new mean If the centroids change do it again If the centroids stay the same they've converged and we're done. Awesome visualization: http://www.naftaliharris.com/blog/visualizing-k-means-clustering/

K-Means Examples spark-submit \ $SPARK_HOME/examples/src/main/python/mllib/kmeans.py \ hdfs://hearth-00.psc.edu:/datasets/kmeans_data.txt 3 spark-submit \ $SPARK_HOME/examples/src/main/python/mllib/kmeans.py \ hdfs://hearth-00.psc.edu:/datasets/archiver.txt 2

Questions? Thanks!

References and Useful Links HDFS shell commands: http://hadoop.apache.org/docs/r0.18.3/hdfs_shell.html http://hadoop.apache.org/docs/r0.18.3/hdfs_shell.html Writing and running your first program: http://www.drdobbs.com/database/hadoop-writing-and-running-your-first-pr/240153197 http://www.drdobbs.com/database/hadoop-writing-and-running-your-first-pr/240153197 https://sites.google.com/site/hadoopandhive/home/how-to-run-and-compile-a-hadoop-program Hadoop Streaming: http://hadoop.apache.org/docs/stable1/streaming.html https://sites.google.com/site/hadoopandhive/home/how-to-run-and-compile-a-hadoop-program/hadoop-streaming http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/ http://hadoop.apache.org/docs/stable1/streaming.html https://sites.google.com/site/hadoopandhive/home/how-to-run-and-compile-a-hadoop-program/hadoop-streaming http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/ Hadoop Stable API: http://hadoop.apache.org/docs/r1.2.1/api/ http://hadoop.apache.org/docs/r1.2.1/api/ Hadoop Official Releases: https://hadoop.apache.org/releases.html https://hadoop.apache.org/releases.html Spark Documentation http://spark.apache.org/docs/latest/

Introduction to Hadoop Programming Bryon Gill, Pittsburgh Supercomputing Center.

Similar presentations

Presentation on theme: "Introduction to Hadoop Programming Bryon Gill, Pittsburgh Supercomputing Center."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Introduction to Hadoop Programming Bryon Gill, Pittsburgh Supercomputing Center.

Similar presentations

Presentation on theme: "Introduction to Hadoop Programming Bryon Gill, Pittsburgh Supercomputing Center."— Presentation transcript:

Similar presentations

About project

Feedback