Overview of Hadoop for Data Mining Federal Big Data Group confidential Mark Silverman Treeminer, Inc. 155 Gibbs Street Suite 514 Rockville, Maryland (240)
TREEMINER, INC. CONFIDENTIAL Agenda Introduction to Hadoop Developing and testing a Map/Reduce application Auto-Clustering in Hadoop and Interworking with Apache Storm
TREEMINER, INC. CONFIDENTIAL Introduction to Hadoop Hadoop consists of: Clustered, distributed, highly available file system (HDFS) Execution framework (Map/Reduce)
TREEMINER, INC. CONFIDENTIAL Hadoop File System “Rack” aware Local storage Distributed copies (generally 3) Rack
TREEMINER, INC. CONFIDENTIAL Sample Hadoop File System
TREEMINER, INC. CONFIDENTIAL Hadoop “Eco-System” Hive Allows SQL-like querying of data in HDFS Pig Basic scripting language for Hadoop Databases Hbase, Accumulo, Cassandra, Neo4j
TREEMINER, INC. CONFIDENTIAL Map / Reduce Parallel Execution Framework
TREEMINER, INC. CONFIDENTIAL Map / Reduce Parallel Execution Framework
TREEMINER, INC. CONFIDENTIAL WordCount Example
TREEMINER, INC. CONFIDENTIAL Getting Started Cloudera and Hortonworks have sandboxes that are easy to download and are fully contained implementations in a VM. Also download from Apache. sandbox/ nloads/quickstart_vms/cdh-5-3-x.html
TREEMINER, INC. CONFIDENTIAL Developing In Map / Reduce Standalone Mode – Hadoop runs as single process, best for debugging Pseudo-Distributed – Separate processes on same server Fully Distributed – Full blown cluster
TREEMINER, INC. CONFIDENTIAL Eclipse Framework Write code in eclipse PC or Linux Options: Run Hadoop on Windows Run Eclipse in Linux with Plugin Run Eclipse in Windows, Remote debug and profiling Profiling: Yourkit
TREEMINER, INC. CONFIDENTIAL WordCount Create a project in eclipse Load wordcount code (widely available and in sandbox downloads) Compile jar file Execute on hadoop in standalone mode $ hadoop jar path/to/file.jar input output
TREEMINER, INC. CONFIDENTIAL Monitoring Hadoop Jobs
TREEMINER, INC. CONFIDENTIAL Monitoring Hadoop Jobs
TREEMINER, INC. CONFIDENTIAL Resources hadoop.apache.org orks/tutorial.pdf Hadoop: A Definitive Guide by Tom White
TREEMINER, INC. CONFIDENTIAL Example: Document AutoClustering using Hadoop and Storm