Sky Agile Horizons Hadoop at Sky
What is Hadoop? - Reliable, Scalable, Distributed Where did it come from? - Community + Yahoo! Where is it now? - Apache Software Foundation Why is it called “Hadoop”? 1.01 Hadoop at Sky Overview
To name just a few… 1.02 Hadoop at Sky Who is using it?
This screengrab is from one of the Hadoop clusters at Facebook (May 2010) 1.03 Hadoop at Sky Is it “production” ready?
1.04 Hadoop at Sky So, what does it give you?
Distributed Filesystem (HDFS) -Name Node -Data Node(s) Distributed Processing Infrastructure -Job Tracker -Task Tracker(s) 1.05 Hadoop at Sky Just two things...
Blocks - 64MB chunks (configurable) WORM (Write once, read many) - NO EDITS - NO APPENDS Replication - 3 copies - direct 1.06 Hadoop at Sky HDFS - Overview
1.07 Hadoop at Sky HDFS - Read
1.08 Hadoop at Sky HDFS - Write
Slots -X mapper slots, Y reducer slots (per node) Jobs -Queued -Prioritised Tasks -Data-aware 1.09 Hadoop at Sky Distributed Processing
1.10 Hadoop at Sky Distributed Processing
Two modes of operation 1.11 Hadoop at Sky Implementation
1.12 Hadoop at Sky Building upon the basics
Map/Reduce – divide & conquer Pig – SQL-like “Pig Latin” HBase – column-based database Hive – data-warehousing (SQL-like queries) Mahout – distributed algorithms 1.13 Hadoop at Sky Sub-projects
Java-based -Key,Value input, Key,Value output(s) Intended for low-level / bespoke work 1.14 Hadoop at Sky Map/Reduce
SQL-like syntax, Map/Reduce under the hood Client-only software 1.15 Hadoop at Sky Hive
1.16 Hadoop at Sky Live Demo
It’s not a magic bullet… If the tools you need don’t exist… Approach is everything… Hadoop is *just* the framework 1.17 Hadoop at Sky Lastly, word of warning...
1.18 Hadoop at Sky Thank you! Questions? - Soft-copy of this presentation - VM image available to download - Example code is on GitHub