Introduction to Hadoop

Introduction to Hadoop
Jessica Krosschell

Introduction and History
Open sourced Java-based framework of tools for storage and large-scale processing of data sets on clusters of hardware Characteristics: Batch processing; massive full data scans; structured and unstructured data sources; doesn’t necessarily replace RDBMS Big Data challenge points: Velocity, Volume, and Variety History 2005, Doug Cutting and Mike Cafarella from Yahoo creates Hadoop based on Google File System white papers. Team names project “Hadoop” after Cutting’s son’s toy elephant. 2006, Apache Hadoop project established 2008, Cloudera founded; Doug Cutting is chief architect 2011, Hortonworks founded by several key Yahoo Hadoop engineers Copyright © 2014 Tiber Solutions, LLC

Introduction and History, cont.
Users of Hadoop: Yahoo, Facebook Twitter uses Scribe to write logs to Hadoop and Pig to analyze the data sets Facebook example: engineering/hadoop/ Examples of application: Mining users’ behaviors to generate recommendations Search for uncommon patterns Hadoop distributions: companies that build on top of Apache Hadoop, provide enterprise Hadoop, and help simplify the process. Top 3 listed below. Cloudera: most established by far; largest user base; proprietary Cloudera Management Suite Hortonworks: uses 100% open source Hadoop (no additional proprietary software) MapR: proprietary file system (MapRFS) rather than HDFS; Copyright © 2014 Tiber Solutions, LLC

Framework Approach: break down data and computation into smaller pieces on servers (data and analysis co-location means that analysis is moved to the data) Fault tolerant and highly scalable: scalability is linear and scales out; if you want to double processing, double computers Apache Hadoop Project contains the following modules: Hadoop Common: contains libraries and utilities needed by other modules Hadoop Distributed File System (HDFS): distributed file system that stores data on commodity machines providing very high aggregate bandwidth across cluster; self healing and provides redundancy Hadoop MapReduce: framework for processing large data jobs across many nodes and combining results Hadoop YARN (Yet Another Resource Negotiator): Added as part of Hadoop 2.0; application management framework that allows Hadoop to go beyond MapReduce apps Copyright © 2014 Tiber Solutions, LLC

Framework, cont. Data Services: projects that store, process, and access data in many ways Pig: Scripting language for Hadoop (pig latin) to analyze large sets of data. Infrastructure layer consists of compiler that produces Map-Reduce jobs and executes them using the Hadoop cluster. Appeals to developers more familiar with scripting languages and SQL than Java. Hive: SQL interface for Hadoop (HiveQL or HQL). Hortonworks has a Hive ODBC driver that allows BI tools to connect to Hadoop. HCatalog: Metadata and table management. Enables users to access data as a set of tables without having to worry about how/where data is stored. Enables data sharing among other tools such as Pig, MapReduce, and Hive. HBase: Non-relational (NoSQL) DB for interactive apps. Commonly used for predictions and recommendations (intelligent applications). Flume: Stores log files and events. Primary use case is to move web log files into Hadoop. Sqoop: Move structured data from or into SQL database (SQL to Hadoop) Copyright © 2014 Tiber Solutions, LLC

Framework, cont. Operational Services – projects for operations and management Ambari: Management and monitoring. Makes clusters easy to operate and simplifies provisioning. Oozie: Workflow and scheduling. Coordinates jobs written in multiple languages and allows for specification of order and dependencies between jobs. Lots of additional projects (either Apache or company-specific). Example of HDP framework below. Copyright © 2014 Tiber Solutions, LLC

Enterprise Architecture
Copyright © 2014 Tiber Solutions, LLC

Hadoop Core Architecture - HDFS
HDFS: Java-based file system A distributed filesystem that runs on large clusters of commodity machines providing rapid data transfer rates and uninterrupted system operations in the case of node failures. NameNode: tracks locations of data and governs space; stores metadata about file DataNode: pings NameNode and gets instructions back; manages reads/writes/replication Copyright © 2014 Tiber Solutions, LLC

Hadoop Core Architecture - MapReduce
MapReduce: computational paradigm where application is divided into small fragments of work. Map: function that parcels out work to different nodes in the distributed cluster. Map function generate key- value pairs to set up the Reduce function. As a SQL analogy, it is the SELECT clause. Reduce: a function that collates, merges, and/or aggregates the key-value pairs into a single result set. As a SQL analogy, it is the GROUP by clause. JobTracker: splits a job into tasks and distributes based on availability and data location TaskTracker: executes task via JVM and sends status back to JobTracker Copyright © 2014 Tiber Solutions, LLC

Hadoop Core Architecture - YARN
YARN: cluster resource management introduced in Hadoop 2.0 Moved resource management from MapReduce to YARN. Allows MapReduce to focus on data processing and allows other engines to use YARN and HDFS. Reused framework from MapReduce to help ensure compatibility with existing applications. Splits up JobTracker responsibilities into separate daemons: a global ResourceManager and per-application ApplicationMaster (job scheduling/monitoring) Copyright © 2014 Tiber Solutions, LLC

Demo Outline Scenario – We are going to count the words in two text files, load the results in a table, and report from it. Execute wordcount Command Line Hadoop Pig Script Manage files File Browser (Hue Application) HDFS Explorer Create and load tables HCatalog Hive Report from tables ODBC - BusinessObjects Copyright © 2014 Tiber Solutions, LLC

Command Line WordCount
Use putty and log into sandbox as root. Change directories to tiber_demo. Cat both word files. Create the directory in /user/hue for the input. hadoop fs -mkdir /user/hue/wc_input Upload the two text files to the new directory. hadoop fs -put word1.txt word2.txt /user/hue/wc_input Display the contents of the directory. hadoop fs -ls /user/hue/wc_input Run the wordcount program. hadoop jar /usr/lib/hadoop-mapreduce/hadoop- mapreduce-examples.jar wordcount /user/hue/wc_input /user/hue/wc_output Display the contents of the output directory hadoop fs -ls /user/hue/wc_output Display contents of result file. hadoop fs -cat /user/hue/wc_output/part-r-00000 Word1.txt: This is a word file with Hello Hadoop and goodbye. Word2.txt: I’m using Hortonworks Sandbox to run my Hadoop demos. Hello and Goodbye. Copyright © 2014 Tiber Solutions, LLC

File Browser and HDFS Explorer
We saw the command line HDFS explorer in previous example. Demo: View wordcount result files using File Browser and HDFS Explorer. HDFS Explorer Copyright © 2014 Tiber Solutions, LLC

Apache Pig Scripting language (pig latin) that defines a set of transformations on a data set such as aggregate, join, and sort. Ideal for ETL, research on raw data, iterative data processing. More Info: Demo: Execute pig_wordcount script and use File Browser to view results. ** If there’s time, execute with additional dump statements to see how data is treated. Script text1 = LOAD '/user/hue/wc_input/word1.txt' USING PigStorage(); text2 = LOAD '/user/hue/wc_input/word2.txt' USING PigStorage(); mergetext = UNION text1, text2; separatewords = FOREACH mergetext GENERATE FLATTEN(TOKENIZE((chararray)$0)) as word; groupwords = GROUP separatewords BY word; countwords = FOREACH groupwords GENERATE COUNT(separatewords), group; STORE countwords INTO '/user/hue/wc_output/pig_wordcount'; Copyright © 2014 Tiber Solutions, LLC

Apache HCatalog Table abstraction (metadata) layer; frees user from knowing where data is stored in HDFS. More Info: Demo: Use HCatalog to load first wordcount results file into wordcount_out table; browse data using HCatalog; show “database” files using HDFS Explorer. Copyright © 2014 Tiber Solutions, LLC

Apache Hive Run HiveQL, SQL-like language, to interact with Hadoop. More Info: Hive Cheat Sheet: Hive Language Manual: Demo: Create and load wordcount results from Pig script into table. Retrieve data. Copyright © 2014 Tiber Solutions, LLC

Labs Use sandbox environment and hue username/password Use separate databases as well as initials when saving tables, scripts, etc. Run through the following Hortonworks tutorials: Intro to HCatalog and Pig: an-introduction-to-hadoop-hcatalog-hive-and-pig/ HCatalog, Pig, and Hive Commands: hcatalog-basic-pig-hive-commands/ Load Data into Sandbox (can include Excel, if interested in ODBC): tutorial/loading-data-into-the-hortonworks-sandbox/ Copyright © 2014 Tiber Solutions, LLC

Labs, cont. More Hortonworks tutorials: Pig Grunt Shell: shell/ *Note that you can either download putty or use the Hue shell in the Hue application **Instead of /user/hadoop directory, use /user/hue HDFS Shell: ODBC Driver: hortonworks-odbc-driver-on-windows-7/ HDFS Explorer: hortonworks-sandbox/ More Hive: More Pig: Whatever else you find interesting: Copyright © 2014 Tiber Solutions, LLC

Introduction to Hadoop

Similar presentations

Presentation on theme: "Introduction to Hadoop"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Introduction to Hadoop

Similar presentations

Presentation on theme: "Introduction to Hadoop"— Presentation transcript:

Similar presentations

About project

Feedback