Introduction to Apache

Introduction to Apache
Last update based on Spark 2.0 done on July 29, 2016. IBM Analytics © 2016 IBM Corporation

Agenda What is Hadoop? Hadoop background Hadoop details
Hadoop Cloudera Installation Questions

Apache Hadoop Distributed File System (HDFS™): Hadoop YARN
It is a framework that allows for the distributed processing of large data sets across computer clusters. Used Cloudera Apache Hadoop distribution for VMware image for presentation purposes. Main Modules/Utilities included in the Hadoop Common Framework are Hadoop Distributed File System (HDFS™): A distributed file system that provides access to application data. Hadoop YARN Framework to schedule jobs and manage cluster resources. Hadoop MapReduce A YARN-based system for parallel processing of large data HBase A scalable, distributed database storage for large structured data

HDFS™ Hadoop Distributed File System (HDFS™) is a distributed file system that provides access to application data. This is a Master/Slave Architecture. Namenode : Master server managing the file system namespace and regulating access to files by clients. Datanodes : Slaves managed by Namenode, usually one per node in the cluster managing storage attached to the nodes.

Hadoop YARN Yarn is a Framework to schedule jobs and manage cluster resources. Resource Manager : Arbitrates all available cluster resources. Node Manager : One for Each node taking direction from the Resource Manager and manages resources available on a single node.

Map Reduce Apache Hadoop MapReduce is the open-source implementation of the MapReduce model. Main Modules/Utilities included in the Hadoop Common Framework are The end-user MapReduce API to develop / code MapReduce application. The MapReduce framework, Runtime implementation of various phases such as the map phase, the sort/shuffle/merge aggregation and the reduce phase. The MapReduce system, Backend infrastructure required for MapReduce application and data.

HBASE Architecture Hbase is Distributed Database based on Google Bigtable providing Large volume Data storage and access. …Contd.,

HBASE Components Features of Habase include in-memory operation and Bloom Filters on a per-column basis as outlined in the original Bigtable paper. Hbase tables serves as source for MapReduce jobs. Main Components of Hbase are : ZooKeeper Distributed Co-ordination service co-ordinating with Data service requests from Clients. Master HBase Master is a lightweight process managing workloads in Hadoop cluster including load balancing. Region Server Running in every node Handles CRUD operation for the Database. Hbase Data Model Tables and Column Family MemStores are used for storing semi-structured data and Capturing MapReduce application data requests.

HBASE Data Storage Hierarchy
Following is high level data storage hierarchy of HBASE Database and its Data Model. HBase Regions / Region Servers Tables Data is Identified by RowKey and Timestamp Column Family Mem Store Data is stored as Column Family Anchor and Contents HDFS Files

Utilities for Data Analysis
There are many utilities and libraries available to analyze data in HBase . Following are used frequently by Hadoop community. Hive Command line utility used to access data including summarizing and performing analytical functions Hue Web based Analytical interface for Hbase. Mahout Scalable Machine learning library used for Analytical purposes that could be called from Apache Spark / and other Java / Scala applications.

Hadoop Installation Installation of Hadoop with all its components is straight forward by installing Cloudera quick start VM if we have VMPlayer already installed. Use the below link and Select platform as VMware to download zip file. Open the zip file from vmplayer to get complete package installed automatically. Once installed, increase Virtual Memory size to 8G+ for Hadoop to perform optimally. Go to Player  Manage  Virtual Machine Settings and change Memory for Virtual machine as below :

Hadoop Installation After the installation and configurations, you should see “Cloudera Live” in Firefox by default. To manage and monitor Hadoop, go to below page. Port 7180 is assigned by default. For any issues with accessing HDFS or Hbase, perform stop Hue, Hive, Hbase and HDFS in sequence and start in reverse order.

Hadoop Installation Use Hue for any data analysis; To go to Hue Editor use following link. Port 8888 is assigned by default for Hue. Use hive from terminal window for command line options for HiveQL, regular SQL like Queries. Hive commands can also be called from unix shell scripts or from perl, python or R.

Questions???

Introduction to Apache

Similar presentations

Presentation on theme: "Introduction to Apache"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Introduction to Apache

Similar presentations

Presentation on theme: "Introduction to Apache"— Presentation transcript:

Similar presentations

About project

Feedback