Hadoop Data Management by Team – 5 ISQS Vivek Sonali DigwalRohit RamtekeMrugank DhoneShashank Mishra.

Hadoop Data Management by Team – 5 ISQS6339-2014 Vivek Shimbulal @svivekbafna Sonali DigwalRohit RamtekeMrugank DhoneShashank Mishra

Videos The Evolution of the Apache Hadoop Ecosystem | Cloudera. 8’11” Published on Sep 6, 2013. Hadoop Co-founder Doug Cutting explains how the Hadoop ecosystem has expanded and evolved into a much larger Big Data platform with Hadoop at its center. http://www.youtube.com/watch?v=eo1PwSfCXTI A Hadoop Ecosystem Overview. 21’54” Published on Jan 10, 2014. This is a technical overview, explaining the Hadoop Ecosystem. As a part of this presentation, we chose to focus on the HDFS, MapReduce, Yarn, Hive, Pig and HBase software components. http://www.youtube.com/watch?v=kRnh3WpcKXo Working in the Hadoop Ecosystem. 10’40” Published on Sep 5, 2013. Mark Grover, a Software Engineer at Cloudera, talks about working in the Hadoop ecosystem. http://www.youtube.com/watch?v=nbUsY9tj-pM 2

HDFS https://www.youtube.com/watch?v=1_ly9dZnmWchttps://www.youtube.com/watch?v=1_ly9dZnmWc, 8’27”

HDFS OVERVIEW Based on Google’s GFS (Google File System) Provides redundant storage of massive amounts of data Data is distributed at all nodes at load time

HDFS DESIGN Runs on commodity hardware Assumes high failure rates of the component Works well with lots of large files It is built around the idea of “write-once, read many times”

HDFS ARCHITECTURE Operates on top of an existing file system Files are stored as “Blocks” Default block size is 64 MB Provides reliability through replication NameNode stores metadata and manages access No data caching due to large datasets

HDFS ARCHITECTURE DIAGRAM

HDFS FILE STORAGE NameNode Keeps metadata in RAM for fast lookup File-system metadata size is limited to the amount of available RAM on the NameNode Stores all metadata DataNode Different blocks of the same file are stored on different DataNodes Stores file content as blocks Same block is replicated across several datanodes for redundancy Periodically sends a report of all existing blocks to the NameNode

FAILURE AND REPLACEMENT DataNode failure and recovery NameNode failure and options to avoid Secondary NameNode Block placement strategies

Hive vs. HBase https://www.youtube.com/watch?v=U0r9s4iXwo0https://www.youtube.com/watch?v=U0r9s4iXwo0, 2’51” https://www.youtube.com/watch?v=IumVWII3fRQhttps://www.youtube.com/watch?v=IumVWII3fRQ, 2’50”

Introduction to Hive Why Hive? Motivation Hive’s Architecture Hive’s Principles- Schema on Read Hive’s Principles DW Stack in Hadoop Getting started with HIVE

Hive Motivation - Map Reduce development is time consuming - Required intimate knowledge of the framework - Limited resources familiar with required expertise - No schema to understand data in HDFS

Architecture

DW Stack in Hadoop DB Tables System Tables SQL Query Engine HDFS Files (Raw and Structured Data) Hcatalog (Metastore) Multiple Engines (SQL & Non-SQL) Storage Metadata Query All the 3 layers glued together All the layers are separate and independent Conventional RDBMS Hadoop DW Stack Layers

HiveQL Create table CREATE (DATABASE|SCHEMA) [IF NOT EXISTS] database_name [COMMENT database_comment] [LOCATION hdfs_path] [WITH DBPROPERTIES (property_name=property_value,...)]; DROP table DROP (DATABASE|SCHEMA) [IF EXISTS] database_name [RESTRICT|CASCADE]; ALTER table ALTER (DATABASE|SCHEMA) database_name SET DBPROPERTIES (property_name=property_value,...); Hive does NOT support deletion or update of a particular record (row) or particular set of records.

Background Row Oriented Storage Structure

Background Column Oriented Storage Structure

What is HBase? HBase is an open Source, non-relational, distributed database. Built on top of Apache Hadoop and Apache Zookeeper. HBase is a BigTable-like storage (for Hadoop). Key/Value Column Family Store. Column Oriented, Multi Dimensional Database.

Key/Value Column Family Structure

HBase Tables Tables are sorted by Row in lexicographical order. Table schema only defines its column families. Each family consists of any number of columns Each column consists of any number of versions Columns only exist when inserted, NULLs are free Columns within a family are sorted and stored together

HBase Components

HBase Vs RDBMS HBaseRDBMS Column OrientedRow Oriented No Query LanguageSQL Flexible SchemaFixed Schema De-Normalized DataNormalized Data Good for semi-structured Data as well as structured data. Good for Structured Data Stores Sparse Data Efficiently.Not Optimized for Sparse tables.

NoSQL https://www.youtube.com/watch?v=XPqrY7YEs0Ahttps://www.youtube.com/watch?v=XPqrY7YEs0A, 5’02”

Database Management System Relational Database OLAP (DATAWAREHOUS E) NOSQL Storage and retrieval of data.

Relational Databases Relational Data Problem: Absence of Standard 1970 Boyce Codd suggested Relation forms to persist data.

NoSQL Relational Databases created as a standard. Problem: Could not handle Big Data NoSQL Databases were the answer.

Objective NoSQL Scalability Performance High Availability

Performance Less Functionality More Performance RDBM S OLAPNoSQL More Functionality Less Performance

Storage RDBM S OLAPNoSQL Structured Data Unstructured or structured data Tables CollectionsCubes

Structured and Unstructured data Unstructured data examples, media files, blogs being written online, text files etc. Structured Data are the ones that conform to a particular Data Model.

Types of NoSQL Databases NoSQL Databases Document Oriented Tabular Key Value Store

Examples of NoSQL Databases NoSQL Databases Document Oriented Tabular Key Value Store Examples: Memcached Cohenrence Redis Examples: BigTable. Hbase, Accumulo Examples: MongoDB CouchDB Cloudant

NoSQL missing features No Joins Support ( to overcome performance issues) No Complex Transactions ( absence of Rollback, commit) No Constraint Support (to be implemented at the application level.)

When to Use? Great Quantities of data need to be stored. The structure of data is not uniform. Relationship not of high importance Fast running applications Growing list of data example server logs, twitter post, Blogs. Constraints and validations not a part of implementation.

When not to use? Complex transaction to be handled Joins must be Handled Validations a necessity at database side.

Lets get started Download mongodb : http://www.mongodb.org/downloadshttp://www.mongodb.org/downloads Command to start MongoDb: (mongod.exe --dbpath="E:\Spring2014\Project\data“) Tutorial: http://docs.mongodb.org/manual/tutorial

Getting started.. Connect to MongoDB Command : open command prompt traverse till the path of bin “E:\Spring2014\Project\BI AND DTM\mongodb-win32-x86_64- 2008plus-2.4.9\mongodb-win32-x86_64-2008plus-2.4.9\bin”

Some basic Operations. INSERT : for(var i = 0 ; i <=25 ; i++) db.testData.insert({x:i}) SELECT : db.testData.find() “use” command to switch to a database and also dynamically create it. Example “use tutorial”, it would create a new object(document), it will be a physical file in the data folder mentioned in the dbpath.

Following diagram compares “insert” statement with “Insert“ of Relational Database. db.users.insert( {name : “Sue”, Age : 26, Status: “complicated”} ) Field –value pair Insert into users (name, age, status) values ( “sue”,26, “complicated”)

In the above NoSQL query users is a collection and the json data is called as document.

SQL to MongoDB Mapping Chart Database Table Row Column Index Table Joins Primary Key Database Collection Document or BSON Field Index Embedded documents and linking Primary (_id attribute) SQL terms MongoDB /NoSQL terms

Questions?

Hadoop Data Management by Team – 5 ISQS Vivek Sonali DigwalRohit RamtekeMrugank DhoneShashank Mishra.

Similar presentations

Presentation on theme: "Hadoop Data Management by Team – 5 ISQS Vivek Sonali DigwalRohit RamtekeMrugank DhoneShashank Mishra."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Hadoop Data Management by Team – 5 ISQS Vivek Sonali DigwalRohit RamtekeMrugank DhoneShashank Mishra.

Similar presentations

Presentation on theme: "Hadoop Data Management by Team – 5 ISQS Vivek Sonali DigwalRohit RamtekeMrugank DhoneShashank Mishra."— Presentation transcript:

Similar presentations

About project

Feedback