Download presentation
Presentation is loading. Please wait.
Published byJoan Watson Modified over 8 years ago
1
Hadoop Data Management by Team – 5 ISQS6339-2014 Vivek Shimbulal @svivekbafna Sonali DigwalRohit RamtekeMrugank DhoneShashank Mishra
2
Videos The Evolution of the Apache Hadoop Ecosystem | Cloudera. 8’11” Published on Sep 6, 2013. Hadoop Co-founder Doug Cutting explains how the Hadoop ecosystem has expanded and evolved into a much larger Big Data platform with Hadoop at its center. http://www.youtube.com/watch?v=eo1PwSfCXTI A Hadoop Ecosystem Overview. 21’54” Published on Jan 10, 2014. This is a technical overview, explaining the Hadoop Ecosystem. As a part of this presentation, we chose to focus on the HDFS, MapReduce, Yarn, Hive, Pig and HBase software components. http://www.youtube.com/watch?v=kRnh3WpcKXo Working in the Hadoop Ecosystem. 10’40” Published on Sep 5, 2013. Mark Grover, a Software Engineer at Cloudera, talks about working in the Hadoop ecosystem. http://www.youtube.com/watch?v=nbUsY9tj-pM 2
3
HDFS https://www.youtube.com/watch?v=1_ly9dZnmWchttps://www.youtube.com/watch?v=1_ly9dZnmWc, 8’27”
4
HDFS OVERVIEW Based on Google’s GFS (Google File System) Provides redundant storage of massive amounts of data Data is distributed at all nodes at load time
5
HDFS DESIGN Runs on commodity hardware Assumes high failure rates of the component Works well with lots of large files It is built around the idea of “write-once, read many times”
6
HDFS ARCHITECTURE Operates on top of an existing file system Files are stored as “Blocks” Default block size is 64 MB Provides reliability through replication NameNode stores metadata and manages access No data caching due to large datasets
7
HDFS ARCHITECTURE DIAGRAM
8
HDFS FILE STORAGE NameNode Keeps metadata in RAM for fast lookup File-system metadata size is limited to the amount of available RAM on the NameNode Stores all metadata DataNode Different blocks of the same file are stored on different DataNodes Stores file content as blocks Same block is replicated across several datanodes for redundancy Periodically sends a report of all existing blocks to the NameNode
9
FAILURE AND REPLACEMENT DataNode failure and recovery NameNode failure and options to avoid Secondary NameNode Block placement strategies
10
Hive vs. HBase https://www.youtube.com/watch?v=U0r9s4iXwo0https://www.youtube.com/watch?v=U0r9s4iXwo0, 2’51” https://www.youtube.com/watch?v=IumVWII3fRQhttps://www.youtube.com/watch?v=IumVWII3fRQ, 2’50”
11
Introduction to Hive Why Hive? Motivation Hive’s Architecture Hive’s Principles- Schema on Read Hive’s Principles DW Stack in Hadoop Getting started with HIVE
12
Hive Motivation - Map Reduce development is time consuming - Required intimate knowledge of the framework - Limited resources familiar with required expertise - No schema to understand data in HDFS
13
Architecture
14
DW Stack in Hadoop DB Tables System Tables SQL Query Engine HDFS Files (Raw and Structured Data) Hcatalog (Metastore) Multiple Engines (SQL & Non-SQL) Storage Metadata Query All the 3 layers glued together All the layers are separate and independent Conventional RDBMS Hadoop DW Stack Layers
15
HiveQL Create table CREATE (DATABASE|SCHEMA) [IF NOT EXISTS] database_name [COMMENT database_comment] [LOCATION hdfs_path] [WITH DBPROPERTIES (property_name=property_value,...)]; DROP table DROP (DATABASE|SCHEMA) [IF EXISTS] database_name [RESTRICT|CASCADE]; ALTER table ALTER (DATABASE|SCHEMA) database_name SET DBPROPERTIES (property_name=property_value,...); Hive does NOT support deletion or update of a particular record (row) or particular set of records.
16
Background Row Oriented Storage Structure
17
Background Column Oriented Storage Structure
18
What is HBase? HBase is an open Source, non-relational, distributed database. Built on top of Apache Hadoop and Apache Zookeeper. HBase is a BigTable-like storage (for Hadoop). Key/Value Column Family Store. Column Oriented, Multi Dimensional Database.
19
Key/Value Column Family Structure
20
HBase Tables Tables are sorted by Row in lexicographical order. Table schema only defines its column families. Each family consists of any number of columns Each column consists of any number of versions Columns only exist when inserted, NULLs are free Columns within a family are sorted and stored together
21
HBase Components
22
HBase Vs RDBMS HBaseRDBMS Column OrientedRow Oriented No Query LanguageSQL Flexible SchemaFixed Schema De-Normalized DataNormalized Data Good for semi-structured Data as well as structured data. Good for Structured Data Stores Sparse Data Efficiently.Not Optimized for Sparse tables.
23
NoSQL https://www.youtube.com/watch?v=XPqrY7YEs0Ahttps://www.youtube.com/watch?v=XPqrY7YEs0A, 5’02”
24
Database Management System Relational Database OLAP (DATAWAREHOUS E) NOSQL Storage and retrieval of data.
25
Relational Databases Relational Data Problem: Absence of Standard 1970 Boyce Codd suggested Relation forms to persist data.
26
NoSQL Relational Databases created as a standard. Problem: Could not handle Big Data NoSQL Databases were the answer.
27
Objective NoSQL Scalability Performance High Availability
28
Performance Less Functionality More Performance RDBM S OLAPNoSQL More Functionality Less Performance
29
Storage RDBM S OLAPNoSQL Structured Data Unstructured or structured data Tables CollectionsCubes
30
Structured and Unstructured data Unstructured data examples, media files, blogs being written online, text files etc. Structured Data are the ones that conform to a particular Data Model.
31
Types of NoSQL Databases NoSQL Databases Document Oriented Tabular Key Value Store
32
Examples of NoSQL Databases NoSQL Databases Document Oriented Tabular Key Value Store Examples: Memcached Cohenrence Redis Examples: BigTable. Hbase, Accumulo Examples: MongoDB CouchDB Cloudant
33
NoSQL missing features No Joins Support ( to overcome performance issues) No Complex Transactions ( absence of Rollback, commit) No Constraint Support (to be implemented at the application level.)
34
When to Use? Great Quantities of data need to be stored. The structure of data is not uniform. Relationship not of high importance Fast running applications Growing list of data example server logs, twitter post, Blogs. Constraints and validations not a part of implementation.
35
When not to use? Complex transaction to be handled Joins must be Handled Validations a necessity at database side.
36
Lets get started Download mongodb : http://www.mongodb.org/downloadshttp://www.mongodb.org/downloads Command to start MongoDb: (mongod.exe --dbpath="E:\Spring2014\Project\data“) Tutorial: http://docs.mongodb.org/manual/tutorial
37
Getting started.. Connect to MongoDB Command : open command prompt traverse till the path of bin “E:\Spring2014\Project\BI AND DTM\mongodb-win32-x86_64- 2008plus-2.4.9\mongodb-win32-x86_64-2008plus-2.4.9\bin”
38
Some basic Operations. INSERT : for(var i = 0 ; i <=25 ; i++) db.testData.insert({x:i}) SELECT : db.testData.find() “use” command to switch to a database and also dynamically create it. Example “use tutorial”, it would create a new object(document), it will be a physical file in the data folder mentioned in the dbpath.
39
Following diagram compares “insert” statement with “Insert“ of Relational Database. db.users.insert( {name : “Sue”, Age : 26, Status: “complicated”} ) Field –value pair Insert into users (name, age, status) values ( “sue”,26, “complicated”)
40
In the above NoSQL query users is a collection and the json data is called as document.
41
SQL to MongoDB Mapping Chart Database Table Row Column Index Table Joins Primary Key Database Collection Document or BSON Field Index Embedded documents and linking Primary (_id attribute) SQL terms MongoDB /NoSQL terms
42
Questions?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.