Download presentation
Presentation is loading. Please wait.
Published byEileen Boone Modified over 9 years ago
1
H ADOOP DB: A N A RCHITECTURAL H YBRID OF M AP R EDUCE AND DBMS T ECHNOLOGIES FOR A NALYTICAL W ORKLOADS By: Muhammad Mudassar MS-IT-8 1
2
W HAT IS GOING ON Data analysis techniques are changing Enterprises moving to cheaper commodity hardware MPP (Massively Parallel Processing) architecture inside “Clods” Analytical data is exploding What technology for data analysis? Parallel databases MapReduce-based systems 2
3
T HE TWO TECHNOLOGIES Parallel Databases High performance and efficiency Bad scores in fault tolerance and run in heterogeneous environment Few known deployments over 100 nodes MapReduce-based systems Designed to scale over 1000 of nodes Fault tolerant and capable to run in heterogeneous environment Biggest issue with MapReduce is performance 3
4
H ADOOP DB A hybrid system to handle demands of data intensive applications Advantages Scalability of MapReduce Performance and efficiency of parallel databases Completely build on open source free to use components PostgreSQL as database layer Hadoop MapReduce is used Amazon’s EC2 cloud is used 4
5
D ESIRED P ROPERTIES Performance A primary characteristic that commercial database systems use to distinguish themselves Fault tolerance Measured differently for analytical DBMS and transactional DBMS. For analytical DBMS query restart is to be avoided Ability to run in heterogeneous environment Nearly impossible to get homogeneous performance from 100 or 1000 nodes Flexible query interface Allow user to write user defined functions (UDFs) and queries that should be parallelized automatically. 5
6
A RCHITECTURE OF H ADOOP DB 6
7
T HE H ADOOP FRAMEWORK Hadoop consists of 2 layers Data storage layers which is Hadoop Distributed File System (HDFS) Data processing or the MapReduce framework HDFS Block-structure file system managed by NameNode Data handled by DataNodes MapReduce framework Master-slave architecture based on JobTracker & TaskTracker JobTracker manages job like assignment keeping track of jobs and load balancing TaskTrackers perform assigned Map or Reduce tasks assigned to them 7
8
T HE H ADOOP DB’ S COMPONENTS HadoopDB extends Hadoop framework with four components 1. Database connector Interface between DBMS and TaskTacker Database is similar to data blocks in HDFS 2. Catalog Maintain information about database Database location, driver class meta data like replica location partitioning property 3. Data Loader Globally partition the data on given key Break single node data into chunks Load the chunks to the database 8
9
T HE H ADOOP DB’ S COMPONENTS 1. SQL to MapReduce to SQL (SMS) Planner HadoopDB provide front end to process SQL queries SMS planner extends Hive Parser transforms query to abstract syntax tree Get table schema information from catalog Logical plan generator creates query plan Optimizer breaks up plan to Map or Reduce phases Executable plan generated for one or more MapReduce jobs SMS tries to push maximum work to database layer 9
10
E VALUATING H ADOOP DB Compare HadoopDB to Hadoop Parallel databases (Vertica, DBMS-X) Features Performance HadoopDB is expected to approach performance of parallel databases Scalability HadoopDB would be scalable 10
11
D ATA L OAD 11
12
Q UERIES R ESULTS 12
13
S CALABILITY HadoopDB and Hadoop take advantage of run time scheduling by splitting data Parallel databases restart entire query on node failure or wait for slowest node 13
14
C ONCLUSION HadoopDB Is a Hybrid system Scales better then parallel databases Fault tolerant Approaches the performance of parallel databases Free and opensource 14
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.