Download presentation
Presentation is loading. Please wait.
Published byStuart Davis Modified over 9 years ago
1
Hadoop Basics -Venkat Cherukupalli
2
What is Hadoop? Open Source Distributed processing Large data sets across clusters Commodity, shared-nothing servers Local computation and storage
3
Key Services Hadoop Distributed File System (HDFS) Reliable data storage MapReduce high-performance parallel data processing
4
HDFS Splits user data across servers in a cluster Replication - multiple node failures will not cause data loss Reliable, scalable and low-cost storage RAID – Massive scale Namenode and Datanode
5
HDFS
6
MapReduce Parallel distributed processing system No special programming techniques Existing algorithms work without change
7
MapReduce Framework Processes large jobs in parallel across many nodes and combines results. Eliminates the bottlenecks imposed by monolithic storage systems. Results are collated and digested into a single output after each piece has been analyzed.
8
Self-healing Shifting work to the remaining nodes. Creates additional copy of the data from the replicas Self-healing for both storage and computation No sysadmin intervention
9
What is SQOOP Imports individual tables or entire databases to files in HDFS Generates Java classes to allow you to interact with your imported data Provides the ability to import from SQL databases straight into your Hive data warehouse Hive
10
Other Concepts HBase -is an open source, non-relational, distributed database modeled after Google's BigTable and is written in Java Hive - Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis data warehouseHadoopdata warehouseHadoop Pig -Platform for creating MapReduce programs used with Hadoop. MapReduceHadoopMapReduceHadoop ZooKeeper -Reliable distributed coordination
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.