Download presentation
Presentation is loading. Please wait.
Published byInge Gunawan Modified over 5 years ago
1
Analysis of Structured or Semi-structured Data on a Hadoop Cluster
Aleksander Ratzloff Faculty Mentor: Dr. Rahman Tashakkori Department of Computer Science Appalachian State University State of NC Undergraduate Research and Creativity Symposium (SNCURCS) North Carolina State University November 22, 2014
2
Presentation outline What is Hadoop?
What discerns structured data from semi-structured data? How does Hadoop work? What is MapReduce? Why is Hadoop useful?
3
Hadoop Who here has heard of Hadoop?
Hadoop is all about “big data”. It’s just a large quantity of data - think billions, if not trillions, of records in a database. Hadoop is a set of tools used to execute the MapReduce algorithm as quickly and accurately as possible
4
Hadoop Hadoop is all about large amounts of data.
Semi-structured data versus structured data Hadoop is all about “big data”. It’s just a large quantity of data - think billions, if not trillions, of records in a database. Hadoop is a set of tools used to execute the MapReduce algorithm as quickly and accurately as possible
5
Hadoop Hadoop is also designed to be run across multiple computers, using HDFS and YARN technologies.
6
HDFS HDFS - Hadoop Distributed File System
Used to distribute storage across numerous nodes The filesystem consists of many data nodes and one name node
7
HDFS
8
YARN YARN - Yet Another Resource Negotiator
Yarn manages the available resources to a Hadoop cluster CPU time, RAM, etc.
9
MapReduce Programming model that Hadoop is centralized around
MapReduce is very scalable! There are two stages: Map and Reduce There are five steps to the MapReduce: Prepare map() input Execute map() Shuffle map output to the reduce processor Execute reduce() Produce final output
10
MapReduce: Map Take an input Map these inputs as keys and values
11
MapReduce: Reduce Take the key-value pairs as input
Do “something” to them to aggregate, classify, or compare the data Produce an output of the results
12
MapReduce: Example Word count
How many of each word is in a document/collection of documents? Map stage will map each of the words to a count, via key-value pairs Reduce stage will aggregate the return values of the Map stage
13
MapReduce: Example
14
Hadoop: Cluster Setup Setup of Hadoop is one of its major pitfalls
Lots of considerations: open file handles, network forwarding, etc Networking woes Documentation is sparse, setup is not for the faint of heart
15
Hadoop: Wrap-up So what can Hadoop do? Beyond Hadoop, what is there?
Mahout Hive Pig Hadoop has the ability to process data very quickly. Any problem that can be broken up into multiple parts, and have those parts come together at the end for the final solution, can be used in a Hadoop cluster. Hadoop technologies: Mahout: machine learning, natively written in Java Hive: data warehouse, data summarization, data analysis Rather than using MapReduce to check something big, you can essentially query the data as if it were a database, something like SQL Pig: query language that is designed to generate MapReduce jobs that are simple queries of the data set Similar to Hive, but is designed around creating MapReduce jobs. Things such as loading data, sorting, filtering, grouping, joining, moving, etc are all handled by Pig.
16
Acknowledgements Sina Tashakkori Michael Kepple
17
Questions?
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.