Presentation is loading. Please wait.

Presentation is loading. Please wait.

Analysis of Structured or Semi-structured Data on a Hadoop Cluster

Similar presentations


Presentation on theme: "Analysis of Structured or Semi-structured Data on a Hadoop Cluster"— Presentation transcript:

1 Analysis of Structured or Semi-structured Data on a Hadoop Cluster
Aleksander Ratzloff Faculty Mentor: Dr. Rahman Tashakkori Department of Computer Science Appalachian State University State of NC Undergraduate Research and Creativity Symposium (SNCURCS) North Carolina State University November 22, 2014

2 Presentation outline What is Hadoop?
What discerns structured data from semi-structured data? How does Hadoop work? What is MapReduce? Why is Hadoop useful?

3 Hadoop Who here has heard of Hadoop?
Hadoop is all about “big data”. It’s just a large quantity of data - think billions, if not trillions, of records in a database. Hadoop is a set of tools used to execute the MapReduce algorithm as quickly and accurately as possible

4 Hadoop Hadoop is all about large amounts of data.
Semi-structured data versus structured data Hadoop is all about “big data”. It’s just a large quantity of data - think billions, if not trillions, of records in a database. Hadoop is a set of tools used to execute the MapReduce algorithm as quickly and accurately as possible

5 Hadoop Hadoop is also designed to be run across multiple computers, using HDFS and YARN technologies.

6 HDFS HDFS - Hadoop Distributed File System
Used to distribute storage across numerous nodes The filesystem consists of many data nodes and one name node

7 HDFS

8 YARN YARN - Yet Another Resource Negotiator
Yarn manages the available resources to a Hadoop cluster CPU time, RAM, etc.

9 MapReduce Programming model that Hadoop is centralized around
MapReduce is very scalable! There are two stages: Map and Reduce There are five steps to the MapReduce: Prepare map() input Execute map() Shuffle map output to the reduce processor Execute reduce() Produce final output

10 MapReduce: Map Take an input Map these inputs as keys and values

11 MapReduce: Reduce Take the key-value pairs as input
Do “something” to them to aggregate, classify, or compare the data Produce an output of the results

12 MapReduce: Example Word count
How many of each word is in a document/collection of documents? Map stage will map each of the words to a count, via key-value pairs Reduce stage will aggregate the return values of the Map stage

13 MapReduce: Example

14 Hadoop: Cluster Setup Setup of Hadoop is one of its major pitfalls
Lots of considerations: open file handles, network forwarding, etc Networking woes Documentation is sparse, setup is not for the faint of heart

15 Hadoop: Wrap-up So what can Hadoop do? Beyond Hadoop, what is there?
Mahout Hive Pig Hadoop has the ability to process data very quickly. Any problem that can be broken up into multiple parts, and have those parts come together at the end for the final solution, can be used in a Hadoop cluster. Hadoop technologies: Mahout: machine learning, natively written in Java Hive: data warehouse, data summarization, data analysis Rather than using MapReduce to check something big, you can essentially query the data as if it were a database, something like SQL Pig: query language that is designed to generate MapReduce jobs that are simple queries of the data set Similar to Hive, but is designed around creating MapReduce jobs. Things such as loading data, sorting, filtering, grouping, joining, moving, etc are all handled by Pig.

16 Acknowledgements Sina Tashakkori Michael Kepple

17 Questions?


Download ppt "Analysis of Structured or Semi-structured Data on a Hadoop Cluster"

Similar presentations


Ads by Google