Map-Reduce Big Data, Map-Reduce, Apache Hadoop SoftUni Team Technical Trainers Software University

Slides:



Advertisements
Similar presentations
MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
Advertisements

 Need for a new processing platform (BigData)  Origin of Hadoop  What is Hadoop & what it is not ?  Hadoop architecture  Hadoop components (Common/HDFS/MapReduce)
HADOOP ADMIN: Session -2
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
Software Quality Assurance QA Engineering, Testing, Bug Tracking, Test Automation Software University Technical Trainers SoftUni Team.
Jeffrey D. Ullman Stanford University. 2 Chunking Replication Distribution on Racks.
 Dimitar Ivanov Introduction to programming with microcontrollers.
Map Reduce and Hadoop S. Sudarshan, IIT Bombay
C# Advanced Topics Methods, Classes and Objects SoftUni Team Technical Trainers Software University
Software University Curriculum, Courses, Exams, Jobs SoftUni Team Technical Trainers Software University
Fundamentals SoftUni Welcome to Software University SoftUni Team Technical Trainers Software University
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
Sky Agile Horizons Hadoop at Sky. What is Hadoop? - Reliable, Scalable, Distributed Where did it come from? - Community + Yahoo! Where is it now? - Apache.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Apache Hadoop MapReduce What is it ? Why use it ? How does it work Some examples Big users.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Introduction to Hadoop and HDFS
f ACT s  Data intensive applications with Petabytes of data  Web pages billion web pages x 20KB = 400+ terabytes  One computer can read
HAMS Technologies 1
Project Tracking Tools Trello, Asana, Basecamp, GitHub Issue Tracker, TRAC SoftUni Team Technical Trainers Software University
Software Testing Lifecycle Exit Criteria Evaluation, Continuous Integration Ivan Yonkov Technical Trainer Software University.
Teamwork and Personal Skills Course Introduction Software University SoftUni Team Technical Trainers.
Fundamentals SoftUni Welcome to Software University SoftUni Team Technical Trainers Software University
Design Patterns: Structural Design Patterns
NoSQL Databases NoSQL Concepts SoftUni Team Technical Trainers Software University
Conditional Statements Implementing Control-Flow Logic in C# SoftUni Team Technical Trainers Software University
Redis Key-Value Database: Practical Introduction
Loops Repeating Code Multiple Times SoftUni Team Technical Trainers Software University
Database APIs and Wrappers
Svetlin Nakov Technical Trainer Software University
Build Processes and Continuous Integration Automating Build Processes Software University Technical Trainers SoftUni Team.
Processing Redis with.NET How to Operate with Redis Databases SoftUni Team Technical Trainers Software University
Multidimensional Arrays, Sets, Dictionaries Processing Matrices, Multidimensional Arrays, Dictionaries, Sets SoftUni Team Technical Trainers Software University.
Test-Driven Development Learn the "Test First" Approach to Coding SoftUni Team Technical Trainers Software University
Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA
Asynchronous Web Services Writing Asynchronous Web Services SoftUni Team Technical Trainers Software University
Before we start, please download: VirtualBox: – The Hortonworks Data Platform: –
Jekyll Static Site Generator Template-Based Site Generation Svetlin Nakov Technical Trainer Software University
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
Forms Overview, Query string, Submitting arrays, PHP & HTML, Input types, Redirecting the user Mario Peshev Technical Trainer Software.
IBM Research ® © 2007 IBM Corporation Introduction to Map-Reduce and Join Processing.
Exam Preparation Algorithms Course: Sample Exam SoftUni Team Technical Trainers Software University
Team3: Xiaokui Shu, Ron Cohen CS5604 at Virginia Tech December 6, 2010.
High-Quality Code: Course Introduction Course Introduction SoftUni Team Technical Trainers Software University
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
Design Patterns: Structural Design Patterns General and reusable solutions to common problems in software design Software University
Advanced C# Course Introduction SoftUni Team Technical Trainers Software University
Mocking with Moq Tools for Easier Unit Testing SoftUni Team Technical Trainers Software University
Data Structures Curriculum, Trainers, Evaluation, Exams SoftUni Team Technical Trainers Software University
Mocking Unit Testing Methods with External Dependencies SoftUni Team Technical Trainers Software University
Mocking with Moq Mocking tools for easier unit testing Svetlin Nakov Technical Trainer Software University
Test-Driven Development Learn the "Test First" Approach to Coding Svetlin Nakov Technical Trainer Software University
Sets, Dictionaries SoftUni Team Technical Trainers Software University
INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.
HADOOP Priyanshu Jha A.D.Dilip 6 th IT. Map Reduce patented[1] software framework introduced by Google to support distributed computing on large data.
Functional Programming Data Aggregation and Nested Queries Ivan Yonkov Technical Trainer Software University
Programming Fundamentals Course Introduction SoftUni Team Technical Trainers Software University
Doctrine The PHP ORM SoftUni Team Technical Trainers Software University
Creating Content Defining Topic, Creating Technical Training Materials SoftUni Team Technical Trainers Software University
ASP.NET MVC Course Program, Trainers, Evaluation, Exams, Resources SoftUni Team Technical Trainers Software University
Inheritance Class Hierarchies SoftUni Team Technical Trainers Software University
Stacks and Queues Processing Sequences of Elements SoftUni Team Technical Trainers Software University
Generics SoftUni Team Technical Trainers Software University
Hadoop Aakash Kag What Why How 1.
Hadoop MapReduce Framework
Central Florida Business Intelligence User Group
Ministry of Higher Education
MIT 802 Introduction to Data Platforms and Sources Lecture 2
Introduction to Apache
Charles Tappert Seidenberg School of CSIS, Pace University
Presentation transcript:

Map-Reduce Big Data, Map-Reduce, Apache Hadoop SoftUni Team Technical Trainers Software University

2 1.Big Data 2.Map-Reduce  What is Map-Reduce?  How It Works?  Mappers and Reducers  Examples 3.Apache Hadoop Table of Contents

Big Data What is Big Data Processing?

4  Big data == process very large data sets (terabytes / petabytes)  So large or complex, that traditional data processing is inadequate  Usually stored and processed by distributed databases  Often related to analysis of very large data sets  Typical components of big data systems  Distributed databases (like Cassandra, HBase and Hive)  Distributed processing frameworks (like Apache Hadoop)  Distributed processing systems (like Map-Reduce)  Distributed file systems (like HDFS) Big Data

Map-Reduce

6  Map-Reduce is a distributed processing framework  Computational model for processing huge data-sets (terabytes)  Using parallel processing on large clusters (thousands of nodes)  Relying on distributed infrastructure  Like Apache Hadoop or MongoDB cluster  The input and output data is stored in a distributed file system (or distributed database)  The framework takes care of scheduling, executing and monitoring tasks, and re-executes the failed tasks What is Map-Reduce?

7  How map-reduce works? 1. Splits the input data-set into independent chunks 2. Process each chunk by "map" tasks in parallel manner  The "map" function groups the input data into key-value pairs  Equal keys are processed by the same "reduce" node 3. Outputs of the "map" tasks are  Sorted, grouped by key, then sent as input to the "reduce" tasks 4. The "reduce" tasks  Aggregate the results per each key and produces the final output Map-Reduce: How It Works?

8  The map-reduce process is a sequence of transformations, executed on several nodes in parallel  Map: groups input chunks of data to key-value pairs  E.g. splits documents(id, chunk-content) into words(word, count)  Combine: sorts and combines all values by the same key  E.g. produce a list of counts for each word  Reduce: combines (aggregates) all values for certain key The Map-Reduce Process (input)  map   combine  >  reduce  (output)

9  We have a very large set of documents (e.g. 200 terabytes)  We want to count how many times each word occurs  Input: set of documents {key + content}  Mapper:  Extract the words from each document (words are used as keys)  Transforms documents {key + content}  word-count-pairs {word, count}  Reducer:  Sums the counts for each word  Transforms {word, list }  word-count-pairs {word, count} Example: Counting Words

10 Counting Words: Mapper and Reducer public void map(Object offset, Text docText, Context context) throws IOException, InterruptedException { throws IOException, InterruptedException { String[] words = docText.toString().toLowerCase().split("\\W+"); String[] words = docText.toString().toLowerCase().split("\\W+"); for (String word : words) for (String word : words) context.write(new Text(word), new IntWritable(1)); context.write(new Text(word), new IntWritable(1));} public void reduce(Text word, Iterable counts, Context context) throws IOException, InterruptedException { Context context) throws IOException, InterruptedException { int sum = 0; int sum = 0; for (IntWritable count : counts) for (IntWritable count : counts) sum += count.get(); sum += count.get(); context.write(word, new IntWritable(sum)); context.write(word, new IntWritable(sum));}

Word Count in Apache Hadoop Live Demo

12  We are given a CSV file holding real estate sales data:  Estate address, city, ZIP code, state, # beds, # baths, square foots, sale date price and GPS coordinates (latitude + longitude)  Find all cities that have sales in price range [ … ]  As side effect, find the sum of all sales by city Example: Extract Data from CSV Report

13 Process CSV Report – How It Works? SELECT city, city, SUM(price) FROM Sales SUM(price) FROM Sales GROUP BY city city chunking map citysum(price) SACRAMENTO LINCOLN RIO LINDA reduce

14 Process CSV Report: Mapper and Reducer public void map(Object offset, Text inputCSVLine, Context context) throws IOException, InterruptedException { throws IOException, InterruptedException { String[] fields = inputCSVLine.toString().split(","); String[] fields = inputCSVLine.toString().split(","); String city = fields[1]; String city = fields[1]; int price = Integer.parseInt(fields[9]); int price = Integer.parseInt(fields[9]); if (price > && price && price < ) context.write(new Text(city), new LongWritable(price); context.write(new Text(city), new LongWritable(price);} public void reduce(Text city, Iterable prices, Context context) throws IOException, InterruptedException { Context context) throws IOException, InterruptedException { long sum = 0; long sum = 0; for (LongWritable val : prices) for (LongWritable val : prices) sum += val.get(); sum += val.get(); context.write(city, new LongWritable(sum)); context.write(city, new LongWritable(sum));}

Processing CSV Report in Apache Hadoop Live Demo

Apache Hadoop Distributed Processing Framework

17  Apache Hadoop project develops open-source software for reliable, scalable, distributed computing  Hadoop Distributed File System (HDFS) – a distributed file system that transparently moves data across Hadoop cluster nodes  Hadoop MapReduce – the map-reduce framework  HBase – a scalable, distributed database for large tables  Hive – SQL-like query for large datasets  Pig – a high-level data-flow language for parallel computation  Hadoop is driven by big players like IBM, Microsoft, Facebook, VMware, LinkedIn, Yahoo, Cloudera, Intel, Twitter, Hortonworks, … Apache Hadoop

18 Hadoop Ecosystem HDFS Storage Redundant (3 copies) For large files – large blocks 64 MB or 128 MB / block Can scale to 1000s of nodes MapReduce API Batch (Job) processing Distributed and localized to clusters Auto-parallelizable for huge amounts of data Fault-tolerant (auto retries) Adds high availability and more Hadoop Libraries Pig Hive HBase Others

19 Hadoop Cluster HDFS (Physical) Storage Name Node Data Node 1 Data Node 2 Data Node 3 Secondary Name Node Contains web site to view cluster information V2 Hadoop uses multiple Name Nodes for HA One Name Node 3 copies of each node by default Many Data Nodes Using common Linux shell commands Block size is 64 or 128 MB Work with data in HDFS

 Tips:  sudo means "run as administrator" (super user)  Some distributions use hadoop dfs rather than hadoop fs Common Hadoop Shell Commands hadoop fs –cat file:///file2 hadoop fs –mkdir /user/hadoop/dir1 /user/hadoop/dir2 hadoop fs –copyFromLocal hadoop fs –copyFromLocal hadoop fs –put hdfs://nn.example.com/hadoopfile sudo hadoop jar sudo hadoop jar hadoop fs –ls /user/hadoop/dir1 hadoop fs –cat hdfs://nn1.example.com/file1 hadoop fs –get /user/hadoop/file hadoop fs –get /user/hadoop/file

Hadoop Shell Commands Live Demo

22  Apache Hadoop MapReduce  The world's leading implementation of the map-reduce computational model  Provides parallelized (scalable) computing  For processing very large data sets  Fault tolerant  Runs on commodity of hardware  Implemented in many cloud platforms: Amazon EMR, Azure HDInsight, Google Cloud, Cloudera, Rackspace, HP Cloud, …Amazon EMRAzure HDInsightGoogle CloudClouderaRackspaceHP Cloud Hadoop MapReduce

23 Hadoop Map-Reduce Pipeline

24  Download and install Java and Hadoop   You will need to install Java first  Download a pre-installed Hadoop virtual machine (VM)  Hortonworks Sandbox Hortonworks Sandbox  Cloudera QuickStart VM Cloudera QuickStart VM  You can use Hadoop in the cloud / local emulator  E.g. Azure HDInsight EmulatorAzure HDInsight Emulator Hadoop: Getting Started

Playing with Apache Hadoop Live Demo

26  Big data == processing huge datasets that are too big for processing on a single machine  Use a cluster of computing nodes  Map-reduce == computational paradigm for parallel data processing of huge data-sets  Data is chunked, then mapped into groups, then groups are processed and the results are aggregated  Highly scalable, can process petabytes of data  Apache Hadoop – industry's leading Map-Reduce framework Summary

? ? ? ? ? ? ? ? ? Map-Reduce

License  This course (slides, examples, labs, videos, homework, etc.) is licensed under the "Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International" licenseCreative Commons Attribution- NonCommercial-ShareAlike 4.0 International 28  Attribution: this work may contain portions from  "Fundamentals of Computer Programming with C#" book by Svetlin Nakov & Co. under CC-BY-SA licenseFundamentals of Computer Programming with C#CC-BY-SA  "Data Structures and Algorithms" course by Telerik Academy under CC-BY-NC-SA licenseData Structures and AlgorithmsCC-BY-NC-SA

Free Software University  Software University Foundation – softuni.orgsoftuni.org  Software University – High-Quality Education, Profession and Job for Software Developers  softuni.bg softuni.bg  Software Facebook  facebook.com/SoftwareUniversity facebook.com/SoftwareUniversity  Software YouTube  youtube.com/SoftwareUniversity youtube.com/SoftwareUniversity  Software University Forums – forum.softuni.bgforum.softuni.bg