Leon Kos University of Ljubljana

Slides:



Advertisements
Similar presentations
 Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware  Created by Doug Cutting and.
Advertisements

 Need for a new processing platform (BigData)  Origin of Hadoop  What is Hadoop & what it is not ?  Hadoop architecture  Hadoop components (Common/HDFS/MapReduce)
Jian Wang Based on “Meet Hadoop! Open Source Grid Computing” by Devaraj Das Yahoo! Inc. Bangalore & Apache Software Foundation.
Google Distributed System and Hadoop Lakshmi Thyagarajan.
GROUP 7 TOOLS FOR BIG DATA Sandeep Prasad Dipojjwal Ray.
The Hadoop Distributed File System, by Dhyuba Borthakur and Related Work Presented by Mohit Goenka.
Hadoop, Hadoop, Hadoop!!! Jerome Mitchell Indiana University.
Hola Hadoop. 0. Clean-Up The Hard-disks Delete tmp/ folder from workspace/mdp-lab3 Delete unneeded downloads.
Cloud Distributed Computing Environment Content of this lecture is primarily from the book “Hadoop, The Definite Guide 2/e)
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
HAMS Technologies 1
Hadoop Basics -Venkat Cherukupalli. What is Hadoop? Open Source Distributed processing Large data sets across clusters Commodity, shared-nothing servers.
W HAT IS H ADOOP ? Hadoop is an open-source software framework for storing and processing big data in a distributed fashion on large clusters of commodity.
Introduction to Apache Hadoop Zibo Wang. Introduction  What is Apache Hadoop?  Apache Hadoop is a software framework which provides open source libraries.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Introduction to Hadoop and HDFS
f ACT s  Data intensive applications with Petabytes of data  Web pages billion web pages x 20KB = 400+ terabytes  One computer can read
Contents HADOOP INTRODUCTION AND CONCEPTUAL OVERVIEW TERMINOLOGY QUICK TOUR OF CLOUDERA MANAGER.
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA
Apache Hadoop Daniel Lust, Anthony Taliercio. What is Apache Hadoop? Allows applications to utilize thousands of nodes while exchanging thousands of terabytes.
U N I V E R S I T Y O F S O U T H F L O R I D A Hadoop Alternative The Hadoop Alternative Larry Moore 1, Zach Fadika 2, Dr. Madhusudhan Govindaraju 2 1.
Before we start, please download: VirtualBox: – The Hortonworks Data Platform: –
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
Copyright © 2015, SAS Institute Inc. All rights reserved. THE ELEPHANT IN THE ROOM SAS & HADOOP.
HADOOP Carson Gallimore, Chris Zingraf, Jonathan Light.
Cloud Computing project NSYSU Sec. 1 Demo. NSYSU EE IT_LAB2 Outline  Our system’s architecture  Flow chart of the hadoop’s job(web crawler) working.
{ Tanya Chaturvedi MBA(ISM) Hadoop is a software framework for distributed processing of large datasets across large clusters of computers.
Cloud Distributed Computing Environment Hadoop. Hadoop is an open-source software system that provides a distributed computing environment on cloud (data.
INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.
1 Student Date Time Wei Li Nov 30, 2015 Monday 9:00-9:25am Shubbhi Taneja Nov 30, 2015 Monday9:25-9:50am Rodrigo Sanandan Dec 2, 2015 Wednesday9:00-9:25am.
What is it and why it matters? Hadoop. What Is Hadoop? Hadoop is an open-source software framework for storing data and running applications on clusters.
Understanding the File system  Block placement Current Strategy  One replica on local node  Second replica on a remote rack  Third replica on same.
Leverage Big Data With Hadoop Analytics Presentation by Ravi Namboori Visit
Hadoop. Introduction Distributed programming framework. Hadoop is an open source framework for writing and running distributed applications that.
Hadoop Javad Azimi May What is Hadoop? Software platform that lets one easily write and run applications that process vast amounts of data. It includes:
Big Data is a Big Deal!.
SNS COLLEGE OF TECHNOLOGY
MapReduce Compiler RHadoop
Sushant Ahuja, Cassio Cristovao, Sameep Mohta
Hadoop Aakash Kag What Why How 1.
Introduction to Distributed Platforms
By Chris immanuel, Heym Kumar, Sai janani, Susmitha
Apache hadoop & Mapreduce
Unit 2 Hadoop and big data
An Open Source Project Commonly Used for Processing Big Data Sets
How to download, configure and run a mapReduce program In a cloudera VM Presented By: Mehakdeep Singh Amrit Singh Chaggar Ranjodh Singh.
Hadoop MapReduce Framework
TABLE OF CONTENTS. TABLE OF CONTENTS Not Possible in single computer and DB Serialised solution not possible Large data backup difficult so data.
Getting Data into Hadoop
Rahi Ashokkumar Patel U
Three modes of Hadoop.
Software Engineering Introduction to Apache Hadoop Map Reduce
Central Florida Business Intelligence User Group
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
Ministry of Higher Education
Database Applications (15-415) Hadoop Lecture 26, April 19, 2016
The Basics of Apache Hadoop
Hadoop Distributed Filesystem
Hadoop Basics.
Overview Hadoop is a framework for running applications on large clusters built of commodity hardware. The Hadoop framework transparently provides applications.
Overview Hadoop is a framework for running applications on large clusters built of commodity hardware. The Hadoop framework transparently provides applications.
Lecture 16 (Intro to MapReduce and Hadoop)
Zoie Barrett and Brian Lam
Charles Tappert Seidenberg School of CSIS, Pace University
Apache Hadoop and Spark
Bryon Gill Pittsburgh Supercomputing Center
Hola Hadoop.
MapReduce: Simplified Data Processing on Large Clusters
Analysis of Structured or Semi-structured Data on a Hadoop Cluster
Presentation transcript:

Introduction to Hadoop Hands-on Training Workshop Big data analysis with RHadoop Leon Kos University of Ljubljana European HPC Summit Week 2018 and PRACEdays18 29 May 2018, Ljubljana

What is BIG DATA? Quintillion bytes of data is produced every day. Traditional data analysis tools are not able to handle such quantities of data. For tackling “big data” we’ll be using largely adopted Hadoop framework with Map/Reduce methodology.

What is Hadoop? See https://wiki.apache.org/hadoop/ProjectDescription Java based software that runs on commodity hardware Map/Reduce programming paradigm Distributed filesystem

What is Map/Reduce? Is the style in which most programs running on Hadoop are written. In this style, input is broken in tiny pieces which are processed independently (the map part). The results of these independent processes are then collated into groups and processed as groups (the reduce part).

What is HDFS? stands for Hadoop Distributed File System. This is how input and output files of Hadoop programs are normally stored. The major advantage of HDFS are that it provides very high input and output speeds. This is critical for good performance for highly parallel programs since as the number of processors involved in working on a problem increases, the overall demand for input data increases as does the overall rate that output is produced. HDFS provides very high bandwidth by storing chunks of files scattered throughout the Hadoop cluster. By clever choice of where individual tasks are run and because files are stored in multiple places, tasks are placed near their input data and output data is largely stored where it is created. An HDFS cluster is built from a NameNode and one or more DataNode instances. 

What is YARN?

Starting Hadoop, R studio and Rhadoop Virtual machine login Username: hduser Password: Hadoop Open terminal from the taskbar

To start the Hadoop file system first open the terminal by clicking on the black icon on the bottom left and type $ start-dfs.sh $ start-yarn.sh Open Firefox browser and type the following address for Namenode information: http://localhost:50070 Datanode information (not really needed to see): http://localhost:50075

Using HDFS? The hadoop file system is different from the local file system of your machine. In future, whenever you want to work with the hadoop file system you will have to use the command: $ hadoop fs $ hadoop fs -ls / To create a copy file to/from in the hadoop file system, use the following commands: $ hadoop fs -mkdir examples $ hadoop fs -copyFromLocal source_path destination_path For example, try: $ hadoop fs -copyFromLocal /home/hduser/week2/Term_frequencies_sentence-level_lemmatized_utf8.csv .

THANK YOU FOR YOUR ATTENTION for PART 1 of Big Data workshop https://www.futurelearn.com/courses/big-data-r-hadoop PRACE Autumn School 2018 - HPC for engineering and Life sciences https://events.prace-ri.eu/event/as18 THANK YOU FOR YOUR ATTENTION for PART 1 of Big Data workshop www.prace-ri.eu