Leon Kos University of Ljubljana

Slides:

Advertisements

Similar presentations

 Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware  Created by Doug Cutting and.

Advertisements

 Need for a new processing platform (BigData)  Origin of Hadoop  What is Hadoop & what it is not ?  Hadoop architecture  Hadoop components (Common/HDFS/MapReduce)

Jian Wang Based on “Meet Hadoop! Open Source Grid Computing” by Devaraj Das Yahoo! Inc. Bangalore & Apache Software Foundation.

Google Distributed System and Hadoop Lakshmi Thyagarajan.

GROUP 7 TOOLS FOR BIG DATA Sandeep Prasad Dipojjwal Ray.

The Hadoop Distributed File System, by Dhyuba Borthakur and Related Work Presented by Mohit Goenka.

Hadoop, Hadoop, Hadoop!!! Jerome Mitchell Indiana University.

Hola Hadoop. 0. Clean-Up The Hard-disks Delete tmp/ folder from workspace/mdp-lab3 Delete unneeded downloads.

Cloud Distributed Computing Environment Content of this lecture is primarily from the book “Hadoop, The Definite Guide 2/e)

CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.

HAMS Technologies 1

Hadoop Basics -Venkat Cherukupalli. What is Hadoop? Open Source Distributed processing Large data sets across clusters Commodity, shared-nothing servers.

W HAT IS H ADOOP ? Hadoop is an open-source software framework for storing and processing big data in a distributed fashion on large clusters of commodity.

Introduction to Apache Hadoop Zibo Wang. Introduction  What is Apache Hadoop?  Apache Hadoop is a software framework which provides open source libraries.

Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.

Introduction to Hadoop and HDFS

f ACT s  Data intensive applications with Petabytes of data  Web pages billion web pages x 20KB = 400+ terabytes  One computer can read

Contents HADOOP INTRODUCTION AND CONCEPTUAL OVERVIEW TERMINOLOGY QUICK TOUR OF CLOUDERA MANAGER.

Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Apache Hadoop Daniel Lust, Anthony Taliercio. What is Apache Hadoop? Allows applications to utilize thousands of nodes while exchanging thousands of terabytes.

U N I V E R S I T Y O F S O U T H F L O R I D A Hadoop Alternative The Hadoop Alternative Larry Moore 1, Zach Fadika 2, Dr. Madhusudhan Govindaraju 2 1.

Before we start, please download: VirtualBox: – The Hortonworks Data Platform: –

CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.

Copyright © 2015, SAS Institute Inc. All rights reserved. THE ELEPHANT IN THE ROOM SAS & HADOOP.

HADOOP Carson Gallimore, Chris Zingraf, Jonathan Light.

Cloud Computing project NSYSU Sec. 1 Demo. NSYSU EE IT_LAB2 Outline  Our system’s architecture  Flow chart of the hadoop’s job(web crawler) working.

{ Tanya Chaturvedi MBA(ISM) Hadoop is a software framework for distributed processing of large datasets across large clusters of computers.

Cloud Distributed Computing Environment Hadoop. Hadoop is an open-source software system that provides a distributed computing environment on cloud (data.

INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.

1 Student Date Time Wei Li Nov 30, 2015 Monday 9:00-9:25am Shubbhi Taneja Nov 30, 2015 Monday9:25-9:50am Rodrigo Sanandan Dec 2, 2015 Wednesday9:00-9:25am.

What is it and why it matters? Hadoop. What Is Hadoop? Hadoop is an open-source software framework for storing data and running applications on clusters.

Understanding the File system  Block placement Current Strategy  One replica on local node  Second replica on a remote rack  Third replica on same.

Leverage Big Data With Hadoop Analytics Presentation by Ravi Namboori Visit

Hadoop. Introduction Distributed programming framework. Hadoop is an open source framework for writing and running distributed applications that.

Hadoop Javad Azimi May What is Hadoop? Software platform that lets one easily write and run applications that process vast amounts of data. It includes:

Big Data is a Big Deal!.

SNS COLLEGE OF TECHNOLOGY

MapReduce Compiler RHadoop

Sushant Ahuja, Cassio Cristovao, Sameep Mohta

Hadoop Aakash Kag What Why How 1.

Introduction to Distributed Platforms

By Chris immanuel, Heym Kumar, Sai janani, Susmitha

Apache hadoop & Mapreduce

Unit 2 Hadoop and big data

An Open Source Project Commonly Used for Processing Big Data Sets

How to download, configure and run a mapReduce program In a cloudera VM Presented By: Mehakdeep Singh Amrit Singh Chaggar Ranjodh Singh.

Hadoop MapReduce Framework

TABLE OF CONTENTS. TABLE OF CONTENTS Not Possible in single computer and DB Serialised solution not possible Large data backup difficult so data.

Getting Data into Hadoop

Rahi Ashokkumar Patel U

Three modes of Hadoop.

Software Engineering Introduction to Apache Hadoop Map Reduce

Central Florida Business Intelligence User Group

MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner

Ministry of Higher Education

Database Applications (15-415) Hadoop Lecture 26, April 19, 2016

The Basics of Apache Hadoop

Hadoop Distributed Filesystem

Overview Hadoop is a framework for running applications on large clusters built of commodity hardware. The Hadoop framework transparently provides applications.

Overview Hadoop is a framework for running applications on large clusters built of commodity hardware. The Hadoop framework transparently provides applications.

Lecture 16 (Intro to MapReduce and Hadoop)

Zoie Barrett and Brian Lam

Charles Tappert Seidenberg School of CSIS, Pace University

Apache Hadoop and Spark

Bryon Gill Pittsburgh Supercomputing Center

MapReduce: Simplified Data Processing on Large Clusters

Analysis of Structured or Semi-structured Data on a Hadoop Cluster

Presentation transcript:

Introduction to Hadoop Hands-on Training Workshop Big data analysis with RHadoop Leon Kos University of Ljubljana European HPC Summit Week 2018 and PRACEdays18 29 May 2018, Ljubljana

What is BIG DATA? Quintillion bytes of data is produced every day. Traditional data analysis tools are not able to handle such quantities of data. For tackling “big data” we’ll be using largely adopted Hadoop framework with Map/Reduce methodology.

What is Hadoop? See https://wiki.apache.org/hadoop/ProjectDescription Java based software that runs on commodity hardware Map/Reduce programming paradigm Distributed filesystem

What is Map/Reduce? Is the style in which most programs running on Hadoop are written. In this style, input is broken in tiny pieces which are processed independently (the map part). The results of these independent processes are then collated into groups and processed as groups (the reduce part).

What is HDFS? stands for Hadoop Distributed File System. This is how input and output files of Hadoop programs are normally stored. The major advantage of HDFS are that it provides very high input and output speeds. This is critical for good performance for highly parallel programs since as the number of processors involved in working on a problem increases, the overall demand for input data increases as does the overall rate that output is produced. HDFS provides very high bandwidth by storing chunks of files scattered throughout the Hadoop cluster. By clever choice of where individual tasks are run and because files are stored in multiple places, tasks are placed near their input data and output data is largely stored where it is created. An HDFS cluster is built from a NameNode and one or more DataNode instances.

What is YARN?

Starting Hadoop, R studio and Rhadoop Virtual machine login Username: hduser Password: Hadoop Open terminal from the taskbar

To start the Hadoop file system first open the terminal by clicking on the black icon on the bottom left and type $ start-dfs.sh $ start-yarn.sh Open Firefox browser and type the following address for Namenode information: http://localhost:50070 Datanode information (not really needed to see): http://localhost:50075

Using HDFS? The hadoop file system is different from the local file system of your machine. In future, whenever you want to work with the hadoop file system you will have to use the command: $ hadoop fs $ hadoop fs -ls / To create a copy file to/from in the hadoop file system, use the following commands: $ hadoop fs -mkdir examples $ hadoop fs -copyFromLocal source_path destination_path For example, try: $ hadoop fs -copyFromLocal /home/hduser/week2/Term_frequencies_sentence-level_lemmatized_utf8.csv .

THANK YOU FOR YOUR ATTENTION for PART 1 of Big Data workshop https://www.futurelearn.com/courses/big-data-r-hadoop PRACE Autumn School 2018 - HPC for engineering and Life sciences https://events.prace-ri.eu/event/as18 THANK YOU FOR YOUR ATTENTION for PART 1 of Big Data workshop www.prace-ri.eu