VI-SEEM data analysis service

Slides:



Advertisements
Similar presentations
MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
Advertisements

MapReduce Simplified Data Processing on Large Clusters
MapReduce.
MapReduce in Action Team 306 Led by Chen Lin College of Information Science and Technology.
Working with pig Cloud computing lecture. Purpose  Get familiar with the pig environment  Advanced features  Walk though some examples.
Other Map-Reduce (ish) Frameworks William Cohen. Y:Y=Hadoop+X or Hadoop~=Y What else are people using? – instead of Hadoop – on top of Hadoop.
Map reduce with Hadoop streaming and/or Hadoop. Hadoop Job Hadoop Mapper Hadoop Reducer Partitioner Hadoop FileSystem Combiner Shuffle Sort Shuffle Sort.
Searching with Lucene Chapter 2. For discussion Information retrieval What is Lucene? Code for indexer using Lucene Pagerank algorithm.
Jian Wang Based on “Meet Hadoop! Open Source Grid Computing” by Devaraj Das Yahoo! Inc. Bangalore & Apache Software Foundation.
HADOOP ADMIN: Session -2
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
U.S. Department of the Interior U.S. Geological Survey David V. Hill, Information Dynamics, Contractor to USGS/EROS 12/08/2011 Satellite Image Processing.
By: Jeffrey Dean & Sanjay Ghemawat Presented by: Warunika Ranaweera Supervised by: Dr. Nalin Ranasinghe.
USING HADOOP & HBASE TO BUILD CONTENT RELEVANCE & PERSONALIZATION Tools to build your big data application Ameya Kanitkar.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Distributed Indexing of Web Scale Datasets for the Cloud {ikons, eangelou, Computing Systems Laboratory School of Electrical.
Introduction to Hadoop and HDFS
Cloud Distributed Computing Platform 2 Content of this lecture is primarily from the book “Hadoop, The Definite Guide 2/e)
HAMS Technologies 1
Vyassa Baratham, Stony Brook University April 20, 2013, 1:05-2:05pm cSplash 2013.
MapReduce High-Level Languages Spring 2014 WPI, Mohamed Eltabakh 1.
Introduction to Hadoop Programming Bryon Gill, Pittsburgh Supercomputing Center.
Alastair Duncan STFC Pre Coffee talk STFC July 2014 The Trials and Tribulations and ultimate success of parallelisation using Hadoop within the SCAPE project.
Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA
Computing Scientometrics in Large-Scale Academic Search Engines with MapReduce Leonidas Akritidis Panayiotis Bozanis Department of Computer & Communication.
Map-Reduce Big Data, Map-Reduce, Apache Hadoop SoftUni Team Technical Trainers Software University
MapReduce Computer Engineering Department Distributed Systems Course Assoc. Prof. Dr. Ahmet Sayar Kocaeli University - Fall 2015.
IBM Research ® © 2007 IBM Corporation Introduction to Map-Reduce and Join Processing.
Other Map-Reduce (ish) Frameworks William Cohen. Y:Y=Hadoop+X or Hadoop~=Y What else are people using? – instead of Hadoop – on top of Hadoop.
Cloud Distributed Computing Environment Hadoop. Hadoop is an open-source software system that provides a distributed computing environment on cloud (data.
INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.
Item Based Recommender System SUPERVISED BY: DR. MANISH KUMAR BAJPAI TARUN BHATIA ( ) VAIBHAV JAISWAL( )
MapReduce using Hadoop Jan Krüger … in 30 minutes...
Introduction to Hadoop Programming Bryon Gill, Pittsburgh Supercomputing Center.
Big Data is a Big Deal!.
Sushant Ahuja, Cassio Cristovao, Sameep Mohta
Hadoop Aakash Kag What Why How 1.
How to download, configure and run a mapReduce program In a cloudera VM Presented By: Mehakdeep Singh Amrit Singh Chaggar Ranjodh Singh.
Large-scale file systems and Map-Reduce
Hadoop MapReduce Framework
MapReduce Types, Formats and Features
VI-SEEM biobank implementation
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn.
15-826: Multimedia Databases and Data Mining
Introduction to MapReduce and Hadoop
Calculation of stock volatility using Hadoop and map-reduce
Overview of Hadoop MapReduce MapReduce is a soft work framework for easily writing applications which process vast amounts of.
Lecture 3. MapReduce Instructor: Weidong Shi (Larry), PhD
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
Ministry of Higher Education
Persistent identifiers in VI-SEEM
Database Applications (15-415) Hadoop Lecture 26, April 19, 2016
Cloud Distributed Computing Environment Hadoop
CS6604 Digital Libraries IDEAL Webpages Presented by
湖南大学-信息科学与工程学院-计算机与科学系
Hadoop Basics.
Guide To UNIX Using Linux Third Edition
CS110: Discussion about Spark
Chapter 2 Lin and Dyer & MapReduce Basics Chapter 2 Lin and Dyer &
Source Code Repository
Lecture 18 (Hadoop: Programming Examples)
Data processing with Hadoop
CSE 491/891 Lecture 24 (Hive).
Charles Tappert Seidenberg School of CSIS, Pace University
MAPREDUCE TYPES, FORMATS AND FEATURES
CS639: Data Management for Data Science
Bryon Gill Pittsburgh Supercomputing Center
Chapter 2 Lin and Dyer & MapReduce Basics Chapter 2 Lin and Dyer &
MapReduce: Simplified Data Processing on Large Clusters
Map Reduce, Types, Formats and Features
Presentation transcript:

VI-SEEM data analysis service Petar Jovanovic Institute of Physics Belgrade The VI-SEEM project initiative is co-funded by the European Commission under the H2020 Research Infrastructures contract no. 675121

Agenda Apache Hadoop Hadoop basic component HDFS MapReduce Word count example with mvn archetype from VI-SEEM code repo Hadoop streaming More examples VI-SEEM REG CL, 11-13 Oct 2017 2

Apache Hadoop Open-source programming framework that supports processing and storing large data sets in a distributed computing environment. Based on Google’s MapReduce and Google File System papers https://research.google.com/archive/mapreduce.html https://research.google.com/archive/gfs.html Evolved from an effort to build an open source search engine (Nutch) Hadoop ecosystem VI-SEEM REG CL, 11-13 Oct 2017 3

Hadoop: basic components HDFS A distributed file system Cuts files into chunks Stores chunks across multiple machines with multiple copies helps with fault tolerance and access speeds MapReduce A basic approach to programming data analysis workflows in Hadoop. A batch system of execution Two types of nodes: Name node Data node VI-SEEM REG CL, 11-13 Oct 2017 4

HDFS HDFS file system commands (https://hadoop.apache.org/docs/r2.4.1/hadoop-project-dist/hadoop-common/FileSystemShell.html) list files in the current directory: hdfs dfs -ls put a file into HDFS: hdfs dfs -put data_file.dat get a file from HDFS: hdfs dfs -get data_file.dat VI-SEEM REG CL, 11-13 Oct 2017 5

MapReduce Works on data as key-value pairs 2 step approach: Map is a function applied to every record in a dataset, which translates it into a key-value pair Reduce is a function that combines key-value pairs in some meaningful way and calculates results on them (e.g. word frequencies in a language corpus) Between map and reduce step, key-value pairs are sorted by key. MapReduce in Unix commands: $ cat input.dat | mapper | sort | reducer > output.dat VI-SEEM REG CL, 11-13 Oct 2017 6

MapReduce - basic VI-SEEM REG CL, 11-13 Oct 2017 7

MapReduce - with combiner VI-SEEM REG CL, 11-13 Oct 2017 8

MapReduce - multiple reducers VI-SEEM REG CL, 11-13 Oct 2017 9

Word count example Mapper Reducer Combiner split input lines into words add 1 as initial count of each word Reducer for each word (key), sum its counts (value) Combiner same as reducer, but runs locally on mapper outputs commutative and associative operation useful when it can reduce the mapper output Maven project archetype in VI-SEEM code repository: https://code.vi-seem.eu/petarj/hadoop-archetype VI-SEEM REG CL, 11-13 Oct 2017 10

Hadoop streaming (1) MapReduce in Unix commands: $ cat input.dat | mapper | sort | reducer > output.dat Hadoop streaming pipes input through standard input to the mapper, whose output is sorted and then piped to reducer. Inputs are processed line by line, and each line is a string the mapper and the reducer are responsible for parsing the input lines slight difference to the Java api where reducer gets a list of values associated with a key VI-SEEM REG CL, 11-13 Oct 2017 11

Hadoop streaming (2) Command to execute: $ hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-2.7.2.jar \ -files mapper_executable,reducer_executable[,combiner_executable] \ -mapper mapper_executable \ [-combiner combiner_executable] \ -reducer reducer_executable \ -input input_data \ -output output_data Combiner is optional To specify more than one reducer add parameter: -D mapreduce.job.reduces=42 VI-SEEM REG CL, 11-13 Oct 2017 12

Word count with straming Getting the code $ git clone https://code.vi-seem.eu/petarj/hadoop-training.git Executing $ cd hadoop-training/wordcount $ hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-2.7.2.jar \ -files=mapper.py,reducer.py \ -mapper mapper.py \ -combiner reducer.py \ -reducer reducer.py \ -input /user/petarj/train_v2.txt -output wcout git -c http.sslVerify=false clone VI-SEEM REG CL, 11-13 Oct 2017 13

Twitter user ranking (1) Goal: find influential Twitter users Simple PageRank algorithm: PR(u) - page rank of u L(v) - number of out edges of v Bu - set of nodes adjacent to u u & v - any two distinct nodes In our case: L(v) is the number of people v is following, Bu are all the people u is following. VI-SEEM REG CL, 11-13 Oct 2017 14

Twitter user ranking (2) Input data: Text file: /user/petarj/twitter_rv.net (26 GB) Each line contains 2 numbers separated by tabs First column is id of a user, second column is an id of his follower Two map-reduce steps needed to perform the analysis: Convert data from edge list to adjacency list and set initial PageRank Compute the PageRank approximation Running multiple reducers VI-SEEM REG CL, 11-13 Oct 2017 15

Twitter user ranking (3) STEP 1 STEP 2 VI-SEEM REG CL, 11-13 Oct 2017 16

Twitter demo Getting the code $ git clone https://code.vi-seem.eu/petarj/hadoop-training.git Executing $ cd hadoop-training/twitterrank $ hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-2.7.2.jar \ -D mapreduce.job.reduces=20 -files=mapper1.py,reducer1.py \ -mapper mapper1.py \ -reducer reducer1.py \ -input /user/petarj/twitter_rv.net -output twrank_step1 $ hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-2.7.2.jar \ -D mapreduce.job.reduces=6 -files=mapper2.py,reducer2.py \ -mapper mapper2.py \ -reducer reducer2.py \ -input twrank_step1 -output twrank_step2 VI-SEEM REG CL, 11-13 Oct 2017 17

Thank you for your attention. Questions Thank you for your attention. VI-SEEM REG CL, 11-13 Oct 2017 18