OPENMARU dm4ir.tistory.com psyoblade.egloos.com.

Slides:

Advertisements

Similar presentations

Hadoop Programming. Overview MapReduce Types Input Formats Output Formats Serialization Job g/apache/hadoop/mapreduce/package-

Advertisements

MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.

Hui Li Pig Tutorial Hui Li Some material adapted from slides by Adam Kawa the 3rd meeting of WHUG June 21, 2012.

CS525: Special Topics in DBs Large-Scale Data Management MapReduce High-Level Langauges Spring 2013 WPI, Mohamed Eltabakh 1.

Mapreduce and Hadoop Introduce Mapreduce and Hadoop

Based on the text by Jimmy Lin and Chris Dryer; and on the yahoo tutorial on mapreduce at index.html

MapReduce in Action Team 306 Led by Chen Lin College of Information Science and Technology.

O’Reilly – Hadoop: The Definitive Guide Ch.5 Developing a MapReduce Application 2 July 2010 Taewhi Lee.

Introduction to Hadoop Programming Bryon Gill, Pittsburgh Supercomputing Center.

Other Map-Reduce (ish) Frameworks William Cohen. Y:Y=Hadoop+X or Hadoop~=Y What else are people using? – instead of Hadoop – on top of Hadoop.

Map reduce with Hadoop streaming and/or Hadoop. Hadoop Job Hadoop Mapper Hadoop Reducer Partitioner Hadoop FileSystem Combiner Shuffle Sort Shuffle Sort.

Hadoop: Nuts and Bolts Data-Intensive Information Processing Applications ― Session #2 Jimmy Lin University of Maryland Tuesday, February 2, 2010 This.

Lecture 3 – Hadoop Technical Introduction CSE 490H.

Jian Wang Based on “Meet Hadoop! Open Source Grid Computing” by Devaraj Das Yahoo! Inc. Bangalore & Apache Software Foundation.

The Longest Common Substring Problem a.k.a Long Repeat by Donnie Demuth.

Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc

Hadoop, Hadoop, Hadoop!!! Jerome Mitchell Indiana University.

Hadoop: The Definitive Guide Chap. 8 MapReduce Features

Big Data Analytics with R and Hadoop

大规模数据处理 / 云计算 Lecture 3 – Hadoop Environment 彭波北京大学信息科学技术学院 4/23/2011 This work is licensed under a Creative Commons.

HADOOP ADMIN: Session -2

Tutorial on Hadoop Environment for ECE Login to the Hadoop Server Host name: , Port: If you are using Linux, you could simply.

Overview Hadoop is a framework for running applications on large clusters built of commodity hardware. The Hadoop framework transparently provides applications.

Cloud Computing Other High-level parallel processing languages Keke Chen.

Whirlwind tour of Hadoop Inspired by Google's GFS Clusters from systems Batch Processing High Throughput Partition-able problems Fault Tolerance.

CC P ROCESAMIENTO M ASIVO DE D ATOS O TOÑO 2014 Aidan Hogan Lecture VI: 2014/04/14.

Introduction to Hadoop and HDFS

HAMS Technologies 1

ZhangGang Since the Hadoop farm has not successfully configured at CC, so I can not do some test with HBase. I just use the machine named.

Whirlwind Tour of Hadoop Edward Capriolo Rev 2. Whirlwind tour of Hadoop Inspired by Google's GFS Clusters from systems Batch Processing High.

Nutch in a Nutshell Presented by Liew Guo Min Zhao Jin.

MapReduce High-Level Languages Spring 2014 WPI, Mohamed Eltabakh 1.

Introduction to Hadoop Programming Bryon Gill, Pittsburgh Supercomputing Center.

CPS216: Advanced Database Systems (Data-intensive Computing Systems) Introduction to MapReduce and Hadoop Shivnath Babu.

Hadoop Introduction Wang Xiaobo Outline Install hadoop HDFS MapReduce WordCount Analyzing Compile image data TeleNav Confidential.

Concurrent Algorithms. Summing the elements of an array

Most information comes from Chapter 3, MySQL Tutorial: 1 MySQL: Part.

Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!

Writing a MapReduce Program 1. Agenda  How to use the Hadoop API to write a MapReduce program in Java  How to use the Streaming API to write Mappers.

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

MapReduce. What is MapReduce? (1) A programing model for parallel processing of a distributed data on a cluster It is an ideal solution for processing.

Windows 7 WampServer 2.1 MySQL PHP 5.3 Script Apache Server User Record or Select Media Upload to Internet Return URL Forward URL Create.

Hadoop Joshua Nester, Garrison Vaughan, Calvin Sauerbier, Jonathan Pingilley, and Adam Albertson.

Cloud Computing Mapreduce (2) Keke Chen. Outline  Hadoop streaming example  Hadoop java API Framework important APIs  Mini-project.

Cloud Computing project NSYSU Sec. 1 Demo. NSYSU EE IT_LAB2 Outline  Our system’s architecture  Flow chart of the hadoop’s job(web crawler) working.

Filtering, aggregating and histograms A FEW COMPLETE EXAMPLES WITH MR, SPARK LUCA MENICHETTI, VAG MOTESNITSALIS.

Data-Intensive Computing with MapReduce Jimmy Lin University of Maryland Thursday, January 31, 2013 Session 2: Hadoop Nuts and Bolts This work is licensed.

HADOOP Course Content By Mr. Kalyan, 7+ Years of Realtime Exp. M.Tech, IIT Kharagpur, Gold Medalist. Introduction to Big Data and Hadoop Big Data › What.

INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.

Introduction to Hadoop Programming Bryon Gill, Pittsburgh Supercomputing Center.

Image taken from: slideshare

Big Data is a Big Deal!.

MapReduce Types, Formats and Features

Lecture 17 (Hadoop: Getting Started)

Central Florida Business Intelligence User Group

MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner

MIT 802 Introduction to Data Platforms and Sources Lecture 2

Hadoop MapReduce Types

CS6604 Digital Libraries IDEAL Webpages Presented by

Wordcount CSCE 587 Spring 2018.

Introduction to Apache

VI-SEEM data analysis service

Lecture 16 (Intro to MapReduce and Hadoop)

CSE 491/891 Lecture 24 (Hive).

Charles Tappert Seidenberg School of CSIS, Pace University

MAPREDUCE TYPES, FORMATS AND FEATURES

Recitation #4 Tel Aviv University 2017/2018 Slava Novgorodov

Bryon Gill Pittsburgh Supercomputing Center

MIT 802 Introduction to Data Platforms and Sources Lecture 2

Map Reduce, Types, Formats and Features

Presentation transcript:

OPENMARU dm4ir.tistory.com psyoblade.egloos.com

"Write once, Run never again" Hadoop user group uk: dumbo

hadoop-mapred.sh #!/bin/bash hadoop jar hadoop streaming.jar \ -input /user/openmaru/input/sample.txt \ -output /user/openmaru/output \ -mapper Mapper.py \ -reducer Reducer.py\ -file Mapper.py \ -file Reducer.py hadoop dfs -cat output/part >./bulk.sql mysql -uuser -ppasswd -Ddb <./bulk.sql hadoop-mapred.sh #!/bin/bash hadoop jar hadoop streaming.jar \ -input /user/openmaru/input/sample.txt \ -output /user/openmaru/output \ -mapper Mapper.py \ -reducer Reducer.py\ -file Mapper.py \ -file Reducer.py hadoop dfs -cat output/part >./bulk.sql mysql -uuser -ppasswd -Ddb <./bulk.sql Sample.txt This distribution includes cryptographic software. The country in which you currently reside may have restrictions on the import, possession, use, and/or re-export to another country, of Sample.txt This distribution includes cryptographic software. The country in which you currently reside may have restrictions on the import, possession, use, and/or re-export to another country, of

StorageStorage Map & Reduce HDFSHDFS line worker … SQL local write reduce reduce map map map reduce reduceNum=1 This distribution includes cryptographic software. The country in which you currently reside may have restrictions on the import, possession, use, and/or re-export to another country, of 160: This distribution includes cryptographic software. The country in 228: which you currently reside may have restrictions on the import, 293: possession, use, and/or re-export to another country, of

Mapper.py for line in sys.stdin: m = re.match('^.*\[([0-9]*)/([a-zA-z]*)/([0-9]*)\].*query=(.*)start*$', line date = parse(m) # parse year, month, day to date query = urllib.unquote(m.group(4).split(‘&’, 1)[0]) print ‘%s_%s\t%s' % (date, query, count) access.log query=%EC%98%A4%ED%94%88%EB%A7%88%EB%A3%A8&start=0&c ount= [30/Sep/2008:00:10: ] "GET /openmaru.search?query=%EC%98%A4%ED%94%88%EB%A7%88%EB%A3%A8&start=0&c ount=5&……&groupby=1 HTTP/1.1" "\xeb\x8f\x84\xec\x8b\x9c" “ " Reducer.py _dict = {} for line in sys.stdin: key, value = line.strip().split('\t') _dict[key] = _dict.get(key, 0) + value for key, value in _dict.iteritems(): print key.split(‘_’,1), int(value) _ 동방신기 동방신기 동방신기 동방신기 동방신기 30

 Wrapper map/reduce (for streaming)  ProcessBuilder PipeMapRed.java ProcessBuilder builder = new ProcessBuilder(); builder.environment().putAll(…); sim = builder.start(); clientOut_ = new DataOutputStream(…); clientIn_ = new DataInputStream(…); clientErr_ = new DataInputStream(…); PipeMapRed.java ProcessBuilder builder = new ProcessBuilder(); builder.environment().putAll(…); sim = builder.start(); clientOut_ = new DataOutputStream(…); clientIn_ = new DataInputStream(…); clientErr_ = new DataInputStream(…); PipeMapper.java write(key); clientOut_.write('\t'); write(value); clientOut_.write('\n'); PipeMapper.java write(key); clientOut_.write('\t'); write(value); clientOut_.write('\n');

StorageStorage Map & Reduce HDFSHDFS line worker … map map map input-stream Mapper.pyforkoutput-stream input-stream forkinput-stream forkoutput-stream Reducer.pyfork

Legacy Database Web Servers Hadoop Cluster HDFSHDFS apache … Map & Reduce MySQLMySQL hadoop dfs -copyFromLocal bulk.sql mysql < bulk.sql cron cron cron stat_term_count { date char(8) term varchar(512) count int } date char(8) term varchar(512) count int } stat_term_count { date char(8) term varchar(512) count int } date char(8) term varchar(512) count int }

hadoop-mapred.sh #!/bin/bash TextInputFormat inputformat="org.apache.hadoop.mapred.TextInputFormat" TextOutputFormat outputformat="org.apache.hadoop.mapred.TextOutputFormat" HashPartitioner partitioner="org.apache.hadoop.mapred.lib.HashPartitioner“ 20 rednum="20" TYPE=single cmdenv="TYPE=single“ hadoop jar hadoop streaming.jar \ … -numReduceTasks $rednum \ -inputformat $inputformat \ -outputformat $outputformat \ -jobconf $tasknum \ -cmdenv $cmdenv for i in $(seq 0 19); do num=`printf "%05d" $i` hadoop dfs -cat output/part-$num >>./bulk.sql done mysql -uchef -pzjvl -Dtest < bulk.sql hadoop-mapred.sh #!/bin/bash TextInputFormat inputformat="org.apache.hadoop.mapred.TextInputFormat" TextOutputFormat outputformat="org.apache.hadoop.mapred.TextOutputFormat" HashPartitioner partitioner="org.apache.hadoop.mapred.lib.HashPartitioner“ 20 rednum="20" TYPE=single cmdenv="TYPE=single“ hadoop jar hadoop streaming.jar \ … -numReduceTasks $rednum \ -inputformat $inputformat \ -outputformat $outputformat \ -jobconf $tasknum \ -cmdenv $cmdenv for i in $(seq 0 19); do num=`printf "%05d" $i` hadoop dfs -cat output/part-$num >>./bulk.sql done mysql -uchef -pzjvl -Dtest < bulk.sql

Reducer.py _type = 'single' if "TYPE" in os.environ: _type = os.environ["TYPE"] …reduce code… print 'CREATE TABLE IF NOT EXISTS %s (…);‘ print 'SET NAMES UTF8;‘ print 'LOCK TABLES %s WRITE;' % (table) for key, value in dict: date, query = key.split(‘_’, 1) count = int(value) print "INSERT INTO %s VALUES ('%s', '%s', %d);" % (table, date, query, count) print 'UNLOCK TABLES;'

datanode: 4 reducenum: 1 logfile size: access.log 3.9G upload: 1min (1 replication), 60m/s 4mins, 4sec Finished in: 4mins, 4sec output size: 313,543,415 bluk.sql created datanode: 20 reducenum: 1~20 logfile size: access.log 3.9G upload: 16min (3 replication), 3~4m/s 2mins, 2sec 1mins, 48sec Finished in: 2mins, 2sec (1 rednum), 1mins, 48sec (20 rednum) output size: 313,543,415 bluk.sql created

 Hadoop  Installation (namenode, jobtracker)  debugging  Python  url-encoding  tab (.vimrc)  sql (filtering)  unicode (mysql)

  HADOOP-1328 "reporter:counter:,, “ "reporter:status: "  HADOOP-3429 Increased the size of the buffer (128KB) private final static int BUFFER_SIZE = 128 * 1024; clientOut_ = new DataOutputStream(new BufferedOutputStream(…, BUFFER_SIZE));   HADOOP-3379 default "stream.non.zero.exit.status.is.failure" "true"

 pReducePrograms pReducePrograms  mapred.job.tracker, fs.default.name LOG.INFO  keep.failed.task.files hadoop org.apache.hadoop.mapred.IsolationRunner../jo b.xml  kill -QUIT printStackTrace()  jps jstack  Reporter  Manually python Mapper.py map.out python Reducer.py reducer.out

WordCount.py import sys, happy, happy.log class WordCount(happy.HappyJob): def __init__(self, inputpath, outputpath): happy.HappyJob.__init__(self) self.inputpaths = inputpath self.outputpath = outputpath self.inputformat = "text" def map(self, records, task): for _, value in records: for word in value.split(): task.collect(word, "1")

def reduce(self, key, values, task): count = 0; for _ in values: count += 1 task.collect(key, str(count)) happy.results["words"] = happy.results.setdefault("words", 0) + count happy.results["unique"] = happy.results.setdefault("unique", 0) + 1 if __name__ == "__main__": wc = WordCount(sys.argv[1], sys.argv[2]) results = wc.run() print str(sum(results["words"])) + " total words" print str(sum(results["unique"])) + " unique words"

WordCount.py class Mapper: def __init__(self): file = open("excludes.txt","r") self.excludes = set(line.strip() for line in file) file.close() def map(self,key,value): for word in value.split(): if not (word in self.excludes): yield word,1 def reducer(key,values): yield key,sum(values) if __name__ == "__main__": import dumbo dumbo.run(Mapper,reducer,reducer)

 Performance Tunning  Partitioner + Combiner  Mahout  Clustering  Classification  Hadoop Pipes  Hadoop with C++

Q1. How to use numeric type key in streaming? A1. Currently can’t. only support byte, text Q2. Why don’t use every reducer insert ? A2. Performance problem. So use mysql bulk load Q3. Which is more efficient model between java and python-streaming? A3. Need more testings

 HadoopStreaming - Hadoop Wiki  HowToDebugMapReducePrograms - Hadoop Wiki  from __future__ import dream :: About Hadoop Streaming 900bbd78acbc a bbd78acbc a81  Writing An Hadoop MapReduce Program In Python - Michael G. Noll ram_In_Python ram_In_Python  Skills Matter : Hadoop User Group UK:Dumbo: Hadoop streaming ade-elegant-and-easy ade-elegant-and-easy  Home — dumbo — GitHub  Example programs — dumbo — GitHub  Hadoop + Python = Happy