Download presentation
Presentation is loading. Please wait.
1
15-826: Multimedia Databases and Data Mining
C. Faloutsos 15-826 15-826: Multimedia Databases and Data Mining Lecture #23: intro to hadoop C. Faloutsos
2
Resources Software Map/reduce paper [Dean & Ghemawat]
Map/reduce paper [Dean & Ghemawat] Tutorial: see part1, foils #9-20 from videolectures.net/site/normal_dl/tag=75554/kdd2010_papadimitriou_sun_yan_lsdm_01.pdf videolectures.net/kdd2010_papadimitriou_sun_yan_lsdm/ 15-826 (c) C. Faloutsos
3
Motivation: Scalability
Faloutsos Faloutsos et al 7/5/2018 Motivation: Scalability Google: > 450,000 processors in clusters of ~2000 processors each [Barroso, Dean, Hölzle, “Web Search for a Planet: The Google Cluster Architecture” IEEE Micro 2003] Yahoo: 5Pb of data [Fayyad, KDD’07] Problem: machine failures, on a daily basis How to parallelize data mining tasks, then? A: map/reduce – hadoop (open-source clone) 15-826 (c) C. Faloutsos 3
4
2’ intro to hadoop details
master-slave architecture; n-way replication (default n=3) ‘group by’ of SQL (in parallel, fault-tolerant way) e.g, find histogram of word frequency compute local histograms then merge into global histogram select course-id, count(*) from ENROLLMENT group by course-id 15-826 (c) C. Faloutsos
5
2’ intro to hadoop details reduce map
master-slave architecture; n-way replication (default n=3) ‘group by’ of SQL (in parallel, fault-tolerant way) e.g, find histogram of word frequency compute local histograms then merge into global histogram select course-id, count(*) from ENROLLMENT group by course-id reduce map 15-826 (c) C. Faloutsos
6
details User Program Mapper fork Master assign map assign reduce
Input Data (on HDFS) Output File 0 File 1 write read Reducer local write Split 0 Split 1 Split 2 Reducer remote read, sort By default: 3-way replication; Late/dead machines: ignored, transparently (!) 15-826 (c) C. Faloutsos
7
More details: (thanks to U Kang for the animations) 15-826
(c) C. Faloutsos
8
Background: MapReduce
MapReduce/Hadoop Framework HDFS HDFS: fault tolerant, scalable, distributed storage system Mapper: read data from HDFS, output (k,v) pair Map 0 Map 1 Map 2 This ease of programming is one of the advantage of the mapreduce framework. Shuffle Output sorted by the key Reduce 0 Reduce 1 Reducer: read output from mappers, output a new (k,v) pair to HDFS Programmers need to provide only map() and reduce() functions HDFS 15-826 (c) C. Faloutsos 8
9
Background: MapReduce
Example: histogram of fruit names HDFS map( fruit ) { output(fruit, 1); } Map 0 Map 1 Map 2 Given fruit information. We don’t know the physical location. We only can define mapper and reducer, that’s what we are allowed to do. The only guarantee: HDFS won’t split a line. Human: write mappers() and reducers(). Make human gray. (apple, 1) (strawberry,1) (apple, 1) (orange, 1) Shuffle reduce( fruit, v[1..n] ) { for(i=1; i <=n; i++) sum = sum + v[i]; output(fruit, sum); } Reduce 0 Reduce 1 (orange, 1) (strawberry, 1) (apple, 2) HDFS 15-826 (c) C. Faloutsos 9
10
Conclusions Convenient mechanism to process Tera- and Peta-bytes of data, on >hundreds of machines User provides map() and reduce() functions Hadoop handles the rest (eg., machine failures etc) 15-826 (c) C. Faloutsos
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.