15-826: Multimedia Databases and Data Mining C. Faloutsos 15-826 15-826: Multimedia Databases and Data Mining Lecture #23: intro to hadoop C. Faloutsos
Resources Software Map/reduce paper [Dean & Ghemawat] http://hadoop.apache.org/ Map/reduce paper [Dean & Ghemawat] http://research.google.com/archive/mapreduce.html Tutorial: see part1, foils #9-20 from videolectures.net/site/normal_dl/tag=75554/kdd2010_papadimitriou_sun_yan_lsdm_01.pdf videolectures.net/kdd2010_papadimitriou_sun_yan_lsdm/ 15-826 (c) 2017 C. Faloutsos
Motivation: Scalability Faloutsos Faloutsos et al 7/5/2018 Motivation: Scalability Google: > 450,000 processors in clusters of ~2000 processors each [Barroso, Dean, Hölzle, “Web Search for a Planet: The Google Cluster Architecture” IEEE Micro 2003] Yahoo: 5Pb of data [Fayyad, KDD’07] Problem: machine failures, on a daily basis How to parallelize data mining tasks, then? A: map/reduce – hadoop (open-source clone) http://hadoop.apache.org/ 15-826 (c) 2017 C. Faloutsos 3
2’ intro to hadoop details master-slave architecture; n-way replication (default n=3) ‘group by’ of SQL (in parallel, fault-tolerant way) e.g, find histogram of word frequency compute local histograms then merge into global histogram select course-id, count(*) from ENROLLMENT group by course-id 15-826 (c) 2017 C. Faloutsos
2’ intro to hadoop details reduce map master-slave architecture; n-way replication (default n=3) ‘group by’ of SQL (in parallel, fault-tolerant way) e.g, find histogram of word frequency compute local histograms then merge into global histogram select course-id, count(*) from ENROLLMENT group by course-id reduce map 15-826 (c) 2017 C. Faloutsos
details User Program Mapper fork Master assign map assign reduce Input Data (on HDFS) Output File 0 File 1 write read Reducer local write Split 0 Split 1 Split 2 Reducer remote read, sort By default: 3-way replication; Late/dead machines: ignored, transparently (!) 15-826 (c) 2017 C. Faloutsos
More details: (thanks to U Kang for the animations) 15-826 (c) 2017 C. Faloutsos
Background: MapReduce MapReduce/Hadoop Framework HDFS HDFS: fault tolerant, scalable, distributed storage system Mapper: read data from HDFS, output (k,v) pair Map 0 Map 1 Map 2 This ease of programming is one of the advantage of the mapreduce framework. Shuffle Output sorted by the key Reduce 0 Reduce 1 Reducer: read output from mappers, output a new (k,v) pair to HDFS Programmers need to provide only map() and reduce() functions HDFS 15-826 (c) 2017 C. Faloutsos 8
Background: MapReduce Example: histogram of fruit names HDFS map( fruit ) { output(fruit, 1); } Map 0 Map 1 Map 2 Given fruit information. We don’t know the physical location. We only can define mapper and reducer, that’s what we are allowed to do. The only guarantee: HDFS won’t split a line. Human: write mappers() and reducers(). Make human gray. (apple, 1) (strawberry,1) (apple, 1) (orange, 1) Shuffle reduce( fruit, v[1..n] ) { for(i=1; i <=n; i++) sum = sum + v[i]; output(fruit, sum); } Reduce 0 Reduce 1 (orange, 1) (strawberry, 1) (apple, 2) HDFS 15-826 (c) 2017 C. Faloutsos 9
Conclusions Convenient mechanism to process Tera- and Peta-bytes of data, on >hundreds of machines User provides map() and reduce() functions Hadoop handles the rest (eg., machine failures etc) 15-826 (c) 2017 C. Faloutsos