Download presentation
Presentation is loading. Please wait.
Published byCamren Higginbottom Modified over 9 years ago
1
CMU SCS 15-826: Multimedia Databases and Data Mining Extra: intro to hadoop C. Faloutsos
2
CMU SCS Resources Software –http://hadoop.apache.org/http://hadoop.apache.org/ Map/reduce paper [Dean & Ghemawat] –http://labs.google.com/papers/mapreduce.htmlhttp://labs.google.com/papers/mapreduce.html Tutorial: see part1, foils #9-20 –videolectures.net/kdd2010_papadimitriou_sun_yan_lsdm/ 15-826(c) 2012 C. Faloutsos2
3
CMU SCS 15-826(c) 2012 C. Faloutsos3 Motivation: Scalability Google: > 450,000 processors in clusters of ~2000 processors each [ Barroso, Dean, Hölzle, “Web Search for a Planet: The Google Cluster Architecture” IEEE Micro 2003 ] Yahoo: 5Pb of data [Fayyad, KDD’07] Problem: machine failures, on a daily basis How to parallelize data mining tasks, then? A: map/reduce – hadoop (open-source clone) http://hadoop.apache.org/ http://hadoop.apache.org/
4
CMU SCS 15-826(c) 2012 C. Faloutsos4 2’ intro to hadoop master-slave architecture; n-way replication (default n=3) ‘group by’ of SQL (in parallel, fault-tolerant way) e.g, find histogram of word frequency –compute local histograms –then merge into global histogram select course-id, count(*) from ENROLLMENT group by course-id details
5
CMU SCS 15-826(c) 2012 C. Faloutsos5 2’ intro to hadoop master-slave architecture; n-way replication (default n=3) ‘group by’ of SQL (in parallel, fault-tolerant way) e.g, find histogram of word frequency –compute local histograms –then merge into global histogram select course-id, count(*) from ENROLLMENT group by course-id map reduce details
6
CMU SCS 15-826(c) 2012 C. Faloutsos6 User Program Reducer Master Mapper fork assign map assign reduce read local write remote read, sort Output File 0 Output File 1 write Split 0 Split 1 Split 2 Input Data (on HDFS) By default: 3-way replication; Late/dead machines: ignored, transparently (!) details
7
CMU SCS More details: (thanks to U Kang for the animations) 15-826(c) 2012 C. Faloutsos7
8
CMU SCS 8 Background: MapReduce MapReduce/Hadoop Framework Map 0Map 1Map 2 Reduce 0Reduce 1 Shuffle HDFS HDFS: fault tolerant, scalable, distributed storage system Mapper: read data from HDFS, output (k,v) pair Reducer: read output from mappers, output a new (k,v) pair to HDFS Output sorted by the key Programmers need to provide only map() and reduce() functions 15-8268(c) 2012 C. Faloutsos
9
CMU SCS 9 Background: MapReduce Example: histogram of fruit names Map 0Map 1Map 2 Reduce 0Reduce 1 Shuffle (apple, 1) (strawberry,1) (apple, 2) (orange, 1) (strawberry, 1) (orange, 1) HDFS map( fruit ) { output(fruit, 1); } reduce( fruit, v[1..n] ) { for(i=1; i <=n; i++) sum = sum + v[i]; output(fruit, sum); } 15-8269(c) 2012 C. Faloutsos
10
CMU SCS Conclusions Convenient mechanism to process Tera- and Peta-bytes of data, on >hundreds of machines User provides map() and reduce() functions Hadoop handles the rest (eg., machine failures etc) 15-826(c) 2012 C. Faloutsos10
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.