Download presentation
Presentation is loading. Please wait.
Published bySarah Manning Modified over 9 years ago
1
Frankie Pike
2
2010: 1.2 zettabytes 1.2 trillion gigabytes DVDs past the moon 2-way = 6 newspapers everyday ~58% growth per year Why care?
4
Google’s capacity = 1 exabyte 24 hours of Youtube > Internet in 2000 4 years of video / day on Youtube 100 trillion words online
5
Common Architecture http://www.adopenstatic.com/images/resources/blog/Kerberos6.jpg
6
Common Architecture Single point of failure Space-constraints Multi-tenancy difficulties Re-writing of programs or changes to network config
7
MapReduce
8
The Promise High reliability any node can go down High scalability easy to add nodes Multi-tenancy Cost Reduction “Cloud-friendly” Java, C++, C#, Python, R Transparent Parallelization
9
The Kryptonite Data set needs to be “big enough” Consistency mid-processing
10
Two Steps in MapReduce Map Reduce
11
Mapping Input K/V pairs -> Intermediate K/V Pairs Input and Intermediate can be different (Server Key, Blog Data) -> (Blog Key, Post Count) Sorted and Partitioned for reduction Number of maps depends on task and cluster 10TB data with blocksize 128MB = 82,000 maps 10-100 maps per node ideal
12
Reducing Intermediate K/V -> Intermediate K/V (smaller) Matching keys consolidated (A, 15); (B, 6); (A, 3) -> (A, 18); (B, 6) Number of Reductions >= 0 Hopefully smaller dataset at each iteration Reduce as much as needed
13
An Example { "type": "post", "name": "Raven's Map/Reduce functionality", "blog_id": 1342, "post_id": 29293921, "tags": ["raven", "nosql"], "post_content": "... ", "comments": [ { "source_ip": '124.2.21.2', "author": "martin", "text": "..." } ] }Want count of comments for blog http://ayende.com/blog/4435/map-reduce-a-visual-explanation
14
Step 1: Map to final format http://ayende.com/blog/4435/map-reduce-a-visual-explanation
15
Step 2: Reduce (Partition) http://ayende.com/blog/4435/map-reduce-a-visual-explanation
16
Step 3: Reduce (more) http://ayende.com/blog/4435/map-reduce-a-visual-explanation
17
Step 4: Reduce (most) http://ayende.com/blog/4435/map-reduce-a-visual-explanation
18
Single Node http://bc.tech.coop/blog/070520.html
19
Dual Node http://map-reduce.wikispaces.asu.edu/
20
N-Nodes http://www.inventoland.net/img/blog/mapReduce.png
21
Dealing with Failure Workers Occasional check-in pings by masters Masters Data structures get periodic auto-saves and consistency checks. Can restart from periodic saves Bandwidth Tasks attempt to pair with local storage
22
Has it worked? Patented Regenerated index
23
Apache Hadoop “open source software for reliable, scalable, distributed computing” Hadoop Distributed File System (HDFS) Hadoop MapReduce Cassandra (multi-master database) HBase (scalable, distributed, structured database) Mahout (data mining and machine learning libs) ZooKeeper (coordination service)
24
Sources Avankipu & Sdsalvi, Cloud Computing - An Overview. http://map-reduce.wikispaces.asu.edu Ayende Rahien, Map/Reduce – A Visual Explanation. http://ayende.com/blog/4435/map-reduce-a-visual- explanation http://hadoop.apache.org/ http://en.wikipedia.org/wiki/MapReduce/
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.