Presentation is loading. Please wait.

Presentation is loading. Please wait.

Frankie Pike. 2010: 1.2 zettabytes 1.2 trillion gigabytes DVDs past the moon 2-way = 6 newspapers everyday ~58% growth per year Why care?

Similar presentations


Presentation on theme: "Frankie Pike. 2010: 1.2 zettabytes 1.2 trillion gigabytes DVDs past the moon 2-way = 6 newspapers everyday ~58% growth per year Why care?"— Presentation transcript:

1 Frankie Pike

2 2010: 1.2 zettabytes 1.2 trillion gigabytes DVDs past the moon 2-way = 6 newspapers everyday ~58% growth per year Why care?

3

4  Google’s capacity = 1 exabyte  24 hours of Youtube > Internet in 2000  4 years of video / day on Youtube  100 trillion words online

5 Common Architecture http://www.adopenstatic.com/images/resources/blog/Kerberos6.jpg

6 Common Architecture Single point of failure Space-constraints Multi-tenancy difficulties Re-writing of programs or changes to network config

7 MapReduce

8 The Promise High reliability any node can go down High scalability easy to add nodes Multi-tenancy Cost Reduction “Cloud-friendly” Java, C++, C#, Python, R Transparent Parallelization

9 The Kryptonite Data set needs to be “big enough” Consistency mid-processing

10 Two Steps in MapReduce Map Reduce

11 Mapping Input K/V pairs -> Intermediate K/V Pairs Input and Intermediate can be different (Server Key, Blog Data) -> (Blog Key, Post Count) Sorted and Partitioned for reduction Number of maps depends on task and cluster 10TB data with blocksize 128MB = 82,000 maps 10-100 maps per node ideal

12 Reducing Intermediate K/V -> Intermediate K/V (smaller) Matching keys consolidated (A, 15); (B, 6); (A, 3) -> (A, 18); (B, 6) Number of Reductions >= 0 Hopefully smaller dataset at each iteration Reduce as much as needed

13 An Example { "type": "post", "name": "Raven's Map/Reduce functionality", "blog_id": 1342, "post_id": 29293921, "tags": ["raven", "nosql"], "post_content": "... ", "comments": [ { "source_ip": '124.2.21.2', "author": "martin", "text": "..." } ] }Want count of comments for blog http://ayende.com/blog/4435/map-reduce-a-visual-explanation

14 Step 1: Map to final format http://ayende.com/blog/4435/map-reduce-a-visual-explanation

15 Step 2: Reduce (Partition) http://ayende.com/blog/4435/map-reduce-a-visual-explanation

16 Step 3: Reduce (more) http://ayende.com/blog/4435/map-reduce-a-visual-explanation

17 Step 4: Reduce (most) http://ayende.com/blog/4435/map-reduce-a-visual-explanation

18 Single Node http://bc.tech.coop/blog/070520.html

19 Dual Node http://map-reduce.wikispaces.asu.edu/

20 N-Nodes http://www.inventoland.net/img/blog/mapReduce.png

21 Dealing with Failure Workers Occasional check-in pings by masters Masters Data structures get periodic auto-saves and consistency checks. Can restart from periodic saves Bandwidth Tasks attempt to pair with local storage

22 Has it worked? Patented Regenerated index

23 Apache Hadoop “open source software for reliable, scalable, distributed computing” Hadoop Distributed File System (HDFS) Hadoop MapReduce Cassandra (multi-master database) HBase (scalable, distributed, structured database) Mahout (data mining and machine learning libs) ZooKeeper (coordination service)

24 Sources Avankipu & Sdsalvi, Cloud Computing - An Overview. http://map-reduce.wikispaces.asu.edu Ayende Rahien, Map/Reduce – A Visual Explanation. http://ayende.com/blog/4435/map-reduce-a-visual- explanation http://hadoop.apache.org/ http://en.wikipedia.org/wiki/MapReduce/


Download ppt "Frankie Pike. 2010: 1.2 zettabytes 1.2 trillion gigabytes DVDs past the moon 2-way = 6 newspapers everyday ~58% growth per year Why care?"

Similar presentations


Ads by Google