Frankie Pike. 2010: 1.2 zettabytes 1.2 trillion gigabytes DVDs past the moon 2-way = 6 newspapers everyday ~58% growth per year Why care?

Frankie Pike

2010: 1.2 zettabytes 1.2 trillion gigabytes DVDs past the moon 2-way = 6 newspapers everyday ~58% growth per year Why care?

 Google’s capacity = 1 exabyte  24 hours of Youtube > Internet in 2000  4 years of video / day on Youtube  100 trillion words online

Common Architecture http://www.adopenstatic.com/images/resources/blog/Kerberos6.jpg

Common Architecture Single point of failure Space-constraints Multi-tenancy difficulties Re-writing of programs or changes to network config

MapReduce

The Promise High reliability any node can go down High scalability easy to add nodes Multi-tenancy Cost Reduction “Cloud-friendly” Java, C++, C#, Python, R Transparent Parallelization

The Kryptonite Data set needs to be “big enough” Consistency mid-processing

Two Steps in MapReduce Map Reduce

Mapping Input K/V pairs -> Intermediate K/V Pairs Input and Intermediate can be different (Server Key, Blog Data) -> (Blog Key, Post Count) Sorted and Partitioned for reduction Number of maps depends on task and cluster 10TB data with blocksize 128MB = 82,000 maps 10-100 maps per node ideal

Reducing Intermediate K/V -> Intermediate K/V (smaller) Matching keys consolidated (A, 15); (B, 6); (A, 3) -> (A, 18); (B, 6) Number of Reductions >= 0 Hopefully smaller dataset at each iteration Reduce as much as needed

An Example { "type": "post", "name": "Raven's Map/Reduce functionality", "blog_id": 1342, "post_id": 29293921, "tags": ["raven", "nosql"], "post_content": "... ", "comments": [ { "source_ip": '124.2.21.2', "author": "martin", "text": "..." } ] }Want count of comments for blog http://ayende.com/blog/4435/map-reduce-a-visual-explanation

Step 1: Map to final format http://ayende.com/blog/4435/map-reduce-a-visual-explanation

Step 2: Reduce (Partition) http://ayende.com/blog/4435/map-reduce-a-visual-explanation

Step 3: Reduce (more) http://ayende.com/blog/4435/map-reduce-a-visual-explanation

Step 4: Reduce (most) http://ayende.com/blog/4435/map-reduce-a-visual-explanation

Single Node http://bc.tech.coop/blog/070520.html

Dual Node http://map-reduce.wikispaces.asu.edu/

N-Nodes http://www.inventoland.net/img/blog/mapReduce.png

Dealing with Failure Workers Occasional check-in pings by masters Masters Data structures get periodic auto-saves and consistency checks. Can restart from periodic saves Bandwidth Tasks attempt to pair with local storage

Has it worked? Patented Regenerated index

Apache Hadoop “open source software for reliable, scalable, distributed computing” Hadoop Distributed File System (HDFS) Hadoop MapReduce Cassandra (multi-master database) HBase (scalable, distributed, structured database) Mahout (data mining and machine learning libs) ZooKeeper (coordination service)

Sources Avankipu & Sdsalvi, Cloud Computing - An Overview. http://map-reduce.wikispaces.asu.edu Ayende Rahien, Map/Reduce – A Visual Explanation. http://ayende.com/blog/4435/map-reduce-a-visual- explanation http://hadoop.apache.org/ http://en.wikipedia.org/wiki/MapReduce/

Frankie Pike. 2010: 1.2 zettabytes 1.2 trillion gigabytes DVDs past the moon 2-way = 6 newspapers everyday ~58% growth per year Why care?

Similar presentations

Presentation on theme: "Frankie Pike. 2010: 1.2 zettabytes 1.2 trillion gigabytes DVDs past the moon 2-way = 6 newspapers everyday ~58% growth per year Why care?"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Frankie Pike. 2010: 1.2 zettabytes 1.2 trillion gigabytes DVDs past the moon 2-way = 6 newspapers everyday ~58% growth per year Why care?

Similar presentations

Presentation on theme: "Frankie Pike. 2010: 1.2 zettabytes 1.2 trillion gigabytes DVDs past the moon 2-way = 6 newspapers everyday ~58% growth per year Why care?"— Presentation transcript:

Similar presentations

About project

Feedback