Key-Value stores simple data model that maps keys to a list of values Easy to achieve Performance Fault tolerance Heterogeneity Availability due to its.

Slides:



Advertisements
Similar presentations
MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
Advertisements

Mapreduce and Hadoop Introduce Mapreduce and Hadoop
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
HadoopDB Inneke Ponet.  Introduction  Technologies for data analysis  HadoopDB  Desired properties  Layers of HadoopDB  HadoopDB Components.
HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads Azza Abouzeid1, Kamil BajdaPawlikowski1, Daniel Abadi1, Avi.
HadoopDB An Architectural Hybrid of Map Reduce and DBMS Technologies for Analytical Workloads Presented By: Wen Zhang and Shawn Holbrook.
Distributed Computations
MapReduce: Simplified Data Processing on Large Clusters Cloud Computing Seminar SEECS, NUST By Dr. Zahid Anwar.
MapReduce Simplified Data Processing on Large Clusters Google, Inc. Presented by Prasad Raghavendra.
Distributed Computations MapReduce
7/14/2015EECS 584, Fall MapReduce: Simplied Data Processing on Large Clusters Yunxing Dai, Huan Feng.
L22: SC Report, Map Reduce November 23, Map Reduce What is MapReduce? Example computing environment How it works Fault Tolerance Debugging Performance.
Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.
MapReduce : Simplified Data Processing on Large Clusters Hongwei Wang & Sihuizi Jin & Yajing Zhang
Hadoop, Hadoop, Hadoop!!! Jerome Mitchell Indiana University.
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
MapReduce.
MapReduce. Web data sets can be very large – Tens to hundreds of terabytes Cannot mine on a single server Standard architecture emerging: – Cluster of.
Map Reduce: Simplified Data Processing On Large Clusters Jeffery Dean and Sanjay Ghemawat (Google Inc.) OSDI 2004 (Operating Systems Design and Implementation)
Introduction to MapReduce Amit K Singh. “The density of transistors on a chip doubles every 18 months, for the same cost” (1965) Do you recognize this.
Jeffrey D. Ullman Stanford University. 2 Chunking Replication Distribution on Racks.
H ADOOP DB: A N A RCHITECTURAL H YBRID OF M AP R EDUCE AND DBMS T ECHNOLOGIES FOR A NALYTICAL W ORKLOADS By: Muhammad Mudassar MS-IT-8 1.
Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)
Parallel Programming Models Basic question: what is the “right” way to write parallel programs –And deal with the complexity of finding parallelism, coarsening.
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
MapReduce and Hadoop 1 Wu-Jun Li Department of Computer Science and Engineering Shanghai Jiao Tong University Lecture 2: MapReduce and Hadoop Mining Massive.
1 The Map-Reduce Framework Compiled by Mark Silberstein, using slides from Dan Weld’s class at U. Washington, Yaniv Carmeli and some other.
MapReduce – An overview Medha Atre (May 7, 2008) Dept of Computer Science Rensselaer Polytechnic Institute.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
HadoopDB Presenters: Serva rashidyan Somaie shahrokhi Aida parbale Spring 2012 azad university of sanandaj 1.
Map Reduce: Simplified Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat Google, Inc. OSDI ’04: 6 th Symposium on Operating Systems Design.
MapReduce M/R slides adapted from those of Jeff Dean’s.
Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters Hung-chih Yang(Yahoo!), Ali Dasdan(Yahoo!), Ruey-Lung Hsiao(UCLA), D. Stott Parker(UCLA)
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
1 MapReduce: Theory and Implementation CSE 490h – Intro to Distributed Computing, Modified by George Lee Except as otherwise noted, the content of this.
Hung-chih Yang 1, Ali Dasdan 1 Ruey-Lung Hsiao 2, D. Stott Parker 2
Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA
SLIDE 1IS 240 – Spring 2013 MapReduce, HBase, and Hive University of California, Berkeley School of Information IS 257: Database Management.
MapReduce and the New Software Stack CHAPTER 2 1.
By Jeff Dean & Sanjay Ghemawat Google Inc. OSDI 2004 Presented by : Mohit Deopujari.
Chapter 5 Ranking with Indexes 1. 2 More Indexing Techniques n Indexing techniques:  Inverted files - best choice for most applications  Suffix trees.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
MapReduce: Simplified Data Processing on Large Clusters Lim JunSeok.
MapReduce Computer Engineering Department Distributed Systems Course Assoc. Prof. Dr. Ahmet Sayar Kocaeli University - Fall 2015.
1. Efficient Peer-to-Peer Lookup Based on a Distributed Trie 2. Complex Queries in DHT-based Peer-to-Peer Networks Lintao Liu 5/21/2002.
C-Store: MapReduce Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 22, 2009.
MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.
MapReduce: simplified data processing on large clusters Jeffrey Dean and Sanjay Ghemawat.
MapReduce: Simplified Data Processing on Large Cluster Authors: Jeffrey Dean and Sanjay Ghemawat Presented by: Yang Liu, University of Michigan EECS 582.
MapReduce: Simplied Data Processing on Large Clusters Written By: Jeffrey Dean and Sanjay Ghemawat Presented By: Manoher Shatha & Naveen Kumar Ratkal.
1 Student Date Time Wei Li Nov 30, 2015 Monday 9:00-9:25am Shubbhi Taneja Nov 30, 2015 Monday9:25-9:50am Rodrigo Sanandan Dec 2, 2015 Wednesday9:00-9:25am.
BIG DATA/ Hadoop Interview Questions.
MapReduce: Simplified Data Processing on Large Clusters Jeff Dean, Sanjay Ghemawat Google, Inc.
Hadoop MapReduce Framework
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn.
CHAPTER 3 Architectures for Distributed Systems
Overview of Hadoop MapReduce MapReduce is a soft work framework for easily writing applications which process vast amounts of.
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
MapReduce Simplied Data Processing on Large Clusters
CS6604 Digital Libraries IDEAL Webpages Presented by
湖南大学-信息科学与工程学院-计算机与科学系
February 26th – Map/Reduce
Cse 344 May 4th – Map/Reduce.
CS 345A Data Mining MapReduce This presentation has been altered.
MAPREDUCE TYPES, FORMATS AND FEATURES
CS639: Data Management for Data Science
5/7/2019 Map Reduce Map reduce.
MapReduce: Simplified Data Processing on Large Clusters
Presentation transcript:

Key-Value stores simple data model that maps keys to a list of values Easy to achieve Performance Fault tolerance Heterogeneity Availability due to its schema-less data model and fine granularity partitioning of the data

Google BigTable Google use the key-value paradigm to map URLs to multidimensional data, such as: Timestamps/Versions Rank Keywords Links No explicit ordering is needed on keys since a hash function is used

MapReduce Many data distributed over hundreds or thousands of machines These data need to be processed and produce new data (usually the process is a simple task) ex. aggregate functions, filtering, etc.

MapReduce Partition the data according to pre-defined key ex. words, URLs, etc. A master node assigns to workers specific partitions (=keys, =mappings of data to keys) The worker will produce a new list of key–value, corresponding to the new intermediate processed data (Map phase) Workers then will gather all intermediate data belonging to a specific key, and reduce them to the requested output ordered by key (Reduce phase)

MapReduce map (in_key, in_value) -> list(out_key, intermediate_value) reduce (out_key, list(intermediate_value)) -> list(out_value) k, v k1,v1 k2,v2 k3,v3 …... map k', v' k1,v1' k2,v2' k3,v3' …... map k1, v1 v1' …... reduce r1 r2 …... k2, v2 v2' …... reduce r1 r2 …... The dirty little secret of Google that is too obvious is... sort

Hadoop Map/Reduce open source implementation of the map/reduce idea takes care of scheduling tasks, monitoring them and re-executes the failed tasks a single master JobTracker and one slave TaskTracker per cluster-node

HadoopDB

HadoopDB: Database Connector extends Hadoop's InputFormat class connects to a database, executes the SQL query and returns results as key-value pairs “should” support any JDBC-compliant database that resides in the cluster

HadoopDB: Catalog The catalog maintains meta-information about the databases connection parameters such as database location, driver class and credentials metadata such as data sets contained in the cluster, replica locations, and data partitioning properties

HadoopDB: Data Loader responsible for: globally repartitioning data on a given partition key upon loading breaking apart single node data into multiple smaller partitions or chunks and finally bulk-loading the single-node databases with the chunks

The [not so] easy way to do the labwork... HadoopDB with MonetDB instead of PostgreSQL read the HadoopDB paper download HadoopDB hook in MonetDB define a benchmark do experiments write an excellent report!  implementation details+experiments

The [not so] fun way to do the labwork... Combine something of the following: Map/Reduce DHTs - Chord URLs N-gram Strings Strings M5 module

DHTs a traditional key-value store based on a distributed hash table there are no master nodes keys are distributed to nodes according to a hash function values are retrieved with O(logN) messages by employing routing tables

Chord protocol peer1 peer2 peer3 peer4 peer5 peer6 keys are assigned an identifier: hash(key) peers are assigned an identifier: hash(IP) store and retrieve pairs of (key, data): lookup(key) Chord Ring modulo 2^m Each peer maintains a routing table (finger table) to route lookups V+2^0 → peer2 V+2^2 → peer5 V+2^1 → peer4

The [not so] fun way to do the labwork... Design the BAT representation of distributed Key- Value store module Develop a wrapper around it to (simulate) parallel behavior define a benchmark do experiments write an excellent report!  implementation details + experiments

What is the interface of the KV store?

SiteStat Basic event ns_utc= & Time= & type=view& ns_m2=no& name=statistics.basics& Ip= & ns_site_cookie=426EA62C01D400FA& agent=Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1;.NET CLR )& ns_pageurl= &ns_js=yes& referrer=http%3A// &lang=nl &secure=no &id= & Pv=2& cntry=NL& language=NL

Dealing with formatted key strings

url box [ 0, "urlbox_0", 1, 1 ] [ 1, "urlbox_1", 270, 270 ] [ 2, "urlbox_2", 236, 219 ] [ 3, "urlbox_3", 180, 173 ] [ 4, "urlbox_4", 136, 122 ] [ 5, "urlbox_5", 56, 48 ] [ 6, "urlbox_6", 34, 32 ] [ 7, "urlbox_7", 5, 5 ] http zvon org xxl DTDTutorial General book.html" OidValue 1org 2com 3net OidValue 1xxl Oidvalue 1DTDtutorial 1DOM2reference Insert 1000 urls -> 2 seconds

N-gram indexing Oidvalue 1Tuto 1utor 1tori Oidvalue 1utor Oidvalue 1tori Oidvalue 1Tuto Break the string in overlapping n-grams Use a single table per n-gram + wildcards