NoSQL DBs.

Slides:

Advertisements

Similar presentations

Chapter 10: Designing Databases

Advertisements

CS525: Special Topics in DBs Large-Scale Data Management HBase Spring 2013 WPI, Mohamed Eltabakh 1.

Data Management in the Cloud Paul Szerlip. The rise of data Think about this o For the past two decades, the largest generator of data was humans -- now.

Jennifer Widom NoSQL Systems Overview (as of November 2011 )

Google Bigtable A Distributed Storage System for Structured Data Hadi Salimi, Distributed Systems Laboratory, School of Computer Engineering, Iran University.

NoSQL Database.

CS 405G: Introduction to Database Systems 24 NoSQL Reuse some slides of Jennifer Widom Chen Qian University of Kentucky.

Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.

A Social blog using MongoDB ITEC-810 Final Presentation Lucero Soria Supervisor: Dr. Jian Yang.

Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Slide

Massively Parallel Cloud Data Storage Systems S. Sudarshan IIT Bombay.

1 Yasin N. Silva Arizona State University This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

JSP Standard Tag Library

Systems analysis and design, 6th edition Dennis, wixom, and roth

MongoDB An introduction. What is MongoDB? The name Mongo is derived from Humongous To say that MongoDB can handle a humongous amount of data Document.

CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.

HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.

MapReduce – An overview Medha Atre (May 7, 2008) Dept of Computer Science Rensselaer Polytechnic Institute.

Panagiotis Antonopoulos Microsoft Corp Ioannis Konstantinou National Technical University of Athens Dimitrios Tsoumakos.

Getting Biologists off ACID Ryan Verdon 3/13/12. Outline Thesis Idea Specific database Effects of losing ACID What is a NoSQL database Types of NoSQL.

WTT Workshop de Tendências Tecnológicas 2014

MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.

Distributed Indexing of Web Scale Datasets for the Cloud {ikons, eangelou, Computing Systems Laboratory School of Electrical.

Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.

Introduction to Hadoop and HDFS

Modern Databases NoSQL and NewSQL Willem Visser RW334.

NoSQL Databases NoSQL Concepts SoftUni Team Technical Trainers Software University

NoSQL Databases Oracle - Berkeley DB Rasanjalee DM Smriti J CSC 8711 Instructor: Dr. Raj Sunderraman.

NoSQL Databases Oracle - Berkeley DB. Content A brief intro to NoSQL About Berkeley Db About our application.

NOSQL DATABASES Please remember to read the NOSQL Distilled book and the Seven Databases book.

1 CS 430 Database Theory Winter 2005 Lecture 16: Inside a DBMS.

Chapter 10: The Data Tier We discuss back-end data storage for Web applications, relational data, and using the MySQL database server for back-end storage.

HBase Elke A. Rundensteiner Fall 2013

NOSQL Implementation and examples Maciej Matuszewski.

CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.

Physical Database Design Purpose- translate the logical description of data into the technical specifications for storing and retrieving data Goal - create.

Introduction.  Administration  Simple DBMS  CMPT 454 Topics John Edgar2.

8 Chapter Eight Server-side Scripts. 8 Chapter Objectives Create dynamic Web pages that retrieve and display database data using Active Server Pages Process.

NoSQL DBs. Positives of RDBMS Historical positives of RDBMS: – Can represent relationships in data – Easy to understand relational model/SQL – Disk oriented.

Dynamo: Amazon’s Highly Available Key-value Store DAAS – Database as a service.

NoSQL Systems Motivation. NoSQL: The Name  “SQL” = Traditional relational DBMS  Recognition over past decade or so: Not every data management/analysis.

NOSQL DATABASE Not Only SQL DATABASE

Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies

NoSQL: Graph Databases. Databases Why NoSQL Databases?

Introduction to MongoDB. Database compared.

Data and Information Systems Laboratory University of Illinois Urbana-Champaign Data Mining Meeting Mar, From SQL to NoSQL Xiao Yu Mar 2012.

NoSQL databases A brief introduction NoSQL databases1.

CS422 Principles of Database Systems Introduction to NoSQL Chengyu Sun California State University, Los Angeles.

Department of Computer Science, Johns Hopkins University EN Instructor: Randal Burns 24 September 2013 NoSQL Data Models and Systems.

General Architecture of Retrieval Systems 1Adrienn Skrop.

Data Tier Options NWEN304 Advanced Network Applications.

Group members: Phạm Hoàng Long Nguyễn Huy Hùng Lê Minh Hiếu Phan Thị Thanh Thảo Nguyễn Đức Trí 1 BIG DATA & NoSQL Topic 1:

Abstract MarkLogic Database – Only Enterprise NoSQL DB Aashi Rastogi, Sanket V. Patel Department of Computer Science University of Bridgeport, Bridgeport,

Exam Friday April 11. MongoDB Specifics Find() to Query db.collection.find(, ) db.collection.find{{select conditions}, {project columns}) Selection conditions:

Image taken from: slideshare

Plan for Cloud Data Models

CS 405G: Introduction to Database Systems

and Big Data Storage Systems

HBase Mohamed Eltabakh

MongoDB Er. Shiva K. Shrestha ME Computer, NCIT

Spark Presentation.

NoSQL Database and Application

NOSQL databases and Big Data Storage Systems

NoSQL Systems Overview (as of November 2011).

NoSQL Databases Antonino Virgillito.

Web DB Programming: PHP

NoSQL databases An introduction and comparison between Mongodb and Mysql document store.

Presentation transcript:

NoSQL DBs

Positives of RDBMS Historical positives of RDBMS: Can represent relationships in data Easy to understand relational model/SQL Disk oriented storage Indexing structures Multi threading to hide latency Locking-based for consistency Recovery (log-based)

DBs today* Things have changed Data no longer just in relational DBs Different constraints on information For example: Placing items in shopping carts Searching for answers in Wikipedia Retrieving Web pages Face book info Large amounts of data!!!

Relational Negatives RDBS very complex, strict Want simplicity RDBS limited in throughput Want higher throughput With RDBS must scale up (expensive servers) Want to scale out (wide – cheap servers) With RDBS overhead of object to relational mapping Want to store data as is Cannot always partition/distribute from single DB server Want to distribute date RDBS providers slow to move to the cloud Everyone wants to use the cloud

SQL Negatives Also requires rewrite because not good for: Text Data warehouses Stream processing Scientific and intelligence databases Interactive transactions Direct SQL interfaces are rare

Data Today* Different types of data: Structured - Info in databases Structured, semi-structured, unstructured Structured - Info in databases Data organized into chunks, similar entities grouped together Descriptions for entities in groups – same format, length, etc.

Data Today* Semi-structured – data has certain structure, but not all items identical Schema info may be mixed in with data values Similar entities grouped together – may have different attributes Self-describing data, e.g. XML May be displayed as a graph

Data Today* Big data – much of it is unstructured Unstructured data Data can be of any type, may have no format or sequence cannot be represented by any type of schema Web pages in HTML Video, sound, images Big data – much of it is unstructured

Big Data - What is it? Massive volumes of rapidly growing data: Smartphones broadcasting location (few secs) Chips in cars diagnostic tests (1000s per sec) Cameras recording public/private spaces RFID tags read at as travel through supply-chain

Characteristics of Big Data Unstructured Heterogeneous Grows at a fast pace Diverse Not formally modeled Data is valuable (just cause it’s big is in important?) Standard databases and data warehouses cannot capture diversity and heterogeneity Cannot achieve satisfactory performance

How to deal with such data NoSQL – do not use a relational structure MapReduce – from Google

What does NoSQL mean? NoSQL used to stand for NO to SQL 1998 but now it is Not Only SQL 2009

NoSQL “NoSQL is not about any one feature of any of the projects. NoSQL is not about scaling, NoSQL is not about performance, NoSQL is not about hating SQL, NoSQL is not about ease of use, …, NoSQL is not about is not about throughput, NoSQL is not about about speed, …, NoSQL is not about open standards, NoSQL is not about Open Source and NoSQL is most likely not about whatever else you want NoSQL to be about. NoSQL is about choice.” Lehnardt of CouchDB

NoSQL Many applications with data structures of low complexity – don’t need relational features NoSQL DBs designed to store data structures simpler or similar to object-oriented programming language compared to relational data structures No expensive Object-Relational mapping needed

Types of NoSQL DBs Classification Examples Column stores Key-value stores Document stores Examples Column: HBase, Accumulo, Cassandra Key-value : Dynamo, Riak, Redis, Cache, Voldemort Document: MongoDB, CouchDB, SimpleDB

Column Stores

Column Store Stores data tables Column order Relational stores in row order

Row-based storage A relational table is serialized as rows are appended and flushed to disk Whole datasets can be R/W in a single I/O operations Good locality of access on disk and in cache of different columns Operations on columns expensive, must read extra data

Column Storage Serializes tables by appending columns and flushing to disk Operations on columns – fast, cheap Operations on rows costly, seeks in many or all columns Good for? aggregations

Column storage with locality groups Like column storage but groups columns expected to be accessed together Store groups together and physically separated from other column groups Google’s Bigtable Started as column families

(a) Row-based (b) Columnar (c) Columnar with locality groups Storage Layout – Row-based, Columnar with/out Locality Groups

Column Store Stores data as tables Advantages for data warehouses, customer relationship management (CRM) systems More efficient for: Aggregates, many columns of same row required Update rows in same column Easier to compress, all values same per column

HBase HBase is an open-source, distributed, versioned, non-relational, column-oriented data store It is an Apache project whose goal is to provide storage for the Hadoop Distributed Computing Facebook has chosen HBase to implement its new message platform Data is logically organized into tables, rows and columns

Querying Scans and queries can select a subset of available columns, perhaps by using a filter There are three types of lookups: Fast lookup using row key and optional timestamp Full table scan Range scan from region start to end Tables have one primary index: the row key

Operations Create()/Disable()/Drop() Put() Get() Scan() No Join! Create/Disable/Drop a table Put() Insert a new record with a new key Insert a record for an existing key Get() Select value from table by a key Scan() Scan a table with a filter No Join!

HBase Data Model Each record is divided into Column Families Each row has a Key Each column family consists of one or more Columns

HBase Data Model Tables are sorted by Row Column Family Column Row Key Timestamp Value Row Key Time Stamp ColumnFamily contents ColumnFamily anchor "com.cnn.www" t9 anchor:cnnsi.com = "CNN" t8 anchor:my.look.ca = "CNN.com" t6 contents:html = "<html>..." t5 t3 Tables are sorted by Row Table schema only define it’s column families . Each family consists of any number of columns Each column consists of any number of versions Columns only exist when inserted, NULLs are free. Columns within a family are sorted and stored together Everything except table names are byte[] (Row, Family: Column, Timestamp)  Value

HBase Physical Model Each column family is stored in a separate file Different sets of column families may have different properties and access patterns Keys & version numbers are replicated with each column family Empty cells are not stored Row Key Time Stamp ColumnFamily contents ColumnFamily anchor "com.cnn.www" t9 anchor:cnnsi.com = "CNN" t8 anchor:my.look.ca = "CNN.com" t6 contents:html = "<html>..." t5 t3

Hbase and SQL I looked up Hbase and SQL and found Phoenix: http://www.slideshare.net/Hadoop_Summit/w-145p230-ataylorv2 Check out slide 38

Cassandra Open Source, Apache Schema optional Need to design column families to support queries Start with queries and work back from there CQL (Cassandra Query Language) Select, From Where Insert, Update, Delete Create ColumnFamily http://cassandra.apache.org/doc/cql/CQL.html#SELECT Has primary and secondary indexes

Cassandra Keyspace is container (like DB) Contains column family objects (like tables) Contain columns, set of related columns identified by application supplied row keys Each row does not have to have same set of columns Has PKs, but no FKs Join not supported Each column family has a self-contained set of columns that are intended to be accessed together to satisfy specific queries from your application. http://planetcassandra.org/create-a-keyspace-and-table/ Video around 12:30 at http://cassandra.apache.org/ Creates a “tree of hashes of their data”

Key-Value Store

Key-value store Key–value (k, v) stores allow the application to store its data in a schema-less way Keys – can be anything Values – objects not interpreted by the system v can be an arbitrarily complex structure with its own semantics or a simple word Good for unstructured data Data could be stored in a datatype of a programming language or an object No meta data No need for a fixed data model

Key-Value Stores Simple data model Map/dictionary Put/request values per key Length of keys limited, few limitations on value High scalability over consistency No complex ad-hoc querying and analytics No joins, aggregate operations

Dynamo Amazon’s Dynamo Highly distributed Only store and retrieve data by primary key Simple key/value interface, store values as BLOBs Operations limited to k,v at a time Get(key) returns list of objects and a context Put(key, context, object) no return values Context is metadata, e.g. version number

DynamoDB Based on Dynamo Can create tables, define attributes, etc. Have 2 APIs to query data Query Scan

DynamoDB - Query A Query operation searches only primary key attribute values Can Query indexes in the same way as tables supports a subset of comparison operators on key attribute values returns all of the item’s data for the matching primary keys (all of each item's attributes) up to 1 MB of data per query operation Always returns results, but can return empty results Query results are always sorted by the range key http://docs.aws.amazon.com/amazondynamodb/latest/APIReference/API_Query.html

DynamoDB - Scan A Scan operation examines every item in the table User specifies filters to apply to the results to refine the values returned after scan has finished A 1 MB limit on the scan (the limit applies before the results are filtered) Scan can result in no table data meeting the filter criteria. Scan supports a specific set of comparison operators http://docs.aws.amazon.com/amazondynamodb/latest/APIReference/API_Scan.html

Sample Query and Scan http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/QueryScanORMModelExample.html

Document Store

Document Store Notion of a document Documents encapsulate and encode data in some standard formats or encodings Encodings include: XML, YAML, and JSON binary forms like BSON, PDF and Microsoft Office documents

Document Store Documents can be organized/grouped as: Collections Tags Non-visible Metadata Directory hierarchies

Document Store More functionality than key-value More appropriate for semi-structured data Recognizes structure of objects stored Objects are documents that may have attributes of various types Objects grouped into collections Simple query mechanisms to search collections for attribute values

Document Store Typically (e.g. MongoDB) Collections – tables documents – records But not all documents in a collection have same fields Documents are addressed in the database via a unique key Allows beyond the simple key-document (or key–value) lookup API or query language allows retrieval of documents based on their contents

MongoDB Specifics

MongoDB huMONGOus MongoDB – document oriented organized around collections of documents Each document has an ID (key-value pair) Collections are similar corresponds to tables in RDBS Document corresponds to rows in RDBS Collections can be created at run-time Documents’ structure not required to be the same, although it may be

MongoDB Operations in queries are limited – must implement in a programming language (JavaScript for MongoDB) No Join Many performance optimizations must be implemented by developer MongoDB does have indexes

MongoDB Can build incrementally without modifying schema (since no schema) Example of hotel info – creating 3 documents: d1 = {name: "Metro Blu", address: "Chicago, IL", rating: 3.5} db.hotels.insert(d1) d2 = {name: "Experiential", rating: 4, type: “New Age”} db.hotels.insert(d2) d3 = {name: "Zazu Hotel", address: "San Francisco, CA", rating: 4.5} db.hotels.insert(d3)

MongoDB DB contains collection called ‘hotels’ with 3 documents To list all hotels: db.hotels.find() Did not have to declare or define the collection Hotels each have a unique key Not every hotel has the same type of information

MongoDB Queries DO NOT look like SQL To query all hotels in CA (searches for regular expression CA in string) db.hotels.find( { address : { $regex : "CA" } } ); To update hotels: db.hotels.update( { name:"Zazu Hotel" }, { $set : {wifi: "free"} } ) db.hotels.update( { name:"Zazu Hotel" }, { $set : {parking: 45} } )

Find() to Query db.collection.find(<criteria>, <projection>) db.collection.find{{select conditions}, {project columns}) Selection conditions: To match the value of a field: db.collection.find({c1: 5}) Everything for select ops must be inside of { } Can use other comparators, e.g. $gt, $lt, $regex, etc. db.collection.find {c1: {$gt: 5}} If have more than one condition, need to connect with $and or $or and place inside brackets []

Find() to Query Projection: If want to specify a subset of columns 1 to include, 0 to not include (_id:1 is default) Cannot mix 1s and 0s, except for _id db.collection.find({Name: “Sue”}, {Name:1, Address:1, _id:0}) If you don’t have any select conditions, but want to specify a set of columns: db.collection.find({},{Name:1, Address:1, _id:0})

Documents can have nested fields Must qualify name with dot notation Must use quotes around qualified name (either double or single quotes) db.collection.find(<criteria>, <projection>) db.collection.find{{select conditions}, {project columns})

> m2= {MOVI: "Gump (1994)", NOVL: {AUTH: "Groom, Winston", TITLE: "Forrest Gump"}} > db.movie.insert(m2) > db.movie.find() { "_id" : ObjectId("55195f845abf51cf253eb17b"), "MOVI" : "Gump (1994)", "NOVL" : { "AUTH" : "Groom, Winston ", "TITLE" : "Forrest Gump" } } > db.movie.find().pretty() "_id" : ObjectId("55195f845abf51cf253eb17b"), > db.movie.find({},{"novl.title":1}) { "_id" : ObjectId("55195f845abf51cf253eb17b") } > db.movie.find({},{"NOVL.TITLE":1}) { "_id" : ObjectId("55195f845abf51cf253eb17b"), "NOVL" : { "TITLE" : "Forrest Gump" } } > db.movie.find({},{"NOVL.TITLE":1, _id:0}) { "NOVL" : { "TITLE" : "Forrest Gump" } }

MongoDB download http://www.mongodb.org/downloads

Create a movie DB http://cs457.cs.ua.edu/2015S/literature.txt http://cs457.cs.ua.edu/2015S/literatureShort.txt http://cs457.cs.ua.edu/2015S/CreateDocuments.txt

Cursor functions The result of a query (find() ) is a cursor object Pointer to the documents in the collection Cursor methods apply function to the result of a query E.g. limit(), etc. For example, can execute a find(…) followed by one of these cursor functions db.collection.find().limit() Look at the documentation to see what functions You can store the cursor by declaring a cursor variable before the find, e.g. var cursor = db.collection.find()

Cursors Can set a variable equal to a cursor, then use that variable in javascript var c = db.testData.find() Print the full result set by using a while loop to iterate over the c variable: while ( c.hasNext() ) printjson( c.next() )

Aggregation Three ways to perform aggregation Single purpose Pipeline MapReduce

Single Purpose Aggregation Simple access to aggregation, lack capability of pipeline Operations: count, distinct, group db.collection.distinct(“custID”) Returns distinct custIDs

Single Purpose Aggregation Count example: Group seems a lot more complicated, see documentation db.movie.aggregate([ {$group: {_id: "$MOVI", total: {$max: "$_id"}}}]) The following operation will count only the documents where the value of the field a is 1 and return3: db.records.count( { a: 1 } )

Pipeline Aggregation Modeled after data processing pipelines Basic --filters that operate like queries Operations to group and sort documents, arrays or arrays of documents http://mongodb.org/manual/reference/operator/aggregation/

Pipeline Operators Stage operators: $match, $project, $limit, $group, $sort Boolean: $and, $or, $not Set: $setEquals, $setUnion, etc. Comparison: $eq, $gt, etc. Arithmetic: $add, $mod, etc. String: $concat, $substr, etc. Text Search: $meta Array: $size Date, Variable, Literal, Conditional Accumulators: $sum, $max, etc.

Pipeline Aggregation Assume a collection with 3 field: CustID, status, amount db.collection.aggregate({$match: { status: “A”}}, {$group: “CustID”, total: {$sum: “$amount”}}} Notice you must use $ to get the value of the key

Sort Cursor sort, aggregation If use cursor sort, can apply after a find( ) If use aggregation db.collection.aggregate($sort: {sort_key}) Does the above when complete other ops in pipeline

Arrays Arrays are denoted with [ ] Some fields can contain arrays Using a find to query a field that contains an array If a field contains an array and your query has multiple conditional operators, the field as a whole will match if either a single array element meets the conditions or a combination of array elements meet the conditions.

FYI Case sensitive to field names, collection names, e.g. Title will not match title

HW#6 will use GitHub DB – json so easy to import Semi-structured – no ER diagram Lots of nested fields Requires some effort to figure out data

Id Type Actor Id login gravatar_id url avatar_url Repo name url payload action push_id size distinct_size ref head before commits Public Created at

What I hate about MongoDB I am confused by syntax – too many { }’s No error messages, or bad error messages If I list a non-existent field, no message (because no schemas to check it with!) Official MongoDB lacking - not enough examples Lots of other websites about MongoDB, but mostly people posting question and I don’t trust answers people post

At CAPS use some type of GUI that makes using MongoDB much easier Robomongo Umongo, etc.

MongoDB Hybrid approach Use MongoDB to handle online shopping SQL to handle payment/processing of orders

Further Reading http://blog.mongodb.org/ https://blog.serverdensity.com/mongodb/ http://blog.mongolab.com/ http://docs.mongodb.org/manual/reference/

At CAPS use some type of GUI that makes using MongoDB much easier Robomongo Umongo, etc.

MongoDB vs DynamoDB (key-value store) When to use one vs. the other MongoDB - if your indexing fields might be altered later MongoDB if you need features of a document database Can query subdocuments, e.g. qualified field names MongoDB if you are going to use Perl, Erlang, or C++ DynamoDB supports Java, JavaScript, Ruby, PHP, Python, and .NET

MongoDB vs DynamoDB MongoDB if you may exceed the limits of DynamoDB Can only store 64kB key in DynamoDB MongoDB if you are going to have data type other than string, number, and base 64 encoded binary, e.g. date boolean MongoDB if you are going to query by regular expression {"name" => qr/[Jj]ohn/}, this cannot be completed byDynamoDB using one query

NoSQL Oracle An Oxymoron?

Oracle NoSQL DB Key-value – horizontally scaled Records version # for k,v pairs Hashes keys for good distribution Map from user defined key (string) to opaque (?) data items

Oracle NoSQL DB CRUD APIs Create, Update provided by put methods Create, Retrieve, Update, Delete Create, Update provided by put methods Retrieve data items with get

CRUD Examples // Put a new key/value pair in the database, if key not already present. Key key = Key.createKey("Katana"); String valString = "sword"; store.putIfAbsent(key, Value.createValue(valString.getBytes())); // Read the value back from the database. ValueVersion retValue = store.get(key); // Update this item, only if the current version matches the version I read. // In conjunction with the previous get, this implements a read-modify-write String newvalString = "Really nice sword"; Value newval = Value.createValue(newvalString.getBytes()); store.putIfVersion(key, newval, retValue.getVersion()); // Finally, (unconditionally) delete this key/value pair from the database. store.delete(key);

NoSQL DBs NoSQL DBs Good for business intelligence Flexible and extensible data model No fixed schema Development of queries is more complex Limits to operations (no join ...), but suited to simple tasks, e.g. storage and retrieval of text files such as tweets Processing simpler and more affordable No standard or uniform query language such as SQL

NoSQL DBs Cont’d Distributed and horizontally scalable (SQL is not) Run on large number of inexpensive (commodity) servers – add more servers as needed Differs from vertical scalability of RDBs where add more power to a central server

But 90% of people using DBs do not have to worry about any of the major scalability problems that can occur within DBs

Criticisms of NoSQL Open source scares business people Lots of hype, little promise If RDBMS works, don’t fix it Questions as to how popular NoSQL is in production today

Future No one size fits all model No more one size fits all language

End of Material

Info about Healthcare.gov According to Time Magazine 3.10.14: Designers didn’t cache most frequently requested data Every time a user had to get info from the website’s large, it queried the db (on disk?) The practice of awarding high tech, high stakes contracts (e.g. Healthcare.gov) to companies whose primary skill is getting those contracts, not delivering on, them must change

Healthcare.gov From Time Magazine 3.10.14: Mikey Dickerson who orchestrated the rescue of Healthcare.gov: “It was only when they were desperate that they turned to us… I have no history in government contracting and no future in it … I don’t wear a suit and tie … They have no use for someone who looks and dresses like me. Maybe this will be a lesson for them. Maybe that will change.”

MapReduce

4th paradigm Manipulate, explore, mine massive data Systems must be able to scale Increases in capacity > improvements in bandwidth Parallel processing only way forward

MapReduce Programming model for distributed computations on massive amounts of data Execution framework for large-scale data processing on clusters of commodity servers Developed by Google – built on old, principles of parallel and distributed processing Hadoop – open source implementation of MapReduce Not the Von Neumann model

MapReduce (MR) MapReduce Level of abstraction and beneficial division of labor Programming model – powerful abstraction separates what from how of data intensive processing Hide system-level details from application developer Based on functional programming

Functional Programming Roots Lisp, Scheme Map: transformation of dataset do something to everything in a list Fold (Reduce): aggregation operation combine results of a list in some way Output aggregated by another user-specified computation Some aggregations can be applied in parallel

Map/Fold in Action Simple map example: Sum of squares Map: Square each item in a list Fold: Sum items in the list Fold: [1 4 9 16 25])  55 Map: [1 2 3 4 5])  [1 4 9 16 25]

Map/Reduce 2 stages: Map Reduce User specified computation applied over all input can occur in parallel return intermediate output Reduce Output from Map is input aggregated by another user-specified computation Some aggregations can occur in parallel

Mappers/Reducers Key-value pair (k,v) – basic data structure in MapReduce Keys, values – int, strings, etc., user defined e.g. keys – URLs, values – HTML content e.g. keys – node ids, values – adjacency lists of nodes

Example: unigram (word count) (docid, doc) doc is text Mapper tokenizes (docid, doc) emits (k,v) for every word, e.g. (“book”, 1) Intermediate results sorted All keys with same value sent to same reducer Reducer sums all counts (of 1) for word writes to one file

MapReduce Example Convert the set of written tennis racket reviews to quantitative ratings of certain features. The output is the average of all numeric ratings of the tennis racket feature. Review 1: The X tennis racket is very flexible, with ample power, but provides average control. Review 2: The Y tennis stick provides medium power and outstanding control. Review 3: Using the Y racket gives you great control, but you have to generate most of your power. The frame is not very flexible.

MapReduce Example Map Function parses the text and outputs: map(R1) -> (<X, flexibility>, 9), (<X, power>, 8), (<X, control>, 5) map(R2) -> (<Y, power>, 5), (<Y, control>, 10) map(R3) -> (<Y, control>, 9), (<Y, power>, 3), (<Y, flexibility>, 2)

MapReduce Example Reduce Function result: reduce((<X, flexibility>)) -> (<X, flexibility>, 9) reduce((<X, power>)) -> (<X, power>, 8) reduce((<X, control>)) -> (<X, control>, 5) reduce((<Y, power>)) -> (<Y, power>, 4) reduce((<Y, control>)) -> (<Y, control>, 9.5) reduce((<Y, flexibility>)) -> (<Y, flexibility>, 2)

MapReduce Example

MapReduce Example

MapReduce applied to DB? Implementation of Relational operations in MapReduce