MapReduce and NoSQL CMSC 461 Michael Wilson
Big data The term big data has become fairly popular as of late There is a need to store vast quantities of data and retrieve them in a short amount of time Images, movies, etc. Large files
MapReduce reduce.html reduce.html Concept pioneered by Google Performing operations on large volumes of data Map function Reduce function
Map function Map function Receives a set of key value pairs as input Performs some operation (user defined) Produces a set of new key value pairs
Reduce function Receives the intermediate key value pairs Can have multiple values for the same key Merges the values together in some way Produces a merged output
When to use MapReduce MapReduce doesn’t work for all problems Problems have to be parallelizable In other words, an algorithm that involves stateful steps is not necessarily a good candidate for MapReduce
Commodity hardware MapReduce clusters are commodity hardware X86 processors, several gigabytes of RAM In this day and age, more computers are cheap Rather than beef up the machines, just use more
Hadoop Hadoop is a Java based MapReduce implementation Very popular Has a secondary component, HDFS Hadoop Distributed File System
HDFS File system spread across a Hadoop MapReduce cluster Large block sizes – 64 MB by default Very popular base for other distributed applications In particular, NoSQL applications
NoSQL NoSQL is a somewhat nebulous term Basically means “not SQL,” or “something other than SQL” Many different approaches Key-Value stores are a big part of the NoSQL movement Focus on them here
Key-Value?! This almost seems like a step backward Key-Value stores are far less structured Can’t establish relations between entities in a key value store Can’t constrain data very well Why is reducing the structure gaining popularity?
Distributable nature Many Key-Value stores can be distributed amongst many nodes By distributing these nodes, searches and operations on vast swaths of data can be performed in a sensible amount of time Not all, however Some can be single server applications stored in RAM
NoSQL Key-Value implementations Hbase Accumulo Memcached Dynamo Many many more