An Introduction to Big Data (With a strong focus on Apache) Nick Burch Senior Developer, Alfresco Software VP ConCom, ASF Member
What we'll be covering ● Map Reduce – a new way to process data in a scalable and fault tolerant manner ● Hadoop – an Apache Map-Reduce implementation – what and how ● The Hadoop Ecosystem ● NoSQL – a whistle-stop introduction ● Some Apache NoSQL projects ● And some notable non-Apache ones
Data is Growing ● Data volumes are increasing rapidly ● The value held in that data is increasing ● But traditional storage models can't scale well to cope with storing + analyzing all of this
Big Data – Storing & Analyzing ● Big Data is a broad term, covering many things ● Covers ways to store lots of data ● Covers scalable ways to store data ● Covers scalable ways to retrieve data ● Covers methods to search, analyze and process large volumes of data ● Covers systems that combine many elements to deliver a data solution ● Not one thing – it's a family of solutions and tools
Map Reduce Scalable, Fault Tolerant data processing
Map Reduce – Google's Solution ● Papers published in , based on the systems Google had developed ● Provides a fault tolerant, automatically retrying, SPOF-avoiding way to process large quantities of data ● Map step reads in chunks of raw data (either from external source, or distributed FS), processes it and outputs keys + values ● Reduce step combines these to get results ● Map step normally data local
Nutch – An Apache web crawler ● Open Source, web scale crawler system based on the Apache Lucene search technology ● Needs to fetch, process, analyze and compare large amounts of data ● Started hitting scaling problems around the same time as the Google MapReduce and GFS papers were published ● Scaling solution was to implement an Open Source, Java version of MapReduce + GFS, and switch Nutch to being built on this
Hadoop – the birth of the elephant ● The Nutch M/R framework worked! ● But it was useful for more than just Nutch ● Framework was pulled out, Hadoop was born! ● Started in Lucene ● Became TLP in 2008 ● Named after Doug Cutting's son's toy stuffed elephant
What is Hadoop? ● An Apache project ● A software framework for data intensive, distributed, fault tolerant applications ● A distributed, replicating, location aware, automatically re-balancing file system ● A framework for writing your map and reduce steps, in a variety of languages ● An engine that drives the scheduling, tracking and execution of Map Reduce tasks ● An ecosystem of related projects and technologies
Growth of Hadoop ● 2004 – Nutch scale problems identified, M/R + GFS identified as a possible solution ● – Part time development work from two developers, allowed Nutch to scale to M web pages ● 2006 – Yahoo abandon in-house M/R code, throw their weight behind Hadoop ● – Yahoo help drive the development of Hadoop, hits web scale in production in 2008
Growth of Hadoop ● 2008 – Hadoop wins Terabyte Sorting Benchmark, sorts 1TB in 209 seconds ● Many companies get involved, lots of new committers working on the codebase ● 2010 – Subprojects graduate to TLP, Hadoop Ecosystem grows ● Hadoop 1.0 released in December 2011 ● Today – Scales to 4,000 machines, 20 PB of data, millions of jobs per month on one cluster
The Hadoop Ecosystem ● Lots of projects around Hadoop and HDFS ● Help allow it to work well in new fields, such as data analysis, easier querying, logs etc ● Many of these are at Apache, including in the Apache Incubator ● Renewed focus recently on reducing external forks of Hadoop, patches returning to core ● Range of companies involved, including big users of Hadoop, and those offering support
Ecosystem – Data Analysis ● One of the key initial uses of Hadoop was to store and then analyze data ● Various projects now exist to make this easier ● Mahout ● Nutch ● Giraph (Incubating) ● Graph processing platform built on Hadoop ● Verticies send messages to each other, like Pregal
Ecosystem – Querying ● Various tools now make querying easier ● Pig ● Hive ● Data Warehouse tool built on Hadoop, M/R based ● Facebook, Netflix etc ● Sqoop (Incubating) ● Bulk data transfer tool ● Load and dump HDFS to/from SQL
Ecosystem – Logs & Streams ● A common use for Hadoop in Operations is to capture large amounts of log data, which Business Analysts (and monitoring!) later use ● Chukwa (Incubating) ● Captures logs from lots of sources, sends to HDFS (analysis) and HBase (visualising) ● M/R anomaly detection, Hive integration ● Flume (Incubating) ● Rapid log store to HDFS + Hive + FTS
NoSQL A new way to store and retrieve data
What is NoSQL? ● Not “No SQL”, more “Not Only SQL” ● NoSQL is a broad class of Database Systems that differ from the old RDBMS model ● Instead of using SQL to query, use alternate systems to express what is to be fetched ● Table structure is often flexible ● Often scales much much better (if needed) ● Often does this by relaxing some of ACID ● Consistent, Partition Tolerant, Available – pick 2
The main kinds of NoSQL stores ● Different models tackle the problem in different ways, and are suited to different uses ● Column Store (BigTable based) ● Document Store ● KV Store (often Dynamo based) ● Graph Database ● It's worth reading Dynamo+BigTable papers ● To learn more, see Emil Eifrem's “Past, Present and Future of NoSQL” talk from ApacheCon
Apache - Column Stores ● Data is grouped by column / column family, rather than by row ● Easy to partition, efficient for OLAP tasks ● Cassandra ● HBase ● Accumulo (Incubating) – Cell level permissioning
Apache – Document Stores ● Stores a wide range data for a document ● One document can have a different set of data to another, and this can change over time ● Supports a rich, flexible way to store data ● CouchDB ● Jackrabbit
Apache – Others ● Hive – Hadoop + HDFS powered data warehousing tool ● Data stored into HDFS, Local or S3 ● Queries performed as HQL, compiles to M/R jobs ● Giraph – Graph Processing System ● Built on Hadoop and ZooKeeper ● Gora – ORM for Column Stores ● ZooKeeper – Core services for writing distributed, highly reliable applications
Key non-Apache NoSQL Stores ● There are lots of others outside of Apache! ● Do different things or use different technologies, you should look at them too ● Riak – KV, with M/R query ● Project Voldemort – KV, fault tolerant ● Redis – In-Memory KV, optional durability ● MongoDB – Document Store ● Neo4J – Graph Database
Big Data for Business ● Can solve bigger problems than old style data warehousing solutions can ● Delivers a wider range of options and processing models ● Many systems offer high availability, automated recovery, automated adding of new hardware etc (but there's a tradeoff to be had) ● Support contracts are often cheaper than data warehousing, and you get more control ● Licenses are free, or much much cheaper
Things to think about ● How much automation do you need? ● How much data do you have now? How fast are you adding new data? ● How much do you need to retrieve in one go? ● What data do you retrieve based on? ● How will your data change over time? ● How quickly do you need to retrieve it? ● How much processing does the raw data need?
There's no silver bullet! ● Different projects tackle the big data problem in different ways, with different approaches ● There's no “one correct way” to do it ● You need to think about your problems ● Decide what's important to you ● Decide what isn't important (it can't all be....) ● Review the techniques, find the right one for your problem ● Pick the project(s) to use for this
Questions? Want to know more? ● ● ● ● Berlin Buzzwords – Videos of talks Online