Download presentation
Presentation is loading. Please wait.
Published byRaymond Holt Modified over 8 years ago
1
An Introduction to Big Data (With a strong focus on Apache) Nick Burch Senior Developer, Alfresco Software VP ConCom, ASF Member
2
What we'll be covering ● Map Reduce – a new way to process data in a scalable and fault tolerant manner ● Hadoop – an Apache Map-Reduce implementation – what and how ● The Hadoop Ecosystem ● NoSQL – a whistle-stop introduction ● Some Apache NoSQL projects ● And some notable non-Apache ones
3
Data is Growing ● Data volumes are increasing rapidly ● The value held in that data is increasing ● But traditional storage models can't scale well to cope with storing + analyzing all of this
4
Big Data – Storing & Analyzing ● Big Data is a broad term, covering many things ● Covers ways to store lots of data ● Covers scalable ways to store data ● Covers scalable ways to retrieve data ● Covers methods to search, analyze and process large volumes of data ● Covers systems that combine many elements to deliver a data solution ● Not one thing – it's a family of solutions and tools
5
Map Reduce Scalable, Fault Tolerant data processing
6
Map Reduce – Google's Solution ● Papers published in 2003+2004, based on the systems Google had developed ● Provides a fault tolerant, automatically retrying, SPOF-avoiding way to process large quantities of data ● Map step reads in chunks of raw data (either from external source, or distributed FS), processes it and outputs keys + values ● Reduce step combines these to get results ● Map step normally data local
7
Nutch – An Apache web crawler ● Open Source, web scale crawler system based on the Apache Lucene search technology ● Needs to fetch, process, analyze and compare large amounts of data ● Started hitting scaling problems around the same time as the Google MapReduce and GFS papers were published ● Scaling solution was to implement an Open Source, Java version of MapReduce + GFS, and switch Nutch to being built on this
8
Hadoop – the birth of the elephant ● The Nutch M/R framework worked! ● But it was useful for more than just Nutch ● Framework was pulled out, Hadoop was born! ● Started in Lucene ● Became TLP in 2008 ● Named after Doug Cutting's son's toy stuffed elephant
9
What is Hadoop? ● An Apache project ● A software framework for data intensive, distributed, fault tolerant applications ● A distributed, replicating, location aware, automatically re-balancing file system ● A framework for writing your map and reduce steps, in a variety of languages ● An engine that drives the scheduling, tracking and execution of Map Reduce tasks ● An ecosystem of related projects and technologies
10
Growth of Hadoop ● 2004 – Nutch scale problems identified, M/R + GFS identified as a possible solution ● 2004-2006 – Part time development work from two developers, allowed Nutch to scale to 20 nodes @ 100M web pages ● 2006 – Yahoo abandon in-house M/R code, throw their weight behind Hadoop ● 2006-2008 – Yahoo help drive the development of Hadoop, hits web scale in production in 2008
11
Growth of Hadoop ● 2008 – Hadoop wins Terabyte Sorting Benchmark, sorts 1TB in 209 seconds ● Many companies get involved, lots of new committers working on the codebase ● 2010 – Subprojects graduate to TLP, Hadoop Ecosystem grows ● Hadoop 1.0 released in December 2011 ● Today – Scales to 4,000 machines, 20 PB of data, millions of jobs per month on one cluster
12
The Hadoop Ecosystem ● Lots of projects around Hadoop and HDFS ● Help allow it to work well in new fields, such as data analysis, easier querying, logs etc ● Many of these are at Apache, including in the Apache Incubator ● Renewed focus recently on reducing external forks of Hadoop, patches returning to core ● Range of companies involved, including big users of Hadoop, and those offering support
13
Ecosystem – Data Analysis ● One of the key initial uses of Hadoop was to store and then analyze data ● Various projects now exist to make this easier ● Mahout ● Nutch ● Giraph (Incubating) ● Graph processing platform built on Hadoop ● Verticies send messages to each other, like Pregal
14
Ecosystem – Querying ● Various tools now make querying easier ● Pig ● Hive ● Data Warehouse tool built on Hadoop, M/R based ● Facebook, Netflix etc ● Sqoop (Incubating) ● Bulk data transfer tool ● Load and dump HDFS to/from SQL
15
Ecosystem – Logs & Streams ● A common use for Hadoop in Operations is to capture large amounts of log data, which Business Analysts (and monitoring!) later use ● Chukwa (Incubating) ● Captures logs from lots of sources, sends to HDFS (analysis) and HBase (visualising) ● M/R anomaly detection, Hive integration ● Flume (Incubating) ● Rapid log store to HDFS + Hive + FTS
16
NoSQL A new way to store and retrieve data
17
What is NoSQL? ● Not “No SQL”, more “Not Only SQL” ● NoSQL is a broad class of Database Systems that differ from the old RDBMS model ● Instead of using SQL to query, use alternate systems to express what is to be fetched ● Table structure is often flexible ● Often scales much much better (if needed) ● Often does this by relaxing some of ACID ● Consistent, Partition Tolerant, Available – pick 2
18
The main kinds of NoSQL stores ● Different models tackle the problem in different ways, and are suited to different uses ● Column Store (BigTable based) ● Document Store ● KV Store (often Dynamo based) ● Graph Database ● It's worth reading Dynamo+BigTable papers ● To learn more, see Emil Eifrem's “Past, Present and Future of NoSQL” talk from ApacheCon
19
Apache - Column Stores ● Data is grouped by column / column family, rather than by row ● Easy to partition, efficient for OLAP tasks ● Cassandra ● HBase ● Accumulo (Incubating) – Cell level permissioning
20
Apache – Document Stores ● Stores a wide range data for a document ● One document can have a different set of data to another, and this can change over time ● Supports a rich, flexible way to store data ● CouchDB ● Jackrabbit
21
Apache – Others ● Hive – Hadoop + HDFS powered data warehousing tool ● Data stored into HDFS, Local or S3 ● Queries performed as HQL, compiles to M/R jobs ● Giraph – Graph Processing System ● Built on Hadoop and ZooKeeper ● Gora – ORM for Column Stores ● ZooKeeper – Core services for writing distributed, highly reliable applications
22
Key non-Apache NoSQL Stores ● There are lots of others outside of Apache! ● Do different things or use different technologies, you should look at them too ● Riak – KV, with M/R query ● Project Voldemort – KV, fault tolerant ● Redis – In-Memory KV, optional durability ● MongoDB – Document Store ● Neo4J – Graph Database
23
Big Data for Business ● Can solve bigger problems than old style data warehousing solutions can ● Delivers a wider range of options and processing models ● Many systems offer high availability, automated recovery, automated adding of new hardware etc (but there's a tradeoff to be had) ● Support contracts are often cheaper than data warehousing, and you get more control ● Licenses are free, or much much cheaper
24
Things to think about ● How much automation do you need? ● How much data do you have now? How fast are you adding new data? ● How much do you need to retrieve in one go? ● What data do you retrieve based on? ● How will your data change over time? ● How quickly do you need to retrieve it? ● How much processing does the raw data need?
25
There's no silver bullet! ● Different projects tackle the big data problem in different ways, with different approaches ● There's no “one correct way” to do it ● You need to think about your problems ● Decide what's important to you ● Decide what isn't important (it can't all be....) ● Review the techniques, find the right one for your problem ● Pick the project(s) to use for this
26
Questions? Want to know more? ● http://www.apache.org/ http://www.apache.org/ ● http://hadoop.apache.org/ http://hadoop.apache.org/ ● http://projects.apache.org/ http://projects.apache.org/ ● Berlin Buzzwords – Videos of talks Online ● @TheASF ● @Gagravarr
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.