An Introduction to Big Data (With a strong focus on Apache) Nick Burch Senior Developer, Alfresco Software VP ConCom, ASF Member.

Slides:



Advertisements
Similar presentations
BigData Tools Seyyed mohammad Razavi. Outline  Introduction  Hbase  Cassandra  Spark  Acumulo  Blur  MongoDB  Hive  Giraph  Pig.
Advertisements

Jennifer Widom NoSQL Systems Overview (as of November 2011 )
 Need for a new processing platform (BigData)  Origin of Hadoop  What is Hadoop & what it is not ?  Hadoop architecture  Hadoop components (Common/HDFS/MapReduce)
NoSQL and NewSQL Justin DeBrabant CIS Advanced Systems - Fall 2013.
CS 405G: Introduction to Database Systems 24 NoSQL Reuse some slides of Jennifer Widom Chen Qian University of Kentucky.
Hadoop Ecosystem Overview
+ Hbase: Hadoop Database B. Ramamurthy. + Motivation-1 HDFS itself is “big” Why do we need “hbase” that is bigger and more complex? Word count, web logs.
Introduction to Apache Hadoop CSCI 572: Information Retrieval and Search Engines Summer 2010.
HADOOP ADMIN: Session -2
1 Yasin N. Silva Arizona State University This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
Facebook (stylized facebook) is a Social Networking System and website launched in February 2004, operated and privately owned by Facebook, Inc. As.
Cloud Distributed Computing Environment Content of this lecture is primarily from the book “Hadoop, The Definite Guide 2/e)
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
Distributed Indexing of Web Scale Datasets for the Cloud {ikons, eangelou, Computing Systems Laboratory School of Electrical.
Hadoop Basics -Venkat Cherukupalli. What is Hadoop? Open Source Distributed processing Large data sets across clusters Commodity, shared-nothing servers.
Introduction to Hadoop and HDFS
SEMINAR ON Guided by: Prof. D.V.Chaudhari Seminar by: Namrata Sakhare Roll No: 65 B.E.Comp.
Contents HADOOP INTRODUCTION AND CONCEPTUAL OVERVIEW TERMINOLOGY QUICK TOUR OF CLOUDERA MANAGER.
An Introduction to HDInsight June 27 th,
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
Introduction to Hbase. Agenda  What is Hbase  About RDBMS  Overview of Hbase  Why Hbase instead of RDBMS  Architecture of Hbase  Hbase interface.
Hadoop implementation of MapReduce computational model Ján Vaňo.
NoSQL Or Peles. What is NoSQL A collection of various technologies meant to work around RDBMS limitations (mostly performance) Not much of a definition...
NoSQL Systems Motivation. NoSQL: The Name  “SQL” = Traditional relational DBMS  Recognition over past decade or so: Not every data management/analysis.
Cloud Distributed Computing Environment Hadoop. Hadoop is an open-source software system that provides a distributed computing environment on cloud (data.
1 HBASE – THE SCALABLE DATA STORE An Introduction to HBase XLDB Europe Workshop 2013: CERN, Geneva James Kinley EMEA Solutions Architect, Cloudera.
This is a free Course Available on Hadoop-Skills.com.
BIG DATA/ Hadoop Interview Questions.
Data Science Hadoop YARN Rodney Nielsen. Rodney Nielsen, Human Intelligence & Language Technologies Lab Outline Classical Hadoop What’s it all about Hadoop.
Leverage Big Data With Hadoop Analytics Presentation by Ravi Namboori Visit
1 Gaurav Kohli Xebia Breaking with DBMS and Dating with Relational Hbase.
Hadoop Introduction. Audience Introduction of students – Name – Years of experience – Background – Do you know Java? – Do you know linux? – Any exposure.
Data Analytics (CS40003) Introduction to Data Lecture #1
Neo4j: GRAPH DATABASE 27 March, 2017
CS 405G: Introduction to Database Systems
Mail call Us: / / Hadoop Training Sathya technologies is one of the best Software Training Institute.
and Big Data Storage Systems
Big Data is a Big Deal!.
SAS users meeting in Halifax
MapReduce Compiler RHadoop
Hadoop.
Software Systems Development
An Open Source Project Commonly Used for Processing Big Data Sets
INTRODUCTION TO PIG, HIVE, HBASE and ZOOKEEPER
CS122B: Projects in Databases and Web Applications Winter 2017
MongoDB Er. Shiva K. Shrestha ME Computer, NCIT
Chapter 14 Big Data Analytics and NoSQL
Hadoopla: Microsoft and the Hadoop Ecosystem
Modern Databases NoSQL and NewSQL
NOSQL.
Hadoop.
NOSQL databases and Big Data Storage Systems
Hadoop Clusters Tess Fulkerson.
Central Florida Business Intelligence User Group
Ministry of Higher Education
MIT 802 Introduction to Data Platforms and Sources Lecture 2
NoSQL Systems Overview (as of November 2011).
NOSQL and CAP Theorem.
Introduction to PIG, HIVE, HBASE & ZOOKEEPER
Overview of big data tools
TIM TAYLOR AND JOSH NEEDHAM
Zoie Barrett and Brian Lam
Charles Tappert Seidenberg School of CSIS, Pace University
Cloud Computing for Data Analysis Pig|Hive|Hbase|Zookeeper
Introduction to NoSQL Database Systems
Big DATA.
Copyright © JanBask Training. All rights reserved Get Started with Hadoop Hive HiveQL Languages.
UNIT 6 RECENT TRENDS.
Pig Hive HBase Zookeeper
Presentation transcript:

An Introduction to Big Data (With a strong focus on Apache) Nick Burch Senior Developer, Alfresco Software VP ConCom, ASF Member

What we'll be covering ● Map Reduce – a new way to process data in a scalable and fault tolerant manner ● Hadoop – an Apache Map-Reduce implementation – what and how ● The Hadoop Ecosystem ● NoSQL – a whistle-stop introduction ● Some Apache NoSQL projects ● And some notable non-Apache ones

Data is Growing ● Data volumes are increasing rapidly ● The value held in that data is increasing ● But traditional storage models can't scale well to cope with storing + analyzing all of this

Big Data – Storing & Analyzing ● Big Data is a broad term, covering many things ● Covers ways to store lots of data ● Covers scalable ways to store data ● Covers scalable ways to retrieve data ● Covers methods to search, analyze and process large volumes of data ● Covers systems that combine many elements to deliver a data solution ● Not one thing – it's a family of solutions and tools

Map Reduce Scalable, Fault Tolerant data processing

Map Reduce – Google's Solution ● Papers published in , based on the systems Google had developed ● Provides a fault tolerant, automatically retrying, SPOF-avoiding way to process large quantities of data ● Map step reads in chunks of raw data (either from external source, or distributed FS), processes it and outputs keys + values ● Reduce step combines these to get results ● Map step normally data local

Nutch – An Apache web crawler ● Open Source, web scale crawler system based on the Apache Lucene search technology ● Needs to fetch, process, analyze and compare large amounts of data ● Started hitting scaling problems around the same time as the Google MapReduce and GFS papers were published ● Scaling solution was to implement an Open Source, Java version of MapReduce + GFS, and switch Nutch to being built on this

Hadoop – the birth of the elephant ● The Nutch M/R framework worked! ● But it was useful for more than just Nutch ● Framework was pulled out, Hadoop was born! ● Started in Lucene ● Became TLP in 2008 ● Named after Doug Cutting's son's toy stuffed elephant

What is Hadoop? ● An Apache project ● A software framework for data intensive, distributed, fault tolerant applications ● A distributed, replicating, location aware, automatically re-balancing file system ● A framework for writing your map and reduce steps, in a variety of languages ● An engine that drives the scheduling, tracking and execution of Map Reduce tasks ● An ecosystem of related projects and technologies

Growth of Hadoop ● 2004 – Nutch scale problems identified, M/R + GFS identified as a possible solution ● – Part time development work from two developers, allowed Nutch to scale to M web pages ● 2006 – Yahoo abandon in-house M/R code, throw their weight behind Hadoop ● – Yahoo help drive the development of Hadoop, hits web scale in production in 2008

Growth of Hadoop ● 2008 – Hadoop wins Terabyte Sorting Benchmark, sorts 1TB in 209 seconds ● Many companies get involved, lots of new committers working on the codebase ● 2010 – Subprojects graduate to TLP, Hadoop Ecosystem grows ● Hadoop 1.0 released in December 2011 ● Today – Scales to 4,000 machines, 20 PB of data, millions of jobs per month on one cluster

The Hadoop Ecosystem ● Lots of projects around Hadoop and HDFS ● Help allow it to work well in new fields, such as data analysis, easier querying, logs etc ● Many of these are at Apache, including in the Apache Incubator ● Renewed focus recently on reducing external forks of Hadoop, patches returning to core ● Range of companies involved, including big users of Hadoop, and those offering support

Ecosystem – Data Analysis ● One of the key initial uses of Hadoop was to store and then analyze data ● Various projects now exist to make this easier ● Mahout ● Nutch ● Giraph (Incubating) ● Graph processing platform built on Hadoop ● Verticies send messages to each other, like Pregal

Ecosystem – Querying ● Various tools now make querying easier ● Pig ● Hive ● Data Warehouse tool built on Hadoop, M/R based ● Facebook, Netflix etc ● Sqoop (Incubating) ● Bulk data transfer tool ● Load and dump HDFS to/from SQL

Ecosystem – Logs & Streams ● A common use for Hadoop in Operations is to capture large amounts of log data, which Business Analysts (and monitoring!) later use ● Chukwa (Incubating) ● Captures logs from lots of sources, sends to HDFS (analysis) and HBase (visualising) ● M/R anomaly detection, Hive integration ● Flume (Incubating) ● Rapid log store to HDFS + Hive + FTS

NoSQL A new way to store and retrieve data

What is NoSQL? ● Not “No SQL”, more “Not Only SQL” ● NoSQL is a broad class of Database Systems that differ from the old RDBMS model ● Instead of using SQL to query, use alternate systems to express what is to be fetched ● Table structure is often flexible ● Often scales much much better (if needed) ● Often does this by relaxing some of ACID ● Consistent, Partition Tolerant, Available – pick 2

The main kinds of NoSQL stores ● Different models tackle the problem in different ways, and are suited to different uses ● Column Store (BigTable based) ● Document Store ● KV Store (often Dynamo based) ● Graph Database ● It's worth reading Dynamo+BigTable papers ● To learn more, see Emil Eifrem's “Past, Present and Future of NoSQL” talk from ApacheCon

Apache - Column Stores ● Data is grouped by column / column family, rather than by row ● Easy to partition, efficient for OLAP tasks ● Cassandra ● HBase ● Accumulo (Incubating) – Cell level permissioning

Apache – Document Stores ● Stores a wide range data for a document ● One document can have a different set of data to another, and this can change over time ● Supports a rich, flexible way to store data ● CouchDB ● Jackrabbit

Apache – Others ● Hive – Hadoop + HDFS powered data warehousing tool ● Data stored into HDFS, Local or S3 ● Queries performed as HQL, compiles to M/R jobs ● Giraph – Graph Processing System ● Built on Hadoop and ZooKeeper ● Gora – ORM for Column Stores ● ZooKeeper – Core services for writing distributed, highly reliable applications

Key non-Apache NoSQL Stores ● There are lots of others outside of Apache! ● Do different things or use different technologies, you should look at them too ● Riak – KV, with M/R query ● Project Voldemort – KV, fault tolerant ● Redis – In-Memory KV, optional durability ● MongoDB – Document Store ● Neo4J – Graph Database

Big Data for Business ● Can solve bigger problems than old style data warehousing solutions can ● Delivers a wider range of options and processing models ● Many systems offer high availability, automated recovery, automated adding of new hardware etc (but there's a tradeoff to be had) ● Support contracts are often cheaper than data warehousing, and you get more control ● Licenses are free, or much much cheaper

Things to think about ● How much automation do you need? ● How much data do you have now? How fast are you adding new data? ● How much do you need to retrieve in one go? ● What data do you retrieve based on? ● How will your data change over time? ● How quickly do you need to retrieve it? ● How much processing does the raw data need?

There's no silver bullet! ● Different projects tackle the big data problem in different ways, with different approaches ● There's no “one correct way” to do it ● You need to think about your problems ● Decide what's important to you ● Decide what isn't important (it can't all be....) ● Review the techniques, find the right one for your problem ● Pick the project(s) to use for this

Questions? Want to know more? ● ● ● ● Berlin Buzzwords – Videos of talks Online