Big Data. Definition Big Data is a term for collection of datasets so large and complex that it becomes difficult to process using conventional database.

Big Data

Definition Big Data is a term for collection of datasets so large and complex that it becomes difficult to process using conventional database management tools or traditional data processing applications

What is ‘Big’ ‘Data’? Is it; Too big to be stored on a single server? Too unstructured to fit into a Row/column DB? Too voluminous or dynamic to fit into a static data warehouse?

How big is big? This was in 2011 By end of 2012 >2.4Bn http://www.thecultureist.com/2013/0 5/09/how-many-people-use-the- internet-more-than-2-billion- infographic/ http://www.thecultureist.com/2013/0 5/09/how-many-people-use-the- internet-more-than-2-billion- infographic/ Now have about 7Bn devices connected

Scale of data

Opportunity for Big Data

Sexiest job of the 21 st Century? http://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the- 21st-century/ar/1

Opportunity for Big Data

Candidates for Big Data

Big Data gap

Community Usages Adobe Alibaba Amazon AOL Facebook Google IBM Users

Conventional approach

Issues with conventional approach Batch-oriented Batches cannot be interrupted or reconfigured on-the- fly Schema management required in multiple places Lots of data being shuffled around Data duplication Turn-around times of hours rather than minutes

Big Data systems (real time) Applications Indexing and Search Usage Analytics Insights, recommen dations CRUD Views Data backend Data Metadata Attention data Indexes

A Data system – example (perhaps) Raw data Eg: Tweets View 1 Eg: #tweets/URL View 2 Eg: Influence scores View 3 Eg: trending topics Applications Indexing and Search Usage Analytics Insights, recommen dations CRUD

Properties of a Data system robustness fast reads AND updates/inserts scalable generic extensible allows ad-hoc analysis low-cost maintenance debuggable

Big Data Technologies

MongoDB Hadoop

MongoDB Horizontally Scalable { author : “steve”, date : new Date(), text : “About MongoDB...”, tags : [“tech”, “database”]} Document Oriented Application High Performance Fully Consistent

MongoDB philosophy Keep functionality when we can (key/value stores are great, but we need more) Non-relational (no joins) makes scaling horizontally practical Document data models are good Database technology should run anywhere virtualized, cloud, metal, etc

Under the hood Written in C++ Runs nearly everywhere Data serialized to BSONBSON Extensive use of memory-mapped files i.e. read- through write-through memory caching.

Database Landscape Scalability & Performance Depth of Functionality MongoDB RDBMS Memcached

MongoDB “MongoDB has the best features of key/value stores, document databases and relational databases in one.” John Nunemaker

Relational made normalized data look like this User Name Email Address Category Name Url Article Name Slug Publish date Text Tag Name Url Comment Date Author

Document databases make normalized data look like this User Name Email Address Article Name Slug Publish date Text Author Tag[] Value Comment[] Comment Date Author Category[] Value

When MongoDB? Online processing Working on small subsets at a time Processing document store data (like game data) while interacting in game (items/HP/XP, graphics, stage, etc)

Hadoop What is Hadoop? A scalable, Fault tolerant, High performance distributed file system Asynchronous replication Write-once and read-many (WORM) Hadoop cluster with 3 DataNodes minimum Data divided into 64MB or 128 MB blocks, each block replicated 3 times (default) NameNode holds filesystem metadata Files are broken up and spread over the DataNodes

Benefits of Hadoop Runs on cheap commodity hardware Automatically handles data replication and node failure It does the hard work – you can focus on processing data Cost Saving and efficient and reliable data processing

Where and When Hadoop? Where? Batch data processing, not real-time / user facing (e.g. Document Analysis and Indexing, Web Graphs and Crawling) Highly parallel data intensive distributed applications Very large production deployments When? Process lots of unstructured data When your processing can easily be made parallel Running batch jobs is acceptable When you have access to lots of cheap hardware

What is Hadoop used for? Searching Log Processing Recommendation systems Analytics Video and Image analysis Data Retention Hadoop Massively scalable data 100PB of data Injest at scale Sensor data Clickstream Online gaming data (user activity, time online, activities performed, etc) Later analytics

How Hadoop works? Hadoop implements a computational paradigm named Map/Reduce, where the application is divided into many small fragments of work, each of which may be executed or re-executed on any node in the cluster. In addition, it provides a distributed file system (HDFS) that stores data on the compute nodes, providing very high aggregate bandwidth across the cluster. Both Map/Reduce and the distributed file system are designed so that node failures are automatically handled by the framework

Hadoop Architecture Distributed File System and MapReduce HDFS Runs on top of existing file system in a node Designed to handle large files with streaming data access patterns Designed for streaming (sequential data access rather than random access)

HDFS Known as Hadoop Distributed File System Primary storage system for Hadoop Apps Multiple replicas of data blocks distributed on compute nodes for reliability Files are stored on multiple boxes for durability and high availability

HDFS Optimized for long sequential reads Data written once, read multiple times, no append possible Large file, sequential reads so no local caching of data. Data replication HDFS

HDFS Architecture

Block Structure files system File is divided to bocks and stored Each individual machine in cluster is Data Node Default block size is 64 MB Information of blocks is stored in metadata All this meta data is stored on machine which is Name Node

Map Reduce Hadoop MapReduce is a software framework for easily writing applications which process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner. A MapReduce job usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce tasks. Typically both the input and the output of the job are stored in a file-system. The framework takes care of scheduling tasks, monitoring them and re-executes the failed tasks. Typically the compute nodes and the storage nodes are the same, that is, the MapReduce framework and the Hadoop Distributed File System (see HDFS Architecture Guide) are running on the same set of nodes. This configuration allows the framework to effectively schedule tasks on the nodes where data is already present, resulting in very high aggregate bandwidth across the cluster. The MapReduce framework consists of a single master JobTracker and one slave TaskTracker per cluster-node. The master is responsible for scheduling the jobs' component tasks on the slaves, monitoring them and re-executing the failed tasks. The slaves execute the tasks as directed by the master.

MapReduce Technology from Google MapReduce program contains transformations that can be applied to the data any number of times MapReduce is an executing mapreduce program which has map tasks running parallel to each other and reduce tasks running parallel to each other as well

MapReduce

HDFS handles the Distributed File System layer MapReduce is how we process the data MapReduce Daemons JobTracker TaskTracker Goals Distribute the reading and processing of data Localize the processing when possible Share as little data as possible while processing

Job Tracker One per cluster “master node” Takes jobs from clients Splits work into “tasks” Distributes “tasks” to TaskTrackers Monitors progress, deals with failures

Task Tracker Many per cluster “slave nodes” Does the actual work, executes the code for the job Talks regularly with JobTracker Launches child process when given a task Reports progress of running “task” back to JobTracker

Minimally, applications specify the input/output locations and supply map and reduce functions via implementations of appropriate interfaces and/or abstract-classes. These, and other job parameters, comprise the job configuration. The Hadoop job client then submits the job (jar/executable etc.) and configuration to the JobTracker which then assumes the responsibility of distributing the software/configuration to the slaves, scheduling tasks and monitoring them, providing status and diagnostic information to the job-client. Although the Hadoop framework is implemented in JavaTM, MapReduce applications need not be written in Java. Anatomy of MapReduce

Submitting a MapReduce job

Simple Data Flow Example

Example of MapReduce Read text files and count how often words occur. The input is text files The output is a text file each line: word, tab, count Map: Produce pairs of (word, count) Reduce: For each word, sum up the counts.

Anatomy of MapReduce Client Submits job: I want to count the occurrences of each word We will assume that the data to process is already there in HDFS JobTracker receives job Queries the NameNode for number of blocks in File The job is split into Tasks One map task per each block As many reduce tasks as specified in the Job TaskTracker checks in Regularly with JobTracker Is there any work for me ? If the JobTracker has a MapTask that the TaskTracker has a local block for the file being processed then the TaskTracker will be given the “task”

Hadoop, pros & cons Pros Superior in availability / scalability / manageability Large block sizes  large files (giga, peta,…) Extremely scalable due to HDFS Batch-based MapReduce facilitating parallel work Cons: Programmability and Metadata Less efficient for smaller files MapReduce is complex than traditional sql queries MapReduce is batch-based - delay Need to publish data in well known schemas

Hive To turn hadoop into a data warehouse Developed at Facebook Declarative language (SQL Dialect) - HiveQL Schema non-optional but data can have many schemas Relationally complete

Data warehousing in Facebook (case study) Hadoop/Hive cluster 8400 cores Raw storage capacity ~12.5PB 8 cores + 12 TB per node 32 GB RAM per node Two level network topology 1 Gbit/sec from node to rack switch 4 Gbit/sec to top level rack switch 2 clusters One for adhoc users One for strict SLA jobs

Hadoop / Hive usage Statistics per day 12 TB of compressed new data added per day 135TB of compressed data scanned per day 7500+ Hive jobs per day 80k computing hours per day Hive simplifies Hadoop New engineers go through a Hive training session ~200 people/month run jobs on Hadoop/Hive Analysts (non-engineers) use Hadoop through Hive Most of jobs are Hive Jobs Data warehousing in Facebook (case study)

Types of Applications Reporting Eg: Daily/ weekly aggregations of impression / click counts Measures of user engagement Microstrategy reports Ad hoc Analysis Eg: how many group admins broken down by state / country Machine Learning (Assembling training data) Ad Optimization Eg: User Engagement as a function of user attributes Many others Data warehousing in Facebook (case study)

Scribe A service for distributed log file collection designed to run as a daemon process on every node in data center forward log files from any process running on that machine back to a central pool of aggregators Scribe is a server for aggregating streaming log data designed to scale to a very large number of nodes and be robust to network and node failures

Scribe and Hadoop clusters at Facebook Used to log Data from web servers Clusters collocated with the web servers Network is the biggest bottleneck Typical cluster has about 50 nodes

Data Flow Architecture at Facebook

OK – what do we know so far? We’ve talked about MongoDB for reasonably-large- sized amounts of unstructured data, and processing it quickly We’ve talked about Hadoop (Hive, Scribe, etc) for unstructured data holding really large amounts of stuff so we can process and mine it with HiveQL What if we have a combined need to process large and huge sets for the same app? What if we need to process good amounts of unstructured data right away, while storing it as lots more for analysis later?

Applications have complex needs Lets use the best tool for the job Often more than one tool is needed MongoDB ideal operational database MongoDB ideal for BIG data, but Not really a data processing engine For heavy processing needs use tool designed for that job... Hadoop

MongoDB & Hadoop together Hadoop Massively scalable data 100PB of data Injest at scale Sensor data Clickstream Online gaming data (user activity, time online, activities performed, etc) Later analytics MongoDB Online processing Working on small subsets at a time Processing document store data (like game data) while interacting in game (items/HP/XP, graphics, stage, etc)

MongoDB MapReduce MongoDB map reduce quite capable... but with limits Javascript not best language for processing map reduce Javascript limited in external data processing libraries Adds load to data store Sharded environments do parallel processing

MongoDB Map Reduce

MongoDB Aggregation Most uses of MongoDB Map Reduce were for aggregation Aggregation Framework optimized for aggregate queries Fixes some of limits of MongoDB MR Can do realtime aggregation similar to SQL GroupBy parallel processing on sharded clusters

MongoDB Aggregation

MongoDB Map Reduce

Hadoop Map Reduce

MongoDB & Hadoop

Some Videos Mongo MapReduce https://www.youtube.com/watch?v=WovfjprPD_I MySQL & NoSQL at Craigslist https://www.youtube.com/watch?v=a0OvgTfF8Pg 9 Databases in 45 minutes https://www.youtube.com/watch?v=XfK4aBF7tEIhttps://www.youtube.com/watch?v=XfK4aBF7tEI

Big Data. Definition Big Data is a term for collection of datasets so large and complex that it becomes difficult to process using conventional database.

Similar presentations

Presentation on theme: "Big Data. Definition Big Data is a term for collection of datasets so large and complex that it becomes difficult to process using conventional database."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Big Data. Definition Big Data is a term for collection of datasets so large and complex that it becomes difficult to process using conventional database.

Similar presentations

Presentation on theme: "Big Data. Definition Big Data is a term for collection of datasets so large and complex that it becomes difficult to process using conventional database."— Presentation transcript:

Similar presentations

About project

Feedback