Presentation is loading. Please wait.

Presentation is loading. Please wait.

Big Data. Definition Big Data is a term for collection of datasets so large and complex that it becomes difficult to process using conventional database.

Similar presentations


Presentation on theme: "Big Data. Definition Big Data is a term for collection of datasets so large and complex that it becomes difficult to process using conventional database."— Presentation transcript:

1 Big Data

2 Definition Big Data is a term for collection of datasets so large and complex that it becomes difficult to process using conventional database management tools or traditional data processing applications

3 What is ‘Big’ ‘Data’? Is it; Too big to be stored on a single server? Too unstructured to fit into a Row/column DB? Too voluminous or dynamic to fit into a static data warehouse?

4 How big is big? This was in 2011 By end of 2012 >2.4Bn http://www.thecultureist.com/2013/0 5/09/how-many-people-use-the- internet-more-than-2-billion- infographic/ http://www.thecultureist.com/2013/0 5/09/how-many-people-use-the- internet-more-than-2-billion- infographic/ Now have about 7Bn devices connected

5 Scale of data

6

7 Opportunity for Big Data

8 Sexiest job of the 21 st Century? http://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the- 21st-century/ar/1

9 Opportunity for Big Data

10 Candidates for Big Data

11 Big Data gap

12 Community Usages Adobe Alibaba Amazon AOL Facebook Google IBM Users

13 Conventional approach

14 Issues with conventional approach Batch-oriented Batches cannot be interrupted or reconfigured on-the- fly Schema management required in multiple places Lots of data being shuffled around Data duplication Turn-around times of hours rather than minutes

15 Big Data systems (real time) Applications Indexing and Search Usage Analytics Insights, recommen dations CRUD Views Data backend Data Metadata Attention data Indexes

16 A Data system – example (perhaps) Raw data Eg: Tweets View 1 Eg: #tweets/URL View 2 Eg: Influence scores View 3 Eg: trending topics Applications Indexing and Search Usage Analytics Insights, recommen dations CRUD

17 Properties of a Data system robustness fast reads AND updates/inserts scalable generic extensible allows ad-hoc analysis low-cost maintenance debuggable

18 Big Data Technologies

19 MongoDB Hadoop

20 MongoDB Horizontally Scalable { author : “steve”, date : new Date(), text : “About MongoDB...”, tags : [“tech”, “database”]} Document Oriented Application High Performance Fully Consistent

21 MongoDB philosophy Keep functionality when we can (key/value stores are great, but we need more) Non-relational (no joins) makes scaling horizontally practical Document data models are good Database technology should run anywhere virtualized, cloud, metal, etc

22 Under the hood Written in C++ Runs nearly everywhere Data serialized to BSONBSON Extensive use of memory-mapped files i.e. read- through write-through memory caching.

23 Database Landscape Scalability & Performance Depth of Functionality MongoDB RDBMS Memcached

24 MongoDB “MongoDB has the best features of key/value stores, document databases and relational databases in one.” John Nunemaker

25 Relational made normalized data look like this User Name Email Address Category Name Url Article Name Slug Publish date Text Tag Name Url Comment Date Author

26 Document databases make normalized data look like this User Name Email Address Article Name Slug Publish date Text Author Tag[] Value Comment[] Comment Date Author Category[] Value

27 When MongoDB? Online processing Working on small subsets at a time Processing document store data (like game data) while interacting in game (items/HP/XP, graphics, stage, etc)

28 Hadoop What is Hadoop? A scalable, Fault tolerant, High performance distributed file system Asynchronous replication Write-once and read-many (WORM) Hadoop cluster with 3 DataNodes minimum Data divided into 64MB or 128 MB blocks, each block replicated 3 times (default) NameNode holds filesystem metadata Files are broken up and spread over the DataNodes

29 Benefits of Hadoop Runs on cheap commodity hardware Automatically handles data replication and node failure It does the hard work – you can focus on processing data Cost Saving and efficient and reliable data processing

30 Where and When Hadoop? Where? Batch data processing, not real-time / user facing (e.g. Document Analysis and Indexing, Web Graphs and Crawling) Highly parallel data intensive distributed applications Very large production deployments When? Process lots of unstructured data When your processing can easily be made parallel Running batch jobs is acceptable When you have access to lots of cheap hardware

31 What is Hadoop used for? Searching Log Processing Recommendation systems Analytics Video and Image analysis Data Retention Hadoop Massively scalable data 100PB of data Injest at scale Sensor data Clickstream Online gaming data (user activity, time online, activities performed, etc) Later analytics

32 How Hadoop works? Hadoop implements a computational paradigm named Map/Reduce, where the application is divided into many small fragments of work, each of which may be executed or re-executed on any node in the cluster. In addition, it provides a distributed file system (HDFS) that stores data on the compute nodes, providing very high aggregate bandwidth across the cluster. Both Map/Reduce and the distributed file system are designed so that node failures are automatically handled by the framework

33 Hadoop Architecture Distributed File System and MapReduce HDFS Runs on top of existing file system in a node Designed to handle large files with streaming data access patterns Designed for streaming (sequential data access rather than random access)

34 HDFS Known as Hadoop Distributed File System Primary storage system for Hadoop Apps Multiple replicas of data blocks distributed on compute nodes for reliability Files are stored on multiple boxes for durability and high availability

35 HDFS Optimized for long sequential reads Data written once, read multiple times, no append possible Large file, sequential reads so no local caching of data. Data replication HDFS

36 HDFS Architecture

37 Block Structure files system File is divided to bocks and stored Each individual machine in cluster is Data Node Default block size is 64 MB Information of blocks is stored in metadata All this meta data is stored on machine which is Name Node

38 Map Reduce Hadoop MapReduce is a software framework for easily writing applications which process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner. A MapReduce job usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce tasks. Typically both the input and the output of the job are stored in a file-system. The framework takes care of scheduling tasks, monitoring them and re-executes the failed tasks. Typically the compute nodes and the storage nodes are the same, that is, the MapReduce framework and the Hadoop Distributed File System (see HDFS Architecture Guide) are running on the same set of nodes. This configuration allows the framework to effectively schedule tasks on the nodes where data is already present, resulting in very high aggregate bandwidth across the cluster. The MapReduce framework consists of a single master JobTracker and one slave TaskTracker per cluster-node. The master is responsible for scheduling the jobs' component tasks on the slaves, monitoring them and re-executing the failed tasks. The slaves execute the tasks as directed by the master.

39 MapReduce Technology from Google MapReduce program contains transformations that can be applied to the data any number of times MapReduce is an executing mapreduce program which has map tasks running parallel to each other and reduce tasks running parallel to each other as well

40 MapReduce

41 HDFS handles the Distributed File System layer MapReduce is how we process the data MapReduce Daemons JobTracker TaskTracker Goals Distribute the reading and processing of data Localize the processing when possible Share as little data as possible while processing

42 Job Tracker One per cluster “master node” Takes jobs from clients Splits work into “tasks” Distributes “tasks” to TaskTrackers Monitors progress, deals with failures

43 Task Tracker Many per cluster “slave nodes” Does the actual work, executes the code for the job Talks regularly with JobTracker Launches child process when given a task Reports progress of running “task” back to JobTracker

44 Minimally, applications specify the input/output locations and supply map and reduce functions via implementations of appropriate interfaces and/or abstract-classes. These, and other job parameters, comprise the job configuration. The Hadoop job client then submits the job (jar/executable etc.) and configuration to the JobTracker which then assumes the responsibility of distributing the software/configuration to the slaves, scheduling tasks and monitoring them, providing status and diagnostic information to the job-client. Although the Hadoop framework is implemented in JavaTM, MapReduce applications need not be written in Java. Anatomy of MapReduce

45 Submitting a MapReduce job

46 Simple Data Flow Example

47 Example of MapReduce Read text files and count how often words occur. The input is text files The output is a text file each line: word, tab, count Map: Produce pairs of (word, count) Reduce: For each word, sum up the counts.

48 Anatomy of MapReduce Client Submits job: I want to count the occurrences of each word We will assume that the data to process is already there in HDFS JobTracker receives job Queries the NameNode for number of blocks in File The job is split into Tasks One map task per each block As many reduce tasks as specified in the Job TaskTracker checks in Regularly with JobTracker Is there any work for me ? If the JobTracker has a MapTask that the TaskTracker has a local block for the file being processed then the TaskTracker will be given the “task”

49 Hadoop, pros & cons Pros Superior in availability / scalability / manageability Large block sizes  large files (giga, peta,…) Extremely scalable due to HDFS Batch-based MapReduce facilitating parallel work Cons: Programmability and Metadata Less efficient for smaller files MapReduce is complex than traditional sql queries MapReduce is batch-based - delay Need to publish data in well known schemas

50 Hive To turn hadoop into a data warehouse Developed at Facebook Declarative language (SQL Dialect) - HiveQL Schema non-optional but data can have many schemas Relationally complete

51 Data warehousing in Facebook (case study) Hadoop/Hive cluster 8400 cores Raw storage capacity ~12.5PB 8 cores + 12 TB per node 32 GB RAM per node Two level network topology 1 Gbit/sec from node to rack switch 4 Gbit/sec to top level rack switch 2 clusters One for adhoc users One for strict SLA jobs

52 Hadoop / Hive usage Statistics per day 12 TB of compressed new data added per day 135TB of compressed data scanned per day 7500+ Hive jobs per day 80k computing hours per day Hive simplifies Hadoop New engineers go through a Hive training session ~200 people/month run jobs on Hadoop/Hive Analysts (non-engineers) use Hadoop through Hive Most of jobs are Hive Jobs Data warehousing in Facebook (case study)

53 Types of Applications Reporting Eg: Daily/ weekly aggregations of impression / click counts Measures of user engagement Microstrategy reports Ad hoc Analysis Eg: how many group admins broken down by state / country Machine Learning (Assembling training data) Ad Optimization Eg: User Engagement as a function of user attributes Many others Data warehousing in Facebook (case study)

54 Scribe A service for distributed log file collection designed to run as a daemon process on every node in data center forward log files from any process running on that machine back to a central pool of aggregators Scribe is a server for aggregating streaming log data designed to scale to a very large number of nodes and be robust to network and node failures

55 Scribe and Hadoop clusters at Facebook Used to log Data from web servers Clusters collocated with the web servers Network is the biggest bottleneck Typical cluster has about 50 nodes

56 Data Flow Architecture at Facebook

57 OK – what do we know so far? We’ve talked about MongoDB for reasonably-large- sized amounts of unstructured data, and processing it quickly We’ve talked about Hadoop (Hive, Scribe, etc) for unstructured data holding really large amounts of stuff so we can process and mine it with HiveQL What if we have a combined need to process large and huge sets for the same app? What if we need to process good amounts of unstructured data right away, while storing it as lots more for analysis later?

58 Applications have complex needs Lets use the best tool for the job Often more than one tool is needed MongoDB ideal operational database MongoDB ideal for BIG data, but Not really a data processing engine For heavy processing needs use tool designed for that job... Hadoop

59 MongoDB & Hadoop together Hadoop Massively scalable data 100PB of data Injest at scale Sensor data Clickstream Online gaming data (user activity, time online, activities performed, etc) Later analytics MongoDB Online processing Working on small subsets at a time Processing document store data (like game data) while interacting in game (items/HP/XP, graphics, stage, etc)

60 MongoDB MapReduce MongoDB map reduce quite capable... but with limits Javascript not best language for processing map reduce Javascript limited in external data processing libraries Adds load to data store Sharded environments do parallel processing

61 MongoDB Map Reduce

62 MongoDB Aggregation Most uses of MongoDB Map Reduce were for aggregation Aggregation Framework optimized for aggregate queries Fixes some of limits of MongoDB MR Can do realtime aggregation similar to SQL GroupBy parallel processing on sharded clusters

63 MongoDB Aggregation

64 MongoDB Map Reduce

65 Hadoop Map Reduce

66 MongoDB & Hadoop

67 Some Videos Mongo MapReduce https://www.youtube.com/watch?v=WovfjprPD_I MySQL & NoSQL at Craigslist https://www.youtube.com/watch?v=a0OvgTfF8Pg 9 Databases in 45 minutes https://www.youtube.com/watch?v=XfK4aBF7tEIhttps://www.youtube.com/watch?v=XfK4aBF7tEI


Download ppt "Big Data. Definition Big Data is a term for collection of datasets so large and complex that it becomes difficult to process using conventional database."

Similar presentations


Ads by Google