Presentation is loading. Please wait.

Presentation is loading. Please wait.

Big Data Overview.

Similar presentations


Presentation on theme: "Big Data Overview."— Presentation transcript:

1 Big Data Overview

2 What is Big Data? Big Data is a catch-all term for data that doesn’t fit into the usual containers Files Databases Can be described with three terms Volume (amount of data) Velocity (streaming data) Variety (forms of data)

3 What is Big Data? The Internet Scientific Research Banking Retail
Mobile Networks Security Government

4 Big Data: Volume In traditional data processing, files are stored on a single server However, big data often can’t fit on a single server Or is impractical to do so Big data relies on being distributed across multiple systems Needs to be accessed in a parallel style

5 Big Data: Volume In a distributed system, files are split into blocks
Spread across several servers Examples include Google File System (Google) Hadoop Distributed File System (Yahoo, Facebook, Twitter) Often uses a fault tolerance systems Repairs data if anything happens to it

6 Big Data: Volume Here is a rough overview of Google’s distribution scheme In a data centre

7 Big Data: Volume And here’s an example of HDFS
Writes blocks multiple times Two at least two different servers

8 Big Data: Volume Can use functional programming to make executing programs on big data easier Easier to parallelise program execution An example is the MapReduce programming model Takes an input Splits it into smaller parts Executes code on each part Merges all the results into one

9 Big Data: Variety Data can take on many forms
Three types: Structured, Semi-structured, Unstructured Structured: typical data can be represented using fields and records Easy to model in Relational Database Model (RDBM) tables Semi-structured: no formal structure (but still have some) Like XML Unstructured: no structure at all (like a paragraph of text)

10 Big Data: Velocity Data can be in two states
At rest: saved in a file or a database In motion: streamed from a sensor/the Internet, and has a frequency Data at rest ca often be dealt with in batches Breaks data into chunks Carries out same process on each chunk No need for user intervention

11 Big Data: Velocity Data in motion is usually streamed in at a certain frequency Example: 1000 events pre second Involves some stream processing The data is processed as soon as it arrives For big data, data arrives at a higher rate Possibly from multiple sources

12 Machine Learning The unstructured nature of some big data makes analysing it difficult RDMBs can easily be used for quantitative data (structured) Simple to query data to produce results Quicker and accurate even on large datasets For qualitative data (unstructured), we need to use machine learning Example: feedback received from a customer

13 Machine Learning Qualitative data like feedback requires more time to analyse Machine learning technique can automate the process Need to discern patterns in the data to extract useful information Covers everything from pattern recognition to artificial intelligence Example: finding positive or negative words from feedback Determines the nature of feedback (positive or negative) Program looks for words/phrases

14 Machine Learning More advanced technique let a computer develop its own knowledge Based on data it is manipulating Valuable with big data To work out patterns and correlations which are not obvious Also includes predictive analysis Used in financial and insurance sector to predict risk

15 Machine Learning >temperature <- c(10,20,30,40,50,60,70,80,90) > growthrate <- c(20,26,38,49,56,68,72,89,92) > plot(temprature,growthrate) > model <- lm(growthrate~temprature) > abline(model) Use this model in coded algorithm applied to data stream of temperature data to predict growth rate and take action if it was too slow.

16 Machine Learning upload\fico real time fraud 3095IN.pdf Uses data at rest which is manipulating to generate predictive model. The model is then applied to streaming data. Predictive model algorithm applied to stream data from a temperature sensor

17 Functional Programming
Functional programming is a programming paradigm Uses mathematical functions to create program Supports: Immutable data structures Statelessness High-order functions

18 Functional Programming
Immutable data structures: variables in imperative programming Mutable, and can change on program execution No variables in functional programming Use of lists and functions Statelessness: stays in same state No previous and current value difference Value produced is result of function run Not dependent on state of any variables High-order functions Passing function as an argument to other function

19 Functional Programming
Immutability, statelessness, and higher order functions help analyse big data using distributed processing concurrently Multiple users accessing the data at same time from different computers As user gets to input value in their own functions The output of the functions will be local to their machine

20 Fact-Based Model We also have fact-based models Examples include
Used for modelling data Different from relational model Examples include Bigtable (Google) Cassandra (Apple, Facebook, Twitter, Instagram)

21 Fact-Based Model Here is a list of features
Raw data stored as atomic fact Each fact captures a single price of information (i.e. atomic) Immutable and eternally true facts using timestamps Each fact is made identifiable so that query can identify duplicates (facts with some identity) A nonce (one-use-number) is used to make identical facts identifiable

22 Here is an example using Students in a school
Fact-Based Model Here is an example using Students in a school StudentId YearClass TimeStamp 1 Year 11 D 05/01/ :35:46 2 Year 11 A 05/01/ :39:25 3 05/01/ :50:10 Year 12 B 10/02/ :25:36 Year 12 E 10/02/ :45:12 A nonce, in information technology, is a number generated for a specific use, such as session authentication. In this context, "nonce" stands for "number used once" or "number once."

23 Fact-Based Model We can also use a graph schema
Method of defining database using nodes, edges and properties Captures the structure of a dataset Stored using fact-based model Description of the types of facts contained in dataset Data-types and relationships between entities

24 Fact-Based Model Graph schema for students and intranet page access
A nonce, in information technology, is a number generated for a specific use, such as session authentication. In this context, "nonce" stands for "number used once" or "number once." Graph schema for students and intranet page access

25 Fact-Based Model Here are some useful terms for graph schema
Nodes: Entities Properties: Information about nodes (e.g. FirstName ) Edges: Relationship between the nodes Solid line: To connect nodes Dashed lines: Connecting properties to the corresponding node

26 Fact-Based Model There are a few advantages of using a fact-based model over a relational database Simplicity (no indexing) Append Perpetuity Historical queries Easy to add a new type of information by defining node, edge and property Existing fact types are unaffected as they are atomic

27


Download ppt "Big Data Overview."

Similar presentations


Ads by Google