Current and Future Research Frontiers

Current and Future Research Frontiers
Data-Intensive Applications, Challenges, Techniques and Technologies: Big Data Current and Future Research Frontiers

Big Data Big Data has drawn huge attention from researchers in
information sciences, policy and decision makers in governments and enterprises Big Data is extremely valuable to produce productivity in businesses evolutionary breakthroughs in scientific disciplines give us a lot of opportunities to make great progresses in many fields. Big Data arises with many challenges difficulties in data capture data storage data analysis data visualization.

Big data is a set of techniques and technologies
Require new forms of integration to uncover large hidden values from large data sets diverse, complex, a massive scale.

Characteristics of Big Data
Volume refers to the amount of all types of data generated from different sources and continue to expand. The benefit of gathering large amounts of data includes the creation of hidden information and patterns through data analysis Variety refers to the different types of data collected via sensors, smartphones ,or social networks. Such data types include video ,image, text, audio, and data logs, in either structured or unstructured format Most of the data generated from mobile applications are in unstructured format

Characteristics of big data
Velocity refers to the speed of data transfer. The contents of data constantly change because of the absorption of complementary data collections introduction of previously archived data or legacy collections Streamed data arriving from multiple sources Value is the most important aspect of big data It refers to the process of discovering huge hidden values from large datasets with various types and rapid generation

Big Data in Commerce and Business
267 million transactions per day in Wal-Mart’s 6000 stores (in 2014) worldwide Wall Mart collaborated with Hewlett Packard to establish a data warehouse With a capability to store 4 petabytes (4000 trillion bytes), tracing every purchase record from their point-of-sale terminals Wall Mark company takes advantage of sophisticated machine learning techniques Exploit the knowledge hidden in this huge volume of data, they successfully improve efficiency of their pricing strategies and advertising campaigns. The management of their inventory and supply chains also significantly benefits from the large-scale warehouse

4V Categorization of IBM

Extracting Business Value from the 4 V’s of Big Data

Big Data Classification

Categories of Big Data I- Data Sources
Social media is the source of information generated via URL Share or Exchange information and ideas in virtual communities and networks For example: collaborative projects, blogs and microblogs, Facebook, and Twitter. Machine-generated data are information automatically generated from a hardware or software Such as computers, medical devices, or other machines, without human intervention Sensing devices exist to measure physical quantities and change them into signals Transaction data, such as financial and work data, comprise an event that involves a time dimension to describe the data IoT represents a set of objects that are uniquely identifiable as a part of the Internet.

IOT as Big Data Source The objects of IOT include smartphones, digital cameras, and tablets. When these devices connect with one another over the Internet, they enable more smart processes and services support basic, economic, environmental, and health needs. A large number of devices connected to the Internet provides many types of services and produces huge amounts of data and information

Categories of Big Data II - Content Format
Structured data are often managed SQL, a programming language created for managing and querying data in RDBMS Structured data are easy to input, query, store, and analyze . Examples of structured data include numbers, words, and dates. Semi-structured data are data that do not follow a conventional database system. Semi-structured data may be in the form of structured data that are not organized in relational database models, such as tables. Capturing semi-structured data for analysis is different from capturing a fixed file format. Capturing semi-structured data requires the use of complex rules that dynamically decide the next process after capturing the data

Categories of Big Data II - Content Format
Unstructured data, such as text messages, location information, videos, and social media data, are data that do not follow a specified format. The size of this type of data continues to increase through the use of smartphones To analyze and understand such data has become a challenge

Categories of Big Data III- Data Stores
Document-Oriented Data stores are mainly designed to Store and retrieve collections of documents or information Support complex data forms in several standard formats such as JSON, XML, and binary forms (e.g., PDF and MS Word) A document- oriented data store is similar to a record or row in a relational database But it is more flexible and can retrieve documents based on their contents (e.g., MongoDB, SimpleDB, and CouchDB). Column-Oriented Database stores its content in columns aside from rows, with attribute values belonging to the same column stored contiguously. Column oriented is different from classical data base systems that store entire rows one after the other

Categories of Big Data III- Data Stores
Graph database is designed to store and represent data that utilize a graph model with nodes, edges, and properties related to one another through relations For example: Neo4j Key-value is an alternative relational database system that stores and accesses data designed to scale to a very large size Dynamo is a good example of a highly available key-value storage system it is used by amazon.com in some of its services

Categories of Big Data IV- Data staging
Cleaning is the process of identifying in complete and unreasonable data Transform is the process of transforming data into a form suitable for analysis. Normalization is the method of structuring database schema to minimize redundancy

Categories of Big Data V- Data processing
Batch MapReduce-based systems have been adopted by many organizations in the past few years for long-running batch jobs Such system allows for the scaling of applications across large clusters of machines comprising thousands of nodes. Realtime One of the powerful real time process-based big data tools is simple scalable streaming system S4 is a distributed computing platform that allows streams of data S4 is a scalable, partially fault tolerant, general purpose, and pluggable platform

Transforming Big Data Analysis

Structured and Unstructured Data Transformation
In the case of structured data, the data is pre-processed before they are stored in relational databases to meet the constraints of schema- on-write. The secod step is to retrieve the data for analysis. In case of unstructured data, the data must first be stored in distributed databases before they are processed for analysis ,. Unstructured data are retrieved from distributed databases after meeting the schema-on-read constraints. For example HBase,

Unified Architecture Apache Hadoop MapReduce framework was initially designed to perform a batch processing on large amounts of data Tools such as Hive and Pig helps to execute ad-hoc queries on historical data using query language. Processing using MapReduce and tools such as Pig and Hive is slow due to disk reads and writes during data processing. A new stack which contains tools such as HBase, Impala etc. enables interactive query processing to access data faster Apache Storm and Kafka include streaming data and were introduced to fulfill the need of real-time analytics

Batch Data Processing Batch data processing is an efficient way of processing high volumes of data is where a group of transactions is collected over a period of time. Data is collected, entered, processed and then the batch results are produced Hadoop is focused on batch data processing Batch processing requires separate programs for input, process and output. An example is payroll and billing systems.

Disadvantages of Batch Processing (Apache Hadoop MapReduce)
The limitations of this model are that it’s expensive and complex. it is hard to compute the consistent metrics among these stacks. Processing on streaming data is slow in case of MapReduce due to the use of disk for storing the intermediate results.

Real Time Data Processing
Real Time Data processing involves a continual input, process and output of data. Data must be processed in a small time period (or near real time). Radar systems, customer services and bank ATMs

Apache Spark Apache Spark introduced the unified architecture
Combines streaming, interactive and batch processing components It is easy to build applications using powerful APIs in JAVA, Python and Scala.

Real Time and Batch Processing Application
We can compare Real Time Analytics and Batch Processing Application with Hadoop MapReduce and Spark

Batch and Real Time Data Processing Solutions

MapReduce and Hadoop MapReduce has been used by Google to generate scalable applications. In other words MapReduce as a programming model and an implementation for processing and generating large data set was created at Google in 2004 by Jeffrey Dean and Sanjay Ghemawat. MapReduce inspired by the “map” and “reduce” functions in Lisp MapReduce breaks an application into several small portions of the problem Each of them can be executed across any node in a computer cluster. The “map” stage gives sub problems to nodes of computers The “reduce” combines the results from all of those different sub problems.

MapReduce Etymology In LISP, the map function takes a function and a set of values as parameters . That function is then applied to each of the values. For example: (map ‘length ‘(() (a) (ab) (abc))) applies the length function to each of the three items in the list. Since length returns the length of an item, the result of map is a list containing the length of each item: ( )

MapReduce Etymology The reduce function is given a binary function and a set of values as parameters. It combines all the values together using the binary function. If we use the + (add) function to reduce the list ( ): (reduce # '+ ‘ ( )) we get

MapReduce Framework for Parallel Computing
Programmers get a simple API and do not have to deal with issues of parallelization, remote execution, data distribution, load balancing, or fault tolerance. The framework makes it easy for one to use thousands of processors to process huge amounts of data (e.g., terabytes and petabytes). From a user's perspective, there are two basic operations in MapReduce: Map and Reduce.

The Operations of MapReduce
Map Operation: Each application of the function to a value can be performed in parallel (concurrently) There is no dependence of one upon another. Reduce Operation can take place only after the map is complete.

Map and Reduce Functions
The Map function reads a stream of data and parses it into intermediate (key, value) pairs. The Reduce function is called once for each unique key that was generated by Map The Reduce function is given the key and a list of all values that were generated for that key as a parameter. The keys are presented in sorted order.

An Example of Using MapReduce -I
The task is counting the number of occurrences of each word in a large collection of documents. The user-written Map function reads the document data and parses out the words. For each word, it writes the (key, value) pair of (word, 1). The word is treated as the key and the associated value of 1 means that we saw the word once.

An Example of Using MapReduce -II
This intermediate data is then sorted by MapReduce by keys The user's Reduce function is called for each unique key. Since the only values are the count of 1, Reduce is called with a list of a "1" for each occurrence of the word that was parsed from the document. The function simply adds them up to generate a total word count for that word

map(String key, String value): // key: document name, value: document contents for each word w in value: EmitIntermediate(w, "1"); reduce(String key, Iterator values): // key: a word; values: a list of counts int result = 0; for each v in values: result += ParseInt(v); Emit(AsString(result));

Comparision of Several Big Data Cloud Platforms

Current and Future Research Frontiers

Similar presentations

Presentation on theme: "Current and Future Research Frontiers"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Current and Future Research Frontiers

Similar presentations

Presentation on theme: "Current and Future Research Frontiers"— Presentation transcript:

Similar presentations

About project

Feedback