Week 02 Big Data.

Slides:



Advertisements
Similar presentations
R and HDInsight in Microsoft Azure
Advertisements

Big Data Management and Analytics Introduction Spring 2015 Dr. Latifur Khan 1.
John Lenhart.  Data stores are growing by 50% each year, and that rate of increase is accelerating [1]  In 2010, we crossed the barrier of the zettabyte.
 Need for a new processing platform (BigData)  Origin of Hadoop  What is Hadoop & what it is not ?  Hadoop architecture  Hadoop components (Common/HDFS/MapReduce)
Big Data Workflows N AME : A SHOK P ADMARAJU C OURSE : T OPICS ON S OFTWARE E NGINEERING I NSTRUCTOR : D R. S ERGIU D ASCALU.
Chapter 14 The Second Component: The Database.
Big Data. What is Big Data? Analog starage vs digital. The FOUR V’s of Big Data. Who’s Generating Big Data The importance of Big Data. Optimalization.
Big Data A big step towards innovation, competition and productivity.
This presentation was scheduled to be delivered by Brian Mitchell, Lead Architect, Microsoft Big Data COE Follow him Contact him.
U.S. Department of the Interior U.S. Geological Survey David V. Hill, Information Dynamics, Contractor to USGS/EROS 12/08/2011 Satellite Image Processing.
USING HADOOP & HBASE TO BUILD CONTENT RELEVANCE & PERSONALIZATION Tools to build your big data application Ameya Kanitkar.
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Introduction to Hadoop and HDFS
f ACT s  Data intensive applications with Petabytes of data  Web pages billion web pages x 20KB = 400+ terabytes  One computer can read
Contents HADOOP INTRODUCTION AND CONCEPTUAL OVERVIEW TERMINOLOGY QUICK TOUR OF CLOUDERA MANAGER.
Big Data Analytics Large-Scale Data Management Big Data Analytics Data Science and Analytics How to manage very large amounts of data and extract value.
+ Big Data IST210 Class Lecture. + Big Data Summary by EMC Corporation ( More videos that.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
CISC 849 : Applications in Fintech Namami Shukla Dept of Computer & Information Sciences University of Delaware iCARE : A Framework for Big Data Based.
What we know or see What’s actually there Wikipedia : In information technology, big data is a collection of data sets so large and complex that it.
IoT Meets Big Data Standardization Considerations
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
BUSINESS INTELLIGENCE & ADVANCED ANALYTICS DISCOVER | PLAN | EXECUTE JANUARY 14, 2016.
Smart Grid Big Data: Automating Analysis of Distribution Systems Steve Pascoe Manager Business Development E&O - NISC.
Copyright © 2016 Pearson Education, Inc. Modern Database Management 12 th Edition Jeff Hoffer, Ramesh Venkataraman, Heikki Topi CHAPTER 11: BIG DATA AND.
BIG DATA/ Hadoop Interview Questions.
INTRODUCTION TO INFORMATION SYSTEMS LECTURE 9: DATABASE FEATURES, FUNCTIONS AND ARCHITECTURES PART (2) أ/ غدير عاشور 1.
Management Information Systems by Prof. Park Kyung-Hye Chapter 7 (8th Week) Databases and Data Warehouses 07.
A Tutorial on Hadoop Cloud Computing : Future Trends.
Big Data-An Analysis. Big Data: A definition Big data is a collection of data sets so large and complex that it becomes difficult.
Data Analytics (CS40003) Introduction to Data Lecture #1
CNIT131 Internet Basics & Beginning HTML
Data Analytics 1 - THE HISTORY AND CONCEPTS OF DATA ANALYTICS
Image taken from: slideshare
Big Data is a Big Deal!.
SNS COLLEGE OF TECHNOLOGY
MapReduce Compiler RHadoop
Understanding Big Data
Hadoop Aakash Kag What Why How 1.
Hadoop.
Software Systems Development
Advanced Topics in Concurrency and Reactive Programming: Case Study – Google Cluster Majeed Kassis.
An Open Source Project Commonly Used for Processing Big Data Sets
Where do we need it ? Why do we need it ? What is Data Analytics ?
Zhangxi Lin, The Rawls College,
Big-Data Fundamentals
NOSQL.
Mohammad J. Mansourzadeh
NOSQL databases and Big Data Storage Systems
Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.
Central Florida Business Intelligence User Group
Ministry of Higher Education
Big Data - in Performance Engineering
Ch 4. The Evolution of Analytic Scalability
Distributed File Systems
Big Data.
Overview of big data tools
Zoie Barrett and Brian Lam
Business Intelligence
Dep. of Information Technology By: Raz Dara Mohammad Amin
Big Data: Four Vs Salhuldin Alqarghuli.
Big Data Analysis in Digital Marketing
AGENDA Buzz word. AGENDA Buzz word What is BIG DATA ? Big Data refers to massive, often unstructured data that is beyond the processing capabilities.
Big DATA.
UNIT 6 RECENT TRENDS.
Big Data.
Presentation transcript:

Week 02 Big Data

What is Big Data? Big data is an all-encompassing term for any collection of data sets so large and complex that it becomes difficult to process using on-hand data management tools or traditional data processing applications. – (From Wikipedia) Big Data refers to extremely vast amounts of multi-structured data that typically has been cost prohibitive to store and analyze. (My view) NOTE: However, big data is only referring to digital data, not the paper files stored in the basement at FBI headquarters, or piles of magnetic tapes in our data center.

Types of Big Data In the simplest terms, Big Data can be broken down into: Structured – Predefined data type (Fixed Schema) Relational databases, transactional data such as sales records, Excel files such as customer information. This type of data normally can be stored into tables with columns and rows. Unstructured – is non pre-defined data model or is not organized in a pre-defined manner. Data Lake is where the unstructured data will be stored. Video, Audio, Images, Metadata, books, satellite images, Adobe PDF files, notes in a web form, blogs, text messages, word documents Semi-structured – Structured data embedded with some unstructured data Email, XML and JSON documents, and other markup languages NOTE: Semi-structured data falls in the middle between structured and unstructured data. It contains certain aspects that are structured, and others that are not.

Why we have Big Data? Evolution of Technology: New technologies generating large volume of data such as Mobile, Cloud, Smart car (self driving car). IoT (Internet of Things): IoTs devices also generating huge data and sending them via Internet. (e.g. Wind Turbine, Gas Pump, Cargo Container, Energy Substation, Smartphone, Wearables, Animals, Shopping Cart, Vehicles, Smart Meter, Parking Meter, Sensors, Camera). We are expecting 50 Billion IoT devices by 2020. Social Media: Social media also generating quite large amount of data daily. (e.g. 204,000,000 emails, 1,736,111 Instagram pictures, facebook – 4,166,667 likes and 200,000 photos, tweeter – 347,222 tweets, Youtube – 300 hours of video uploaded) Other factors: Transportation, Retail, Banking & Finance, Media & Entertainment, Healthcare, Education, Government also contributing large amount of data. Big Data captures, manages, processes the above fast growing data.

Characteristics of Big Data (The 4 Vs)

Characteristics of Big Data (The 5 Vs) Volume: Data created by and moving through today’s services may describe tens of millions of customers, hundreds of millions of devices, and billions of transactions or statistical records. Such scale requires careful engineering, as it is necessary to carefully conserve even the number of CPU instructions or operating system events and network messages per data items. Parallel processing is a powerful tool to cope with scale. MapReduce computing frameworks like Hadoop and storage systems like HBASE and Cassandra provide low-cost, practical system foundations. Analysis also requires efficient algorithms, because “data in flight” may only be observed one time, so conventional storage-based approaches may not work. Large volumes of data may require a mix of “move the data to the processing” and “move the processing to the data” architectural styles. By 2020, accumulated digital universe of data will grow from 4.4 zetabytes today to around 44 zetabytes or 44 Trillion gigabytes.

Characteristics of Big Data (The 5 Vs) Velocity: Timeliness is often critical to the value of Big Data. For example, online customers may expect promotions (coupons) received on a mobile device to reflect their current location, or they may expect recommendations to reflect their most recent purchases or media that was accessed. The business value of some data decays rapidly. Because raw data is often delivered in streams, or in small batches in near real-time, the requirement to deliver rapid results can be demanding and does not mesh well with conventional data warehouse technology. Data is being generated at an alarming rate. From the old day mainframe – Client/Server – Internet – Mobile, social media, Cloud today. In every 60 seconds: 100,000+ tweets 695,000+ facebook status update 11,000,000+ Instant messages 698,445 Google searches 168,000+ emails 1,820 TB data created in the DB 217+ new mobile users

Characteristics of Big Data (The 5 Vs) Variety: Big Data often means integrating and processing multiple types of data. We can consider most data sources as structured, semi-structured, or unstructured. Structured data refers to records with fixed fields and types. Unstructured data includes text, speech, and other multimedia. Semi-structured data may be a mixture of the above, such as web documents, or sparse records with many variants, such as personal medical records with well defined but complex types. (CSV, XML, and JSON)

Characteristics of Big Data (The 5 Vs) Veracity: Data sources (even in the same domain) are of   widely differing qualities, with significant differences   in the coverage, accuracy and timeliness of data provided.  Per IBM's Big Data website, one in three business leaders   don't trust the information they use to make decisions.   Establishing trust in big data presents a huge challenge as the variety and number of sources grows. Uncertainty and inconsistencies in the data.

Characteristics of Big Data (The 5 Vs) Value: Mechanism to bring the correct meaning out of the data – Data Mining. (to make the data become value). NOTE: Big Data generally includes data sets with sizes beyond the ability of commonly-used software tools to capture, manage, and process the data within a tolerable elapsed time. Big data sizes are a constantly moving target, from hundreds of terabytes to many petabytes of data in a single data set. With this difficulty, a new tool sets has arisen to handle making sense over these large quantities of data.  Big data is difficult to work with using relational databases, desktop statistics and visualization packages, requiring instead "massively parallel software running on tens, hundreds, or even thousands of servers".

A Big Data Platform must: Analyze a variety of different types of information This information by itself could be unrelated, but when paired with other information can illustrate a causation for an various events that the business can take advantage.  Analyze information in motion Various data types will be streaming and contain a large amount of "bursts".  Ad hoc analysis needs to be done on the streaming data to search for relevant events. Cover extremely large volumes of data Due to the proliferation of devices in the network, how they are used, along with customer interactions on smartphones and the web, a cost efficient process to analyze the petabytes of information is required

A Big Data Platform must: Cover varying types of data sources Data can be streaming, batch, structured, unstructured, and semi-structured, depending on the information type, where it comes from and its primary use.  Big Data must be able to accommodate all of these various types of data on a very large scale. Analytics Big Data must provide the mechanisms to allow ad-hoc queries, data discovery and experimentation on the large data sets to effectively correlate various events and data types to get an understanding of the data that is useful and addresses business needs. 

Big Data Platform Architecture Load Data Input Data (Collection of data from different data stores with different formats – Structured, Unstructured, and Semi-structured) Data Lake (repository that holds vast amount of raw data) Apache Hadoop Platform EDW Data Mining Analytic tools Output Data (Files, Online reports) Extract useful data

Problems with Big Data Problem 1: Storing exponentially growing huge datasets. By 2020, total digital data will grow to 44 Zettabytes approximately. By 2020, about 1.7 MB of new information will be created every second for every person. Problem 2: Processing data having complex structure. Structured + un-structured + Semi-structured Problem 3: Processing data fast. The data is growing at much faster rate than that of disk read/write speed. Bringing huge amount of data to computation unit becomes a bottleneck.

The Solution – Apache Hadoop Apache Hadoop is a framework that allows us to store and process large data sets in parallel and distributed fashion. (Open-source framework) Hadoop consists of two parts: HDFS (Hadoop Distributed File System) (Storage) – allows to dump any kind of data across the cluster. MapReduce (Processing) – allows parallel processing of the data stored in HDFS.

Apache Hadoop Components HDFS (Storage): allows to dump any kind of data across the cluster MapReduce (Processing): allows parallel processing of the data stored in HDFS

HDFS Components HDFS (Hadoop Distributed File System): a distributed file system that provides high-throughput access to application data.

HDFS Components HDFS consists of: NameNode (Master): is the main node that maintains and manages DataNode contains metadata about the data stored. (Data block information such as location of blocks stored, the size of the files, permissions, hierarchy) Receives block report from all the DataNodes. DataNodes (slaves): are commodity hardware in the distributed environment. stores actual data serves read/write requests from the clients Secondary NameNode: is not a backup of NameNode, whose main function is to take checkpoints of the file system metadata present on NameNode. Checkpointing – periodically applies edit log records to FsImage file and refresh the edit log. Stores a copy of FsImage file and edit log. If NameNode is failed, File System metadata can be recovered from the last saved FsImage. NOTE: FsImage is a snapshot of the HDFS file system metadata at a certain point of time. NOTE: Edit log is a transaction log which contains records for every change that occurs to file system metadata.

HDFS Components (Hadoop Cluster) NameNode (Master) (Slaves) DataNode Secondary

MapReduce Framework MapReduce: is a programming framework that allows us to perform distributed and parallel processing on large data sets in a distributed environment. Map() HDFS Input Data Output Data Reduce() Aggregated data Map Tasks Reduce Tasks

Big Data Problems solved Problem 1: Storing exponentially growing huge datasets. Solution: Hadoop HDFS HDFS is a storage unit of Hardoop It is a Distributed File System (e.g. 512 MB file will be divided into 4 slaves with 128MB each. Divided files (input data) into smaller chunks and stores it across the cluster Scalable (easy to add DataNode)

Big Data Problems solved Problem 2: storing unstructured data Solution: Hadoop HDFS Allows to store any kind of data, can be structured, semi-structured or unstructured. Follows WORM (Write Once Read Many) No schema validation is done while dumping data

Big Data Problems solved Problem 3: Processing data faster Solution: Hadoop MapReduce Provides parallel processing of data present in HDFS Allows to process data locally. i.e. each node works with part of data which is stored on it.

What is Hadoop Ecosystem?

Hadoop Ecosystem

Big Data Opportunity Walmart story: Making a lot of money by selling “Strawberry pop tarts” during hurricane as a result of Big Data analysis. IBM smart meters: By collecting and analyzing data from smart meters, IBM discovered that during off-peak hours, users require less energy. Therefore, advises consumers to use heavy machines during off-peak hours to reduce cost and energy.

References https://www.edureka.co/blog/big-data-tutorial https://www.ijsr.net/archive/v5i6/NOV164121.pdf http://hadoop.apache.org/