Introduction to Big Data James Miller

Slides:



Advertisements
Similar presentations
Business Intelligence Systems
Advertisements

R and HDInsight in Microsoft Azure
Setting Big Data Capabilities Free How to Make Business on Big Data? Stig Torngaard, Partner Platon.
Paula Ta-Shma, IBM Haifa Research 1 “Advanced Topics on Storage Systems” - Spring 2013, Tel-Aviv University Big Data and.
ETM Hadoop. ETM IDC estimate put the size of the “digital universe” at zettabytes in forecasting a tenfold growth by 2011 to.
Chapter 14 The Second Component: The Database.
Fraud Detection in Banking using Big Data By Madhu Malapaka For ISACA, Hyderabad Chapter Date: 14 th Dec 2014 Wilshire Software.
Big Data A big step towards innovation, competition and productivity.
Large Scale Data Analytics
This presentation was scheduled to be delivered by Brian Mitchell, Lead Architect, Microsoft Big Data COE Follow him Contact him.
U.S. Department of the Interior U.S. Geological Survey David V. Hill, Information Dynamics, Contractor to USGS/EROS 12/08/2011 Satellite Image Processing.
Data Mining on the Web via Cloud Computing COMS E6125 Web Enhanced Information Management Presented By Hemanth Murthy.
A Brief Overview by Aditya Dutt March 18 th ’ Aditya Inc.
Data Mining. 2 Models Created by Data Mining Linear Equations Rules Clusters Graphs Tree Structures Recurrent Patterns.
Big Data. What is Big Data? Big Data Analytics: 11 Case Histories and Success Stories
Bleeding edge technology to transform Data into Knowledge HADOOP In pioneer days they used oxen for heavy pulling, and when one ox couldn’t budge a log,
INTRODUCTION TO DATA MINING MIS2502 Data Analytics.
Introduction to Hadoop and HDFS
An Introduction to HDInsight June 27 th,
+ Big Data IST210 Class Lecture. + Big Data Summary by EMC Corporation ( More videos that.
1 Melanie Alexander. Agenda Define Big Data Trends Business Value Challenges What to consider Supplier Negotiation Contract Negotiation Summary 2.
MIS2502: Data Analytics Advanced Analytics - Introduction.
What we know or see What’s actually there Wikipedia : In information technology, big data is a collection of data sets so large and complex that it.
Table of Contents Introduction Why Data Analytics Data Analytics Terminology Predictive Analytics Data Analytics challenges Data Analytics Platform Data.
Big Data Analytics with Excel Peter Myers Bitwise Solutions.
Copyright © 2016 Pearson Education, Inc. Modern Database Management 12 th Edition Jeff Hoffer, Ramesh Venkataraman, Heikki Topi CHAPTER 11: BIG DATA AND.
Harnessing Big Data with Hadoop Dipti Sangani; Madhu Reddy DBI210.
BIG DATA. Big Data: A definition Big data is a collection of data sets so large and complex that it becomes difficult to process using on-hand database.
Apache Hadoop on Windows Azure Avkash Chauhan
Abstract MarkLogic Database – Only Enterprise NoSQL DB Aashi Rastogi, Sanket V. Patel Department of Computer Science University of Bridgeport, Bridgeport,
Microsoft Ignite /28/2017 6:07 PM
Data Mining - Introduction Compiled By: Umair Yaqub Lecturer Govt. Murray College Sialkot.
Hadoop in the Wild CMSC 491 Hadoop-Based Distributed Computing Spring 2016 Adam Shook.
Bleeding edge technology to transform Data into Knowledge HADOOP In pioneer days they used oxen for heavy pulling, and when one ox couldn’t budge a log,
A Tutorial on Hadoop Cloud Computing : Future Trends.
Big Data-An Analysis. Big Data: A definition Big data is a collection of data sets so large and complex that it becomes difficult.
Hadoop Big Data Usability Tools and Methods. On the subject of massive data analytics, usability is simply as crucial as performance. Right here are three.
Data Analytics (CS40003) Introduction to Data Lecture #1
9/24/2017 7:27 AM © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN.
Big Data & Test Automation
CNIT131 Internet Basics & Beginning HTML
Data Mining – Intro.
SNS COLLEGE OF TECHNOLOGY
SAS users meeting in Halifax
Big Data Enterprise Patterns
Understanding Big Data
Hadoop Aakash Kag What Why How 1.
Hadoop.
ANOMALY DETECTION FRAMEWORK FOR BIG DATA
MIS2502: Data Analytics Advanced Analytics - Introduction
Chapter 14 Big Data Analytics and NoSQL
Hadoop.
Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.
Data Warehouse.
Ministry of Higher Education
MIS5101: Data Analytics Advanced Analytics - Introduction
MANAGING DATA RESOURCES
Data Warehousing and Data Mining
Big Data.
Overview of big data tools
Big Data Young Lee BUS 550.
Zoie Barrett and Brian Lam
Charles Tappert Seidenberg School of CSIS, Pace University
Big Data Analysis in Digital Marketing
Big DATA.
Data Analysis and R : Technology & Opportunity
IBM C IBM Big Data Engineer. You want to train yourself to do better in exam or you want to test your preparation in either situation Dumpspedia’s.
Big-Data Analytics with Azure HDInsight
Big Data.
Presentation transcript:

Introduction to Big Data James Miller .

Some material from Big Data for Dummies and

. .

What is Big Data? Large Volumes High Velocity Wide Variety · 8 Bits = 1 Byte · 1000 Bytes = 1 Kilobyte · 1000 Kilobytes = 1 Megabyte · 1000 Megabytes = 1 Gigabyte · 1000 Gigabytes = 1 Terabyte · 1000 Terabytes = 1 Petabyte · 1000 Petabytes = 1 Exabyte · 1000 Exabytes = 1 Zettabyte · 1000 Zettabytes = 1 Yottabyte · 1000 Yottabytes = 1 Brontobyte · 1000 Brontobytes = 1 Geopbyte

Waves of Data Management Relational Databases and Data Warehouses Web and Content Management - unstructured including audio and video Big Data

Where is data available?

Other major data sources Social Networks, Twitter Internet of Things (airplanes, cars, electric meters, machine logs) Email and other text Satellite Images

Why study Big Data / Data Analytics?

Why Big Data Now? In the late 1990s, search engine and Internet companies like Google, Yahoo!, and Amazon.com were able to expand their business models, leveraging inexpensive hardware for computing and storage. Next, these companies needed a new generation of software technologies that would allow them to monetize the huge amounts of data they were capturing from customers. These companies could not wait for results of analytic processing. They needed the capability to process and analyze this data in near real time.

Distributed Computing on commodity hardware

Why Virtual Machines?

Have someone else build and run your Big Data operation Amazon Elastic Public Compute Cloud Google Big Data Services Microsoft Azure

What do you find in Big Data Centers A room full of racks Racks full of cases of standard sizes Cases full of blades (a blade is an individual server on a circuit card that fits in a slot) Thousands of servers Organized wiring Specialized data management software Virtualized resources (machines, storage, networks) See optional links to Data Center Videos in Canvas

How can we use all this hardware? How will all the software fit together? What do we need besides relational databases? How can we put many servers to work on our big data job(s)?

One view of Big Data Architecture

Big Data Management

This looks like trouble

Hadoop – What is it and why is it important?

Map Reduce

Hadoop and Map Reduce

Hadoop Ecosystem HBASE - billions of rows, uses Hadoop Distributed File System Hive – batch oriented data warehousing layer Pig and Pig Latin –language for loading and retrieving data stored in HDFS] Sqoop – bulk import from relational, bulk export, generates map and reduce jobs. Zookeeper – manages parallel processing and big jobs across racks

Analytics and Big Data

Analytics Examples See link to optional reading in Canvas Fraud Detection – insurance claims, credit card transactions Monetize your data Google ads License your data to others Predict which customers are good targets of upselling when the call in to the call center

Data Mining Analyze large amounts of data to find patterns in that data Techniques Classification Trees Logistic Regression Neural Networks Clustering Techniques

Here’s a classification tree example Here’s a classification tree example. Consider the situation where a telephone company wants to determine which residential customers are likely to disconnect their service. The telephone company has information consisting of the following attributes: how long the person has had the service, how much he spends on the service, whether he has had problems with the service, whether he has the best calling plan for his needs, where he lives, how old he is, whether he has other services bundled together with his calling plan, competitive information concerning other carriers plans, and whether he still has the service or has disconnected the service. Of course, you can find many more attributes than this. The last attribute is the outcome variable; this is what the software will use to classify the customers into one of the two groups — perhaps called stayers and flight risks.

The data set is broken into training data and a test data set The data set is broken into training data and a test data set. The training data consists of observations (called attributes) and an outcome variable (binary in the case of a classification model) — in this case, the stayers or the flight risks. The algorithm is run over the training data and comes up with a tree that can be read like a series of rules. For example, if the customers have been with the company for more than ten years and they are over 55 years old, they are likely to remain as loyal customers. These rules are then run over the test data set to determine how good this model is on “new data.” Accuracy measures are provided for the model. For example, a popular technique is the confusion matrix. This matrix is a table that provides information about how many cases were correctly versus incorrectly classified. If the model looks good, it can be deployed on other data, as it is available (that is, using it to predict new cases of flight risk). Based on the model, the company might decide, for example, to send out special offers to those customers whom it thinks are flight risks.

Text Analytics Tweets, Logs, Emails

Text Analytics

Customized Approaches

Customized Approaches R Environment Effective data-handling and manipulation components. Operators for calculations on arrays and other types of ordered data. Tools specific to a wide variety of data analyses. Advanced visualization capabilities. S programming language designed by programmers, for programmers with familiar concepts such as loops, conditionals, user defined recursive functions and many options for input/output. .

Customized Approaches Avoid 100% custom coding TA-Lib for stock market and individual stock analysis GeoTools - An open source geospatial toolkit for manipulating GIS data in many forms

Big Data Analysis Framework Support for multiple data types including unstructured Handle batch processing and/or real time data streams Use what already exists in your environment Support NoSQL and other newer forms of accessing data Overcome latency Provide cheap storage Integrate with cloud deployments

What data do you need? Exploratory Stage – look for patterns in large amounts of data. Hadoop and MapReduce – experiments FlumeNG – stream data directly into Hadoop May eliminate “uninteresting data” where no patterns are found Codify (attempt to implement repeatable process) Example – retailer discovers chatter upcoming college football event in a particular geographic area near one of its stores Integration (look for specific customers that should receive offers, increase inventory ….) Customer data, inventory data.

Data Transformation Big Data may be “dirty” Extract Transform Load Look for patterns first without cleaning data When patterns useful to the business are found, then apply quality standards Extract Transform Load May be better to ELT (Extract Load Transform) Can process faster and in parallel in Hadoop Can use Hadoop environment tools – HiveQL, Pig Latin

Streaming and Complex Event Processing Recognition of a pattern in the stream of data triggers CEP Example: System detects that a customer has used the loyalty program at checkout and triggers and event to determine if this is an important customer and if so, provides a special offer or a gift at checkout.

Getting started with Big Data

Do and Don’t Do involve all business units in your big data strategy Do evaluate all delivery models for big data (cloud) Do think about your traditional data sources as part of your strategy Do plan for consistent metadata Do distribute your data Don’t rely on a single approach for big data analytics. Don’t go big before you are ready Don’t overlook the need to integrate data Don’t forget to manage data securely Don’t overlook the need to manage the performance of your data