Introduction to Big Data James Miller .
Some material from Big Data for Dummies and
. .
What is Big Data? Large Volumes High Velocity Wide Variety · 8 Bits = 1 Byte · 1000 Bytes = 1 Kilobyte · 1000 Kilobytes = 1 Megabyte · 1000 Megabytes = 1 Gigabyte · 1000 Gigabytes = 1 Terabyte · 1000 Terabytes = 1 Petabyte · 1000 Petabytes = 1 Exabyte · 1000 Exabytes = 1 Zettabyte · 1000 Zettabytes = 1 Yottabyte · 1000 Yottabytes = 1 Brontobyte · 1000 Brontobytes = 1 Geopbyte
Waves of Data Management Relational Databases and Data Warehouses Web and Content Management - unstructured including audio and video Big Data
Where is data available?
Other major data sources Social Networks, Twitter Internet of Things (airplanes, cars, electric meters, machine logs) Email and other text Satellite Images
Why study Big Data / Data Analytics?
Why Big Data Now? In the late 1990s, search engine and Internet companies like Google, Yahoo!, and Amazon.com were able to expand their business models, leveraging inexpensive hardware for computing and storage. Next, these companies needed a new generation of software technologies that would allow them to monetize the huge amounts of data they were capturing from customers. These companies could not wait for results of analytic processing. They needed the capability to process and analyze this data in near real time.
Distributed Computing on commodity hardware
Why Virtual Machines?
Have someone else build and run your Big Data operation Amazon Elastic Public Compute Cloud Google Big Data Services Microsoft Azure
What do you find in Big Data Centers A room full of racks Racks full of cases of standard sizes Cases full of blades (a blade is an individual server on a circuit card that fits in a slot) Thousands of servers Organized wiring Specialized data management software Virtualized resources (machines, storage, networks) See optional links to Data Center Videos in Canvas
How can we use all this hardware? How will all the software fit together? What do we need besides relational databases? How can we put many servers to work on our big data job(s)?
One view of Big Data Architecture
Big Data Management
This looks like trouble
Hadoop – What is it and why is it important?
Map Reduce
Hadoop and Map Reduce
Hadoop Ecosystem HBASE - billions of rows, uses Hadoop Distributed File System Hive – batch oriented data warehousing layer Pig and Pig Latin –language for loading and retrieving data stored in HDFS] Sqoop – bulk import from relational, bulk export, generates map and reduce jobs. Zookeeper – manages parallel processing and big jobs across racks
Analytics and Big Data
Analytics Examples See link to optional reading in Canvas Fraud Detection – insurance claims, credit card transactions Monetize your data Google ads License your data to others Predict which customers are good targets of upselling when the call in to the call center
Data Mining Analyze large amounts of data to find patterns in that data Techniques Classification Trees Logistic Regression Neural Networks Clustering Techniques
Here’s a classification tree example Here’s a classification tree example. Consider the situation where a telephone company wants to determine which residential customers are likely to disconnect their service. The telephone company has information consisting of the following attributes: how long the person has had the service, how much he spends on the service, whether he has had problems with the service, whether he has the best calling plan for his needs, where he lives, how old he is, whether he has other services bundled together with his calling plan, competitive information concerning other carriers plans, and whether he still has the service or has disconnected the service. Of course, you can find many more attributes than this. The last attribute is the outcome variable; this is what the software will use to classify the customers into one of the two groups — perhaps called stayers and flight risks.
The data set is broken into training data and a test data set The data set is broken into training data and a test data set. The training data consists of observations (called attributes) and an outcome variable (binary in the case of a classification model) — in this case, the stayers or the flight risks. The algorithm is run over the training data and comes up with a tree that can be read like a series of rules. For example, if the customers have been with the company for more than ten years and they are over 55 years old, they are likely to remain as loyal customers. These rules are then run over the test data set to determine how good this model is on “new data.” Accuracy measures are provided for the model. For example, a popular technique is the confusion matrix. This matrix is a table that provides information about how many cases were correctly versus incorrectly classified. If the model looks good, it can be deployed on other data, as it is available (that is, using it to predict new cases of flight risk). Based on the model, the company might decide, for example, to send out special offers to those customers whom it thinks are flight risks.
Text Analytics Tweets, Logs, Emails
Text Analytics
Customized Approaches
Customized Approaches R Environment Effective data-handling and manipulation components. Operators for calculations on arrays and other types of ordered data. Tools specific to a wide variety of data analyses. Advanced visualization capabilities. S programming language designed by programmers, for programmers with familiar concepts such as loops, conditionals, user defined recursive functions and many options for input/output. .
Customized Approaches Avoid 100% custom coding TA-Lib for stock market and individual stock analysis GeoTools - An open source geospatial toolkit for manipulating GIS data in many forms
Big Data Analysis Framework Support for multiple data types including unstructured Handle batch processing and/or real time data streams Use what already exists in your environment Support NoSQL and other newer forms of accessing data Overcome latency Provide cheap storage Integrate with cloud deployments
What data do you need? Exploratory Stage – look for patterns in large amounts of data. Hadoop and MapReduce – experiments FlumeNG – stream data directly into Hadoop May eliminate “uninteresting data” where no patterns are found Codify (attempt to implement repeatable process) Example – retailer discovers chatter upcoming college football event in a particular geographic area near one of its stores Integration (look for specific customers that should receive offers, increase inventory ….) Customer data, inventory data.
Data Transformation Big Data may be “dirty” Extract Transform Load Look for patterns first without cleaning data When patterns useful to the business are found, then apply quality standards Extract Transform Load May be better to ELT (Extract Load Transform) Can process faster and in parallel in Hadoop Can use Hadoop environment tools – HiveQL, Pig Latin
Streaming and Complex Event Processing Recognition of a pattern in the stream of data triggers CEP Example: System detects that a customer has used the loyalty program at checkout and triggers and event to determine if this is an important customer and if so, provides a special offer or a gift at checkout.
Getting started with Big Data
Do and Don’t Do involve all business units in your big data strategy Do evaluate all delivery models for big data (cloud) Do think about your traditional data sources as part of your strategy Do plan for consistent metadata Do distribute your data Don’t rely on a single approach for big data analytics. Don’t go big before you are ready Don’t overlook the need to integrate data Don’t forget to manage data securely Don’t overlook the need to manage the performance of your data