Download presentation
Presentation is loading. Please wait.
1
Big Data and Data Analytics
2
CONTENTS Module 1: Big Data Module 2: Business Intelligence/Analytics
Module 3: Visualization Module 4: Data Mining
3
MODULE 1 What is Big Data?
4
What is Big Data? In the last minute there were ……. What is Big Data
Massive sets of unstructured/semi-structured data from Web traffic, social media, sensors, etc Petabytes, exabytes of data Volumes too great for typical DBMS Information from multiple internal and external sources: Transactions Social media Enterprise content Sensors Mobile devices In the last minute there were ……. 204 million s sent 61,000 hours of music listened to on Pandora 20 million photo views 100,000 tweets 6 million views and 277,000 Facebook Logins 2+ million Google searches 3 million uploads on Flickr
5
What is Big Data? continued
Companies leverage data to adapt products and services to: Meet customer needs Optimize operations Optimize infrastructure Find new sources of revenue Can reveal more patterns and anomalies IBM estimates that by million jobs will be created globally to support big data 1.9 million of these jobs will be in the United States © The KTP Company, 2005
6
Where does Big Data come from?
Transactions Enterprise “Dark Data” Partner, Employee Customer, Supplier Monitoring Contracts Public Commercial Sensor Credit Weather Industry Social Media Population Sentiment Economic Network 6
7
Types of Data
8
Types of Data When collecting or gathering data we collect data from individuals cases on particular variables. A variable is a unit of data collection whose value can vary. Variables can be defined into types according to the level of mathematical scaling that can be carried out on the data. There are four types of data or levels of measurement: When thinking about research we are looking at gathering knowledge through some form of observation. Research in the scientific sense, as talked about in the first lecture, involves the systematic measurement of these observations. When we observe something we understand that observation in terms of its attributes for example, colour or texture, and we look at how that attribute varies across our observations. So, when we collect or gather data we collect information from individual cases (people, organisations, businesses and so on) about particular variables. Level of mathematical scaling? The level of mathematical precision with which the values of a variable can be expressed. 1. Categorical (Nominal) 2. Ordinal 3. Interval 4. Ratio
9
Categorical (Nominal) data
Nominal or categorical data is data that comprises of categories that cannot be rank ordered – each category is just different. The categories available cannot be placed in any order and no judgement can be made about the relative size or distance from one category to another. Categories bear no quantitative relationship to one another Examples: - customer’s location (America, Europe, Asia) - employee classification (manager, supervisor, associate) What does this mean? No mathematical operations can be performed on the data relative to each other. Therefore, nominal data reflect qualitative differences rather than quantitative ones.
10
Nominal data Examples: What is your gender? (please tick) Male Female Did you enjoy the film? (please tick) Yes No Systems for measuring nominal data must ensure that each category is mutually exclusive and the system of measurement needs to be exhaustive. Exhaustive: the system of categories system should have enough categories for all the observations Variables that have only two responses i.e. Yes or No, are known as dichotomies. Mutually exclusive: each observation (person, case, score) cannot fall into more than one category. Exhaustive: the system of categories system should have enough categories for all the observations.
11
Ordinal data No fixed units of measurement Examples:
How satisfied are you with the level of service you have received? (please tick) Very satisfied Somewhat satisfied Neutral Somewhat dissatisfied Very dissatisfied Example: Ordinal data is data that comprises of categories that can be rank ordered. Similarly with nominal data the distance between each category cannot be calculated but the categories can be ranked above or below each other. No fixed units of measurement Examples: - college football rankings - survey responses (poor, average, good, very good, excellent) What does this mean? Can make statistical judgements and perform limited maths. Ordinal data works with the same assumptions as nominal data but is more complex in that here, whilst still comprising of categories, now the categories themselves can be rank-ordered i.e. one category has more or less of some underlying quality than being in another category. What does this mean? We can make statistical judgements about the relative position of one value to another and can perform limited mathematical calculations.
12
Interval and ratio data
Both interval and ratio data are examples of scale data. Scale data: data is in numeric format ($50, $100, $150) data that can be measured on a continuous scale the distance between each can be observed and as a result measured the data can be placed in rank order. Ordinal data can be ranked – similarly interval and ratio data also operate on a mechanism of scale. The key difference is that with interval and ratio data the data is numerical and has numerically equal distances between each category. For example, a 5cm difference in height between 150cm and 155cm is the same as a 5cm difference between 15cm and 20cm. Imagine a 30cm ruler – it shows equal distance between each mark i.e. 10 1mm markers between each cm.
13
Interval data Ordinal data but with constant differences between observations Ratios are not meaningful Examples: Time – moves along a continuous measure or seconds, minutes and so on and is without a zero point of time. Temperature – moves along a continuous measure of degrees and is without a true zero. SAT scores Temperature and true zero – whilst there is a zero degrees this is a measure of temperature!
14
Ratio data Ratios are meaningful Examples: monthly sales
Ratio data measured on a continuous scale and does have a natural zero point. Ratios are meaningful Examples: monthly sales delivery times Weight Height Age
15
Data for Business Analytics
(continued) Classifying Data Elements in a Purchasing Database Figure 1.2 Ratio Ratio Ratio Ratio Interval Interval Categorical Categorical Categorical Categorical If there was field (column) for Supplier Rating (Excellent, Good, Acceptable, Bad), that data would be classified as Ordinal
16
Big Data Characteristics
Growing quantity of data e.g. social media, behavioral, video VOLUME VARIETY Quickening speed of data e.g. smart meters, process monitoring VELOCITY Gartner, Feb 2001 Increase in types of data e.g. app data, unstructured data
17
Which Big Data characteristic is the biggest issue for your organization?
Source: Getting Value from Big Data, Gartner Webinar, May 2012
18
Volume Petabytes, exabytes of data Volumes too great for typical DBMS
19
Volume - Bytes Defined eBay data warehouse (2010) = 10 PB eBay will increase this 2.5 times by 2011 Teradata > 10 PB Megabyte: 220 bytes or, loosely, one million bytes Gigabyte: 230 bytes or, loosely one billion bytes
20
Velocity Velocity Massive amount of streaming data
21
Variety Variety Massive sets of unstructured/semi-structured data from Web traffic, social media, sensors, and so on
22
Which source of data represents the most immediate opportunity?
Source: Getting Value from Big Data, Gartner Webinar, May 2012
23
Big Data Opportunities
Making better informed decisions e.g. strategies, recommendations Discovering hidden insights e.g. anomalies forensics, patterns, trends Automating business processes e.g. complex events, translation
24
Which is the biggest opportunity for Big Data in your organization?
Through 2017: 85% of Fortune 500 organizations will be unable to exploit big data for competitive advantage. Business analytics needs will drive 70% of investments in the expansion and modernization of information infrastructure. Source: Getting Value from Big Data, Gartner Webinar, May 2012 © The KTP Company, 2005
25
Identifying Insurance Fraud
Auto Insurance Opportunity Save and make money by reducing fraudulent auto insurance claims Data & Analytics Predictive analytics against years of historical claims and coverage data Text mining adjuster reports for hidden clues, e.g. missing facts, inconsistencies, changed stories Results Improved success rate in pursuing fraudulent claims from 50% to 88%; reduced fraudulent claim investigation time by 95% Marketing to individuals with low propensity for fraud What **“dark data” is just laying around that can transform business processes? **Operational data that is not being used. Consulting and market research company Gartner Inc. describes dark data as "information assets that organizations collect, process and store in the course of their regular business activity, but generally fail to use for other purposes." 25 © The KTP Company, 2005
26
Quality Improvement Opportunity Data & Analytics Result
Move from manual to automated inspection of burger bun production to ensure and improve quality Data & Analytics Photo-analyze over 1000 buns-per-minute for color, shape and seed distribution Continually adjust ovens and process automatically Result Eliminate 1000s of pounds of wasted product per year; speed production; save energy; Reduce manual labor costs Is the company using all of its “senses” to observe, measure and optimize business processes? 26
27
Improving Corporate Image
Opportunity Improve reputation, brand and buzz by tapping social media Data & Analytics Continually scanning twitterverse for mentions of their business Integrating tweeters with their robust customer management system Results Saw tweet from a top customer lamenting late flight—no time to dine at Morton’s Tuxedo-clad waiter waiting for him when he landed with a bag containing his favorite steak, prepared the way he normally likes it with all the fixin’s How can the company listen, analyze and respond in real-time?
28
Big Data, Big Rewards Read the case study “Big Data, Big Rewards”
29
MODULE 2 Business Analytics
30
Business Analytics/Business Intelligence
Business Analytics/Business intelligence (BI) is a broad category of applications, technologies, and processes for: gathering, storing, accessing, and analyzing data to help business users make better decisions.
31
Things Are Getting More Complex
Many companies are performing new kinds of analytics (**sentiment analysis, etc.), to better and more quickly understand and respond to what customers are saying about them and their products. The cloud, and appliances are being used as data stores Advanced analytics are growing in popularity and importance **Sentiment analysis (also known as opinion mining) refers to the use of natural language processing, text analysis and computational linguistics to identify and extract subjective information in source materials.
32
Uncertainty of Data
33
Analytics Models VALUE Hindsight Insight Foresight DIFFICULTY
How can we make it happen? Prescriptive Analytics What will happen? Predictive Analytics Optimization Information Why did it happen? VALUE Diagnostic Analytics What happened? Descriptive Analytics Hindsight Insight Foresight DIFFICULTY 33
34
Descriptive Analytics
Descriptive analytics, such as reporting/OLAP, dashboards, and data visualization, have been widely used for some time. They are the core of traditional BI. Descriptive analytics, such as reporting/OLAP, dashboards, and data visualization, have been widely used for some time. They are the core of traditional BI. What has occurred? Descriptive analytics, such as data visualization, is important in helping users interpret the output from predictive and predictive analytics.
35
Predictive Analytics What will occur?
Algorithms for predictive analytics, such as regression analysis, machine learning, and neural networks, have also been around for some time. What will occur? Algorithms for predictive analytics, such as regression analysis, machine learning, and neural networks, have also been around for some time. Recently, however, software products have made them much easier to understand and use. They have also been integrated into specific applications, such as for campaign management. Marketing is the target for many predictive analytics applications. Descriptive analytics, such as data visualization, is important in helping users interpret the output from predictive and predictive analytics. Prescriptive analytics are often referred to as advanced analytics. Marketing is the target for many predictive analytics applications. Descriptive analytics, such as data visualization, is important in helping users interpret the output from predictive and prescriptive analytics.
36
Prescriptive Analytics
Prescriptive analytics are often referred to as advanced analytics. Often for the allocation of scarce resources Optimization What should occur? Prescriptive analytics provide an optimal solution, often for the allocation of scarce resources. They, too, have been in academia for a long time but are now finding wider use in practice. For example, the use of mathematical programming for revenue management is common for organizations that have “perishable” goods (e.g., rental cars, hotel rooms, airline seats). Harrah’s has been using revenue management for hotel room pricing for some time. Prescriptive analytics can benefit healthcare strategic planning by using analytics to leverage operational and usage data combined with data of external factors such as economic data, population demographic trends and population health trends, to more accurately plan for future capital investments such as new facilities and equipment utilization as well as understand the trade-offs between adding additional beds and expanding an existing facility versus building a new one.
37
Organizational Transformation
Analytics are a competitive requirement For BI-based organizations, the use of BI/analytics is a requirement for successfully competing in the marketplace. TDWI report on Big Data Analytics found that 85% of respondents indicated that their firms would be using advanced analytics within three years IBM/MIT Sloan Management Review research study found that top performing companies in their industry are much more likely to use analytics rather than intuition across the widest range of possible decisions. In some industries, the use of advanced analytics has moved beyond a “nice to have” to being a requirement.
38
Complex Systems Require Analytics
Tackle complex problems and provide individualized solutions Products and services are organized around the needs of individual customers Dollar value of interactions with each customer is high There is high level of interaction with each customer Examples: IBM, World Bank, Halliburton
39
Volume Operations Require Analytics
Serves high-volume markets through standardized products and services Each customer interaction has a low dollar value Customer interactions are generally conducted through technology rather than person-to-person Are likely to be analytics-based Examples: Amazon.com, eBay, Hertz Analytics are critical to high volume companies competing in the marketplace.
40
The Nature of the Industry
Online retailers like Amazon.com and Overstock.com are high volume operations who rely on analytics to compete. When you enter their sites a cookie is placed on your PC and all clicks are recorded. Based on your clicks and any search terms, recommendation engines decide what products to display. After you purchase an item, they have additional information that is used in marketing campaigns. Customer segmentation analysis is used in deciding what promotions to send you. How profitable you are influences how the customer care center treats you. A pricing team helps set prices and decides what prices are needed to clear out merchandise. Forecasting models are used to decide how many items to order for inventory. Dashboards monitor all aspects of organizational performance Online retailers like Amazon.com and Overstock.com are great examples of high volume operations who rely on analytics to compete. As soon as you enter, their sites a cookie is placed on your PC and all clicks are recorded. Based on your clicks and any search terms, recommendation engines decide what products to display. After you purchase an item, they have additional information that is used in marketing campaigns. Customer segmentation analysis is used in deciding what promotions to send you. How profitable you are influences how the customer care center treats you. A pricing team helps set prices and decides what prices are needed to clear out merchandise. Forecasting models are used to decide how many items to order for inventory. Dashboards monitor all aspects of organizational performance.
41
Knowledge Requirements for Advanced Analytics
Business Domain Data Modeling Choosing the right data to include in models is important. Important to have some thoughts as to what variables might be related. Domain knowledge is necessary to understand how they can be used. Role of Business Analyst is crucial Consider the story of the relationship between beer and diapers in the market basket of young males in convenience stores. You still have to decide (or experiment to discover) whether it is better to put them together or spread them across the store (in the hope that other things will be bought while walking the isles). The findings were that men between years in age, shopping between 5pm and 7pm on Fridays, who purchased diapers were most likely to also have beer in their carts. This motivated the grocery store to move the beer isle closer to the diaper isle and instantaneously, a 35% increase in sales of both!
42
MODULE 3 Visualization
43
Visualization: Acquisition of Insight
Many people and institutions possess data that may ‘hide’ fundamental relations Realtors Bankers Air Traffic Controller Fraud investigators Engineers They want to be able to view some graphical representation of that data, maybe interact with it, and then be able to say…….ahha!
44
Example: Fraud Detection
The Serious Fraud Office (SFO) suspected mortgage fraud The SFO provided 12 filing cabinets of data After 12 person years a suspect was identified The suspect was arrested, tried and convicted
45
Example: Fraud Detection continued
The data was supplied in electronic form A visualization tool (Netmap) was used to examine the data After 4 person weeks the same suspect was identified A master criminal behind the fraud was also identified
46
Is Information Visualization Useful?
Drugs and Chips Texas Instruments Manufactures microprocessors on silicon wafers that are routed through 400 steps in many weeks. This process is monitored, gathering 140,000 pieces of information about each wafer. Somewhere in that heap of data can be warnings about things going wrong. Detect a bug early before bad chips are made. TI uses visualization tools to make the detection process easier Eli Lilly Has 1500 scientists using an advanced information visualization tool (Spotfire) for decision making. “With its ability to represent multiple sources of information and interactively change your view, it’s helpful for homing in on specific molecules and deciding whether we should be doing further testing on them” Sheldon Ort of Eli Lilly, speaking to Fortune
47
The Cholera Epidemic, London 1845
Dr. John Snow, medical officer for London, investigated the cholera epidemic of 1845 in Soho. He mapped the deaths and noted that the deaths, indicated by points, tended to occur near the Broad Street pump. Closure of the pump coincided with a reduction in cholera.
48
Challenger Disaster On 28th January 1986 the space shuttle Challenger exploded, and seven astronauts died, because two rubber O-Rings leaked. The previous day, engineers who designed the rocket opposed the launch, concerned that the O-Rings would not seal at the forecast temperature (25 to 29oF). After much discussion, the decision was taken to go ahead. Cause of the accident: An inability to assess the link between cool temperature and O-Ring damage on earlier flights. Many charts poorly presented
49
Visualization Refers to the innovative use of images and interactive technology to explore large, high- density datasets Help users see patterns and relationships that would be difficult to see in text lists Rich graphs, charts Dashboards Maps Increasingly is being used to identify insights into both structured and unstructured data for such areas as operational efficiencies profitability strategic planning Video Tableau
50
Introduction to Information Visualization - Fall 2012
Examples Geo data mapping Demo Introduction to Information Visualization - Fall 2012 50
51
Introduction to Information Visualization - Fall 2012
Examples Treemap Demo Introduction to Information Visualization - Fall 2012 51
52
Introduction to Information Visualization - Fall 2012
Examples Population “Trendalyzer” Demo Introduction to Information Visualization - Fall 2012 52
53
Video on the use of visualization to extract knowledge from data.
Watch Gary Flake on extreme visualization © The KTP Company, 2005
54
MODULE 4 Data Mining
55
What is Data Mining? The process of semi automatically analyzing large databases to find useful patterns (Silberschatz) Areas of Use Sales/ Marketing Diversify target market Identify clients needs to increase response rates Risk Assessment Identify Customers that pose high credit risk Fraud Detection Identify people misusing the system. E.g. People who have two Social Security Numbers Credit Card Fraud Detection Detect significant deviations from normal behavior: Network Intrusion Detection Customer Care Identify customers likely to change providers Identify customer needs Medicine Match patients with similar problems cure
56
Data Mining Techniques...
Classification [Predictive] Clustering [Descriptive] Association Rule Discovery [Descriptive] Sequential Pattern Discovery [Descriptive] Regression [Predictive] Deviation Detection [Predictive]
57
Classification Classification is the process of predicting the class of a new item. Categorize the new item and identify to which class it belongs Example: A bank wants to classify its Home Loan Customers into groups according to their response to bank advertisements. The bank might use the classifications “Responds Rarely, Responds Sometimes, Responds Frequently”. The bank will then attempt to find rules about the customers that respond Frequently and Sometimes. The rules could be used to predict needs of potential customers.
58
Technique for Classification
Decision-Tree Classifiers Job Job Doctor Engineer Carpenter Income Income Income >100K <30K >50K <40K >90K <50K Bad Good Bad Good Bad Good Predicting credit risk of a person with the jobs specified.
59
Classification: Application 1
Direct Marketing Goal: Reduce cost of mailing by targeting a set of consumers likely to buy a new cell-phone product. Approach: Use the data for a similar product introduced before. We know which customers decided to buy and which decided otherwise. This {buy, don’t buy} decision forms the class attribute. Collect various demographic, lifestyle, and company-interaction related information about all such customers. Type of business, where they stay, how much they earn, etc. Use this information as input attributes to learn a classifier model.
60
Classification: Application 2
Fraud Detection Goal: Predict fraudulent cases in credit card transactions. Approach: Use credit card transactions and the information on its account-holder as attributes. When does a customer buy, what does he buy, how often he pays on time, etc. Label past transactions as fraud or fair transactions. This forms the class attribute. Learn a model for the class of the transactions. Use this model to detect fraud by observing credit card transactions on an account.
61
Classification: Application 3
Customer Attrition/Churn: Goal: To predict whether a customer is likely to be lost to a competitor. Approach: Use detailed record of transactions with each of the past and present customers, to find attributes. How often the customer calls, where he calls, what time-of-the day he calls most, his financial status, marital status, etc. Label the customers as loyal or disloyal. Find a model for loyalty.
62
Clustering Clustering algorithms find groups of items that are similar. … It divides a data set so that records with similar content are in the same group, and groups are as different as possible from each other. Example: Insurance company could use clustering to group clients by their age, location and types of insurance purchased. The categories are unspecified and this is referred to as ‘unsupervised learning’
63
Clustering continued Group data into clusters How is this achieved ?
Similar data is grouped in the same cluster Dissimilar data is grouped in the a different cluster How is this achieved ? Hierarchical Group data into t-trees K-Nearest Neighbor A classification method that classifies a point by calculating the distances between the point and points in the training data set. Then it assigns the point to the class that is most common among its k-nearest neighbors (where k is an integer)
64
Clustering: Application 1
Document Clustering: Goal: To find groups of documents that are similar to each other based on the important terms appearing in them. Approach: To identify frequently occurring terms in each document. Form a similarity measure based on the frequencies of different terms. Use it to cluster. Gain: Information Retrieval can utilize the clusters to relate a new document or search term to clustered documents.
65
Clustering: Application 2
Market Segmentation: Goal: subdivide a market into distinct subsets of customers where any subset may conceivably be selected as a market target to be reached with a distinct marketing mix. Approach: Collect different attributes of customers based on their geographical and lifestyle related information. Find clusters of similar customers. Measure the clustering quality by observing buying patterns of customers in same cluster vs. those from different clusters.
66
Association Rule: Definition
Given a set of records each of which contain some number of items from a given collection; Produce dependency rules which will predict occurrence of an item based on occurrences of other items. Example: When a customer buys a hammer, then 90% of the time they will buy nails. Rules Discovered: {Milk} --> {Coke} {Diaper, Milk} --> {Beer}
67
Association Rule Discovery: Application 1
Marketing and Sales Promotion: Let the rule discovered be {Bagels, … } --> {Potato Chips} Potato Chips as consequent => Can be used to determine what should be done to boost its sales. Bagels in the antecedent => Can be used to see which products would be affected if the store discontinues selling bagels. Bagels in antecedent and Potato chips in consequent => Can be used to see what products should be sold with Bagels to promote sale of Potato chips!
68
Association Rule Discovery: Application 2
Supermarket shelf management. Goal: To identify items that are bought together by sufficiently many customers. Approach: Process the point-of-sale data collected with barcode scanners to find dependencies among items. A classic rule -- If a customer buys diaper and milk, then he is very likely to buy beer. So, don’t be surprised if you find six-packs stacked next to diapers!
69
Association Rule Discovery: Application 3
Inventory Management: Goal: A consumer appliance repair company wants to anticipate the nature of repairs on its consumer products and keep the service vehicles equipped with right parts to reduce on number of visits to consumer households. Approach: Process the data on tools and parts required in previous repairs at different consumer locations and discover the co-occurrence patterns.
70
Sequential Pattern Discovery
Given is a set of objects, with each object associated with its own timeline of events, find rules that predict strong sequential dependencies among different events. Rules are formed by first discovering patterns. Event occurrences in the patterns are governed by timing constraints. In telecommunications alarm logs, (Inverter_Problem Excessive_Line_Current) (Rectifier_Alarm) --> (Fire_Alarm) In point-of-sale transaction sequences, Computer Bookstore: (Intro_To_Visual_C) (C++_Primer) --> (Perl_for_dummies,Tcl_Tk) Athletic Apparel Store: (Shoes) (Racket, Racketball) --> (Sports_Jacket)
71
Regression Predict a value of a given continuous valued variable based on the values of other variables, assuming a linear or nonlinear model of dependency. Greatly studied in statistics, neural network fields. Examples: Predicting sales amounts of new product based on advertising expenditure. Predicting wind velocities as a function of temperature, humidity, air pressure, etc. Time series prediction of stock market indices.
72
Using Databases to Improve Business Performance and Decision Making
Web mining Discovery and analysis of useful patterns and information from Web Understand customer behavior Evaluate effectiveness of Web site, and so on Web content mining Mines content of Web pages Web structure mining Analyzes links to and from Web page Web usage mining Mines user interaction data recorded by Web server This slide continues the exploration of mining stored data for valuable information. In addition to text mining, there are several ways to mine information on the Web. The text uses the example of marketers using Google Trends and Google Insights for Search services, which track the popularity of various words and phrases used in Google search queries, to learn what people are interested in and what they are interested in buying. What type of Web mining is this? Why might a marketer be interested in the number of Web pages that link to a site?
73
Using Databases to Improve Business Performance and Decision Making
Text mining Extracts key elements from large unstructured data sets Stored s Call center transcripts Legal cases Patent descriptions Service reports, and so on Sentiment analysis software Mines s, blogs, social media to detect opinions Whereas data mining describes analysis of data in databases, there are other types of information that can be “mined,” such as text mining and Web mining.
74
Big Data, Big Rewards Read the case study “Big Data, Big Rewards”
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.