Presentation is loading. Please wait.

Presentation is loading. Please wait.

Copyright © Curt Hill 2003-2013 Data Mining A Brief Overview.

Similar presentations


Presentation on theme: "Copyright © Curt Hill 2003-2013 Data Mining A Brief Overview."— Presentation transcript:

1 Copyright © Curt Hill 2003-2013 Data Mining A Brief Overview

2 Copyright © Curt Hill 2003-2013 The Problem Huge volumes of data overwhelm traditional methods of data analysis such as: Spreadsheets Ad hoc queries Multidimensional analysis tools Statistical analysis packages

3 Copyright © Curt Hill 2003-2013 What is Data Mining? Exploratory data analysis based on a data warehouse –Knowledge Discovery in Databases (KDD) Data Mining extracts previously unknown and potentially useful information –Rules, constraints, correlations, patterns, signatures and irregularities The goal is to automate the methods for finding these in the data

4 Copyright © Curt Hill 2003-2013 Data Warehouse A database usually separated from the operational database(s) Used as a base for decision support systems –Upper and middle management –Not used for day to day management but for spotting trends and making path decisions Typically very large and composed of recent copies from the operational database(s) Data Mining is one of the applications that could use

5 Goals of Data Mining Prediction of future behaviors –Seasonal or non-seasonal trends –How will consumers respond to discounts? –Allows the enterprise to be ready Identification of item, event or activity –Intruders may be identified by the files they access or programs they use Copyright © Curt Hill 2003-2013

6 Goals Again Classification of categories of users or products –Shoppers may be categorized as: Discount seeking Rush Regular Attached to certain brand names –The store may be made more friendly to such Optimize the use of time, space, materials and money Copyright © Curt Hill 2003-2013

7 Knowledge Discovery There are several types of discoverable knowledge –Association Rules –Classification hierarchies –Sequential patterns –Time series patterns –Clustering Each of these needs more information Copyright © Curt Hill 2003-2013

8 Association Rules What we are looking for is knowledge of associations that are not obvious This has gained traction in market basket research –Very profitable information If a MRI has characteristic a and b then if often has c –This is an association rule Copyright © Curt Hill 2003-2013

9 Market Basket Model Premise: the items in a checkout transaction are not random Thus we analyze customer transactions for patterns or association rules These patterns may guide decisions on –Sale items –Shelf arrangement or product placement

10 Copyright © Curt Hill 2003-2013 Retail Example A young father goes to the store to buy disposable diapers On his way through the store he sees a Sports Illustrated and buys it In general, people do not impulse buy disposable diapers, but while buying these, they may buy something else on impulse Can we examine retail transaction records and perceive the connection?

11 Association Rule Is of the form: X => Y –Where both X and Y could be sets of items The support of this rule is the percent of total transactions that have both The confidence of this rule is the number of transactions which have the first one divided by the number of transactions that have both High support and high confidence indicates rules that business decisions may be based upon this rule –Put magazine rack on the route to the diapers Copyright © Curt Hill 2003-2013

12 Agriculture Example LandSat are in polar orbits They record data on all land every 18 days A pixel is approximately 31 yards on a side Seven bands from near infrared to ultraviolet are recorded for each pixel Each produce a 1 byte value Can you get this data in a spreadsheet?

13 Copyright © Curt Hill 2003-2013 Agriculural rule In middle summer a near infrared value in the range 48 to 255 and red in red in range 0 to 31 suggests that the yield will be 128 to 255 bushels acre If the support and confidence are high this suggests that the farmer should apply nitrogen to the areas where near infrared was less than 47 and red was greater than 32

14 Computational Difficulties Consider how many tickets a supermarket or department store might generate? In general, most of these tickets have more than two or three items The store carries thousands of items Discovering these association rules become computationally taxing One good reason to keep this off of the operational database Copyright © Curt Hill 2003-2013

15 Algorithm Properties There are a number of algorithms for finding these rules These typically exploit two properties: Downward closure The subset of a large itemset should also have large support Removing a few items does not hurt Antimonotocity The superset of a small itemset should have small support Copyright © Curt Hill 2003-2013

16 Classification Classifying data into predetermined groups Then we can deal with the groups in different ways AKA supervised learning –Developed by Artificial Intelligence The process of clustering is attempting to classify data in groups that are not predetermined Copyright © Curt Hill 2003-2013

17 Models The two typical models are decision trees and a set of rules We look at the data to build the model and then use the model for new data Consider in the next slide a decision tree for granting a credit card to an applicant Copyright © Curt Hill 2003-2013

18 Example: Decision Tree Copyright © Curt Hill 2003-2013 Married SalaryBalance Age YesNo <25K >75K <5K GoodFairPoor >5K <25 Fair >25 Good

19 Clustering AKA unsupervised learning Classify the data into groups that you are not aware of to begin with A distance function must be supplied that describes the distance between two points –The points are often not purely numeric –They are often not in 2 dimensions or even 3 which makes things interesting Copyright © Curt Hill 2003-2013

20 Applications Marketing –Determine advertising, store placement, segmentation of customers Finance –Analysis of performance of securities Manufacturing –Optimizing resources, designing the manufacturing process Health Care –Discovery of items in X-Ray and MRI images Copyright © Curt Hill 2003-2013

21 Example Certain diseases switch on genes characteristic to that disease Drugs often switch off a gene In 2011 database of genes and what affected them was mined The result was that mice infected with small cell lung cancer were treated with an antidepressant, imipramine –The tumors were reduced Copyright © Curt Hill 2003-2013

22 Telco Example A local telephone company mines its connection data for possible marketing opportunities A phone very busy in the 3PM to 6PM range suggests a teenager –Pitch a teen phone Busy in the 9AM to 5PM suggests a home business –Pitch a business line Copyright © Curt Hill 2003-2013

23 Social Media Publicly viewable social media presents a very large quantity of data However it is: –Noisy –Unstructured –Dynamic It is of great interest in political campaigns, marketing, health care –This is where people express things first Copyright © Curt Hill 2003-2013

24 Finally Much of the analysis done in data mining has been done for centuries –What is different now is the amount and types of captured data There are a number of commercial tools for mining Many large companies have substantial investment and return on their mining activities Copyright © Curt Hill 2003-2013


Download ppt "Copyright © Curt Hill 2003-2013 Data Mining A Brief Overview."

Similar presentations


Ads by Google