Presentation is loading. Please wait.

Presentation is loading. Please wait.

Data Mining in Action: A Case Study

Similar presentations


Presentation on theme: "Data Mining in Action: A Case Study"— Presentation transcript:

1 Data Mining in Action: A Case Study
Drew Minkin Archaeus Design Systems SQL Saturday #52 - Colorado

2 Drew Minkin Past Present Analytics Architect at Zilliant
Senior Consultant, Fujitsu 6+ years Microsoft Services Escalation Engineer Dedicated Field Engineer (“Alliance”) Local speaker for SQL and BI Lecturer, SMU’s Business School BI Grad Cert Program Present Business Intelligence Architect at FiServ ISV Predictive Analytics Architect, Archaeus Design Systems

3 Agenda Data Mining Intro DM Methodology Data Concepts
Validating and Testing Models  Applying Output with Scorecards

4 For Future Reference http://archive.ics.uci.edu/ml/

5 ? Data Mining Intro That big question mark is editable – use whatever character you want!

6 Data Mining Intro Methodology Architecture Information Flow
Technologies

7 Data Mining in the BI Spectrum

8 Data Mining Information Flow

9 Data Mining Architecture

10 Data Mining Methodology
? Data Mining Methodology That big question mark is editable – use whatever character you want!

11 Methodology Problem Definition Data Modeling Data Discovery
Analytics Modeling Applied Analytics Model Validation

12 Problem Description Business case and non-technical details of predictive analytics inquiry Business objectives and success criteria Requirements, assumptions and constraints Project plan , Risks and contingencies Data mining goals and success criteria Terminology, tools and techniques

13 Data Discovery Analysis of source data for structural and content gaps
Data collection report Data description report Data exploration report Data quality report

14 Data Modeling Selection and manipulation of source data into a conformed entity input ready for formal exploration Dataset and dictionary and rationale Data cleansing report Derived attributes Generated merged and reformatted data

15 Analytics Modeling Research and analysis of patterns and creation of data mining models Model Modeling technique Modeling assumptions Test design Parameter settings Model description

16 Model Trials Testing data mining models using different algorithms and validation of statistical significance Revised Parameter settings Model Validation plan Model assessment

17 Applied Analytics Integration of models with new data Deployment plan
Monitoring and maintenance plan Final report Final presentation Experience documentation

18 ? Data Concepts That big question mark is editable – use whatever character you want!

19 Cases: What We Study Case – set of columns you want to analyze
Age, Gender, Region, Annual Spending Case Key – unique ID of a case A column has: Data Type Content Type And optionally: Distribution Discretization Related Columns Flags (e.g. NOT NULL)

20 Column Data Types We don’t care about detailed low-level types
DM only uses: Text Long Boolean Double Date and by some 3rd party algorithms: Time, and Sequence

21 Column Content Types Common: For special purposes: Denotes a key:
DISCRETE Red, Blue CONTINOUS $6,511.49 DISCRETIZED 1-5, 6-20, 21+ Denotes a key: KEY For special purposes: KEY SEQUENCE KEY TIME ORDERED CYCLICAL

22 Column Usage Some algorithms interpret this in different ways, but in general, columns are for: Input For predicting another column PREDICT These columns are both predicted and act as inputs for predicting others PREDICT_ONLY Not used as input Columns can be input or predictable or both

23 Discretization When you don’t need to analyze full continuous range
DM automatically convert data into buckets By default, into 5 Techniques: AUTOMATIC CLUSTERS EQUAL_AREAS THRESHOLDS

24 Distribution If you know the distribution of your data (you should), indicate it: NORMAL Typical Gaussian bell-curve LOG NORMAL Most values at the “beginning” of the scale UNIFORM Flat line – equally likely or perfectly random Other distributions can exist, but you cannot indicate them – algorithm will work fine

25 Distribution Nested Case – case containing a table column
Purchases of a Customer Used for analyzing patterns in a relationship It has a Nested Key Not a “relational” foreign key! Normally, the Nested Key is a column you want to analyze E.g.: Product Name or Model

26 Testing and Validation
? Testing and Validation That big question mark is editable – use whatever character you want!

27 Algorithms and Use Cases
Classification Segmentation Association Forecasting Text Analysis Advanced Data Exploration Estimation Association Rules Clustering Decision Trees Linear Regression Logistic Regression Naïve Bayes Neural Nets Sequence Clustering Time Series

28 Algorithms and Use Cases
Drillthrough PMML DM Dimension Association Yes No Clustering Decision Trees Linear Regression Logistic Regression Naive Bayes Neural Network Sequence Clustering Time Series

29 Varying Model Input

30 Varying Model Input

31 Fields to Notice AVGGIFT Average dollar amount of gifts to date
INCOME HOUSEHOLD INCOME LASTGIFT last donation amount MAXRAMNT Dollar amount of largest gift to date MINRAMNT Dollar amount of smallest gift to date RAMNTALL Dollar amount of lifetime gifts to date WEALTH Wealth Rating WEALTH Wealth Rating STATE State abbreviation (a nominal/symbolic field)

32 Fields to Notice Donor Rank
DOMAIN/Cluster code. A nominal or symbolic field. could be broken down by bytes as explained below. 1st byte = Urbanicity level of the donor's neighborhood U=Urban C=City S=Suburban T=Town R=Rural 2nd byte = Socio-Economic status of the neighborhood 1 = Highest SES 2 = Average SES 3 = Lowest SES except for Urban communities, = Highest SES, = Above average SES = Below average SES = Lowest SES.

33 Results from Discrete Donation

34 Results from Discretized Donation

35 ? Scorecarding Demo That big question mark is editable – use whatever character you want!

36 The Future of SQL Data Mining
+ =

37 Acknowledgements www.crisp-dm.org www.sqlserverdatamining.com
Masao Okada Rafal Lukawiecki Eugene A. Asahara


Download ppt "Data Mining in Action: A Case Study"

Similar presentations


Ads by Google