Drew Minkin ◦ Past  Analytics Architect at Zilliant  Senior Consultant, Fujitsu  6+ years Microsoft Services  Escalation.

Drew Minkin archaeusdesignsystem@yahoo.com

◦ Past  Analytics Architect at Zilliant  Senior Consultant, Fujitsu  6+ years Microsoft Services  Escalation Engineer  Dedicated Field Engineer (“Alliance”)  Local speaker for SQL and BI  OLAP Lecturer, SMU’s BI Graduate Certificate Program ◦ Present  Business Intelligence Architect at FiServ ISV  Part time data miner for hire

 Data Mining Intro  DM Methodology  Data Concepts  Validating and Testing Models  Applying Output with Scorecards

 http://archive.ics.uci.edu/ml/  http://www.kdnuggets.com/ http://www.kdnuggets.com/

 Methodology  Architecture  Information Flow  Technologies

 Problem Definition  Data Modeling  Data Discovery  Analytics Modeling  Applied Analytics  Model Validation

 Business case and non-technical details of predictive analytics inquiry ◦ Business objectives and success criteria ◦ Requirements, assumptions and constraints ◦ Project plan, Risks and contingencies ◦ Data mining goals and success criteria ◦ Terminology, tools and techniques

 Analysis of source data for structural and content gaps ◦ Data collection report ◦ Data description report ◦ Data exploration report ◦ Data quality report

 Selection and manipulation of source data into a conformed entity input ready for formal exploration ◦ Dataset and dictionary and rationale ◦ Data cleansing report ◦ Derived attributes ◦ Generated merged and reformatted data

 Research and analysis of patterns and creation of data mining models ◦ Model ◦ Modeling technique ◦ Modeling assumptions ◦ Test design ◦ Parameter settings ◦ Model description

 Testing data mining models using different algorithms and validation of statistical significance ◦ Revised Parameter settings ◦ Model Validation plan ◦ Model assessment

 Integration of models with new data ◦ Deployment plan ◦ Monitoring and maintenance plan ◦ Final report ◦ Final presentation ◦ Experience documentation

 Case – set of columns you want to analyze ◦ Age, Gender, Region, Annual Spending  Case Key – unique ID of a case  A column has: ◦ Data Type ◦ Content Type ◦ And optionally:  Distribution  Discretization  Related Columns  Flags (e.g. NOT NULL)

 We don’t care about detailed low-level types  DM only uses: ◦ Text ◦ Long ◦ Boolean ◦ Double ◦ Date ◦ and by some 3rd party algorithms:  Time, and Sequence

 Common: ◦ DISCRETE  Red, Blue ◦ CONTINOUS  $6,511.49 ◦ DISCRETIZED  1-5, 6-20, 21+  Denotes a key: ◦ KEY  For special purposes: ◦ KEY SEQUENCE ◦ KEY TIME ◦ ORDERED ◦ CYCLICAL

 Some algorithms interpret this in different ways, but in general, columns are for:  Input ◦ For predicting another column  PREDICT ◦ These columns are both predicted and act as inputs for predicting others  PREDICT_ONLY ◦ Not used as input  Columns can be input or predictable or both

 When you don’t need to analyze full continuous range  DM automatically convert data into buckets ◦ By default, into 5  Techniques: ◦ AUTOMATIC ◦ CLUSTERS ◦ EQUAL_AREAS ◦ THRESHOLDS

 If you know the distribution of your data (you should), indicate it: ◦ NORMAL  Typical Gaussian bell-curve ◦ LOG NORMAL  Most values at the “beginning” of the scale ◦ UNIFORM  Flat line – equally likely or perfectly random  Other distributions can exist, but you cannot indicate them – algorithm will work fine

 Nested Case – case containing a table column ◦ Purchases of a Customer  Used for analyzing patterns in a relationship  It has a Nested Key ◦ Not a “relational” foreign key! ◦ Normally, the Nested Key is a column you want to analyze  E.g.: Product Name or Model

Classification Estimation Segmentation Association Forecasting Text Analysis Advanced Data Exploration Time Series Sequence Clustering Neural Nets Naïve Bayes Logistic Regression Linear Regression Decision Trees Clustering Association Rules Algorithms and Use Cases

AlgorithmDrillthroughPMMLDM Dimension AssociationYesNoYes ClusteringYes Decision TreesYes Linear RegressionYesNo Logistic RegressionNo Naive BayesYes No Neural NetworkNo Sequence ClusteringYesNoYes Time SeriesYesNo

 AVGGIFT Average dollar amount of gifts to date  INCOME HOUSEHOLD INCOME  LASTGIFT last donation amount  MAXRAMNT Dollar amount of largest gift to date  MINRAMNT Dollar amount of smallest gift to date  RAMNTALL Dollar amount of lifetime gifts to date  WEALTH1 Wealth Rating  WEALTH2 Wealth Rating  STATE State abbreviation (a nominal/symbolic field)

 Donor Rank  DOMAIN/Cluster code. A nominal or symbolic field.  could be broken down by bytes as explained below. ◦ 1st byte = Urbanicity level of the donor's neighborhood  U=Urban  C=City  S=Suburban  T=Town  R=Rural ◦ 2nd byte = Socio-Economic status of the neighborhood  1 = Highest SES  2 = Average SES  3 = Lowest SES except for Urban communities, 1 = Highest SES, 2= Above average SES 3 = Below average SES 4 = Lowest SES. 

= http://dejasu.wordpress.com/2008/01/28/knowledge-wisdom-other/question_mark.jpg

 www.crisp-dm.org  www.sqlserverdatamining.com  Masao Okada  Rafal Lukawiecki  Eugene A. Asahara

Data Mining in Action : A Case Study Drew Minkin (madmanminkin) Evaluation Links

Drew Minkin ◦ Past  Analytics Architect at Zilliant  Senior Consultant, Fujitsu  6+ years Microsoft Services  Escalation.

Similar presentations

Presentation on theme: "Drew Minkin ◦ Past  Analytics Architect at Zilliant  Senior Consultant, Fujitsu  6+ years Microsoft Services  Escalation."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Drew Minkin ◦ Past  Analytics Architect at Zilliant  Senior Consultant, Fujitsu  6+ years Microsoft Services  Escalation.

Similar presentations

Presentation on theme: "Drew Minkin ◦ Past  Analytics Architect at Zilliant  Senior Consultant, Fujitsu  6+ years Microsoft Services  Escalation."— Presentation transcript:

Similar presentations

About project

Feedback