Dr. Russell Anderson Dr. Musa Jafar West Texas A&M University.

Dr. Russell Anderson Dr. Musa Jafar West Texas A&M University

 The process of discovering useful information in large data repositories. (Tan, P-N., Steinbach, M., and Kumar, V., Introduction to Data Mining, Addison-Wesley, 2006)  Discovered information should be:  Valid  Previously unknown  Actionable

 Seven objectives of Lenox and Cuff in 2002 (based on ACM 2001 Ironman Report)  Prepare and warehouse data  Process data based on set of DM algorithms  Analyze results  Make predictions  Select proper algorithm  Make application  Motivated to continue graduate studies in DM  We have added  Get to know data using statistical analysis tools  Use visualization tools for analysis and review

1. Get to know the data. 2. Select an appropriate data mining algorithm based on the data and the mining objective. 3. Construct a model using the selected algorithm. 4. Analyze the results. 5. Make application.

 How is it structured?  Single table/flat-file.  Multi-table – relationships  Number of observations  Number of dimensions (attributes)  Compute summary statistics using tool such as MS-Excel  Visually evaluate characteristics of the data

 Tools developed:  Correlation Matrix  Scatter Plot  Parallel Coordinate Plot

 Distributions of data  Data ranges of numeric attributes  Cardinality of discrete attributes  Shape of distribution  Skewed  Multi-model  Location of outliers  Identification possible relationships between attributes  Identification of subpopulations within the data

 Microsoft Business Intelligence Tools  Association Analysis – aka market basket analysis  Classification  Decision Trees  Artificial Neural Network  Bayesian Analysis  Regression  Cluster Analysis  Custom Tools with Embedded Visual Presentation  Artificial neural network for both classification and regression  Self-Organizing Map (SOM) for cluster analysis

 Purpose of each methodology  Steps of underlying algorithm  Data types supported  Issues in construction and application  Parameter settings  Results interpretation

 Does the model fit the training data too well?  Need to separate available into training and validation subsets.  Visual view of training progress valuable.

 Mushroom edibility classifiers Classifier A Actual EdiblePoisonous PredictedEdible38%0% Poisonous8%54% Classifier B Actual EdiblePoisonous PredictedEdible44%1% Poisonous2%53%

 Black Box - models built using sophisticated methodologies (ANN’s for example) perform very well, but gaining an understanding of the model itself is difficult.  Contribution of individual input attributes  Nature of contribution (shape of curve)  Interaction between input attributes

 For a detailed presentation of the mechanics of the software deployed, attend our workshop tomorrow morning.  Saturday: 8-10 AM  Kachina A  Microsoft SQL Server Business Intelligence Studio  Visualization Tools

Dr. Russell Anderson Dr. Musa Jafar West Texas A&M University.

Similar presentations

Presentation on theme: "Dr. Russell Anderson Dr. Musa Jafar West Texas A&M University."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Dr. Russell Anderson Dr. Musa Jafar West Texas A&M University.

Similar presentations

Presentation on theme: "Dr. Russell Anderson Dr. Musa Jafar West Texas A&M University."— Presentation transcript:

Similar presentations

About project

Feedback