Presentation is loading. Please wait.

Presentation is loading. Please wait.

III 1 Sorin Alexe RUTCOR, Rutgers University, Piscataway, NJ URL: rutcor.rutgers.edu/~salexe Datascope - a new tool.

Similar presentations


Presentation on theme: "III 1 Sorin Alexe RUTCOR, Rutgers University, Piscataway, NJ URL: rutcor.rutgers.edu/~salexe Datascope - a new tool."— Presentation transcript:

1

2 III 1 Sorin Alexe RUTCOR, Rutgers University, Piscataway, NJ e-mail: salexe@rutcor.rutgers.edu URL: rutcor.rutgers.edu/~salexe Datascope - a new tool for Logical Analysis of Data (LAD) Datascope - a new tool for Logical Analysis of Data (LAD) DIMACS Mixer Series, September 19, 2002

3 III 2 Dataset Hidden Function LAD Approximation LAD - Problem

4 III 3 LAD - Patterns Positive Pattern Negative Pattern

5 III 4 LAD - Theories, Models, Classifications Positive Theory Negative Theory Model

6 III 5 Datascope Functions Support Set Identification Space Discretization Pattern Detection Model Construction Discriminant / Prognostic Index Classification Feature Analysis

7 III 6 Matlab Solver Internal Solver Datascope Dataflow Discretization Significant Features Cutpoints, Support Set Feature Analysis Pattern Space Diagnosis Prognosis Risk Stratification Pandect Generation Discriminant Construction User Excel Model Pre-Processing Raw Data Theories/Models Pattern Report

8 III 7 1. Support Set Identification Selects Small Subset of Significant Features Preserves Hidden Knowledge Feature Ranking Criteria: Statistical Correlation with Outcome Combinatorial Entropy Distribution Monotonicity Class Separation Envelope Eccentricity E.g., 10 proteins selected out of 15,144

9 III 8 Data Spreadsheet Oriented OLE (via Clipboard)/ Excel Spreadsheet / dBase tables Training / Test Generation Bootstrap k-Folding Jackknife New Features Correlation

10 III 9 Data: Training/Test

11 III 10 2. Space Discretization Criteria: Entropy Correlation with Output Bins (equipartitioning) Intervals Clustered Class Separation Criteria: Entropy Correlation with Output Bins (equipartitioning) Intervals Clustered Class Separation Parameter Choice: User Defined Minimizing Support Set Parameter Choice: User Defined Minimizing Support Set Quality Measures: Entropy Separability Quality Measures: Entropy Separability

12 III 11 Entropy Correlation with Output Bins Intervals Clustered Class Separation

13 III 12 3. Generation of Maximal Patterns Pattern Type Selection: Prime Cones Intervals Spanned Pattern Type Selection: Prime Cones Intervals Spanned Parameter Bound Settings: Prevalence: % of positive observations % of negative observations Homogeneity: on positive patterns on negative patterns Degree. Parameter Bound Settings: Prevalence: % of positive observations % of negative observations Homogeneity: on positive patterns on negative patterns Degree. Post-Generation Filters: By Characteristics Maximality Strongness Post-Generation Filters: By Characteristics Maximality Strongness

14 III 13 i.e., Positive Patterns Positive Patterns Pattern Definition Training Set Test Set

15 III 14 Negative Patterns Negative Patterns Pattern Definition Training Set Test Set

16 III 15 4. Theories and Models Pandect Theory Selection: via: Greedy Bottleneck Greedy Lexicographic Greedy Set Covering Heuristics Theory Selection: via: Greedy Bottleneck Greedy Lexicographic Greedy Set Covering Heuristics Model Selection: 2 Set-Covering Problems Quadratic Set-Covering Problem Model Selection: 2 Set-Covering Problems Quadratic Set-Covering Problem

17 III 16 4. Example (Model)

18 III 17 5. Example (Classification)

19 III 18

20 III 19

21 III 20

22 III 21 5. Discriminants Weight Selection Methods: Direct 1. Prognostic Index 2. Weighted Prognostic Index LP-Based 3. Distance Maximizing Separator (SVM) 4. Cost Minimizing Separator 5. Expected Value Separator NLP-Based 6. Regression in Pattern Space (ANN) 7. Best Correlation with Output (weighted sums of patterns)

23 III 22 Prognostic Index Weighted Prognostic Expected Value Index Separator Distance Maximizing Cost Minimizing Best Correlation Separator Separator with Output

24 III 23 Accuracy Sensitivity Specificity

25 III 24

26 III 25 Reporting Cutpoints Discretized Space Pandect Coverage of Observations by Patterns Pattern Report (Compact/Full Versions) Theories/Models Attribute Analysis Log File

27 III 26 Pattern Space Training + + + + + + - - - Patterns Test + + + + + + - - - Patterns Positive Observations Unclassified Observations Negative Observations

28 III 27 Clustered Pattern Space

29 III 28 Accuracy Sensitivity Specificity Accuracy Sensitivity Specificity Bootstrap K-Folding Jackknife Bootstrap K-Folding Jackknife Validation Procedures Stratified Random Partition Stratified Random Partition LAD Model on Training Set Performance Evaluation Performance Evaluation Raw Data

30 III 29 Special Features Generating User Model Generation (Excel Files) Datascope Macro Language Multiple and Complex Experiments Interface with Other Applications (Datascope Server)

31 III 30 Performance Tjen-Sien Lim, Wei-Yin Loh and Yu-Shan Shin A Comparison of Prediction Accuracy, Complexity, and Training Time of Thirty-three Old and New Classification Algorithms, by, Machine Learning, 40, 203-229 (2000) http://www.ics.uci.edu/~mlearn/MLRepository.html

32 III 31 LAD Case Studies Assessing Long-Term Mortality Risk After Exercise Electrocardiography Ovarian Cancer Detection Using Proteomic Data Combinatorial Analysis of Breast Cancer Data from Image Cytometry and Gene Expression Microarrays Cell Proliferation on Medical Implants Country Risk Rating


Download ppt "III 1 Sorin Alexe RUTCOR, Rutgers University, Piscataway, NJ URL: rutcor.rutgers.edu/~salexe Datascope - a new tool."

Similar presentations


Ads by Google