Download presentation
Presentation is loading. Please wait.
2
III 1 Sorin Alexe RUTCOR, Rutgers University, Piscataway, NJ e-mail: salexe@rutcor.rutgers.edu URL: rutcor.rutgers.edu/~salexe Datascope - a new tool for Logical Analysis of Data (LAD) Datascope - a new tool for Logical Analysis of Data (LAD) DIMACS Mixer Series, September 19, 2002
3
III 2 Dataset Hidden Function LAD Approximation LAD - Problem
4
III 3 LAD - Patterns Positive Pattern Negative Pattern
5
III 4 LAD - Theories, Models, Classifications Positive Theory Negative Theory Model
6
III 5 Datascope Functions Support Set Identification Space Discretization Pattern Detection Model Construction Discriminant / Prognostic Index Classification Feature Analysis
7
III 6 Matlab Solver Internal Solver Datascope Dataflow Discretization Significant Features Cutpoints, Support Set Feature Analysis Pattern Space Diagnosis Prognosis Risk Stratification Pandect Generation Discriminant Construction User Excel Model Pre-Processing Raw Data Theories/Models Pattern Report
8
III 7 1. Support Set Identification Selects Small Subset of Significant Features Preserves Hidden Knowledge Feature Ranking Criteria: Statistical Correlation with Outcome Combinatorial Entropy Distribution Monotonicity Class Separation Envelope Eccentricity E.g., 10 proteins selected out of 15,144
9
III 8 Data Spreadsheet Oriented OLE (via Clipboard)/ Excel Spreadsheet / dBase tables Training / Test Generation Bootstrap k-Folding Jackknife New Features Correlation
10
III 9 Data: Training/Test
11
III 10 2. Space Discretization Criteria: Entropy Correlation with Output Bins (equipartitioning) Intervals Clustered Class Separation Criteria: Entropy Correlation with Output Bins (equipartitioning) Intervals Clustered Class Separation Parameter Choice: User Defined Minimizing Support Set Parameter Choice: User Defined Minimizing Support Set Quality Measures: Entropy Separability Quality Measures: Entropy Separability
12
III 11 Entropy Correlation with Output Bins Intervals Clustered Class Separation
13
III 12 3. Generation of Maximal Patterns Pattern Type Selection: Prime Cones Intervals Spanned Pattern Type Selection: Prime Cones Intervals Spanned Parameter Bound Settings: Prevalence: % of positive observations % of negative observations Homogeneity: on positive patterns on negative patterns Degree. Parameter Bound Settings: Prevalence: % of positive observations % of negative observations Homogeneity: on positive patterns on negative patterns Degree. Post-Generation Filters: By Characteristics Maximality Strongness Post-Generation Filters: By Characteristics Maximality Strongness
14
III 13 i.e., Positive Patterns Positive Patterns Pattern Definition Training Set Test Set
15
III 14 Negative Patterns Negative Patterns Pattern Definition Training Set Test Set
16
III 15 4. Theories and Models Pandect Theory Selection: via: Greedy Bottleneck Greedy Lexicographic Greedy Set Covering Heuristics Theory Selection: via: Greedy Bottleneck Greedy Lexicographic Greedy Set Covering Heuristics Model Selection: 2 Set-Covering Problems Quadratic Set-Covering Problem Model Selection: 2 Set-Covering Problems Quadratic Set-Covering Problem
17
III 16 4. Example (Model)
18
III 17 5. Example (Classification)
19
III 18
20
III 19
21
III 20
22
III 21 5. Discriminants Weight Selection Methods: Direct 1. Prognostic Index 2. Weighted Prognostic Index LP-Based 3. Distance Maximizing Separator (SVM) 4. Cost Minimizing Separator 5. Expected Value Separator NLP-Based 6. Regression in Pattern Space (ANN) 7. Best Correlation with Output (weighted sums of patterns)
23
III 22 Prognostic Index Weighted Prognostic Expected Value Index Separator Distance Maximizing Cost Minimizing Best Correlation Separator Separator with Output
24
III 23 Accuracy Sensitivity Specificity
25
III 24
26
III 25 Reporting Cutpoints Discretized Space Pandect Coverage of Observations by Patterns Pattern Report (Compact/Full Versions) Theories/Models Attribute Analysis Log File
27
III 26 Pattern Space Training + + + + + + - - - Patterns Test + + + + + + - - - Patterns Positive Observations Unclassified Observations Negative Observations
28
III 27 Clustered Pattern Space
29
III 28 Accuracy Sensitivity Specificity Accuracy Sensitivity Specificity Bootstrap K-Folding Jackknife Bootstrap K-Folding Jackknife Validation Procedures Stratified Random Partition Stratified Random Partition LAD Model on Training Set Performance Evaluation Performance Evaluation Raw Data
30
III 29 Special Features Generating User Model Generation (Excel Files) Datascope Macro Language Multiple and Complex Experiments Interface with Other Applications (Datascope Server)
31
III 30 Performance Tjen-Sien Lim, Wei-Yin Loh and Yu-Shan Shin A Comparison of Prediction Accuracy, Complexity, and Training Time of Thirty-three Old and New Classification Algorithms, by, Machine Learning, 40, 203-229 (2000) http://www.ics.uci.edu/~mlearn/MLRepository.html
32
III 31 LAD Case Studies Assessing Long-Term Mortality Risk After Exercise Electrocardiography Ovarian Cancer Detection Using Proteomic Data Combinatorial Analysis of Breast Cancer Data from Image Cytometry and Gene Expression Microarrays Cell Proliferation on Medical Implants Country Risk Rating
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.