Dept of Biomedical Engineering, Medical Informatics Linköpings universitet, Linköping, Sweden A Data Pre-processing Method to Increase.

Slides:



Advertisements
Similar presentations
TWO STEP EQUATIONS 1. SOLVE FOR X 2. DO THE ADDITION STEP FIRST
Advertisements

Mathematical Preliminaries
Multistage Sampling.
1 DATA STRUCTURES USED IN SPATIAL DATA MINING. 2 What is Spatial data ? broadly be defined as data which covers multidimensional points, lines, rectangles,
Algebra I Unit 1: Solving Equations in One Variable
Chapter 6 Cost and Choice. Copyright © 2001 Addison Wesley LongmanSlide 6- 2 Figure 6.1 A Simplified Jam-Making Technology.
1 Copyright © 2010, Elsevier Inc. All rights Reserved Fig 2.1 Chapter 2.
STATISTICS Random Variables and Distribution Functions
By D. Fisher Geometric Transformations. Reflection, Rotation, or Translation 1.
By: Saad Rais, Statistics Canada Zdenek Patak, Statistics Canada
The Application of Propensity Score Analysis to Non-randomized Medical Device Clinical Studies: A Regulatory Perspective Lilly Yue, Ph.D.* CDRH, FDA,
Business Transaction Management Software for Application Coordination 1 Business Processes and Coordination.
Summary of Convergence Tests for Series and Solved Problems
Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13
Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13
Title Subtitle.
Chapter R: Reference: Basic Algebraic Concepts
0 - 0.
DIVIDING INTEGERS 1. IF THE SIGNS ARE THE SAME THE ANSWER IS POSITIVE 2. IF THE SIGNS ARE DIFFERENT THE ANSWER IS NEGATIVE.
ADDING INTEGERS 1. POS. + POS. = POS. 2. NEG. + NEG. = NEG. 3. POS. + NEG. OR NEG. + POS. SUBTRACT TAKE SIGN OF BIGGER ABSOLUTE VALUE.
MULT. INTEGERS 1. IF THE SIGNS ARE THE SAME THE ANSWER IS POSITIVE 2. IF THE SIGNS ARE DIFFERENT THE ANSWER IS NEGATIVE.
Addition Facts
Year 6 mental test 5 second questions
Overview of Lecture Partitioning Evaluating the Null Hypothesis ANOVA
C82MST Statistical Methods 2 - Lecture 2 1 Overview of Lecture Variability and Averages The Normal Distribution Comparing Population Variances Experimental.
On the application of GP for software engineering predictive modeling: A systematic review Expert systems with Applications, Vol. 38 no. 9, 2011 Wasif.
Chapter 5 Formulating the research design
ZMQS ZMQS
STATISTICAL INFERENCE ABOUT MEANS AND PROPORTIONS WITH TWO POPULATIONS
Bayesian network for gene regulatory network construction
BT Wholesale October Creating your own telephone network WHOLESALE CALLS LINE ASSOCIATED.
Notes 15 ECE Microwave Engineering
Machine Learning Math Essentials Part 2
LOGO Regression Analysis Lecturer: Dr. Bo Yuan
On Comparing Classifiers : Pitfalls to Avoid and Recommended Approach
Copyright 2007 McGraw-Hill Pty Ltd PPTs t/a Marketing Research 2e by Lukas, Hair, Bush and Ortinau Slides prepared by Judy Rex 16-1 Chapter Sixteen Data.
ABC Technology Project
Copyright © Cengage Learning. All rights reserved.
© Charles van Marrewijk, An Introduction to Geographical Economics Brakman, Garretsen, and Van Marrewijk.
© Charles van Marrewijk, An Introduction to Geographical Economics Brakman, Garretsen, and Van Marrewijk.
© Charles van Marrewijk, An Introduction to Geographical Economics Brakman, Garretsen, and Van Marrewijk.
Quadratic Inequalities
Phase II/III Design: Case Study
Chapter 2 Section 3.
Squares and Square Root WALK. Solve each problem REVIEW:
Lecture 3 Validity of screening and diagnostic tests
Insert Date HereSlide 1 Using Derivative and Integral Information in the Statistical Analysis of Computer Models Gemma Stephenson March 2007.
Science as a Process Chapter 1 Section 2.
GG Consulting, LLC I-SUITE. Source: TEA SHARS Frequently asked questions 2.
Addition 1’s to 20.
25 seconds left…...
Copyright © Cengage Learning. All rights reserved.
Copyright © Cengage Learning. All rights reserved.
Week 1.
We will resume in: 25 Minutes.
CSE 473/573 Computer Vision and Image Processing (CVIP) Ifeoma Nwogu Lecture 27 – Overview of probability concepts 1.
Dantzig-Wolfe Decomposition
A SMALL TRUTH TO MAKE LIFE 100%
1 Unit 1 Kinematics Chapter 1 Day
1 PART 1 ILLUSTRATION OF DOCUMENTS  Brief introduction to the documents contained in the envelope  Detailed clarification of the documents content.
Mani Srivastava UCLA - EE Department Room: 6731-H Boelter Hall Tel: WWW: Copyright 2003.
Simple Linear Regression Analysis
How Cells Obtain Energy from Food
Chapter 16: Correlation.
1 McGill University Department of Civil Engineering and Applied Mechanics Montreal, Quebec, Canada.
Integrating Bayesian Networks and Simpson’s Paradox in Data Mining Alex Freitas University of Kent Ken McGarry University of Sunderland.
Evaluating Classifiers
1 Systematic Data Selection to Mine Concept-Drifting Data Streams Wei Fan Proceedings of the 2004 ACM SIGKDD international conference on Knowledge discovery.
Table 1. Advantages and Disadvantages of Traditional DM/ML Methods
A Short Tutorial on Causal Network Modeling and Discovery
Presentation transcript:

Dept of Biomedical Engineering, Medical Informatics Linköpings universitet, Linköping, Sweden A Data Pre-processing Method to Increase Efficiency and Accuracy in Data Mining Amir R Razavi, Hans Gill, Hans Åhlfeldt, Nosrat Shahsavar Department of Biomedical Engineering, Division of Medical Informatics Linköpings universitet, Linköping, Sweden

Dept of Biomedical Engineering, Medical Informatics Linköpings universitet, Linköping, Sweden 2 A Data Pre-processing Method in Data Mining Outline –Introduction –Dataset and variables –Data pre-processing –Data mining Algorithm (DTI) –Result –Discussion

Dept of Biomedical Engineering, Medical Informatics Linköpings universitet, Linköping, Sweden 3 Introduction Abundance of data in medicine and availability of comprehensive registers Difficulty in analysing huge amount of data with traditional methods Efficient data mining methods

Dept of Biomedical Engineering, Medical Informatics Linköpings universitet, Linköping, Sweden 4 Introduction Applying data mining methods to breast cancer register Pre-processing is an essential part of knowledge discovery in databases Finding an efficient pre-processing approach is essential for a successful data mining

Dept of Biomedical Engineering, Medical Informatics Linköpings universitet, Linköping, Sweden 5 Methods Dataset Data pre-processing –Data combination and selection –Cleaning data –Replacing missing values –Dimension reduction Decision Tree Induction (DTI) Performance comparison

Dept of Biomedical Engineering, Medical Informatics Linköpings universitet, Linköping, Sweden 6 Dataset 3949 female patients, 1986 to 1995, follow up to 2003 Data from three registers: regional, tumour marker and death registers, overall more than 150 variables

Dept of Biomedical Engineering, Medical Informatics Linköpings universitet, Linköping, Sweden 7 Variables

Dept of Biomedical Engineering, Medical Informatics Linköpings universitet, Linköping, Sweden 8 After combining data from different registers, important variables (predictors/outcomes) were selected after consulting with domain experts: –Number of predictors were reduced from +150 –Chosing four important outcomes for breast cancer Data Pre-processing – Data Selection

Dept of Biomedical Engineering, Medical Informatics Linköpings universitet, Linköping, Sweden 9 Cleaning the data from outliers and errors, for example: –Duration between diagnosis of the disease and the recurrence –Age Data Pre-processing – Cleaning Data

Dept of Biomedical Engineering, Medical Informatics Linköpings universitet, Linköping, Sweden 10 Data Pre-processing - Replacing Missing Values EM (expectation maximization) algorithm –Dempster et al., 1977 –A two step iterative approach that estimates the parameters of a model starting from an initial guess. Each iteration consists of two steps: An expectation step that finds the distribution for the missing data based on the known values for the observed variables and the current estimate of the parameters. A maximization step that substitutes the missing data with the expected value.

Dept of Biomedical Engineering, Medical Informatics Linköpings universitet, Linköping, Sweden 11 Data Pre-processing - Dimension Reduction Canonical Correlation Analysis (CCA) –It investigates the relationship between two sets of variables. –A canonical correlation is the correlation of two canonical variates, one representing a set of independent variables, the other a set of dependent variables. –A canonical variate, is a linear combination of a set of original variables.

Dept of Biomedical Engineering, Medical Informatics Linköpings universitet, Linköping, Sweden 12 Data Pre-processing - Dimension Reduction –The aim is to create a number of canonical solutions each consisting of a linear combination of one set of variables: Ui = a 1 X 1 + a 2 X 2 + … + a m X m and a linear combination of the other set of variables: Vi = b 1 Y 1 + b 2 Y 2 + … + b n Y n –The goal is to determine the coefficients (a’s and b’s) that maximize the correlation between canonical variates Ui and Vi.

Dept of Biomedical Engineering, Medical Informatics Linköpings universitet, Linköping, Sweden 13 Data Pre-processing - Dimension Reduction –For finding important variables in each set (predictors and outcomes) magnitude of loadings were used. –Variables with the absolute value of loadings more than or equal to 0.3 were assumed important and entered into the next step for data mining. –Loading shows how each original variable contribute towards each canonical variate.

Dept of Biomedical Engineering, Medical Informatics Linköpings universitet, Linköping, Sweden 14 Data Pre-processing - Dimension Reduction Variables with their loadings

Dept of Biomedical Engineering, Medical Informatics Linköpings universitet, Linköping, Sweden 15 Data Mining Algorithm Decision Tree Induction (DTI) –A decision tree is a tree in which each branch node represents a choice between a number of alternatives, and each leaf node represents a classification or decision. –Each internal node denotes a test on variables, each branch stands for an outcome of the test, leaf nodes represent an outcome, and the uppermost node in a tree is the root node.

Dept of Biomedical Engineering, Medical Informatics Linköpings universitet, Linköping, Sweden 16 Resulted Decision Tree

Dept of Biomedical Engineering, Medical Informatics Linköpings universitet, Linköping, Sweden 17 Performance comparison Sensitivity = Specificity = Accuracy = Number of leaves and tree size TP, TN, FP and FN denotes true positive, true negatives, false positives and false negatives, respectively

Dept of Biomedical Engineering, Medical Informatics Linköpings universitet, Linköping, Sweden 18 Performance Comparison Comparing different approaches

Dept of Biomedical Engineering, Medical Informatics Linköpings universitet, Linköping, Sweden 19 Discussion Effective data pre-processing is a very important step in knowledge discovery –Real word data are usually Incomplete Noisy Inconsistent Are not collected for data mining

Dept of Biomedical Engineering, Medical Informatics Linköpings universitet, Linköping, Sweden 20 Discussion Replacing missing values before dimension reduction –Providing more information to CCA for dimension reduction Running CCA prior to DTI –Reducing the number of variables while increasing accuracy of classification –Considerable increase in the interpretability of DTI

Dept of Biomedical Engineering, Medical Informatics Linköpings universitet, Linköping, Sweden 21 Discussion In medical studies often no pre-processing is done before DTI Proper pre-processing including consulting with domain experts, replacing missing values and dimension reduction prepares the data for a better data mining by DTI Increasing the accuracy and interpretability of DTI are the result of our approach

Dept of Biomedical Engineering, Medical Informatics Linköpings universitet, Linköping, Sweden 22 Future Works Increase the efficiency of knowledge discovery of medical registers. Validate the result of our methodology (pre- processing prior to data mining ) with domain experts for the prediction of recurrence of cancer. How to use the discovered knowledge and integrate it with clinical workflow. Improve the quality of registers with adding and completing important predictors.

Dept of Biomedical Engineering, Medical Informatics Linköpings universitet, Linköping, Sweden 23 Thanks for your attention