DATA INTEGRATION TECHNIQUES WITH APPLICATIONS IN HEALTH SCIENCES BKSinha Ex-Faculty, ISI, Kolkata April 17, 2012
What is Data Integration ? Integration of Multiple Indicators Existence of several different indicators Desired to provide an AGGREGATE OR Over-all Measure….in an objective and statistically sound approach Multiple Criteria Decision Making [MCDM] Advocated by Hwang & Yoon (1981) : Multiple Attribute Decision Making : Methods & Applications : A State-of-the- Art-Survey. Springer-Verlag, Berlin
HEALTH ISSUES Air / Surface / Water Pollution : Different Sources & Their Effects ******************* US EPA : TRI Data Base Toxic Release Inventory [TRI] Data EPA’s 33/50 Program TRI Data for 17 Chemicals during long years : for 50 States & DC
TOXIC RELEASE INVENTORY [TRI] : US EPA TOXIC CHEMICALS…… BENZENE CADMIUM CARBON TETRACHLORIDE CHOLOFORM CYANIDE LEAD MERCURY NICKEL TOLUENE M-XYLENE… TRI Data…..expressed as % … Less the Better….More the Worse
NATURE OF DATA & PROBLEM States VS Chemicals : TRI Data [Coded] Benzene I 7% II 12% Q. Which State is the Least III 17% Hit by Benzene ? IV 9% Ans. VI V 14% AND Worst Hit ? III VI 6% Single Chemical….. VII 15% NO PROBLEM AT ALL VIII 16% TO RANK THE STATES FROM BEST TO WORST...
ADD ONE MORE CHEMICAL... States VS Chemicals : TRI Data Benzene CADMIUM I 7% 13 % II 12% 9% III 17% 4% IV 9% 11% V 14% 10% VI 6% 11% VII 15% 9% VIII 16% 11% Q. Combine the Two Chemicals : Which State is Worst ? How to Combine ?
AND ADD MORE…… States \ Chemicals [TRI Data] Be Cd Ca Tr Ch Cy … I : 7% 13% 21% 2% 34% 21% … CONCEPT OF DATA MATRIX X = (( X iJ )), 1 i K; 1 j N K Locations & N Data Sources DATA INTEGRATION FOR OVER-ALL TRI INDEX FOR GLOBAL COMPARISON
Data Matrix States VS Chemicals : TRI Data Be Cd Ca Tr Ch Cy L I 7% 13% 21% 2% 34% 21% 17% II 12% 9% 18% 3% 42% 28% 11% III 17% 4% 23% 7% 22% 19% 23% IV 9% 11% 17% 5% 25% 23% 19% V 14% 10% 13% 8% 21% 19% 25% VI 6% 11% 19% 5% 33% 21% 22% VII 15% 9% 13% 4% 38% 19% 28% VIII 16% 11% 10% 5% 33% 20% 25%
Application Areas Disease Prevalence Statistics Disease Symptom Statistics Health Statistics Demographic Statistics Human Development Index Statistics *********** Data Integration : Common Problem Techniques are quite general ….
Nature of Data Locations versus Features : Quantitative data providing impacts of features on the locations based on similarity principle Purpose : Overall Ranking of Locations based on Combined Evidence from a Pool of Features Features may / may not have equal importance in the process of ‘combining evidences’
Aggregate Methods….. Some kind of “aggregate” …..pooling of TRI Data to a single value for each State for over-all comparison TRI Data Total TRI for I = 115 [over 7 features] Average TRI for I = 16.43% Compute Average for Each State & Compare the averages across all states
Aggregate Methods…. AM…..GM…..HM….. Use of Median as Representative of TRI TRI Data I…Median = 17% II…Median = 12% III….19% ETC ETC….. Q. ARE ALL CHEMICALS EQUALLY HARMFUL ? Ans. Possibly NOT ! Q. Are all Features Equally Important ?
Concept of Weight….. Subject Specialist’s Knowledge….. Choice of Weights : Rel. Importance Wts.[TRI] : Interpretation of weights….. Total of Weights = 25.0 Rel. Wts : 2.0/25 = 8%, 3.5/25 =14% etc etc….for all chemicals….. Total of Rel Wts. = 1 OR 100 % Use Rel. Wts. to compute Weighted AM, GM, HM etc
Use of Ranks….. Convert Scores into Ranks for Each Item TRI Data Matrix :Convert into Rank Matrix Benzene Cadmium etc etc TRI Scores Ranks I 7% ……...2 II 12% …….4 III 17% …….8 IV 9% ……...3 V 14% …….5 VI 6% …..1 VII 15% …...6 VIII 16% …...7
Rank Matrix…. States VS Chemicals : TRI Data ….ranks Be Cd Ca Tr Ch Cy L I 2 II 4 III 8 IV 3 etc etc etc V 5 for each chemical VI 1 VII 6 Then use “aggregate” methods VIII 7 based on ranks
Why Ranks ….. Raw Scores…… Aggregate methods are sensitive to Outliers….too high or too low values… Extreme Values…. Use of Trimmed Mean Ranking…..recommended for Robust Results……
Less Known Methods…. TOPSIS METHOD ELECTRE METHOD [computation-intensive…..] Concepts : TOPSIS Method Features…..Locations…. Ideal Location Anti-Ideal Location Distance from Ideal….from Anti-ideal Within Feature Variation Composite Index
TOPSIS METHOD Technique for Ordering Preferences by Similarity to Ideal Solution Uses Concepts of Ideal & Anti-Ideal Locations Distance from Ideal & Anti-Ideal Locations Weight of Features Sum of Squares for each feature
Philosophy for TOPSIS TOPSIS (Technique for Ordering Preferences by Similarity to Ideal Solution) In the absence of a natural course of action for over-all summary measure and ranking….next best alternative course of action would be to assign top rank to the one which has shortest distance from the ideal and farthest distance from the anti-ideal…..
Concepts: Ideal & Anti-Ideal States States VS Chemicals : TRI Data Be Cd Ca Tr Ch Cy L I 7% 13% 21% 2% 34% 21% 17% II 12% 9% 18% 3% 42% 28% 11% III 17% 4% 23% 7% 22% 19% 23% IV 9% 11% 17% 5% 25% 23% 19% V 14% 10% 13% 8% 21% 19% 25% VI 6% 11% 19% 5% 33% 21% 22% VII 15% 9% 13% 4% 38% 19% 28% VIII 16% 11% 10% 5% 33% 20% 25% ********************************************************* Ideal... 6% 4% 10% 2% 21% 19% 11% Anti- 17% 13% 23% 8% 42% 28% 28% Ideal
Ideal & Anti-Ideal…… Hypothetical Locations! Abs. Best / Worst States ……Hypothetical Setting up the Limits for others….. Ranking of the others….. Better - Placed States ? Closer to Ideal : Distance from Ideal….small AND ALSO Far from Anti-Ideal : Distance from Anti-Ideal…Large
Concepts of Distance….. Euclidean Distance….. Ideal : 6% 4% 10% 2% 21% 19% 11% Anti- 17% 13% 23% 8% 42% 28% 28% Ideal Squarred Distance between Ideal & Its Anti = (6-17)^2 + (4-13)^2 + …. + (11-28)^2
Computations…. Distance between Location & ID OR NID…. I : 7% 13% 21% 2% 34% 21% 17% ID: 6% 4% 10% 2% 21% 19% 11% NID 17% 13% 23% 8% 42% 28% 28% Sq.Dis. [ I vs ID] (7- 6)^2 =1, (13 – 4)^2 =81, …… Sq. Dis. [ I vs NID] (7-17)^2 =100, (13 – 13)^2=0, … ….
Sq. Dist. Comp. vs Ideal Features\ Chemicals Locations I II III IV V VI VII VIII
Sq. Dist. Comp. vs Anti-Ideal Features\ Chemicals Locations I II III IV V VI VII VIII
Choice of Weights….. Wts.[TRI] : Rel. Wts : Sum of Squares for each feature over all locations Be : 7^2+12^2+17^2+9^2+14^2+6^2 +15^2+16^2 = 1236
Computation of Feature-wise Sum of Squares Features \ Sum of Squares 1[Be] [Cd] [Ca] [Tr] [Ch] [Cy] [L] 3818
Formation of Composite Indices…. Ingredients Distances, Weights & Sum of Squares Composite Index [CI] : 2 Components derived from Ideal & Anti-Ideal locations For Each Location : Added over all Features Sq.Distance x Wt of Feature Divided by Sum of Squares of feature
Two Components…
u’s and v’s….min. & max.
Details of Computations….. State I : L 2 [I, IDR] = [(7- 6)^2 x 0.08 / ….] 1/2 L 2 [I, NIDR =[ (7-17)^2 x 0.08 / …] 1/2 CI = Composite Index = L 2 [I, IDR] / {L 2 [I, IDR} + L 2 [I,NIDR]} It is a RATIO between 0 and 1 ….smaller the ratio, better is the placement of the State in over-all comparison across states …..
Computational Details : Ideal Locations / Features for Sq Distance wrt Ideal x Weight / SS of Features I II III IV V VI VII VIII
Computational Details : Anti-Ideal Locations / Features for Sq Dis. wrt Anti-Ideal x Weight / SS of Features I II
Final Ranking Table… States L 2 [., IDR] L 2 [., NIDR] CI Rank I II III IV etc etc etc V VI VII VIII
Choice of Weights…. Internal & External Importance of Environmental factors…. Use of Shanon’s Entropy Measure …. Define p iJ = X iJ / i X iJ = proportion…. Compute for each item (J) = - i p iJ ln p iJ / ln (K) Use w(J) = (1 - (J)) / r (1- (r)) Alternatively….use w(J) proportional to cv 2 of Item J …coeff of variation [cv] computed from the data matrix…..
Extensions….. Ranking depends critically on Choice of Distance Measure & Choice of Weights Distance Measure : Squared Distance [L 2 ] Mean Deviation : L 1 – Norm
Reversal of Roles of Rows & Cols.
Pollution Data on 50 US States
US Pollution Data [contd.]....
US Pollution Data [contd.]….
Results of TOPSIS Analysis : Two Sets of Weights [Entropy & CV] & Two Distance Measures [L-1 & L-2]
Results …contd…..
Results…contd….
Questions….. Q1. Can the indicators be expressed in original units of measurement or we only need % ? Ans. Yes….original units will do since the formulae indicate unit-free computations. Also see US Original Pollution Data at the end. Q2. What about interdependence among the indicators ?
Questions…. Ans. It is believed that the indicators are seemingly uncorrelated. If there is any functional dependence, only the smallest subset of them should be used. Q3. What about PCA ? Ans. That won’t lead to ranking of the locations. Also it will be difficult to interpret the linear combinations of the indicators.