Patients.txt Variable Name Description Type Valid Values –PATNO Patient Number Character Numerals –GENDER Gender Character ‘M' or 'F' –VISIT Visit Date MMDDYY10 Any valid date –HR Heart Rate Numeric 40 to 100 –SBP Systolic Blood Pres. Numeric 80 to 200 –DBP Diastolic Blood Pres. Numeric 60 to 120 –DX Diagnosis Code Character 1 to 3 digits –AE Adverse Event Character '0' or '1' 1
Patients.txt 2
Variable Name Description Type Valid Values –PATNO Patient Number Character Numerals –GENDER Gender Character ‘M' or 'F' –VISIT Visit Date MMDDYY10 Any valid date –HR Heart Rate Numeric 40 to 100 –SBP Systolic Blood Pres. Numeric 80 to 200 –DBP Diastolic Blood Pres. Numeric 60 to 120 –DX Diagnosis Code Character 1 to 3 digits –AE Adverse Event Character '0' or '1' 3
Distribution 4
5
6
7
Some of Invalid value 8
HR - Heart Rate (BETWEEN 40 AND 100) SBP - systolic Blood Pressure (BETWEEN 80 AND 200) DBP - Diastolic Blood Pressure (Between 60 to 120) 9
10
DBP - Diastolic Blood Pressure (Between 60 to 120) 11
12
DBP - Diastolic Blood Pressure (Between 60 to 120) 13
SBP - systolic Blood Pressure (BETWEEN 80 AND 200) 14
SBP - systolic Blood Pressure (BETWEEN 80 AND 200) 15
HR - Heart Rate (BETWEEN 40 AND 100) SBP 16
17
Data integration combining/merging data from heterogeneous data sources. is the process of combining data residing at different sources (internal data sources and external data sources) providing the user with a unified view of these data. 18
SCHEMA INTEGRATION use different representations or definitions of schema but it refers to or represent the same information. as the entity identification problem. 19
For example How can we identify that customer_id in one data set and customer_no in another refer to the same entity? 20
Schema matching Currently, most of the schema matching is done manually. –tedious, –time-consuming, –error-prone. 21
We need automated support for schema matching –faster, –error-free and –less labor-intensive. 22
A mapping between Global Schema and Local Schema 23
The architecture for data integration 24
Correlation Analysis Redundancy apply correlation analysis 25
Correlation Analysis Given two attributes (X1, X2); Measure the correlation of one attribute (X1) to another attribute (X2). 26
Correlation Analysis 27
Correlation Analysis 28
Correlation Analysis Table 2 is generated by the following criteria: –i) For the number of bytes in the attributes, if total number of bytes is less than or equal to 8 byte, we put it as 1, else it would be 0. –ii) For 1 attribute frequently access, we propose to sum the total frequency of one attribute, which is (6 1+2) = 9. The average frequently accessed = 9 / 3 = 3. Any number which is less than average frequently accessed, would be converted into 0, else it is 1. 29
Correlation Analysis 30
Correlation Analysis We apply correlation analysis to find out among attributes where are pairs as a redundancy. 31
Correlation Analysis 32
Correlation Analysis 33
Correlation Analysis 34
Correlation Analysis If the resulting value is greater than 0, then X2 and X3 are positively correlated. The higher the value (approaching 1), the more each attribute implies the other. Therefore, it is recommended that X2 (or X3 ) may be removed as they are redundant variables. 35
Clustering To explain how we apply a clustering algorithm to generate clusters, we assume that a relation has 10 attributes involved in query processing. Furthermore, one disk page can only take less than 100 bytes 36
Clustering Table 6.1 shows the length of each attributes. We use a frequent access table to keep track the number of times users access in a particular relation as shown in Table 6.1. When the users access the relation, the frequent access table will be updated. The frequent access table also shows the length of attribute. 37
Clustering 38
Clustering From Table 6.1, we would like covert those numeric figures into Y or N condition based on some criteria. We propose the following converting scheme: –For number of bytes in the attributes, if total number bytes less than one fetch of instruction cycle way 100 byte, we put it as Y else it would be N. –For 1 attribute frequently access, we propose to sum the total frequent of one attribute which is ( ) = 47. – The average frequently access = 47 / 10 = 4.7. – Any number is less than average frequently access, we would like to convert it into N else it is Y. 39
Clustering 40
Clustering 41
42
DATA TRANSFORMATION In metadata, a data transformation converts data from a source data format into destination data. 43