Presentation is loading. Please wait.

Presentation is loading. Please wait.

Patients.txt Variable Name Description Type Valid Values –PATNO Patient Number Character Numerals –GENDER Gender Character ‘M' or 'F' –VISIT Visit Date.

Similar presentations


Presentation on theme: "Patients.txt Variable Name Description Type Valid Values –PATNO Patient Number Character Numerals –GENDER Gender Character ‘M' or 'F' –VISIT Visit Date."— Presentation transcript:

1 Patients.txt Variable Name Description Type Valid Values –PATNO Patient Number Character Numerals –GENDER Gender Character ‘M' or 'F' –VISIT Visit Date MMDDYY10 Any valid date –HR Heart Rate Numeric 40 to 100 –SBP Systolic Blood Pres. Numeric 80 to 200 –DBP Diastolic Blood Pres. Numeric 60 to 120 –DX Diagnosis Code Character 1 to 3 digits –AE Adverse Event Character '0' or '1' 1

2 Patients.txt 2

3 Variable Name Description Type Valid Values –PATNO Patient Number Character Numerals –GENDER Gender Character ‘M' or 'F' –VISIT Visit Date MMDDYY10 Any valid date –HR Heart Rate Numeric 40 to 100 –SBP Systolic Blood Pres. Numeric 80 to 200 –DBP Diastolic Blood Pres. Numeric 60 to 120 –DX Diagnosis Code Character 1 to 3 digits –AE Adverse Event Character '0' or '1' 3

4 Distribution 4

5 5

6 6

7 7

8 Some of Invalid value 8

9 HR - Heart Rate (BETWEEN 40 AND 100) SBP - systolic Blood Pressure (BETWEEN 80 AND 200) DBP - Diastolic Blood Pressure (Between 60 to 120) 9

10 10

11 DBP - Diastolic Blood Pressure (Between 60 to 120) 11

12 12

13 DBP - Diastolic Blood Pressure (Between 60 to 120) 13

14 SBP - systolic Blood Pressure (BETWEEN 80 AND 200) 14

15 SBP - systolic Blood Pressure (BETWEEN 80 AND 200) 15

16 HR - Heart Rate (BETWEEN 40 AND 100) SBP 16

17 17

18 Data integration combining/merging data from heterogeneous data sources. is the process of combining data residing at different sources (internal data sources and external data sources) providing the user with a unified view of these data. 18

19 SCHEMA INTEGRATION use different representations or definitions of schema but it refers to or represent the same information. as the entity identification problem. 19

20 For example How can we identify that customer_id in one data set and customer_no in another refer to the same entity? 20

21 Schema matching Currently, most of the schema matching is done manually. –tedious, –time-consuming, –error-prone. 21

22 We need automated support for schema matching –faster, –error-free and –less labor-intensive. 22

23 A mapping between Global Schema and Local Schema 23

24 The architecture for data integration 24

25 Correlation Analysis Redundancy apply correlation analysis 25

26 Correlation Analysis Given two attributes (X1, X2); Measure the correlation of one attribute (X1) to another attribute (X2). 26

27 Correlation Analysis 27

28 Correlation Analysis 28

29 Correlation Analysis Table 2 is generated by the following criteria: –i) For the number of bytes in the attributes, if total number of bytes is less than or equal to 8 byte, we put it as 1, else it would be 0. –ii) For 1 attribute frequently access, we propose to sum the total frequency of one attribute, which is (6 1+2) = 9. The average frequently accessed = 9 / 3 = 3. Any number which is less than average frequently accessed, would be converted into 0, else it is 1. 29

30 Correlation Analysis 30

31 Correlation Analysis We apply correlation analysis to find out among attributes where are pairs as a redundancy. 31

32 Correlation Analysis 32

33 Correlation Analysis 33

34 Correlation Analysis 34

35 Correlation Analysis If the resulting value is greater than 0, then X2 and X3 are positively correlated. The higher the value (approaching 1), the more each attribute implies the other. Therefore, it is recommended that X2 (or X3 ) may be removed as they are redundant variables. 35

36 Clustering To explain how we apply a clustering algorithm to generate clusters, we assume that a relation has 10 attributes involved in query processing. Furthermore, one disk page can only take less than 100 bytes 36

37 Clustering Table 6.1 shows the length of each attributes. We use a frequent access table to keep track the number of times users access in a particular relation as shown in Table 6.1. When the users access the relation, the frequent access table will be updated. The frequent access table also shows the length of attribute. 37

38 Clustering 38

39 Clustering From Table 6.1, we would like covert those numeric figures into Y or N condition based on some criteria. We propose the following converting scheme: –For number of bytes in the attributes, if total number bytes less than one fetch of instruction cycle way 100 byte, we put it as Y else it would be N. –For 1 attribute frequently access, we propose to sum the total frequent of one attribute which is (7 + 2 + 4 + 3 + 2 + 8 + 5 + 4 + 9 + 3) = 47. – The average frequently access = 47 / 10 = 4.7. – Any number is less than average frequently access, we would like to convert it into N else it is Y. 39

40 Clustering 40

41 Clustering 41

42 42

43 DATA TRANSFORMATION In metadata, a data transformation converts data from a source data format into destination data. 43


Download ppt "Patients.txt Variable Name Description Type Valid Values –PATNO Patient Number Character Numerals –GENDER Gender Character ‘M' or 'F' –VISIT Visit Date."

Similar presentations


Ads by Google