Presentation is loading. Please wait.

Presentation is loading. Please wait.

Health Care Data Analytics

Similar presentations


Presentation on theme: "Health Care Data Analytics"— Presentation transcript:

1 Health Care Data Analytics
Unit 2: Working with Data Welcome to Component 24, Health Care Data Analytics, Unit 2, Working with Data. This unit is a laboratory exercise using publicly available data sets and Microsoft Excel®. The activities focus on preparing data for analysis and performing simple analyses. This material (Comp 24 Unit 2) was developed by The University of Texas Health Science Center at Houston, funded by the Department of Health and Human Services, Office of the National Coordinator for Health Information Technology under Award Number 90WT0006. This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. To view a copy of this license, visit Health IT Workforce Curriculum Version 4.0

2 Working with Data Lecture – Learning Objectives
Describe reasons why data needs to be cleaned or modified before analysis Demonstrate ability to identify and correct basic errors in data Demonstrate ability to perform descriptive statistics Demonstrate ability to use pivot tables Describe the relationship between a database in an health IT system and data analysis tools The Objectives for this unit, Working with Data, are to: Describe reasons why data need to be cleaned or modified before analysis Demonstrate ability to identify and correct basic errors in data Demonstrate ability to perform descriptive statistics Demonstrate ability to use pivot tables Describe the relationship between a database in an health IT system and data analysis tools

3 Technologies and Tools
Common technologies and tools used for data analytics include: Spreadsheet programs such as Microsoft Excel® Statistical programs such as R, SAS, SPSS, and Stata Database management systems such as MySQL and Microsoft SQL Server® - can perform some basic analysis Business intelligence applications such as Tableau®, QlikView®, IBM Cognos Common technologies and tools used for data analytics include spreadsheet programs such as Microsoft Excel® and dedicated statistical programs such as R, SAS, SPSS, and Stata. Note that database management systems such as MySQL and Microsoft SQL Server® can perform some basic analysis, but this is not their strength. Typically data will be retrieved from a database such as an EHR’s database using a SQL query. The data will then be imported into another application for the statistical analysis. There are also specialized business intelligence applications; some examples are Tableau, QlikView [pronounced Click View], and IBM’s Cognos.

4 Install the Excel Analysis ToolPak
You must already have Microsoft Office with Excel on your computer Click the File tab, then click Options. Click Add-Ins, and then in the Manage box, select Excel Add-ins. Click Go. In the Add-Ins available box, select the Analysis ToolPak check box, and then click OK. After you load the Analysis ToolPak, the Data Analysis command is available in the Analysis group on the Data tab. For the exercises in this unit, you will use Microsoft Excel because it is widely available. The Analysis ToolPak is a Microsoft Excel add-in program that is available when you install Microsoft Office or Excel. To use it in Excel, however, you need to load it first. Click the File tab, and then click Options. Click Add-Ins, and then in the Manage box, select Excel Add-ins. Click Go. In the Add-Ins available box, select the Analysis ToolPak check box, and then click OK. If Analysis ToolPak is not listed in the Add-Ins available box, click Browse to locate it. If you get prompted that the Analysis ToolPak is not currently installed on your computer, click Yes to install it. After you load the Analysis ToolPak, the Data Analysis command is available in the Analysis group on the Data tab. NOTE: To include Visual Basic for Application (VBA) functions for the Analysis ToolPak, you load the Analysis ToolPak-VBA add-in the same way that you load the Analysis ToolPak. In the Add-ins available box, select the Analysis ToolPak-VBA check box, and then click OK.

5 Cleaning Data Identify errors Determine correct values or infer/impute
Descriptive statistics Categorical data Use of pivot tables Determine correct values or infer/impute If uncorrectable delete the record Work with a copy of your dataset and log all changes! Most, if not all, data sets contain errors or empty fields. These can be due to a wide variety of reasons, such as keyboard errors or fields left blank because the patient could not provide the information. Before you start doing any analysis, you must examine your data, identify any problems, and determine how to address them. The first task is to identify the errors. You can run descriptive statistics, check counts of categorical data, and create pivot tables to examine your data. If (or rather, when) you find errors or blank entries, you have to decide what to do. Can you determine the correct value from another source? For example, if the patient’s height is missing but you have the medical record number, can you look it up on another system? You may also have to infer or impute values. Perhaps the gender field on your patient is blank and you cannot determine the gender from the name, but you see that the patient had a diagnosis of prostate cancer. You retrieve a pathology report and see that this is your patient, and can therefore safely change the blank gender field to Male. If you cannot correct the record, then you may have to make the decision to delete the record from your dataset. Good data governance principles are beyond the scope of this unit, but in general, always work with a copy of your original set of data, and keep a record or log of every change you make to your data.

6 Data Cleaning – Continuous Data Descriptive Statistics
To generate descriptive statistics in Excel: Data  Data Analysis  Descriptive Statistics Now let’s take a look at descriptive statistics, which give a basic overview of the data, including the mean (or average), the median (the midpoint) and the mode (the most frequently-occurring value), the smallest and largest values, and the total number of records. Running descriptive statistics is an excellent way to quickly check if the values are reasonable. For example, this set of data of 500 patient weights shows a mean weight of 189 pounds, with a minimum of 89 and a maximum of 565. We can To generate descriptive statistics in Excel, click Data, then Data Analysis, then Descriptive Statistics 2.1 Figure: (Smith, K. 2016)

7 Data Cleaning – Categorical Data
COUNTIF function =COUNTIF(range, criteria) =COUNTIF($B$1:$B$10, “M”) - will give 5 =COUNTIF($B$1:$B$10, “F”) - will give 3 =COUNTIF($B$1:$B$10, “U”) - will give 1 Can identify some errors You will probably have data that is not numeric, and so counts of how many times certain values occur can be very helpful in not only understanding the data, but also to identify errors in the data. Take a look at this list of 10 patient genders. If we use the COUNTIF function on the range of cells from B1 to B10, and specify that we want to know how many M’s there are, we will get 5; for entries of F for Female, we will get 3, and for U (for Unknown), we will get 1. Now, we have a problem – 5 plus 3 plus 1 only adds to 9, and we have 10 records. Obviously, in this example, we can see that there is an entry with a D in it – which may have been a data entry error. But if we were working with thousands or millions of records, we would not be able to scroll through a list and look for the incorrect records. We need additional tools. So let’s look at another Excel tool called Filter. 2.2 Figure: (Smith, K. 2016)

8 Filtering Records Displays only those records that meet certain criteria Click a cell in the column to be filtered On the Data tab, click the Filter icon Excel’s Filter function lets you display only records that meet certain criteria. To filter records, click a cell in the column to be filtered (in this case, we are using the Gender column) and then click the Filter icon on the Data tab. Nothing will appear to happen, other than a drop-down arrow will now appear at the right of the column heading (indicated here with a red arrow.) Click the drop-down arrow and we’ll look at what happens on the next slide. 2.3 Figure: (Smith, K. 2016)

9 Filtering Records, continued
Dialog box displays all the values present in the column Can check only values you are interested in – Excel will display only those records If you click the drop-down arrow on the column header, you should now see a dialogue box with several options. First, at the bottom of the dialogue box, is the list of all the values that were present in the Gender column – the M, F, U, and the D that is an incorrect value. To see only the record(s) with a D in the Gender column, uncheck the F, M, and U and leave only the D checked. You can also specify additional filtering criteria, such as whether the entry in a cell equals a value, does not equal a value, begins or ends with a string of characters, and so on. 2.4 Figure: (Smith, K. 2016)

10 Column Graph Column graph shows individual weights
But doesn’t show us how many patients are in a particular weight category Let’s take a look at the weights for these 10 patients. We could graph them, but this isn’t particularly helpful – individual weights on a graph plotted against patient number doesn’t give us any insight into categories of weights – such as one category for pounds, another category for 250 to 299 pounds, and so on. That’s a job for a frequency histogram, which looks like a column graph, but shows distributions. Let’s look at frequencies and histograms now. 2.5 Figure: (Smith, K. 2016)

11 Frequencies and Histograms
Frequency: “How many of X and Y are there?” A frequency calculation gives how many times a particular value occurs Can be shown as: Frequency table Histogram: a graph of the number of times values occur in a set of data Frequencies answer the question “how many of something is there?” Recall that nominal data are things like names, labels, or categories of data. Examples of nominal data are patient names; genders, such as male, female, and unknown; and marital status -married, single, divorced, widowed. Frequencies are primarily shown in a table format, and also in a histogram – which looks like a column chart, so don’t get confused! A histogram is a graph of the number of times values occur in a set of data – in other words, the frequency of the values. It looks like a column chart, in that it is a visualization of the distribution of the values. Let’s look at an example now.

12 Example Frequency Table and Histogram
On the left is a frequency table and on the right is a histogram of the same data. The categories or bins from the frequency table are the numbers along the horizontal X axis of the graph, and the frequencies (or counts) are the numbers along the vertical or Y axis of the graph. 2.6 Figure: (Smith, K. 2016)

13 Example How many patients are in each of the following weight categories (in pounds)? < Let’s step through the process of creating that table and histogram. We want to know how many patients fall into each category, and then we want to see this in a graph. We’ll take our 500 patient weights and set up categories in roughly 50-pound divisions.

14 Set up the category bins
Add a column to your Excel spreadsheet with the bins that you want to use to categorize the patient weights Add a column to your Excel spreadsheet with the bins that you want to use to categorize the patient weights. Note that in this example only a few of the 500 patient weights are shown. 2.7 Figure: (Smith, K. 2016)

15 Creating a Frequency Table and Histogram
To run Frequencies in Microsoft Excel, click the Data tab, then click Data Analysis, and then choose Histogram. 2.8 Figure: (Smith, K. 2016)

16 Creating a Frequency Table and Histogram
In the Input Range field, enter the range of cells that contain the weights In the Bin Range field, enter the range of cells that contain the category bins that you created Click Chart Output In the Input Range field, enter the range of cells that contain the weights (in this case, the data start in H2 and continue through H501) In the Bin Range field, enter the range of cells that contain the category bins that you created (In this example, the bins are in cells L2 through L12) Finally, determine what type of output you want. Let’s click Chart Output for this example. 1.9 Figure: (Smith, K. 2016)

17 Frequency Table and Histogram Output
Excel will produce a new worksheet in your file with the table and the histogram. Using the Excel Data Analysis ToolPak Frequency function for 500 records. 2.10 Figure: (Smith, K. 2016)

18 Sorted Histogram 2.11 Figure: (Smith, K. 2016)
Sorting the histogram organizes the categories or bins by how often they occur. The most frequent bins will be to the left of the graph. 2.11 Figure: (Smith, K. 2016)

19 Pivot Tables Pivot tables are an Excel tool that let you summarize, analyze, and create different views of your data. You can arrange how the data is displayed. Pivot tables are very useful for identifying trends or relationships among data in large datasets. Use the laboratory exercise on pivot tables to explore data on hospital-acquired infections Pivot tables are an Excel tool that let you summarize, analyze, and create different views of your data. You can arrange how the data is displayed. Pivot tables are very useful for identifying trends or relationships among data in large datasets.

20 Example Pivot Table 2.13 Figure: (HealthData.gov, n.d.)
Here is an example of a raw set of data on the left and a pivot table of some of the data on the right, along with a column chart of the data in the pivot table. You can practice creating this pivot table in the laboratory activity associated with this unit. 2.13 Figure: (HealthData.gov, n.d.) 2.12 Figure: (HealthData.gov, n.d.)

21 Chi-Square Test Are two categorical variables related?
Categorical variable examples: Gender Ethnicity Age group (e.g , 50-59) Disease stage (I, II, III, IV) Presence or absence of a disease Now think back to the section in this component on types of data. Recall that we discussed nominal, ordinal, interval, and ratio data. Nominal data included things that were names or categories, such as gender, race, religion, smokes (yes/no), and so on. Another name for nominal data is categorical data, and this term is typically used in statistics. So think about the example of gender. Let’s say we have three categories: Male, Female, or Unknown. You can do simple analysis such as counting how many of each type of gender in the hospital, but it doesn’t make sense to do many kinds of calculations on categorical data. For example, it doesn’t make sense to try to calculate the average gender of the patients in the hospital. However, you may want to know if there is a relationship between two categorical variables. For example, you might want to know if there is a relationship between patient gender (we’ll simplify it to just male and female for this example) and smoking status (yes/no). The Chi-Square [prounounced ki or k-eye] test is used to determine whether there is a relationship between two categorical (nominal) variables. In other words, it checks to see if the two variables are independent of each other. You’ll see this frequently reported in literature.

22 Unit 2 Summary: Working with Data
Before you start doing any analysis, you must examine your data, identify any problems, and determine how to address them We discussed descriptive statistics, pivot tables, histograms, frequency tables, and chi-square test Pivot Tables describe the relationship between a database in an HIT system and data analysis tools This concludes Component 24, Health Care Data Analytics, Unit 2, Working with Data. In summary, Before you start doing any analysis, you must examine your data, identify any problems, and determine how to address them We discussed descriptive statistics, pivot tables, histograms, frequency tables, and chi-square test Pivot Tables describe the relationship between a database in an HIT system and data analysis tools You can practice some of these data tools in the laboratory activities associated with this unit.

23 Unit 2: Working with Data References
Load the Analysis ToolPak. (n.d.). Retrieved May 03, 2016, from Charts, Tables, and Figures 2.1 – 2.11 Figure: Smith, K. (2016). Used with permission from author. 2.12 – 2.13 Figure: Healthcare Associated Infections - State | HealthData.gov. (n.d.). Retrieved May 03, 2016, from No Audio. Health IT Workforce Curriculum Version 4.0

24 Unit 2: Working with Data References
This material was developed by The University of Texas Health Science Center at Houston, funded by the Department of Health and Human Services, Office of the National Coordinator for Health Information Technology under Award Number 90WT0006. No Audio. Health IT Workforce Curriculum Version 4.0


Download ppt "Health Care Data Analytics"

Similar presentations


Ads by Google