Analyze ICD-10 Diagnosis Codes with Stata Dr. Tiankai Wang, Associate Professor and MHIM Program Director Department of Health Information Management, College of Health Professions, Texas State University Introduction On October 1, 2015 the United States transitioned to the 10th revision of the International Classification of Diseases set of diagnoses for coding medical encounters and inpatient procedure codes, known as ICD-10-CM/PCS. ICD-10 impacts almost all providers, health payment systems, and clinical processes. By analyzing ICD-10 diagnosis codes, healthcare professionals can better understand costs, and more accurately plan and select strategic initiatives in business development. The American Hospital Association notes other benefits which include expanded ability to perform public health surveillance, improved ability to measure the quality of health care services, and strengthened capability to distinguish improvements in medicine and associated technology. However, the increased granularity of ICD-10 versus that of ICD-9, such as the addition of laterality-specific codes, presents a challenge. Moreover, there are thousands of new codes, creating enormous complexities and confusion. Proficiency in using statistical software to conduct ICD-10 research would be of tremendous value to healthcare professionals, researchers and epidemiologists. The most popular data analysis software in 2015 is SPSS, R, SAS, and Stata. The literature is silent on data analytics software regarding ICD-10 diagnosis codes analysis in SPSS, SAS, or R, although some software community users developed unofficial packages. Currently Stata is the only data analysis software system that provides specific commands to analyze ICD-10. When data are gathered from multiple sources, they may not be fully standardized. There can also be reporting errors. Stata icd10 commands are designed to address these common challenges with secondary data. Stata icd10 commands include the following data management commands (icd10 check, icd10 clean, and icd10 generate), and interactive utilities (icd10 lookup). The purpose of this poster is to illustrate how to use Stata icd10 commands to analyze ICD-10 diagnosis codes in healthcare research. Usage of Stata icd10 Commands Icd10 check First, verification of ICD-10 diagnosis codes for validity is needed. Icd10 check verifies that a variable contains defined ICD-10 diagnosis codes and provides a summary of any problems encountered. The syntax is “icd10 check varname, generate (newvar) year (201x)”. varname is the variable containing ICD-10 diagnosis codes. newvar is the new variable generated to indicate whether the code in varname is valid or invalid. 0 indicates a valid ICD-10 code, number 1 through 8 indicates for the reason the code is not valid, or a value of 99 for the undefined code. year (2014) indicates that the verification is based on ICD-10 diagnosis codes of 2014 edition. Before conducting further analysis, the researcher will need to drop the observations with the invalid ICD-10 diagnosis codes. Exhibit 1 shows the codes and outputs. Cause is the varname in this dataset. invalid is a new variable named we generated. 2014 indicates that we use ICD-10 2014 codes in Stata v14. Icd10 clean Exhibit 2 displays the original data contents. The variable Cause contains the ICD-10 diagnosis codes indicating the causes of deaths. “1” in the variable Sex stands for males, and “2” stands for females. The variable deaths indicates the frequency of deaths. Notice that the original format of Cause is one letter following by three numbers, e.g., A009. This is not a standard format of ICD-10. The command line “icd10 clean varname, dots” can be used to standardize the format. Exhibit 3 shows the standard ICD-10 format after cleaning. Usage of Stata icd10 Commands (cont’d) Icd10 lookup With the cleaned ICD-10 data, further analysis can be conducted. For example, a researcher may want to identify the most frequent causes of male deaths in the United States in 2013. The researcher can sort deaths in descending order with gsort, then list the first 10 observations specifying Sex equal to 1 (i.e., male). The output is in Exhibit 4. Thus, the researcher can see that the top cause of death is ICD-10 code I25.1 with 89,340 deaths. Accordingly, the researcher would like to know the meaning of I25.1 in the ICD-10 classification codes. This is where Stata icd10 commands are of tremendous value to the researcher. “icd10 lookup” displays descriptions of the diagnosis codes specified on the command line. In this example, the output of the command line “icd10 lookup I25.1” is “I25.1 Atherosclerotic heart disease”. The researcher is now informed that the top cause of male deaths in the U.S. in 2013 is atherosclerotic heart disease. With similar Stata codes specifying different conditions, researchers will construe that the top cause of female deaths in the U.S. in 2013 is “F03 unspecified dementia” with 88,408 deaths, and the top cause of (both male and female) newborn (i.e., Age=0) deaths is “P07.2 Extreme immaturity” with 3,168 deaths. The outputs are in Exhibit 5. The inherent “icd10 lookup” command in Stata improves the efficiency in ICD-10 diagnosis codes lookup and avoids potential errors when using manual processes or merging datasets. Usage of Stata icd10 Commands (cont’d) Icd10 generate Another advantage of Stata icd10 command is the icd10 generate command. For example, the researcher may want to identify all deaths due to respiratory illness which cover 275 different ICD-10 diagnosis codes. A variable would need to be created and set equal to one of the 275 codes. Without icd10 commands, the researcher would need three command lines to generate this variable. Using icd10 generate command with the range( ) option, the researcher can easily generate the variable with one command line. The Stata code is “icd10 generate resp= Cause, range (J10/J898)”. In this process, the observations with ICD-10 codes ranging from J10 to J898 are coded to “1” in the new variable resp. Next, tabulate resp with the frequency weight option “[fweight=deaths]”. The output is in Exhibit 6. The researcher will see that about 10% of all deaths in the U.S. in 2013 were from respiratory illnesses. Exhibit 5. Look up the descriptions of ICD-10 diagnosis codes Exhibit 3. Format ICD-10 codes for standard format Exhibit 1. Check and drop invalid ICD-10 diagnosis codes Exhibit 6. Generate and tabulate of respiratory illnesses Conclusion Using icd10 commands can save researchers time and avoid errors that often happen in manual processes or merging datasets. Furthermore, with the other Stata commands, researchers have the ability to complete many other tasks, such as graphing summary data for data visualization, and conducting regression analysis to identify the potential associations among factors. Data To illustrate the usage of Stata icd10 commands, the most recent year (2013) of WHO mortality data of the United States is utilized. The WHO mortality data can be accessible at http://apps.who.int/healthinfo/statistics/mortality/causeofdeath_query/. The original sample data contain 6,215 observations and 33 variables. Exhibit 2. List first ten original observations Cause, Sex, and Deaths, separating by Cause without the observation numbers Exhibit 4. List the top causes of male deaths in the U.S. in 2013