Dead Man Visiting Farrokh Alemi, PhD Narrated by … This set of slides were organized by Professor Alemi and narrated by Farhat Fazelyar
Introduction to SQL This section of the course introduces you to Standard Query Language and key commands within it. SQL is a standard language for accessing and manipulating relational databases. SQL is an American National Standards Institute standard, its core commands are the same across vendors. The current standard is from 1999 which is incredibly long time for a standard to remain stable. This is in part due to the fact that SQL is well suited to the task of data manipulation.
Data Manipulation Commands The data manipulation language is designed to add, change, and remove data from a database. In this section, we primarily focus on data manipulation commands. Some examples of SQL commands include commands to retrieve data from a database, to insert data in a database, to update data already in the database, and to delete data from a database.
Data Definition Commands SQL also includes data definition language. These commands are used to create a database, modify its structure, and destroy it when you no longer need it. We will later discuss how one creates tables or deletes them. There are also different types of tables. There are for example, temporary tables of data that delete when you close your SQL data management software.
Data Definition Commands SQL also includes data control language. These commands protect the database from unauthorized access, from harmful interaction among multiple database users, and from power failures and equipment malfunctions. We will not cover these commands in this course.
SELECT, INTO, FROM, COUNT, MIN, MAX, WHERE, HAVING, & GROUP BY The list of SQL commands is short. That is good news, your task is simple. The bad news is that these commands can be used in a variety of ways to accomplish different tasks. In this section of the course we go over five of these commands: select, into, from, where, and group by commands.
Learn format from the Web I learn more out of a web search than I could from asking my instructor One usually learns the format for the command through searches on the web. I assume that you can do so on your own. In fact, whenever you run into an error you should always search for the error on the web and you will see many instances of others posting solutions to your problem. Do this first because this is the best way to get your problems solved. Most students of SQL admit that they learned more from web searches than any instruction. The beauty of such learning is that you learn just enough to solve your problem at hand.
Learn from Examples Yay! We focus on the concepts. To make sure that these commands are well understood we demonstrate these commands in manipulation of the data that you had previously imported into a database. If you haven’t done so go back and do so. We show you one example of the command and leave it up to you to learn the format and other examples of use of the command. You can’t learn SQL without practice and you can’t practice without downloading some data.
SELECT id Here we see the format of the select command usually followed by a field name. The SQL commands are usually written in all caps to separate them from other elements in the code. Here we see the select command asking the software to report on a variable or field name called id.
SELECT id , diagnosis If there are more than one field, the fields should be separated by comma. The convention is to start each field name on a new line preceding with the comma so if one wants to delete a field name one can easily do so. Here we are selecting the fields id and diagnoses.
SELECT id , diagnosis FROM #data We need to also specify where the fields will come from, in particular which table in our database will the fields come from. The FROM command does so. This command says that we should select id and diagnosis from a table called data. The hashtag before data says that this is a temporary file that will disappear once we close the database.
SELECT id , diagnosis FROM dbo.data This FROM command says that the table data is a permanent table inside the database. The word dbo is referring to the fact that the table is inside the database. If you download data into a database then you would use dbo to indicate where it is.
SELECT id , diagnosis FROM diabetes.dbo.data You would also need to make sure that the software knows which database you have in mind. You can do this in two ways. You can add the name of the database to the prefix of the file name. Here we are saying that the database is called Diabetes.
USE diabetes SELECT id , diagnosis FROM dbo.data You can also use the USE command to specify the name of the database through out your code, so you do not need to repeat this often every time you read data..
INTO #temp FROM dbo.data SELECT id , diagnosis INTO #temp FROM dbo.data The INTO command writes the manipulated data into a new file. In this code, because the name of file is preceded with hashtag it is a temporary file called temp. The fields ID and diagnosis are selected and put into the temporary file called temp.
INTO #GoodData FROM dbo.data WHERE [Age at death]>[Age at Dx] SELECT id , diagnosis INTO #GoodData FROM dbo.data WHERE [Age at death]>[Age at Dx] The WHERE command filters the data. Here we see a statement that the data where age at death is greater than age at diagnosis should be included into the temporary file called “good data”. Presumably errors in data entry has led to some cases showing that they have died but the patient continues to visit later. These cases are deleted and only the cases where death occurs after all diagnoses are kept. WHERE command filters the data.
INTO #GoodData FROM dbo.data WHERE [Age at death]>[Age at Dx] SELECT id , diagnosis INTO #GoodData FROM dbo.data WHERE [Age at death]>[Age at Dx] Be careful. You may think that a visit from a dead patient is not possible. In rare situations some visits occur after death—for instance, the transport of a dead patient from home to hospital may lead to recording of an encounter. Other scenarios can also exist where the patient has an encounter after death. These are rare and very specific to post mortality services. The most common reason for encounters with health care system after death is due to date of death being entered wrong by a clinician or clerk in the healthcare system. Check the date entered against the date the computer recorded for the entry. We may be left with an erroneous date of death. One of the first steps in cleaning the data is to identify patients whose date of death occurs before date of various outpatient or inpatient encounters.
INTO #GoodData FROM dbo.data WHERE [Age at death]>[Age at Dx] SELECT id , diagnosis INTO #GoodData FROM dbo.data WHERE [Age at death]>[Age at Dx] GROUP BY id The group by command is one of the most important commands in SQL and allows reporting of field values for specific subgroup of data. Here the group by command forces us to keep only one id per person whose data are good to go. Note that we no longer can include the diagnosis field. Any field that is included must be aggregated so that it shows how the field is calculated by the subgroups.
, count(diagnosis) AS cDx INTO #GoodData FROM dbo.data SELECT id , count(diagnosis) AS cDx INTO #GoodData FROM dbo.data WHERE [Age at death]>[Age at Dx] GROUP BY id Here we have included the count of the field diagnosis for each subgroup defined by the GROUP BY statement. The function count is an aggregation function. It will report the number of diagnosis that each unique person has. Note that these are not necessary distinct diagnoses, just distinct individuals.
, count(distinct diagnosis) AS cDx INTO #GoodData FROM dbo.data SELECT id , count(distinct diagnosis) AS cDx INTO #GoodData FROM dbo.data WHERE [Age at death]>[Age at Dx] GROUP BY id If we want to know the number of distinct diagnoses we would need to use the qualifier distinct in the count function.
All fields must be in the table SELECT id INTO #GoodData FROM dbo.data WHERE [Age at death]>[Age at Dx] GROUP BY ID All fields must be in the table To summarize, in this code, the “SELECT id” command tells the system that we are interested in finding the ID of the patients. Note that we do not need to identify the source of the ID field as this query has only one table, so there is no confusion where the variable comes from. It is always best to specify so there is no room for confusion. Thus we should have used: “SELECT dbo dot data dot id”. The INTO command says that we should include these IDs into a file called Good Data. The “FROM Dbo dot data” command says that we want to get this information from a permanent table called Data, which incidentally includes both age at death and age at diagnosis. The “WHERE [Age at Death] is greater than [Age at Diagnosis]” command says that age at death must be greater than age at diagnosis. The “GROUP BY id” says that we want to see only one value for each ID no matter how many times the person’s diagnosis occurs before death. Also note that the WHERE command must be executed before the GROUP BY command. The SELECT command is executed after the GROUP BY command.
INTO #GoodData FROM dbo.data GROUP BY ID SELECT id INTO #GoodData FROM dbo.data GROUP BY ID HAVING Min([Age at death])>Max([Age at Dx]) The command HAVING is the same as WHERE but executed after grouping is done. In this code we have dropped the WHERE filter and added a HAVING command. The HAVING command is executed after the GROUP BY statement. In GROUP BY we are saying that the data should be grouped by unique persons, i.e. unique IDs. Note that now that we are examining the data by different persons, we no longer can use the fields “age at death” or “age at diagnosis” without aggregation. A person has many diagnoses and we need to clarify for the code how we want the information to be summarized per person. We had previously used the COUNT function, here we are using the Minimum and the Maximum aggregation functions. In particular, we are taking the maximum value of age at death for each patient and then comparing it to maximum reported age at various diagnoses.
INTO #GoodData FROM dbo.data WHERE [Age at death]>[Age at Dx] Try Your Hand at Coding SELECT id INTO #GoodData FROM dbo.data WHERE [Age at death]>[Age at Dx] GROUP BY ID Now is a good time to try the code on your data and see how far you get.
Check your errors on the web Try Your Hand at Coding Check your errors on the web SELECT id INTO #GoodData FROM dbo.data WHERE [Age at death]>[Age at Dx] GROUP BY ID Remember that in case of error you need to check the web site to figure out the exact format of commands and the data. Also note that the name of the fields in your file may be different. A lot may go wrong but you never know until you try. Repeated trial and errors can get your SQL code to work
INTO #GoodData FROM dbo.data WHERE [Age at death]>[Age at Dx] Try Your Hand at Coding SELECT id INTO #GoodData FROM dbo.data WHERE [Age at death]>[Age at Dx] GROUP BY ID Also repeat the code using HAVING to see what you will get different. See how WHERE and HAVING lead to different selections. Repeat using HAVING