Presentation is loading. Please wait.

Presentation is loading. Please wait.

Data Integration: Excel Data Manipulation Tips & Techniques

Similar presentations


Presentation on theme: "Data Integration: Excel Data Manipulation Tips & Techniques"— Presentation transcript:

1 Data Integration: Excel Data Manipulation Tips & Techniques
Matthew D. Simmonds, Simtech Solutions Inc., President

2 Overview: Data Manipulation Techniques in Excel.
Database conversions are not a “flip of the switch” process and require some work to get the data from the source database into the target database. This presentation will outline tips and techniques used in Microsoft Excel to prepare data for loading into a target database or for performing offline data cleansing.

3 Why use Excel? Microsoft Excel as it is technically the least common denominator and the most ubiquitous database application around. Most database platforms can save files in comma separated variable (CSV) format which can be easily imported in Excel and worked with.

4 General Rules Save your work in versions so you can revert back if necessary. Any new data added to the system post-export may have to be re-cleansed so you might want to freeze your database. Work with the Excel file only in a secured and control environment. Excel does have its limitations rows to be exact. Audit your work.

5 Data manipulation that can be done in Excel:
Identify duplicate records. Remove duplicate records. Manipulate database columns to match a target format. Populate blank data quality codes. Split up one field into several fields. Check for a middle initial. Strip out undesirable characters. Combine data elements that are stored across multiple columns into one column. Translate values to be consistent with the target format. Identify and remove erroneous service records.

6 I. Identify duplicate records.
To use this technique perform the following: Copy the unique fields you wish to concatenate to blank columns at the beginning of the worksheet. These are considered your “key fields” and should be the key identifiers you use to find a record. (HUD recommends name, DOB, gender and social) Insert another blank column to the left of these columns. Enter the concatenate function (i.e. =CONCATENATE(C2,D2,E2)) in the first row of the column you just created. Copy this formula down the column to the last record. Click on Data | Sort and select the column containing the newly created key. Insert another column. Check for duplicates using the EXACT command. If the new key field is in column A and Column B is blank enter the formula in the first row of Column B as follows =EXACT(A2,A3). Copy the formula down to the remaining cells. The EXACT() function returns TRUE if the values it is given are identical, and FALSE otherwise.  Since it may be difficult to locate all of the cells that have a value of TRUE in them you may wish to highlight the TRUE values using conditional formatting. To do this select the test column and click on Format | Conditional Formatting. Arrange the drop down boxes to read “Cell Value Is” “Equal To” “TRUE”. Select the Format button, followed by the Patterns tab, then choose the color to highlight the cell if the function value is true.

7 II. Remove duplicate records.
After identifying redundant records you have to resolve how to handle them. NOTE: If you are working with one flat table and need to break it into separate tables you MUST copy the client ID from the first record to the duplicated records so everything jives. Once you have performed this step you can copy the table over to the other worksheets where they can be manipulated further as needed. Scroll the worksheet for values of “True”. Compare the matching records to determine how to handle the record as follows: A). If the data is purely redundant (elevator button syndrome) then delete any duplicate records from the table with a value in Exact Match?=“True”. B). If there are potentially valid data in multiple records you may choose to merge the data into one record and delete the other(s).

8 IIIA. Manipulate the database columns as needed to match the desired target format.
Basic Manipulation Inserting/Adding Columns – Simply select the column to the right of the column you wish to add, right click, and select “Insert”. Deleting Columns - This process can be handled simply by selecting the column that is not needed, right clicking and selecting “Delete”. Moving Columns – Select the column you wish to move, right click, select “Cut”, select the column to the right of where you want the column to go, right click, and select “Paste”.

9 IIIB. Manipulate the database columns as needed to match the desired target format.
Advanced Manipulation Create a separate worksheet in the same workbook that has the source data. Copy the target format field names into a header record (the first row) on the worksheet. You may have to use Copy | Paste Special and select Transpose in order to copy field names listed vertically to be listed across the worksheet. Enter the equals sign (=) in the first column and first row under the newly copied file header row. Click over to the source file, select the corresponding field in the first row under the source file’s header row and hit Enter. Repeat this process until you have mapped one field for all of the values in your target database. Copy the row containing these new formulas down your spreadsheet to at least as many rows as you have records in your source file. Create a blank worksheet. Copy the entire worksheet using Select All (the box in top left corner of the workbook) and use Paste Special. Select the option to paste Values only. Save the newly created target file.

10 IV. Populate blank data quality codes.
Fields such as last permanent zip code and social security number require a data quality code. If your staff has not been answering this question you can derive whether or not the code is partial or full based on the length of the data entered using the =LEN() function to count the number of characters. To use this technique perform the following: Insert a column next to the field you want to create codes for. Enter =LEN(FieldYouWantToCheck) into Row 2 of a new column and copy the formula down the column. Sort the database by this new length column. Use either Copy | Paste (CTRL C, CTRL V) or Find|Replace (CTRL F) to change values from lengths to HUD data quality code values.

11 Splitting up one field into several fields
Splitting up one field into several fields. (using full name as an example) To use this technique perform the following: Right-click column heading for the full name field and select Insert. If you are splitting the field Select the name field column then click Data in the menu bar and select Text to Columns. Select the option for Delimited and click Next. In the Delimiters section click Space and/or Comma (depending on how your data is stored) and then click Finish. Note if the data is stored as Simmonds, Matthew D. you will have to insert three columns and use both comma and space as delimiters. When prompted, answer OK to replace the contents of the destination cells. Change the text in the column headers to now accurately reflect the change. (i.e. Last Name, First Name, Middle Name, Suffix).

12 VI. Checking for a middle initial.
If you are breaking up one field, such as Full Name, into several you may have to account for smaller fields such as a middle initial field that will sometimes be incorrectly spliced into other fields. You can use the Len() formula to check this as well. To use this technique perform the following: Enter =Len(cell_you_want_to_check_for_the_middle_ initial) in a blank column. Copy this formula down to the last data row. Sort the data by this newly created length field using Data | Sort. Copy & Paste data values as necessary to separate middle initials from last name fields.

13 VII. Stripping out undesirable characters.
For the sake of data consistency it is common to remove extra characters such as $,-/. If you are trying to link a social security number that has dashes with one that does not you will never get a match. You can use the Find|Replace (CTRL F) function to resolve this. To use this technique perform the following: To ensure the Find | Replace works only over the column you wish to work with copy the data column to a blank worksheet. Highlight the column you want to work with and hit CTRL+F. To remove dashes you would enter “–” in the field labeled Find what:, click the Replace tab and replace the dash with blanks by hitting the space bar once in the Replace with: field and click Replace. After performing the Find|Replace copy the properly formatted data back into the original worksheet.

14 VIII. Combining Data Elements Across Multiple Columns to one Column
Race American Indian or Alaskan Native Asian Black or African Hawaiian or Other Pacific Islander White =CONCATENATE(B2,C2,D2,E2,F2) FALSE TRUE To use this technique: Create a blank column and use the =Concatenate formula as shown above to merge the values. Copy this formula down to the last data row. Use Copy | Paste Special to copy the newly created values to another blank column. Use Find | Replace to change the merged values with valid values. (ie replace FALSEFALSEFALSEFALSETRUE with “White”) Repeat this process until all merged values are replaced with valid values.

15 IX. Translating Values to be consistent with a target format.
To use this technique: Create an Excel workbook with three worksheets. The first will contain the values from the source table. The second will be a Field Value Map table to match values used within the source and target databases. The third table will contain the formulas need to translate your table. Map the source values to the target values on the Field Value Map worksheet similar to the table shown. NOTE: You MUST have your source values listed in ascending order (alphabetically or numerically) for this to work properly. Table Field Source Value New Target Value Client ID Race an Alaskan Native ai American Indian A Asian B Black/African American nh Native Hawaiian pi Other Pacific Islander W White O Other Multi-Racial

16 IX. Translating Values to be consistent with a target format
IX. Translating Values to be consistent with a target format. (continued) To use this technique: For any fields that need value translations use the =VLOOKUP function in Excel to translate the source database values to the target values. For values that are not to be translated you can simply use the = technique described in IIIB. Open a blank workbook. Use Copy | Paste Special and select to paste the Values on your target worksheet to the new workbook. Save the newly translated file.

17 X. Removing erroneous service records.
A B C D E Client ID SERVICE NAME AIRS CODE START DATE END DATE 3 108493 Bed BH-180 10/14/2003 10/15/2003 4 4/13/2005 5/23/2005 5 5/25/2005 6 5/26/2005 Quite often it is difficult to figure out when a client has left (especially for programs such as street outreach) so the data records for a recorded service end up having blanks for an exit date. Blanks can be valid if you are still servicing the client. One way to check for erroneous blank conditions is to determine if you have provided the same client with the same service since the record where the blank exit occurs. You can do this using the =AND function as shown.

18 Conclusion Microsoft Excel is filled with powerful functions that can be used to help cleanse your HMIS database. Other functions to look into include filtering out invalid records out using the FILTER command, auditing your data and creating reports using the COUNTIF function and using advanced data validation techniques (Data | Validation) to ensure clean data entry. Performing these tasks and mastering the functionality available within Excel will help ensure the integrity of your HMIS database.


Download ppt "Data Integration: Excel Data Manipulation Tips & Techniques"

Similar presentations


Ads by Google