Troubles with Text Data

Slides:



Advertisements
Similar presentations
CSCI 6962: Server-side Design and Programming Input Validation and Error Handling.
Advertisements

How to Program in C++ CHAPTER 3: INPUT & OUTPUT INSTRUCTOR: MOHAMMAD MOJADDAM.
Pet Fish and High Cholesterol in the WHI OS: An Analysis Example Joe Larson 5 / 6 / 09.
Understanding SAS Data Step Processing Alan C. Elliott stattutorials.com.
Welcome to SAS…Session..!. What is SAS..! A Complete programming language with report formatting with statistical and mathematical capabilities.
ERA Manager Training December 19, Propriety and Confidential. Do not distribute. 2 ERA Manager Overview In an effort to reduce the need for Providers,
Conversion Functions.
Overview Excel is a spreadsheet, a grid made from columns and rows. It is a software program that can make number manipulation easy and somewhat painless.
DAY 18: MICROSOFT ACCESS – CHAPTER 3 CONTD. Akhila Kondai October 21, 2013.
Operating System Discussion Section. The Basics of C Reference: Lecture note 2 and 3 notes.html.
HRP Copyright © Leland Stanford Junior University. All rights reserved. Warning: This presentation is protected by copyright law and.
MICROSOFT ACCESS – CHAPTER 3 CONTD. Sravanthi Lakkimsetty Mar 09, 2016.
FILES AND EXCEPTIONS Topics Introduction to File Input and Output Using Loops to Process Files Processing Records Exceptions.
MS-EXCEL PART 3. Use data validation in Excel to make sure that users enter certain values into a cell. Data Validation Example In this example, we restrict.
SAS ® 101 Based on Learning SAS by Example: A Programmer’s Guide Chapters 5 & 6 By Ravi Mandal.
Perform a complete mail merge Lesson 14 By the end of this lesson you will be able to complete the following: Use the Mail Merge Wizard to perform a basic.
Orders – Create Responses Boeing Supply Chain Platform (BSCP) Detailed Training July 2016.
The purpose of a CPU is to process data Custom written software is created for a user to meet exact purpose Off the shelf software is developed by a software.
Emdeon Office Batch Management Services This document provides detailed information on Batch Import Services and other Batch features.
CSC 108H: Introduction to Computer Programming Summer 2011 Marek Janicki.
How to Use the Online Project Monitoring System (OPMS) Navigating the Survey.
N5 Databases Notes Information Systems Design & Development: Structures and links.
LINGO TUTORIAL.
Module X. SMS and Broadcasting
DU REDCap Introduction
Compatible with the latest browsers; Chrome, Safari, Firefox, Opera and Internet Explorer 9 and above.
SI Ad hoc report builder overview
Formulas, Functions, and other Useful Features
Introduction to Computing Science and Programming I
Recruiter 2.0 Overview May 1, 2012.
College Credit Plus Updates September 12, 2016.
Downloading and Preparing a StudentVoice File for SPSS
Required Data Files Review
Arrays: Checkboxes and Textareas
Web address to access School of Medicine Applications:
Computer Programming I
Chapter 2: Getting Data into SAS
Two “identical” programs
Chapter 3: Working With Your Data
Data File Import / Export
Mail Merge Instructions (Yanick’s Version)
TO DOWNLOAD FREE TRIAL of Kurzweil 3000 Subscription
Intro to PHP & Variables
REDCap Data Migration from CSV file
Key points.
ECONOMETRICS ii – spring 2018
Informational PDF #5 How to Prepare Your Precoding File and Upload it into the Portal System.
Variables In programming, we often need to have places to store data. These receptacles are called variables. They are called that because they can change.
Topics Introduction to File Input and Output
Conditions and Ifs BIS1523 – Lecture 8.
INPUT & OUTPUT scanf & printf.
Chapter 3 The DATA DIVISION.
NextGen Trustee General Ledger Accounting
Introduction to SAS A SAS program is a list of SAS statements executed in order Every SAS statement ends with a semicolon! SAS statements can be in caps.
Our Wireless Enterprise Help Desk (WEHD) is here to assist you 24 hours a day, 7 days a week, 365 days a year. We also have our own private number.
PHP.
Hunter Glanz & Josh Horstman
Excel Lookup Formulas Welcome! with Cindy Kredo
Let’s Talk About Variable Attributes
Lab 2 HRP223 – 2010 October 18, 2010 Copyright © Leland Stanford Junior University. All rights reserved. Warning: This presentation is protected.
Homework Reading Programming Assignments Finish K&R Chapter 1
Chapter 3: Selection Structures: Making Decisions
Boolean Expressions to Make Comparisons
Fundamental Programming
Chapter 3: Selection Structures: Making Decisions
Topics Introduction to File Input and Output
Introduction to C Programming
Purchase Document Management
Kaspersky Social Channel
Presentation transcript:

Troubles with Text Data Nicole Ackermann Spring 2018 Gauss Presentation Work supported by National Cancer Institute Grant: R01CA19039104-A

Overview of Presentation Brief overview of project & data source/data structure Useful SAS® functions & processes discovered along the way Unresolved issues Some of the code used for this project is specific to this project and the structure of our data and surveys, however, I tried to pick out the pieces I think could be the most useful across a multitude of projects. This was my first time working with text messaging data

Brief Overview of Project Randomized control trial that has a portion of the study involving text messaging surveys A series of four text surveys are sent to each participant Each survey has about 15-20 outgoing messages with replies expected for about 12 messages Participants have 48 hours to complete each text survey once they receive their first question Reminders are also sent to participants to complete the surveys

Text Data from Twilio Our team worked with a company who helped set up the text message surveys using the cloud communications platform, Twilio We are able to download logs (as CSV files) of all text messages from Twilio Data comes in a one row per text format Desired format of data: One row per participant Numeric variables for each text survey question containing the participant’s response to that specific question

Starting Data structure From To Body Status SentDate 5555551234 5555555678 Welcome to your Text Survey! Delivered 2018-04-23 19:04:00 CDT This is your first question 2018-04-23 19:04:30 CDT Thanks! Here is my answer: 3 Received 2018-04-23 19:04:40 CDT We are sorry. We do not understand. Please rely using numbers! 2018-04-23 19:04:41 CDT 3 is my answer 2018-04-23 19:05:03 CDT This is your second question 4  ----Signed Participant--- 2018-04-23 19:05:45 CDT From /To are the phone numbers – can tell what is coming in from participant by the status column. SentDate very important – highlight the issues of texts being received and sent at the exact same time. Also, this is clean example, we are going to have 500 total participants so data comes with many, many texts coming/going at any given time, especially bc surveys are always send out at the same time of day. Currently, we have about 54,000 texts & are still in the data collection phase of the project

Ending Data structure ID Ques1 Ques2 Ques3 Ques4 …. Ques14 10001 4 2 9999 ….. 25 10002 3 30.5 10003 … 120 10004 1 60 10005 . 10006 One row per participant, correct answer assigned to each question they answered (missing if they did not answer the question, 9999 if they answered SKIP to any questions)

Process of data cleaning Identify correct responses to each survey question (14 - 15 questions per survey) for each participant, for each of the four text surveys Issues: Timing System responds quickly, sometimes no difference in timing of messages Error messages when incorrect response received Some validation to message responses: must be a digit that comes first in response that matches one of the valid responses Otherwise participant will receive some of four error messages, depending on which question they are responding to No validation on what follows the digit

Functions Used in Data Cleaning Process Lag Function (with sorting) – used to pull correct answer to survey questions Find Function – used to identify special cases that required further attention and SKIP messages ‘SKIP’ could be sent by a participant to skip any of the survey questions they did not want to answer Several functions used in coordination with one another to clean text fields Anydigit, anyalpha, substr, compress, input

Lag Function

Lag Function Useful for extracting values from previous rows in dataset Important to use caution with Lag function – can sometimes produce unexpected results This is especially true when dealing with missing values and when executing conditionally More on the lag function: http://support.sas.com/resources/papers/proceedings09/055-2009.pdf http://documentation.sas.com/?cdcId=pgmmvacdc&cdcVersion=9.4&docsetId=lefunctionsref&docsetTarget=n0l66p5oqex1f2n1quuopdvtcjqb.htm&locale=en

Lag function (with sorting) proc sort data=example1; by ID sentdate descending status; run; data example1; set example1; Prev_text=lag(body); ID was made based on From/To phone number combo – Identified participant receiving/sending text messages, used to separate survey correspondence for each participant Based on rule that when a valid response to a survey question was received, Twilio then sends the next survey question in que. Based on this rule, the lag of the body variable would be the answer to the previous survey question

Lag function (with sorting) proc sort data=example1; by ID sentdate descending status; run; data example1; set example1; prev_text=lag(body); if first.ID then prev_text=.; This alternate way is important if you do not want to have the first row of a new ID value have the preceding value of a different ID --Because of how our survey was set up, this is not important to us – first texts are intro so not looking for an answer to a survey question there Also, make sure to have the if/conditional expression AFTER the lag function is used!!!!

Lag function (with sorting) From To ID Body Status SentDate Prev_Text 5555551234 5555555678 10001 Welcome to your Text Survey! Delivered 2018-04-23 19:04:00 CDT This is your first question 2018-04-23 19:04:30 CDT Thanks! Here is my answer: 3 Received 2018-04-23 19:04:40 CDT We are sorry. We do not understand. Please rely using numbers! 2018-04-23 19:04:41 CDT 3 is my answer 2018-04-23 19:05:03 CDT This is your second question 4  ----Signed Participant--- 2018-04-23 19:05:45 CDT Row with Body = ‘This is your second question’ contains correct answer (Prev_Text=‘3 is my answer’) to previous question in survey order (‘This is your first question’)

Find Function

Find Function Syntax FIND(string, substring <, start-position> <, modifier(s)>) Searches the string for the substring listed and will return the position of the substring If the substring is not found in the string, then the function will return a 0 In this example, we are only interested in if the string contains the substring (not the position) which is why we use ‘ge 1’

Find function data example3; set example2; array varlist [*] ques1-ques14; do i=1 to dim(varlist); *Need to flag cases where there is a range entered*; if find(varlist[i], '-','i') ge 1 or find(varlist[i],'to','i') ge 1 then PossibleRange=1; *Account for SKIP - make those '9999'*; if find(varlist[i], 'SKIP','i') ge 1 then varlist[i]='9999'; end; drop i; run; Dataset contains 16 variables, answers to questions Participants could text SKIP if they wanted to skip a question – we will want to make those special missing later down the road

Find Function FIND(string, substring <, start-position> <, modifier(s)>) Syntax used in this example: String – varlist[i] Substring – looking for ‘-’, ‘to’ or ‘SKIP’ in the previous example No starting position in this example (this is optional argument) Modifiers (also optional argument): We used the ‘i’ modifier – one I use the most often This modifier ignores character case

Find Function ID Ques1 ….. Ques14 10001 1 10 - 20 10002 30 10003 SKIP 10004 4 15 mins 10005 4, thks 30 – 45 10006 3 maybe 10007 ID Ques1 ….. Ques14 PossibleRange 10001 1 10 - 20 10002 30 . 10003 9999 10 10004 4 15 mins 10005 4, thks 30 – 45 10006 3 maybe 10007

Cleaning Text Fields Sometimes one function on its own may not seem like it could be that useful, but combined with other functions becomes very powerful Anydigit, anyalpha, substr, compress, input Functions

Cleaning text fields Goal: pull the first number out of a text field Functions used: Anydigit – returns the first position a digit is found within a string Anyalpha – returns the first position an alphabetic character is found within a string Substr – returns a specified substring from within a string variable Compress – many ways to use this function, but generally, removes certain characters from within a sting Input – converts character values to numeric values In anydigit and anyalpha you can also specify a starting position Compress (by Default) REMOVES characters from a string, however, in this example, we are using it to KEEP certain characters

Cleaning Text Fields - Code data example5; set example4; start = anydigit(ques1); end = anyalpha(ques1); new = substr(ques1,start, (end-start)); new2 = compress(new,'.','dk'); new3 = input(new2, best8.); run; Substr example: searching within the string ques1, starting at position ‘start’ (pulled from anydigit, which is where the first digit appears), length is end – start Compress example: compressing within the string new_ques1, characters in question – in this case, since we are using the special modifier ‘k’, this tells SAS we want to KEEP this character. The special modifier ‘d’ tells SAS to add digits to the list of characters – and again, since we are using the k modifier, SAS KEEPS digits in this example Done in intermediary steps to show processing of code – could make this shorted by combing functions (functions within functions)

Cleaning Text Fields - Data ID ques1 10001 20.5 minutes 10002 4 not 5 10003 60 10004 7d@ys 10005 5 days 1006 1-agree ID ques1 start end new new2 new3 10001 20.5 minutes 1 6 20.5 10002 4 not 5 3 4 4.0 10003 60 60.0 10004 7d@ys 2 7 7.0 10005 5 days 5 5.0 1006 1-agree 1- 1.0 Describe processing of code here Presenting intermediary variables to can understand how code is working New2 is character variable new3 is numeric variable

More on the Substr Function Syntax: SUBSTR(string, position <, length>) In this example: String: ques1 Position: using the created ‘start’ variable as the starting position Length: using (end-start) as the length This gives us the value from the first digit to the first alphabetic character --See SAS documentation for other helpful tips on this function: (http://documentation.sas.com/?docsetId=lefunctionsref&docsetVersion=9.4&docsetTarget=n0n08xougp40i5n1xw7njpcy0a2b.htm&locale=en)

More on the Compress Function Syntax: COMPRESS(source <, characters> <, modifier(s)>) In this example: Source (variable): new Characters: ‘.’ We want to specify any instances of a decimal/period used Modifiers: ‘d’ : this adds digits to the list of characters ‘k’: this tells SAS we want to KEEP the characters, not remove them, and all other characters are removed Other modifiers that might be useful: ‘a’ – adds alphabetic characters to the list of characters. ‘i’ – ignores cases of the characters to be kept or removed ‘n’ - adds digits, the underscore character, and English letters to the list of characters. ‘s’ - adds space characters (blank, horizontal tab, vertical tab, carriage return, line feed, form feed, and NBSP ('A0'x, or 160 decimal ASCII) to the list of characters. ‘t’ - trims trailing blanks from the first and second arguments. ---See SAS documentation for full list of modifier (http://go.documentation.sas.com/?docsetId=lefunctionsref&docsetTarget=n0fcshr0ir3h73n1b845c4aq58hz.htm&docsetVersion=3.2&locale=en)

Adding prefixes and suffixes to list of variables proc sql; select cats(name,'=',name,'S1') into :list separated by ' ' from dictionary.columns where libname = 'WORK' and memname = 'EXAMPLE3' and upcase(name) ne 'ID'; quit; proc datasets library = WORK nolist; modify EXAMPLE3; rename &list; One of my favorite pieces of code Can also add noprint option if do not want to print your list to proc sql statement line – I print mind to double check correct variables assigned You are assigning the renaming code (that we call in PROC DATASETS) to macro variable ‘list’ using the dictionary tables in PROC SQL This was used in the project when creating wide datasets and wanting to clarify which variables from which of the 4 text surveys

Unresolved issues A more efficient way to handle Emojis: We did not run into many in our data, not time consuming to fix manually in this instance Some Emojis do not cause an issue importing via CSV in SAS®, but others do Discovered in talking with SAS® Help: SAS® was able to process the "heart" emoji, but not the "sad face" emoji Anyone in audience run into a similar issue?

Acknowledgements Principal Investigator: Erika Waters, PhD, MPH Project Manager: Julia Maki, PhD Work supported by National Cancer Institute Grant: R01CA19039104-A SAS® Help & SAS® 9.4 Online Documentation

Questions? Thank you!