Presentation is loading. Please wait.

Presentation is loading. Please wait.

Troubles with Text Data

Similar presentations


Presentation on theme: "Troubles with Text Data"— Presentation transcript:

1 Troubles with Text Data
Nicole Ackermann Spring 2018 Gauss Presentation Work supported by National Cancer Institute Grant: R01CA A

2 Overview of Presentation
Brief overview of project & data source/data structure Useful SAS® functions & processes discovered along the way Unresolved issues Some of the code used for this project is specific to this project and the structure of our data and surveys, however, I tried to pick out the pieces I think could be the most useful across a multitude of projects. This was my first time working with text messaging data

3 Brief Overview of Project
Randomized control trial that has a portion of the study involving text messaging surveys A series of four text surveys are sent to each participant Each survey has about outgoing messages with replies expected for about 12 messages Participants have 48 hours to complete each text survey once they receive their first question Reminders are also sent to participants to complete the surveys

4 Text Data from Twilio Our team worked with a company who helped set up the text message surveys using the cloud communications platform, Twilio We are able to download logs (as CSV files) of all text messages from Twilio Data comes in a one row per text format Desired format of data: One row per participant Numeric variables for each text survey question containing the participant’s response to that specific question

5 Starting Data structure
From To Body Status SentDate Welcome to your Text Survey! Delivered :04:00 CDT This is your first question :04:30 CDT Thanks! Here is my answer: 3 Received :04:40 CDT We are sorry. We do not understand. Please rely using numbers! :04:41 CDT 3 is my answer :05:03 CDT This is your second question 4  ----Signed Participant--- :05:45 CDT From /To are the phone numbers – can tell what is coming in from participant by the status column. SentDate very important – highlight the issues of texts being received and sent at the exact same time. Also, this is clean example, we are going to have 500 total participants so data comes with many, many texts coming/going at any given time, especially bc surveys are always send out at the same time of day. Currently, we have about 54,000 texts & are still in the data collection phase of the project

6 Ending Data structure ID Ques1 Ques2 Ques3 Ques4 …. Ques14 10001 4 2
9999 ….. 25 10002 3 30.5 10003 120 10004 1 60 10005 . 10006 One row per participant, correct answer assigned to each question they answered (missing if they did not answer the question, 9999 if they answered SKIP to any questions)

7 Process of data cleaning
Identify correct responses to each survey question ( questions per survey) for each participant, for each of the four text surveys Issues: Timing System responds quickly, sometimes no difference in timing of messages Error messages when incorrect response received Some validation to message responses: must be a digit that comes first in response that matches one of the valid responses Otherwise participant will receive some of four error messages, depending on which question they are responding to No validation on what follows the digit

8 Functions Used in Data Cleaning Process
Lag Function (with sorting) – used to pull correct answer to survey questions Find Function – used to identify special cases that required further attention and SKIP messages ‘SKIP’ could be sent by a participant to skip any of the survey questions they did not want to answer Several functions used in coordination with one another to clean text fields Anydigit, anyalpha, substr, compress, input

9 Lag Function

10 Lag Function Useful for extracting values from previous rows in dataset Important to use caution with Lag function – can sometimes produce unexpected results This is especially true when dealing with missing values and when executing conditionally More on the lag function:

11 Lag function (with sorting)
proc sort data=example1; by ID sentdate descending status; run; data example1; set example1; Prev_text=lag(body); ID was made based on From/To phone number combo – Identified participant receiving/sending text messages, used to separate survey correspondence for each participant Based on rule that when a valid response to a survey question was received, Twilio then sends the next survey question in que. Based on this rule, the lag of the body variable would be the answer to the previous survey question

12 Lag function (with sorting)
proc sort data=example1; by ID sentdate descending status; run; data example1; set example1; prev_text=lag(body); if first.ID then prev_text=.; This alternate way is important if you do not want to have the first row of a new ID value have the preceding value of a different ID --Because of how our survey was set up, this is not important to us – first texts are intro so not looking for an answer to a survey question there Also, make sure to have the if/conditional expression AFTER the lag function is used!!!!

13 Lag function (with sorting)
From To ID Body Status SentDate Prev_Text 10001 Welcome to your Text Survey! Delivered :04:00 CDT This is your first question :04:30 CDT Thanks! Here is my answer: 3 Received :04:40 CDT We are sorry. We do not understand. Please rely using numbers! :04:41 CDT 3 is my answer :05:03 CDT This is your second question 4  ----Signed Participant--- :05:45 CDT Row with Body = ‘This is your second question’ contains correct answer (Prev_Text=‘3 is my answer’) to previous question in survey order (‘This is your first question’)

14 Find Function

15 Find Function Syntax FIND(string, substring <, start-position> <, modifier(s)>) Searches the string for the substring listed and will return the position of the substring If the substring is not found in the string, then the function will return a 0 In this example, we are only interested in if the string contains the substring (not the position) which is why we use ‘ge 1’

16 Find function data example3; set example2;
array varlist [*] ques1-ques14; do i=1 to dim(varlist); *Need to flag cases where there is a range entered*; if find(varlist[i], '-','i') ge 1 or find(varlist[i],'to','i') ge 1 then PossibleRange=1; *Account for SKIP - make those '9999'*; if find(varlist[i], 'SKIP','i') ge 1 then varlist[i]='9999'; end; drop i; run; Dataset contains 16 variables, answers to questions Participants could text SKIP if they wanted to skip a question – we will want to make those special missing later down the road

17 Find Function FIND(string, substring <, start-position> <, modifier(s)>) Syntax used in this example: String – varlist[i] Substring – looking for ‘-’, ‘to’ or ‘SKIP’ in the previous example No starting position in this example (this is optional argument) Modifiers (also optional argument): We used the ‘i’ modifier – one I use the most often This modifier ignores character case

18 Find Function ID Ques1 ….. Ques14 10001 1 10 - 20 10002 30 10003 SKIP
10004 4 15 mins 10005 4, thks 30 – 45 10006 3 maybe 10007 ID Ques1 ….. Ques14 PossibleRange 10001 1 10002 30 . 10003 9999 10 10004 4 15 mins 10005 4, thks 30 – 45 10006 3 maybe 10007

19 Cleaning Text Fields Sometimes one function on its own may not seem like it could be that useful, but combined with other functions becomes very powerful Anydigit, anyalpha, substr, compress, input Functions

20 Cleaning text fields Goal: pull the first number out of a text field
Functions used: Anydigit – returns the first position a digit is found within a string Anyalpha – returns the first position an alphabetic character is found within a string Substr – returns a specified substring from within a string variable Compress – many ways to use this function, but generally, removes certain characters from within a sting Input – converts character values to numeric values In anydigit and anyalpha you can also specify a starting position Compress (by Default) REMOVES characters from a string, however, in this example, we are using it to KEEP certain characters

21 Cleaning Text Fields - Code
data example5; set example4; start = anydigit(ques1); end = anyalpha(ques1); new = substr(ques1,start, (end-start)); new2 = compress(new,'.','dk'); new3 = input(new2, best8.); run; Substr example: searching within the string ques1, starting at position ‘start’ (pulled from anydigit, which is where the first digit appears), length is end – start Compress example: compressing within the string new_ques1, characters in question – in this case, since we are using the special modifier ‘k’, this tells SAS we want to KEEP this character. The special modifier ‘d’ tells SAS to add digits to the list of characters – and again, since we are using the k modifier, SAS KEEPS digits in this example Done in intermediary steps to show processing of code – could make this shorted by combing functions (functions within functions)

22 Cleaning Text Fields - Data
ID ques1 10001 20.5 minutes 10002 4 not 5 10003 60 10004 10005 5 days 1006 1-agree ID ques1 start end new new2 new3 10001 20.5 minutes 1 6 20.5 10002 4 not 5 3 4 4.0 10003 60 60.0 10004 2 7 7.0 10005 5 days 5 5.0 1006 1-agree 1- 1.0 Describe processing of code here Presenting intermediary variables to can understand how code is working New2 is character variable new3 is numeric variable

23 More on the Substr Function
Syntax: SUBSTR(string, position <, length>) In this example: String: ques1 Position: using the created ‘start’ variable as the starting position Length: using (end-start) as the length This gives us the value from the first digit to the first alphabetic character --See SAS documentation for other helpful tips on this function: (

24 More on the Compress Function
Syntax: COMPRESS(source <, characters> <, modifier(s)>) In this example: Source (variable): new Characters: ‘.’ We want to specify any instances of a decimal/period used Modifiers: ‘d’ : this adds digits to the list of characters ‘k’: this tells SAS we want to KEEP the characters, not remove them, and all other characters are removed Other modifiers that might be useful: ‘a’ – adds alphabetic characters to the list of characters. ‘i’ – ignores cases of the characters to be kept or removed ‘n’ - adds digits, the underscore character, and English letters to the list of characters. ‘s’ - adds space characters (blank, horizontal tab, vertical tab, carriage return, line feed, form feed, and NBSP ('A0'x, or 160 decimal ASCII) to the list of characters. ‘t’ - trims trailing blanks from the first and second arguments. ---See SAS documentation for full list of modifier (

25 Adding prefixes and suffixes to list of variables
proc sql; select cats(name,'=',name,'S1') into :list separated by ' ' from dictionary.columns where libname = 'WORK' and memname = 'EXAMPLE3' and upcase(name) ne 'ID'; quit; proc datasets library = WORK nolist; modify EXAMPLE3; rename &list; One of my favorite pieces of code Can also add noprint option if do not want to print your list to proc sql statement line – I print mind to double check correct variables assigned You are assigning the renaming code (that we call in PROC DATASETS) to macro variable ‘list’ using the dictionary tables in PROC SQL This was used in the project when creating wide datasets and wanting to clarify which variables from which of the 4 text surveys

26 Unresolved issues A more efficient way to handle Emojis:
We did not run into many in our data, not time consuming to fix manually in this instance Some Emojis do not cause an issue importing via CSV in SAS®, but others do Discovered in talking with SAS® Help: SAS® was able to process the "heart" emoji, but not the "sad face" emoji Anyone in audience run into a similar issue?

27 Acknowledgements Principal Investigator: Erika Waters, PhD, MPH
Project Manager: Julia Maki, PhD Work supported by National Cancer Institute Grant: R01CA A SAS® Help & SAS® 9.4 Online Documentation

28 Questions? Thank you!


Download ppt "Troubles with Text Data"

Similar presentations


Ads by Google