Using Voyant to Explore Text Data Connected Health Conference December 14, 2016
Contents Overview Data Default Dashboard Review of Voyant “Tools” Word Cloud Reader Bubble Lines Links Break out into groups and explore tools Overview Discuss increasing use of unstructured data (words on websites, social media, text applications) Need to quickly understand that data
Overview Rise in unstructured text data (e.g. websites, blogs/forums, social media, SMS applications) Challenge of understanding what is in all that data without some visualization tools Voyant is a browser-based tool that helps you explore different themes and patterns in your data, pinpointing areas for further exploration and analysis Need input from Jill and Nancy for good global health practitioner applications
Data A sample of ~4,700 tweets that mention the word “zika” from Nov 2015 to Nov 2016 Data structured so that the beginning of the file corresponds with the earlier dates Can use other data formats in Voyant such as: txt, htm, html, xml, doc, docx, rtf, pdf What questions might you have about what people are saying about Zika? Brainstorm questions: What measures to prevent Zika are people more familiar with? Mosquito avoidance? Sexual transmission prevention? Did people’s understanding of preventions measures change over time? (especially as governmental agencies implemented awareness campaigns)
Retrieve Dataset and Load into Tool 4,700+ tweets that mention “zika” (11/2015 – 11/2016) Copy/paste text or URL in box (text will be retrieved from specified URLs), or click upload and find your file
Default Dashboard 5 Panels: Word Cloud, Text, Trends, Summary, Context Word Clouds are a visual representation of word counts for each word in your data. The larger the displayed word, the more frequently is occurs in your data. The Reader shows the actual text of the document you uploaded Trends show the distribution of the top 5 words throughout the course of the document – the horizontal axis shows a default of 10 equal sized segments in your data (here, the first segment corresponds to your earlier data and the 10 segment should contain your most recent Nov 2016 data) Summary provides the frequency of the top 5 most common words in the data as well as figures for the number of documents you uploaded, the total number of words, and the number of unique words Contexts shows each occurrence of a keyword with a bit of surrounding text (the context). It can be useful for studying more closely how terms are used in different contexts. 5 Panels: Word Cloud, Text, Trends, Summary, Context
Word Cloud – Terms to Display Click and slide the icon in the lower left to choose how many terms to display Hover to left of the “?” in the upper right corner and click options Next to “Stopwords”, click “Edit List” Play with this wordcloud Adjust the number of terms you want to display Click on a word and see how the panels to the right of it (text panel and trend panel) update
Wordcloud – Edit Stopwords Stopwords are words that will be filtered out from analysis. Typically, these words do not provide additional insights when exploring our data Stopwords to add : t.co https zika rt virus Sometimes, you can start by seeing what words are most frequently used in your data – in this case, we see our top 5 most used words in the lower left summary box on the default dashboard: t.co, https, zika, rt, virus. Because none of those terms provides insights for us, we can remove them (along with a few others). Remember to type only one word per line (hit enter after each word) When you update the list of stopwords, it will update this throughout your dashboard (in all panels) Term Limits: 105
Reader The Reader tool provides a view of the text from the data Hover over a word to see frequency of word Click on word to reveal distribution of the word throughout your data Search box to find specific words See the cursor hover over three words in the Reader panel: Babies (83) Symptoms (22) Pregnant (251) Now click on symptoms and look how a line graph populates in the grey ribbon below the panel. The line represents the frequency of the word “symptom” from start to end of dataset (again, where the leftmost part corresponds to the earlier dates and the rightmost part to the later dates) You can use this feature when looking for specific themes in your data
Bubble Lines How can bubble lines help us? They help us not only visualize how common a word is, but also where it is located in the document and compare it to other words Here we look at mosquit* repel* pregnan* sex* The * denote wildcard characters that occur after the stem so I can capture plurals and other forms of the same word After entering the words, click the box next to Separate Lines for Terms What do you observe? Bubble Lines displays the frequency and repetition of a word’s use in a dataset. Each line is broken up into equal parts representing the beginning, middle, and end of a document or dataset. Larger bubbles = higher frequency The left side of the line corresponds to earliest Tweets starting in Nov 2015 and the rightmost end represents recent data (Nov 2016)
Links Links shows a network graph of higher frequency terms that appear in proximity. Keywords are shown in green and collocates (words in proximity) are showing in orange. Features include: - hovering over keywords shows their frequency in the corpus - hovering over collocates shows their frequency in proximity (not their total frequency) - double-clicking on any word fetches more results a search box for queries (hover over the magnifying icon for help with the syntax) Let’s hover over some of the frequently occurring words such as cases (329) and health (???) – you’ll notice that words that occur close to cases and health are shown with the red web Now let’s increase the context so we include more words that occur close to words we’re interested in
Resources Voyant Documentation http://docs.voyant-tools.org/ Description of Tools http://docs.voyant-tools.org/tools/