Mess with Text: textual analysis using AntConc and TagAnt Liorah Golomb, Humanities Librarian University of Oklahoma ResBaz OU 2017
Find text to mess with Project Gutenberg (public domain, mostly pre-1923) tvsubtitles.net Any HTML that can be saved as text using Notepad or any text editor PDF and Word docs, using a tool such as AntFileConverter Epubs, Kindle books, etc. using a tool such as Calibre
The tools AntConc A freeware corpus analysis toolkit for concordancing and text analysis. TagAnt A freeware Part-Of-Speech (POS) tagger built on TreeTagger (developed by Helmut Schmid). Developer: Laurence Anthony, Waseda University Laurenceanthony.net
The corpora Common Sense by Thomas Paine (1776). Project Gutenberg Ebook #147. The Federalist Papers by Alexander Hamilton, John Jay, and James Madison (1787-1788). Project Gutenberg Etext #1404. The Tweets of Donald J. Trump, Jan. 20-Oct. 3 ,7:30 a.m., 2017. Via Trump Twitter Archive.
Basic AntConc features Create word list from one or multiple text files Use concordance to find specific words Compare one corpus to another using tool preferences Find a term in context Find the words near a term using collocates Find the words surrounding a term using clusters/n-grams Create a keyword list by comparing a corpus to a larger standard corpus Can use truncation (*) Export results to text file
Some comparisons Thomas Paine used 3817 individual words in Common Sense (word count ~21736). 2001 words were used only once. Use of word sad=0, awful=1, bad=4, terrible=0, fake=0 Hamilton et al. used 8608 individual words in the Federalist Papers (word count ~129423). 2941 words were used only once. Use of word sad=0, awful=4, bad=14, terrible=0, fake=0 Trump used 4797 individual words in his tweets* (word count ~29935). 2632 words were used only once.** Use of word sad=15, awful=0, bad=43, terrible=15, fake=114. *Stripped of retweets but not of date and time stamps. Also not stripped, URLs or components of URLs. ** Includes strings of letters that were part of a URL.
TagAnt part of speech tagger Creates a new file labelled as tagged and places it in the same folder as the corpus examined Sort vertically to place into a spreadsheet Column A=word, Column B=tag, Column C=lemma Interpret the tags using a site such as the Penn Treebank Project or (my preference) Georgetown’s