Download presentation
Presentation is loading. Please wait.
Published byBranden Turner Modified over 6 years ago
1
Mess with Text: textual analysis using AntConc and TagAnt
Liorah Golomb, Humanities Librarian University of Oklahoma ResBaz OU 2017
2
Find text to mess with Project Gutenberg (public domain, mostly pre-1923) tvsubtitles.net Any HTML that can be saved as text using Notepad or any text editor PDF and Word docs, using a tool such as AntFileConverter Epubs, Kindle books, etc. using a tool such as Calibre
3
The tools AntConc A freeware corpus analysis toolkit for concordancing and text analysis. TagAnt A freeware Part-Of-Speech (POS) tagger built on TreeTagger (developed by Helmut Schmid). Developer: Laurence Anthony, Waseda University Laurenceanthony.net
4
The corpora Common Sense by Thomas Paine (1776). Project Gutenberg Ebook #147. The Federalist Papers by Alexander Hamilton, John Jay, and James Madison ( ). Project Gutenberg Etext #1404. The Tweets of Donald J. Trump, Jan. 20-Oct. 3 ,7:30 a.m., Via Trump Twitter Archive.
5
Basic AntConc features
Create word list from one or multiple text files Use concordance to find specific words Compare one corpus to another using tool preferences Find a term in context Find the words near a term using collocates Find the words surrounding a term using clusters/n-grams Create a keyword list by comparing a corpus to a larger standard corpus Can use truncation (*) Export results to text file
6
Some comparisons Thomas Paine used 3817 individual words in Common Sense (word count ~21736) words were used only once. Use of word sad=0, awful=1, bad=4, terrible=0, fake=0 Hamilton et al. used 8608 individual words in the Federalist Papers (word count ~129423) words were used only once. Use of word sad=0, awful=4, bad=14, terrible=0, fake=0 Trump used 4797 individual words in his tweets* (word count ~29935) words were used only once.** Use of word sad=15, awful=0, bad=43, terrible=15, fake=114. *Stripped of retweets but not of date and time stamps. Also not stripped, URLs or components of URLs. ** Includes strings of letters that were part of a URL.
7
TagAnt part of speech tagger
Creates a new file labelled as tagged and places it in the same folder as the corpus examined Sort vertically to place into a spreadsheet Column A=word, Column B=tag, Column C=lemma Interpret the tags using a site such as the Penn Treebank Project or (my preference) Georgetown’s
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.