Introduction to Textual Analysis Mikal Eckstrom and Gabi Kirilloff Digital Humanities Bootcamp 2016
What we are covering What is textual analysis How it strengthens the humanities Its application in the classroom and to your research Terminology Various online methods http://textalyser.net/ http://docs.voyant-tools.org/tools/links/ https://books.google.com/ngrams/
Text Analysis “It’s not that we no longer read books, but we now have new ways of studying them in their natural habitat.”-Matthew Jockers (2013) “But it must be recognized that the notion of “probability of a sentence” is an entirely useless one, under any known interpretation of this term.” Noam Chomsky (1969) Mikal
What is text analysis? Analyzing text(s) through computational analysis that employs new methodologies in an effort to construct new meaning to an already existing (set of) written work. Mikal
Text as Science We often have an hypothesis—even as close readers We have conclusions—even our own worst paper has conclusions Now with text analysis, or data mining, we, like scientists, have data. Like scientists, digital humanists also seek to discover new evidence and meaning from texts, no matter what the scale of the corpora is. Mikal
Mikal
Terminology Sentence: unit of written language Utterance: unit of spoken language Word Form: the inflected form as it actually appears in the corpus Lemma: an abstract form, shared by word forms having the same stem, part of speech, word sense – stands for the class of words with same stem Types: number of distinct words in a corpus (vocabulary size) Tokens: total number of words Mikal
What Text Analysis Enables What you can do: Categorize and Cluster documents Compare and contrast vocabulary Examine syntactical relationships Entity Recognition This can allow you to: Examine differences based on metadata Examination of voice and style Geographic mapping and helpful visualizations Gabi
Clustering and Examining Similarity Context Words High Frequency Words Punctuation Sentence Length Gabi
Exploring Syntactical Relationships “He quickly ran up the old steps to the castle.” Gabi
Word Clouding | Text Analysis Mikal American Indian Male Jewish Male Jewish Female
Data Collection Getting good data is trickier than you think Large Corpus Metadata Clean text Where to find data Hathitrust Internet Archive Gutenberg Women Writers Project Gabi and Mikal
Martha Ballard’s Diary http://historying.org/2010/04/01/topic-modeling-martha-ballards-diary/ Mikal
Textalyzer http://textalyser.net/ Mikal
Voyant http://voyant-tools.org/ Gabi
WordSeer http://wordseer.berkeley.edu/ Gabi
Stanford Tools NER: http://nlp.stanford.edu:8080/ner/ DParse: http://nlp.stanford.edu:8080/parser/ Gabi
N-Grams https://books.google.com/ngrams/ Mikal
Human Word Prediction Clearly, at least some of us have the ability to predict future words in an utterance. How? Domain knowledge: red house vs. red hat Syntactic knowledge: the…<adj|noun> Lexical knowledge: baked <steak vs. cake> Mikal
Useful Applications for N-Grams Why do we want to predict a word, given some preceding words? Rank the likelihood of sequences containing various alternative hypotheses, e.g. for ASR Theatre owners say popcorn/unicorn sales have doubled... Assess the likelihood/goodness of a sentence, e.g. for text generation or machine translation The doctor recommended a cat scan. El doctor recommendó una exploración del gato. Mikal
Coding (and why you might want to consider it) Custom questions may call for custom methods Understanding the options available to you can make it easier to envision new research questions R Statistical language Works with plain text and XML Very easy to create complex visualizations Python Gabi and Mikal
Limitations and Constraints “Flattening” data and obscuring information Corpus selection bias Imperfect datasets Gabi
Summary Text analysis can allow us to derive new meaning from text Visually understand the relationships between various texts, tokens, and data sets. N-gram probabilities can be used to estimate the likelihood Of a word occurring in a context (N-1) Of a sentence occurring at all Smoothing techniques deal with problems of unseen words in corpus
Resources Stanford Lit Lab Pamphlets: http://litlab.stanford.edu/LiteraryLabPamphlet4.pdf Ted Underwood: http://tedunderwood.com/2012/08/14/where-to- start-with-text-mining/ Lincoln Mullen: http://lincolnmullen.com/projects/dh-r/
Example Exercise Split into groups of 3 or 4 people and take 10 minutes to use Voyant to explore your text. Report to the group at least 1 interesting finding.