Digital Text and Data Processing Week 6
Difficulty vs. enjoyment
Course evaluation Final essay (ca. 4,000 words) Report of your individual research project (50%) Critical reflection on digital humanities research (50%) Five “Coding Challenges” which need to be marked as sufficient
Homework Write a brief text (max. 500 words) about your individual research project. Answer the following questions: Which texts have you selected for your corpus? Which research question do you intend to answer? Which types of analyses will be most useful for your research question? The course syllable mentions five possible topics. Also provide a brief description of your theoretical question if you want to focus on another topic q
Applications TDM technologies have been used to study Literary genres (e.g. Sarah Allison et al., Quantitative Formalism: An Experiment). Literary characters (e.g. Stephen Ramsay, Reading Machines: Toward an Algorithmic Criticism) Date of creation (Richard Forsyth, “Stylochronometry with Substrings, or: A Poet Young and Old”) Authorship
Applications Themes (Martha Nell Smith et al., ““Undiscovered Public Knowledge”: Mining for Patterns of Erotic Language in Emily Dickinson’s Correspondence with Susan Huntington (Gilbert) Dickinson”) Lexical repetitions (T. E. Clement, ““A Thing Not Beginning and Not Ending””) Rhyme and meter Allusions (N. Coffee et al., “The Tesserae Project: Intertextual Analysis of Latin Poetry”)
Feature extraction Most frequent words Genres Type-token ration Grammatical categories Repetitions of words Sentiments Genres Texts with specific themes Literary characters Authorship
Final three weeks W6: Natural Language Processing, Semantic Tagging W7: Complexity metrics (sentence length, syllables); Topic Modelling W8: Course wrap up; Mapping geographic information Which other analyses may be useful?
Natural Language Processing Making computers understand languages spoken by human beings Applications: Part of Speech Tagging Sentiment analysis Information extraction Machine translation Summarising Paraphrasing
Can be done in the SMILE Text Analyzer, among many other tools Part of speech tagging providing the syntactical category or words within in a sentence: “The Signora had no business to do it," said Miss Bartlett, "no business at all. The/DT Signora/NNP had/VBD no/DT business/NN to/TO do/VB it/PRP said/VBD Miss/NNP Bartlett/NNP no/DT business/NN at/IN all/DT Can be done in the SMILE Text Analyzer, among many other tools
Brill’s POS tagger Combination of a lexicon-based and a rule-based approach A lexicon entry looks as follows: Talk VB NN Initial Results are improved with transformation rules: e.g. VB NN PREVIOUSTAG JJ she could re-enter the world of rapid/JJ talk/VB, which was alone familiar to her So she did want to talk/VB about her broken engagement
PERL NLP modules Lingua::EN::Tagger (a “trained” POS Tagger) Lingua::EN::Fathom (Readability measures) Lingua::EN::Sentence Also: Lingua::DE Lingua::FR Lingua::ES Lingua::Klingon For Dutch: Frog
Lemmatisation POS lemma I PRP made VBD make my PRP$ song NN a DT coat covered cover with IN embroideries NNS embroidery out of old JJ mythologies mythology