Download presentation
Presentation is loading. Please wait.
Published byHoratio Mason Modified over 8 years ago
1
Digital Text and Data Processing Week 8
2
□ Is it a valid scholarly discipline? Can these technologies genuinely enable scholars to generate valuable insights? □ Adam Kirsch, “Technology Is Taking Over English Departments” □ Stephen Marche, “Literature is not Data: Against Digital Humanities” □ Melissa Dinsman, “The Digital in the Humanities” Themes in the debate about DH
3
□ From deduction to induction (or abduction) □ Stanley Fish, "Mind Your P's and B's: The Digital Humanities and Interpretation“ □ Chris Anderson, “The End of Theory”The End of Theory □ DH as a complementary tool set; different levels of analysis □ Martin Mueller, "Stanley Fish and The Digital Humanities” □ Matthew Jockers and Julia Flanders, "A Matter of Scale"
4
□ Pushing the interpretative paradigm □ Jerome McGann and Lisa Samuels, “Deformance and Interpretation”Deformance and Interpretation □ Stephen Ramsay, "Algorithmic Criticism" □ Alan Liu, "The State of the Digital Humanities” □ Stephen Ramsay, "The Hermeneutics of Screwing Around
5
□ Practical work: recognition and theoretical aspects □ Stephen Ramsay and Geoffrey Rockwell, "Developing Things” □ Richard Grusin, “The Dark Side of Digital Humanities” □ Changing nature of “evidence” □ John Burrows, "Never Say Always Again: Reflections on the Numbers Game” □ Stephen Ramsay, "Algorithmic Criticism"
6
□ Vocabulary diversity (type-token ratios) □ Grammatical categories (POS categories) □ Average number of words per sentence □ Word frequency: lists or PCA diagrams □ Distinctive words: td-idf □ KWIC lists and Collocation Quantitative analyses
7
□ Eight Short stories by Rudyard Kipling (1865-1936) from the collection In Black and White (1888) □ Four stories have an Indian narrator and four stories have a British narrator Case study
8
□ Program analyseTexts.pl creates data about tokens, types and POS categories □ To analyse differences and similarities, it can be useful to add a facet variable, using, for example, the ifelse function: d$narrator <- ifelse( rownames(d) == "AtHowliThana" | rownames(d) == "DrayWaraYowDee" | rownames(d) == "Gemini"| rownames(d) =="InFloodTime", "Indian", "British" )
9
□ Two-dimensional hashes have two indexes. □ The first of these represents the row, and the second indicates the column. □ Example: $tdm{ $text }{ $word }++ ;
10
Number of tokens
11
Type-token ratios
12
tiff("image.tif", compression = "lzw") print(p) dev.off() Saving images
13
Type-token ratios and average number of words
14
□ Sum of the values in different columns: d$adjectives = d$JJ + d$JJR + d$JJS
15
□ Graphs can be combined using the ”gridExtra” package. □ Example: p1 <- ggplot( d, aes( x = rownames(d), y = verbs / tokens, fill = narrator ) ) + geom_bar( stat = "identity") p2 <- ggplot( d, aes( x = rownames(d), y = adj / tokens, fill = narrator ) ) + geom_bar( stat = "identity" ) grid.arrange(p1, p2, p3, p4, ncol=1 )
16
Occurrences of POS categories
17
□ Program tdm.pl creates data about the 40 most common words in a corpus □ It is also possible to supply words yourself in the @words array □ It also creates a term-document matrix of the 40 most distinctive words on the basis of the td-idf formula.
18
Subsetting a dataframe d[ condition for rows, condition for columns ] Example: t <- read.csv("data.csv") i <-t[ rownames(t) == "AtHowliThana" | rownames(t) == "DrayWaraYowDee" | rownames(t) == "Gemini"| rownames(t) == "InFloodTime", ]
19
colSums() function Example: i <- as.data.frame ( t( colSums(i) ) ) □ Adds all the values of all the columns □ All values must be numeric □ Creates a list, in which all the calculated values ae on separate rows □ The function t() can be used to transpose a dataframe (convert rows into columns or vice versa)
21
Principal Component Analysis □ Performed on a ”term-document matrix” □ Frequency counts ought to be normalised: tdm <- read.csv("tdm.csv", head=TRUE ) md <- read.csv("data.csv", head=TRUE ) tdm <- tdm / md$tokens □ The analysis creates new variables which account for most of the variability in the data set
22
Principal Component Analysis
23
pca <- prcomp( tdm, center = TRUE, scale. = TRUE) summary(pca)
24
loadings <- p$rotation
26
p$rotation
27
Collocation Indian British
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.