Presentation is loading. Please wait.

Presentation is loading. Please wait.

Digital Text and Data Processing Week 2. “The book is a machine to think with” I.A. Richards, Principles of Literary Criticism “The technologising of.

Similar presentations


Presentation on theme: "Digital Text and Data Processing Week 2. “The book is a machine to think with” I.A. Richards, Principles of Literary Criticism “The technologising of."— Presentation transcript:

1 Digital Text and Data Processing Week 2

2 “The book is a machine to think with” I.A. Richards, Principles of Literary Criticism “The technologising of the word” Walter Ong, Orality and Literacy

3 □ Discussion of the reading □ Regular expressions □ Tokenisation □ Frequency lists □ Individual research projects Today’s class

4 □ Text analysis □ Digital Literary Studies □ Algorithmic Criticism (Stephen Ramsay) □ Literary informatics (Martin Mueller) Terminology

5 Becket, Andrew, A concordance to Shakespeare suited to all the editions, in which the distinguished and parallel passages in the plays of that justly admired writer are methodically arranged. 1787

6 □ Segmentation or tokenisation □ Often based on the fact that there are spaces in between words (at least since scriptura continua was abandoned in late 9 th C.) □ “soft mark up” Studies based on vocabulary Source: Chistopher Kelty, Abracadabra: Language, Memory, RepresentationAbracadabra: Language, Memory, Representation

7 □ Token counts reflect the total number of words; Types are the unique words in a text □ ‘Bag of words’ model: original word order is ignored Frequency lists “It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, it was the epoch of belief, it was the epoch of incredulity” Tokens: 36 Types: 13 the6 it6 of6 was6 epoch2 age2 times2 foolishness1 wisdom1

8 authors Stylometrics David Hoover, Textual Analysis Textual Analysis □ Study of style on the basis of quantitative aspects □ Analyses of differences and similarities between texts in different genres, in different periods, texts by different authors

9 Hugh Craig, Stylistic Analysis and Authorship StudiesStylistic Analysis and Authorship Studies

10 Common words □ Zipf’s law: A small numer of words have a high frequency, a large number of ‘hapax legomena’ (words that appear only once) □ Function words and lexical words □ Common words may be ignored by making use of a list of stop words, e.g. Glasgow stop word listGlasgow stop word list

11 Authorship attribution John Burrows, Never Say Always Again: Reflections on the Numbers Game Never Say Always Again: Reflections on the Numbers Game □ Suggesting an author for texts whose authorship is disputed

12 Digital Shakespeare □ “Secondary query potential” of digital text □ “non-reading” or scalable reading □ “The underlying methods (…) are probabilistic and in many ways more compatible with a spirit of tentative inquiry □ “The impossibly impoverishing reduction of a text into lists of its constituent parts may let you see some salient differences and resemblances across many texts that you could not as readily see by reading” □ “Is it an instance of the old joke about the drunk who is looking for his lost car key under a lamp post because that is where the light is?” □ Digital methods focus on “Establishing the ‘‘fact that’’ than with explaining the ‘‘reason why’’.

13 Recapitulation W1 □ Variables begin with a dollar sign. Two types: strings and numbers □ Statements end in a semi-colon □ “Use strict” has the effect that all variables need to be declared on first use with the “my” keyword □ “Use warnings” means that programmers will be warned when there errors, even when these are “non-fatal”

14 Reading a file open ( IN, "shelley.txt") ; while( ) { print $_ ; } close ( IN ) ; Curly brackets create a “block” of code

15 Operators □ Concatenation of strings with the dot $string1 = "Hello" ; $string2 = "World" ; $string3 = $string. " ". $string2 ; □ Mathematical operators: $sum = 5 + 1 ; $sum = 5++ ; $number = 2 ; $number += 3 ;

16 Functions □ Functions “cluster” a number of instructions □ Examples: □ length() my $title = "Ulysses" ; print length($title) ; # output of this line: 7 □ lc() and uc() my $title = "Ulysses" ; print lc($title) ; # output of this line: “ulysses"

17 □ Text patterns □ Simplest regular expression: Simple sequence of characters Example: Regular expressions /sun/ Also matches: disunited, sunk, Sunday, asunder / sun / Does NOT match: […] the gate of the eastern sun, […] gloom beneath the noonday sun.

18 □ \b can be used in regular expressions to represent word boundaries □ If you place “i” directly after the second forward slash, the comparison will take place in a case insensitive manner. /\bsun\b/i […] Points to the unrisen sun! […] […] Startles the dreamer, sun-like truth […] […] stamped upon the sun; […]

19 .Any character \wAny alphanumerical character: alphabetical characters, numbers and underscore \dAny digit \sWhite space: space, tab, newline [..]Any of the characters supplied within square brackets, e.g. [A-Za-z] Character classes

20 {n,m}Pattern must occur a least n times, at most m times {n,}At least n times {n}Exactly n times ? is the same as {0,1} +is the same as {1,} *Is the same as {0,} Quantifiers

21 /\d{4}/ Matches: 1234, 2013, 1066 /b[aeiou]{1,2}t\w*/ Matches: bit, but, beat, boathouse Not: beauty, blister, boyhood /[a-zA-Z]+/ Matches any word that consists of alphabetical characters only Does not FULLY match: e-mail, catch22, can’t Examples

22 Do not match characters, but locations within strings. \bWord boundaries ^Start of a line $ End of a line Anchors

23 Match variables □ Parentheses create substrings within a regular expression □ In perl, this substring is stored as variable $1 □ Example: $keyword = “quick-thinking” ; if ( $keyword =~ /(\w+)-\w+/ ) { print $1 ; #This will print “quick” }

24 □ Regular expressions can be combined with vertical bar (‘|’) /\bsun\b|\bstar\b|\bmoon\b/ □ ‘special characters’ need to be escaped with the backslash (‘\’) /\?/

25 Three types of variables □ Scalars: a single value; start with $ □ Arrays: multiple values; start with @ □ Hashes: Multple values which can be referenced with ‘keys’; start with %

26 $line = "If music be the food of love, play on" ; @array = split(" ", $line ) ; # $array[0] contains "If" # $array[4] contains "food" Basic tokenisation

27 Looping through an array foreach my $w ( @words ) { print $w ; } Looping through an array

28 my %freq ; $freq{"if"}++ ; $freq{“music"}++ ; print $freq{"if"}. “\n" ;

29 Calculation of frequencies my %freq ; foreach my $w ( @words ) { $freq{ $w }++ ; }

30 foreach my $f ( sort { $freq{$b} $freq{$a} } keys %freq ) { print $f. "\t". $freq{$f}. "\n" ; } Looping through a hash


Download ppt "Digital Text and Data Processing Week 2. “The book is a machine to think with” I.A. Richards, Principles of Literary Criticism “The technologising of."

Similar presentations


Ads by Google