Digital Text and Data Processing Week 2
From texts to data
Text mining algorithms may focus on the recognition on Linguistic aspects: e.g. most frequent words, number of words in a sentence, number of nouns or adjectives, type/token ratio Semantic aspects: e.g. References to concepts, sentiments, named entities (personal names, organisations, geographic locations)
Research based on vocabulary Segmentation or tokenisation Often based on the fact that there are spaces in between words (at least since scriptura continua was abandoned in late 9th C.) “soft mark up” Source: Chistopher Kelty, Abracadabra: Language, Memory, Representation
Frequency lists Token counts reflect the total number of words; Types are the unique words in a text ‘Bag of words’ model: original word order is ignored the 6 it 6 of 6 was 6 epoch 2 age 2 times 2 foolishness 1 wisdom 1 belief 1 “It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, it was the epoch of belief, it was the epoch of incredulity” Tokens: 36 Types: 13
Hugh Craig, Stylistic Analysis and Authorship Studies Stylometrics Study of style on the basis of observable textual aspects Analyses of differences and similarities between texts in different genres, in different periods, texts by different authors Hugh Craig, Stylistic Analysis and Authorship Studies
Authorship attribution Suggesting an author for texts whose authorship is disputed John Burrows, Never Say Always Again: Reflections on the Numbers Game
Becket, Andrew, A concordance to Shakespeare suited to all the editions, in which the distinguished and parallel passages in the plays of that justly admired writer are methodically arranged. 1787
Larry Wall, Programming Perl “We will encourage you to develop the three great virtues of a programmer: laziness, impatience and hybris” Larry Wall, Programming Perl
Recapitulation W1 Variables begin with a dollar sign. Two types: strings and numbers Statements end in a semi-colon Operators such as +, - and * can be used to do calculations Use “perl” + name of program to run a program in the command prompt
Avoiding errors “Use strict” has the effect that all variables need to be declared on first use with the “my” keyword “Use warnings” means that programmers will be warned when there errors, even when these are “non-fatal”
Operators Concatenation of strings with the dot $string1 = "Hello" ; $string2 = "World" ; $string3 = $string1 . " " . $string2 ; Shorthand notation for mathematical operators: $sum = 5 + 1 ; $sum = 5++ ; $number = 2 ; $number += 3 ;
Control keywords if ( <condition> ) { <first block of code> } elsif ( <condition> ) { <second block of code> } else { <last block of code ; default option> }
Reading a file open ( IN , "shelley.txt") ; while(<IN>) { print $_ ; } close ( IN ) ; Curly brackets create a “block” of code
Regular expressions A pattern which represents a specific sequence of characters The pattern is given within two forward slashes Use the =~ operator to test if a given string contains the regex. Example: $keyword =~ /rain/
Regular expressions Simplest regular expression: Simple sequence of characters Example: /sun/ Also matches: disunited, sunk, Sunday, asunder / sun / Does NOT match: […] the gate of the eastern sun, […] gloom beneath the noonday sun.
\b can be used in regular expressions to represent word boundaries If you place “i” directly after the second forward slash, the comparison will take place in a case insensitive manner /\bsun\b/i […] Points to the unrisen sun! […] […] Startles the dreamer, sun-like truth […] […] stamped upon the sun; […]
Character classes . Any character \w Any alphanumerical character: alphabetical characters, numbers and underscore \d Any digit \s White space: space, tab, newline [..] Any of the characters supplied within square brackets, e.g. [A-Za-z]
Quantifiers {n,m} Pattern must occur a least n times, at most m times {n,} At least n times {n} Exactly n times ? is the same as {0,1} + is the same as {1,} * Is the same as {0,}
Examples /\d{4}/ Matches: 1234, 2013, 1066 /[a-zA-Z]+/ Matches any word that consists of alphabetical characters only Does not FULLY match: e-mail, catch22, can’t
Examples /b[aeiou]{1,2}t\w*/ bit boathouse beauty but beat beast blister boyhood
Anchors Do not match characters, but locations within strings. \b Word boundaries ^ Start of a line $ End of a line
Regular expressions can be combined with the vertical bar (‘|’) /\bsun\b|\bstar\b|\bmoon\b/ ‘special characters’ need to be escaped with the backslash (‘\’) /\?/
Exercises Try your hand at Perl exercises on regular expressions, numbers 9 and 10