Presentation is loading. Please wait.

Presentation is loading. Please wait.

Digital Text and Data Processing

Similar presentations


Presentation on theme: "Digital Text and Data Processing"— Presentation transcript:

1 Digital Text and Data Processing
Week 2

2 From texts to data

3 Text mining algorithms may focus on the recognition on
Linguistic aspects: e.g. most frequent words, number of words in a sentence, number of nouns or adjectives, type/token ratio Semantic aspects: e.g. References to concepts, sentiments, named entities (personal names, organisations, geographic locations)

4 Research based on vocabulary
Segmentation or tokenisation Often based on the fact that there are spaces in between words (at least since scriptura continua was abandoned in late 9th C.) “soft mark up” Source: Chistopher Kelty, Abracadabra: Language, Memory, Representation

5 Frequency lists Token counts reflect the total number of words; Types are the unique words in a text ‘Bag of words’ model: original word order is ignored the 6 it 6 of 6 was 6 epoch 2 age 2 times 2 foolishness 1 wisdom 1 belief 1 “It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, it was the epoch of belief, it was the epoch of incredulity” Tokens: 36 Types: 13

6 Hugh Craig, Stylistic Analysis and Authorship Studies
Stylometrics Study of style on the basis of observable textual aspects Analyses of differences and similarities between texts in different genres, in different periods, texts by different authors Hugh Craig, Stylistic Analysis and Authorship Studies

7 Authorship attribution
Suggesting an author for texts whose authorship is disputed John Burrows, Never Say Always Again: Reflections on the Numbers Game

8 Becket, Andrew, A concordance to Shakespeare suited to all the editions, in which the distinguished and parallel passages in the plays of that justly admired writer are methodically arranged. 1787

9 Larry Wall, Programming Perl
“We will encourage you to develop the three great virtues of a programmer:  laziness, impatience and hybris” Larry Wall, Programming Perl

10 Recapitulation W1 Variables begin with a dollar sign. Two types: strings and numbers Statements end in a semi-colon Operators such as +, - and * can be used to do calculations Use “perl” + name of program to run a program in the command prompt

11 Avoiding errors “Use strict” has the effect that all variables need to be declared on first use with the “my” keyword “Use warnings” means that programmers will be warned when there errors, even when these are “non-fatal”

12 Operators Concatenation of strings with the dot
$string1 = "Hello" ; $string2 = "World" ; $string3 = $string1 . " " . $string2 ; Shorthand notation for mathematical operators: $sum = ; $sum = 5++ ; $number = 2 ; $number += 3 ;

13 Control keywords if ( <condition> ) {
<first block of code> } elsif ( <condition> ) { <second block of code> } else { <last block of code ; default option> }

14 Reading a file open ( IN , "shelley.txt") ; while(<IN>) {
print $_ ; } close ( IN ) ; Curly brackets create a “block” of code

15 Regular expressions A pattern which represents a specific sequence of characters The pattern is given within two forward slashes Use the =~ operator to test if a given string contains the regex. Example: $keyword =~ /rain/

16 Regular expressions Simplest regular expression: Simple sequence of characters Example: /sun/ Also matches: disunited, sunk, Sunday, asunder / sun / Does NOT match: […] the gate of the eastern sun, […] gloom beneath the noonday sun.

17 \b can be used in regular expressions to represent word boundaries
If you place “i” directly after the second forward slash, the comparison will take place in a case insensitive manner /\bsun\b/i […] Points to the unrisen sun! […] […] Startles the dreamer, sun-like truth […] […] stamped upon the sun; […]

18 Character classes . Any character
\w Any alphanumerical character: alphabetical characters, numbers and underscore \d Any digit \s White space: space, tab, newline [..] Any of the characters supplied within square brackets, e.g. [A-Za-z]

19 Quantifiers {n,m} Pattern must occur a least n times, at most m times
{n,} At least n times {n} Exactly n times ? is the same as {0,1} + is the same as {1,} * Is the same as {0,}

20 Examples /\d{4}/ Matches: 1234, 2013, 1066 /[a-zA-Z]+/
Matches any word that consists of alphabetical characters only Does not FULLY match: , catch22, can’t

21 Examples /b[aeiou]{1,2}t\w*/ bit boathouse beauty but beat beast
blister boyhood

22 Anchors Do not match characters, but locations within strings.
\b Word boundaries ^ Start of a line $ End of a line

23 Regular expressions can be combined with the vertical bar (‘|’)
/\bsun\b|\bstar\b|\bmoon\b/ ‘special characters’ need to be escaped with the backslash (‘\’) /\?/

24 Exercises Try your hand at Perl exercises on regular expressions, numbers 9 and 10


Download ppt "Digital Text and Data Processing"

Similar presentations


Ads by Google