Download presentation
Presentation is loading. Please wait.
Published byHubert James Modified over 9 years ago
1
Digital Text and Data Processing Tokenisation
2
Today’s class □ Tokenisation and creation of frequency lists □ Keyword in context lists □ Moretti and distant reading □ Research projects and assignment 1
3
Revision □ Regular expressions □ Simple sequences of characters □ Character classes, e.g. \w, \d or. □ Quantifiers, e.g. {2,4} or ?, +, * □ Anchors, e.g. \b, ^, $
4
Match variables □ Parentheses create substrings within a regular expression □ In perl, this substring is stored as variable $1 □ Example: $keyword = “quick-thinking” ; if ( $keyword =~ /(\w+)-\w+/ ) { print $1 ; #This will print “quick” }
5
Three types of variables □ Scalars: a single value; start with $ □ Arrays: multiple values; start with @ @titles = (“Ullyses”, “Dubliners”, “Finnegan’s Wake”) ; □ Hashes: Multiple values which can be referenced with ‘keys’; start with % %isbn ; $isbn{“9782070439713”} = “Ullyses”;
6
$line = "If music be the food of love, play on" ; @array = split(" ", $line ) ; # $array[0] contains "If" # $array[4] contains "food" Basic tokenisation
7
Looping through an array foreach my $w ( @words ) { print $w ; } Looping through an array
8
my %freq ; $freq{"if"}++ ; $freq{"music"}++ ; print $freq{"if"}. “\n" ; Creating a hash Assigning / updating a value
9
Calculation of frequencies my %freq ; foreach my $w ( @words ) { $freq{ $w }++ ; }
10
foreach my $f ( keys %freq ) { print $f. "\t". $freq{$f} ; } Looping through a hash
11
foreach my $f ( sort { $freq{$b} $freq{$a} } keys %freq ) { print $f. "\t". $freq{$f} ; } Sorting a hash
12
But she returned to the writing-table, observing, as she passed her son, "Still page 322?" Freddy snorted, and turned over two leaves. For a brief space they were silent. Close by, beyond the curtains, the gentle murmur of a long conversation had never ceased.
13
Is it actually a word? foreach my $w ( @words ) { if ( $w =~ /(\w)/ ) { $freq{ $1 }++ ; } }
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.