Digital Text and Data Processing

Digital Text and Data Processing
Week 4

Data creation Data created using Perl CSV as output Data analysis CSV as input Analysed using R

Programming languages
Used to give instructions to a computer There is a gap between human language and machine language Digital information is information represented as combinations of 1s and 0s, e.g.: A =

First generation programming languages: Assembler, eg ADD X1 Y1
Higher-level programming languages: Compilers or Interpreter Human Programmer Programming language, e.g. Perl Machine Language Language processor Computer

APPEAL: listen (please, please); open yourself, wide; join (you, me), connect (us,together), tell me. do something if distressed; @dawn, sing; read (books,$poems,stories) until peaceful; study if able; write me if-you-please; Fragment from a poem in Perl by Sharon Hopkins: the full poem can be found here

Data mining has many applications in the financial world, in counter-terrorism, online marketing (e.g. Pariser, The Filter Bubble) Publishers and humanities scholars alike are exploring possibilities for texts, reading, literary criticism, etc. Predicting bestsellers Recommendation services Enhancing the reading experience “Distant Reading” as dominant new methodology in Digital Humanities

Comma Separated Values
i,you,he Emma,160416,3178,1994 Persuasion,77431,1284,918 PrideAndPrejudice,121812,2068,1356 N.B. The first row in this particular example has one column less

Both a programme and a programming language
“a free software environment for statistical computing and graphic” Fully open source Successor of “S” The capabilities of R can be extended via external “packages” You can use the “simple” R application and RStudio

Common words Zipf’s law: A small numer of words have a high frequency, a large number of ‘hapax legomena’ (words that appear only once) Function words and lexical words Common words may be ignored by making use of a list of stop words, e.g. Glasgow stop word list

Functions Functions are declared using the keyword “sub”
sub myFunction($) { my $parameter = shift ; return $parameter } print &myFunction(“parameter”) ; Functions are declared using the keyword “sub” They can take parameters as input (and these can be read using shift) Functions return a value Once the function has been defined, it can be invoked using "& + [name function]”

print &square(7) ; ## This will print “49”. sub square($) { my $number = shift ; return $number * $number ; }

Removing stopwords Save list as “stopwords.txt”
open ( ST , "stopwords.txt" ) or die "Can't read file!" ; while(<ST>) { if ($_ =~ /\w/) { , $_) ; } } close( ST) ;

Change the array into a long string, using join()
my $stopwords = join( "|" ) ;

Add a function which can test if a word is contained within this long string
sub isstopword($) { my $text = shift ; if ( lc($text) =~ /\b($stopwords)\b/) { return 1 ; } else { return 0 ; }

Next, this function can be used in the tokenisation program
“!” is a negation if ( !( &isstopword($word) )) { $freq{ $word }++ ; }

Collocation Frequencies of the words that are used near a given search term In the file “collocation.pl”, the variable $distance specifies the size of the “window”: the context that will be considered It reads in texts in “paragraph mode”: # Paragraph mode for while loop $/ = "";

Textual units $/ = “\n” ; $/ = “” ; $/ = undef ;
The special variable s/ defines the “textual units” that are read by the program Line by line (default option) Paragraph mode Full text (no segmentation) $/ = “\n” ; $/ = “” ; $/ = undef ;

Reading a corpus Save all the .txt files in a folder named, for instance “Corpus” my $dir = "Corpus" ; opendir(DIR, $dir) or die "Can't open directory!"; while (my $file = readdir(DIR)) { if ( $file =~ /txt$/) { push $file ) ; } }

Exercise Create a folder within your DTDP working environment, and name it “Corpus” Save all the files for your own project in this corpus (or use the sample corpus) Make a program which simply lists all the filenames.

Number of types Number of tokens

Type-token ratio The higher the number, the higher the vocabulary diversity. If the number is (relatively) low, there is a high level of repetition The length of the text has an impact on the type-token ratio

Type-token ratio The higher the number, the higher the vocabulary diversity. If the number is (relatively) low, there is a high level of repetition The length of the text has an impact on the type-token ratio You are advised to “normalise” type-token ratios: count the types and tokens exclusively in, for instance, the first 2000 words

Digital Text and Data Processing

Similar presentations

Presentation on theme: "Digital Text and Data Processing"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Digital Text and Data Processing

Similar presentations

Presentation on theme: "Digital Text and Data Processing"— Presentation transcript:

Similar presentations

About project

Feedback