Digital Text and Data Processing

Digital Text and Data Processing
Week 5

Analyses of single texts
Frequency lists, possibly filtered using a list of stopwords Distribution analysis Concordance (or Keyword in Context List) Collocation N.B. Code is available in DTDP file repository

Removing stopwords Save list as “stopwords.txt”
open ( ST , "stopwords.txt" ) or die "Can't read file!" ; while(<ST>) { if ($_ =~ /\w/) { , $_) ; } } close( ST) ;

Change the array into a long string, using join()
my $stopwords = join( "|" ) ;

Functions Functions are declared using the keyword “sub”
sub myFunction($) { my $parameter = shift ; return $parameter } print &myFunction(“parameter”) ; Functions are declared using the keyword “sub” They can take parameters as input (and these can be read using shift) Functions return a value Once the function has been defined, it can be invoked using "& + [name function]”

Add a function which can test if a word is contained within this long string
sub isstopword($) { my $text = shift ; if ( lc($text) =~ /\b($stopwords)\b/) { return 1 ; } else { return 0 ; }

Next, this function can be used in the tokenisation program
“!” is a negation if ( !( &isstopword($word) )) { $freq{ $word }++ ; }

Collocation Frequencies of the words that are used near a given search term In the file “collocation.pl”, the variable $distance specifies the size of the “window”: the context that will be considered It reads in texts in “paragraph mode”: # Paragraph mode for while loop $/ = "";

Textual units $/ = “\n” ; $/ = “” ; $/ = undef ;
The special variable s/ defines the “textual units” that are read by the program Line by line (default option) Paragraph mode Full text (no segmentation) $/ = “\n” ; $/ = “” ; $/ = undef ;

Reading a corpus Save all the .txt files in a folder named, for instance “Corpus” my $dir = "Corpus" ; opendir(DIR, $dir) or die "Can't open directory!"; while (my $file = readdir(DIR)) { if ( $file =~ /txt$/) { push $file ) ; } }

Two-dimensional hashes
Hashes with two keys Often useful for collecting data about different texts $freq{ $text }{ $word }

Lexical variety Number of types Number of tokens

Type-token ratio The higher the number, the higher the vocabulary diversity. If the number is (relatively) low, there is a high level of repetition The length of the text has an impact on the type-token ratio

This is a sentence is which all the words are unique.
11 / 11 = 1

Variables in R Any combination of alphanumerical characters, underscore and dot (but the variable name cannot begin with a dot) Unlike Perl, they do not begin with a $ The assignment operator in R is <- n <- 5

Vectors A collection of indexed values (similar to an array in Perl)
Can be created using the c() function, or by supplying a range Examples: x <- c( 4, 5, 3, 7) ; y <- 1:30 ;

Data frame A collection of vectors, all of the same length
Each column of the table is stored in R as a vector. V1 V2 V3 R1 3, 4, 5 R2 1, 21, 8 R3 23, 5, 6

CSV file type,token ARoomWithaView.txt,6925,66445
ATaleofTwoCities.txt,10188,135584 HeartofDarkness.txt,5512,37896 Ivanhoe.txt,12859,175069 MobyDick.txt,18582,211806 PrideandPrejudice.txt,6454,121781

Reading data in R Use the read.csv function
The CSV file will be represented as a data frame Values on first line and first value of each subsequent line will be used as rownames and colnames data <- read.csv( "data.csv" , header = TRUE) ; colnames(data)

Data frame columns Can be accessed using the $ operator
data <- read.csv( "data.csv" , header = TRUE) ; data$year

Calculations max(), min(), mean(), sd() y <- data$year ;
max(y) ; sd(y) ;

Subsetting Selecting rows with specific characteristics; the result is a filtered data frame. You need to give criteria both for rows and for columns (but one of these can also remain empty) d2 <- d[ d$year > 2000 , ] d2 <- d[ d$year > 2000 , ]

Basic R Exercises

Visualisation Distinction between “scientific visualisation” and “information visualisation” Lev Manovich: “a mapping between discrete data and a visual representation” (p. 2) Conversion to a graphic modality

Data Visualisation Jacques Bertin created a classification of “graphical primitives” in Semiology of Graphics Similar to Lev Manovich’s description of the “graphical primitives such as points, straight lines, curves, and simple geometric shapes” which “stand in for objects and relations between them”

Ggplot A package for the creation of visuals (next to the base plotting system) A technical implementation of Leland Wilkinson’s book The Grammar of Graphics Images express meaning via position, size, shape, rotation, colour, saturation, orientation and blur (p. 118)

Grammar of Graphics Graphs consist of (1) aesthetic attributes and (2) geometric objects Data values are expressed through aesthetic attributes; they are provided in the aes() function in ggplot The ggplot() function specified the basic data set that is used in the visualization; additional functions (which begin with geom_) define the geometic objects (the “graphic primitives”).

Bar Chart e <- read.csv("e.csv" , header = TRUE ) ;
p <- ggplot( e, aes(x=subject1) ) + geom_bar( ) ;

Flipping X and Y axes p <- ggplot( e, aes(x=subject1) ) + geom_bar( ) + coord_flip()

Scatter plot p <- ggplot( d, aes( x= avgWords , y = ratio , col = author , shape = century , label = rownames(d) ) ) + geom_point( size = 6 ) + geom_text( col = "black" , hjust = -0.2 , size = 4 )

Documentation

Digital Text and Data Processing

Similar presentations

Presentation on theme: "Digital Text and Data Processing"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Digital Text and Data Processing

Similar presentations

Presentation on theme: "Digital Text and Data Processing"— Presentation transcript:

Similar presentations

About project

Feedback