Download presentation
Presentation is loading. Please wait.
1
Digital Text and Data Processing
Week 1
2
Course background Future of reading Understanding “Machine reading”:
Text analysis tools Visualisation tools Differences between machine reading and human reading Images taken from textarc.org and from Google App store, Javelin for Android
3
Scale
4
Text Mining “a collection of methods used to find patterns and create intelligence from unstructured text data” (1) Related to data mining Information is found “not among formalised database records, but in the unstructured textual data” (2) (1) Francis, Louise. “Taming Text: An Introduction to Text Mining.” Casualty Actuarial Society Forum Winter (2006), p. 51 (2) Feldman, Ronan. The Text Mining Handbook: Advanced Approaches in Analyzing Unstructured Data. Cambridge: Cambridge University Press, 2007, p. 1
5
One thing was certain, that the WHITE kitten had had nothing to do with it:--it was the black kitten's fault entirely. For the white kitten had been having its face washed by the old cat for the last quarter of an hour (and bearing it pretty well, considering). [..] And here Alice began to get rather sleepy, and went on saying to herself, in a dreamy sort of way, 'Do cats eat bats? Do cats eat bats?' In a Wonderland they lie, Dreaming as the days go by, Dreaming as the summers die: Ever drifting down the stream, Lingering in the golden gleam. Life, what is it but a dream? [..] This piece of rudeness was more than Alice could bear: she got up in great disgust, and walked off; the Dormouse fell asleep instantly.
6
Difficulties of natural language
Semantic categories are generally implicit Inflections: conjugations and declension Homonyms and synonyms: Meaning of polysemic are context-specific Spelling changes over time or may vary across regions
7
I trod on grass made green by summer's rain, Through the fast-falling rain and high-wrought sea
'Tis like a wondrous strain that sweeps And suddenly my brain became as sand She mixed; some impulse made my heart refrain were found where the rainbow quenches its points upon the earth Rain rain rains rain’s Rain’s Rain. rain. Rain! ‘rain’
8
Two stages in text mining
Data creation Data analysis
9
Weekly Programme Cluster 1: Data creation
W1: Introduction to the course and introduction to the Perl programming language W2: Regular expressions, word segmentation, frequency lists, types and tokens W3: Natural language processing: Part of Speech tagging, lemmatisation W4: Sentence segmentation, Complexity metrics
10
Weekly Programme Cluster 2: Data analysis
W5: Introduction to R package W6: Ggplot: creating visualisations W7: Topic Modelling, Multivariate analysis: Principal Component Analysis, Clustering techniques W7: Geographic information
11
Individual Research project
Techniques taught in DTDP generally enable you to study formal differences and similarities between texts, e.g. vocabulary, sentence length, grammatical structure Create a corpus consisting of ten different texts at a minimum, all of ca words or more; you can copy texts from existing corpora You can apply the techniques which are explained in this class to your own corpus Formulate your own research question
12
Course evaluation Final essay (ca. 4,000 words)
Report of your individual research project (50%) Critical reflection on digital humanities research (50%) How does this type of research relate to traditional scholarship? Is programming a legitimate scholarly activity in the humanities? Can visualisations of texts function as independent scholarly resources? Five “Coding Challenges” which need to be marked as sufficient
13
Course syllabus can be found at www.bookandbyte.org/DTDP
Weekly programme (including the homework for each week) Course organisation Exercises and coding challenges Links to software tools and tutorials Bibliography List of text corpora Practical organisation of classes
14
Getting started Download and install Perl
Create a working directory on your computer Open a code editor and type the following lines: print “It works!” ; Save the file, with extension .pl Use the .bat file that is provided. In the command prompt, type in: “[name of file].pl”
15
Variables Always preceded by a dollar sign $keyword
Variables can be assigned a value with a specific data type (‘string’ or ‘number’) $keyword = “time” ; $number = 10 ; Three types of variables: scalar, array, hash
16
Strings Can be created with single quotes and with double quotes
In the case of double quotes, the contents of the string will be interpreted. You can then use “escape characters” in your string to add basic formatting: “\n” new line “\t” tab
17
Statements Perl statements can be compared to sentences.
Perl statements end in a semi-colon! print “This is a statement!” ;
18
Writing to file Done as follows: open( OUT , “>out.txt” ) ;
print OUT “Text in out.txt” ; close(OUT)
19
Exercise Create a new file named “data.txt”, in which print the following lines: This is the first line. This is the second line. This line contains a tab.
20
Operators = Assignment e.g. $a = 5 ; Arithmetic operators + Addition
Subtraction * Multiplication
21
Exercise Create two variables, and assign a numerical value to both of them Print their sum, their difference and their product.
22
Reading a file Use the following code: open ( IN , “shelley.txt” ) ;
while ( <IN> ) { print $_ ; } close ( IN ) ;
23
Exercise Create a Perl application which can read the text file “shelley.txt” and which copy all the lines to a new file.
24
Control keywords if ( <condition> ) {
<first block of code> } elsif ( <condition> ) { <second block of code> } else { <last block of code ; default option> }
25
Regular expressions A pattern which represents a particular sequence of characters The pattern is given within two forward slashes Use the =~ operator to test if a given string contains the regular expression. Example: $keyword =~ /rain/
26
Exercise Create an application in Perl which can read a machine readable version of Shelley’s Collected Poems (file is provided) and which can print all lines that contain a given keyword. (suggestions: “fire” , “rain” , “moon”, “storm”, “time”)
27
Regular expressions (2)
If you place “i” directly after the second forward slash, the comparison will take place in a case insensitive manner. \b can be used in regular expressions to represent word boundaries if ( $keyword =~ /\btime\b/i ) { }
28
Coding challenge Create a program that can count the total number of lines in the file “shelley.txt” Create a program that can calculate the length of each line, using the length() function length( $line ) ; Calculate the average line length (in characters) for the entire file.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.