Digital Text and Data Processing Week 1
□ Future of reading? □ Understanding “Machine reading”: □ Text analysis tools □ Visualisation tools Course background □ Differences between machine reading and human reading Images taken from textarc.org and from Google App store, Javelin for Android
Scale
□ “a collection of methods used to find patterns and create intelligence from unstructured text data” (1) □ Information is found “not among formalised database records, but in the unstructured textual data” (2) □ Related to data mining Text Mining (1) Francis, Louise. “Taming Text: An Introduction to Text Mining.” Casualty Actuarial Society Forum Winter (2006), p. 51 (2) Feldman, Ronan. The Text Mining Handbook: Advanced Approaches in Analyzing Unstructured Data. Cambridge: Cambridge University Press, 2007, p. 1
□ Information is often implicit □ Homonyms and synonyms □ Computers do not have access to the meaning of the text □ Spelling changes over time or may be vary according to region Difficulties natural language
I trod on grass made green by summer's rain, Through the fast-falling rain and high- wrought sea 'Tis like a wondrous strain that sweeps And suddenly my brain became as sand She mixed; some impulse made my heart refrain were found where the rainbow quenches its points upon the earth Rain rain rains rain’s Rain’s Rain. rain. Rain! ‘rain’
The outworn creeds again believed, Hatred, despair, and fear and vain belief Because I am a Priest do you believe imagine, while asserting what it believes to be true … The pleasure of believing what we see long-believing courage, and the systematic efforts of generations of
□ Data creation □ Data analysis Two stages in text mining
□ W1: Introduction to the course and introduction to the Perl programming language □ W2: Regular expressions, word segmentation, frequency lists, types and tokens □ W3: Natural language processing: Part of Speech tagging, lemmatisation □ W4: Exploration of existing text mining tools Weekly Programme Cluster 1: Data creation
□ W5: Introduction to R package □ W6: Multivariate analysis: Principal Component Analysis, Clustering techniques □ W7: Visualisation □ W8: Conclusion: What type of knowledge can we create? Weekly Programme Cluster 2: Data analysis
□ 5 assignments (2 points to be earned for each) □ Final essay (ca. 3,000 words) □ Report of your individual research project □ Critical reflection on the merits of text mining: □ What sort of knowledge can be produced? □ How does this type of research relate to traditional scholarship? □ Main obstacles or challenges? □ Is the creation of a text analysis tool a legitimate scholarly activity in the humanities? Course evaluation
□ Programming languages: used to give instructions to a computer □ There is a gap between human language and machine language □ Digital information is information represented as combinations of 1s and 0s, e.g.: A = A = Introduction to programming
□ First generation programming languages: Assembler, eg ADD X1 Y1 □ Higher-level programming languages: Compilers or Interpreter Human Programmer Language processor Computer Programming language, e.g. Perl Machine Language
The Perl programming language □ Open source □ Developed by the linguist Larry Wall □ Easy to learn; Code is often easy to read □ Developed specifically for text processing
Getting started 1. Create a working directory on your computer 2. Open a code editor and type the following lines: use strict ; use warnings ; print “It works!” ; 3. Modify the.bat file that is provided
Today’s exercise Create an application in Perl which can read a machine readable version of Shelley’s Collected Poems (file is provided) and which can print all lines that contain a given keyword. (suggestions: “fire”, “rain”, “moon”, “storm”, “time”)
Variables □ Always preceded by a dollar sign $keyword □ Variables can be assigned a value with a specific data type (‘string’ or ‘number’) $keyword = “time” ; $number = 10 ; □ Three types of variables: scalar, array, hash
Strings □ Can be created with single quotes and with double quotes □ In the case of double quotes, the contents of the string will be interpreted. □ For instance, you can then use “escape characters” in your string: “\n” new line “\t” tab “\a”alarm bell
Statements □ Perl statements can be compared to sentences. □ Perl statements end in a semi-colon! print “Now this makes a statement!” ;
Exercise Print a string that looks as follows: This is the first line. This is the second line. This line contains atab. Also try to use the “\a” escape character in your string.
Reading a file Is done as follows: open ( IN, “shelley.txt” ) ; while ( ) { print $_ ; } close ( IN ) ;
Exercise Create a Perl application which can read the text file “shelley.txt” and which can print all the lines.
Control keywords if ( ) { } elsif { } else { <last block of code ; default option> }
Regular expressions (2) □ The pattern is given within two forward slashes □ Use the =~ operator to test if a given string contains the regex. □ Example: $keyword =~ /rain/
Control keywords if ( ) { } elsif { } else { <last block of code ; default option> }
Regular expressions □ The pattern is given within two forward slashes □ Use the =~ operator to test if a given string contains the regex. □ Example: $keyword =~ /rain/
Exercise You should now be able to make the exercise that was discussed earlier
Regular expressions (2) □ If you place “i” directly after the second forward slash, the comparison will take place in a case insensitive manner. □ \b can be used in regular expressions to represent word boundaries if ( $keyword =~ /\btime\b/i ) { }
Additional exercises □ Create a program that can count the total number of lines in the file “shelley.txt” □ Create a program that can calculate the length of each line, using the length() function length( $line ) ; □ Calculate the average line length (in characters) for the entire file.