Download presentation
Presentation is loading. Please wait.
Published byChristiana Allison Modified over 9 years ago
1
Digital Text and Data Processing Week 1
2
□ Future of reading? □ Understanding “Machine reading”: □ Text analysis tools □ Visualisation tools Course background □ Differences between machine reading and human reading Images taken from textarc.org and from Google App store, Javelin for Android
3
Scale
4
□ “a collection of methods used to find patterns and create intelligence from unstructured text data” (1) □ Information is found “not among formalised database records, but in the unstructured textual data” (2) □ Related to data mining Text Mining (1) Francis, Louise. “Taming Text: An Introduction to Text Mining.” Casualty Actuarial Society Forum Winter (2006), p. 51 (2) Feldman, Ronan. The Text Mining Handbook: Advanced Approaches in Analyzing Unstructured Data. Cambridge: Cambridge University Press, 2007, p. 1
5
□ Information is often implicit □ Homonyms and synonyms □ Computers do not have access to the meaning of the text □ Spelling changes over time or may be vary according to region Difficulties natural language
6
I trod on grass made green by summer's rain, Through the fast-falling rain and high- wrought sea 'Tis like a wondrous strain that sweeps And suddenly my brain became as sand She mixed; some impulse made my heart refrain were found where the rainbow quenches its points upon the earth Rain rain rains rain’s Rain’s Rain. rain. Rain! ‘rain’
7
The outworn creeds again believed, Hatred, despair, and fear and vain belief Because I am a Priest do you believe imagine, while asserting what it believes to be true … The pleasure of believing what we see long-believing courage, and the systematic efforts of generations of
8
□ Data creation □ Data analysis Two stages in text mining
9
□ W1: Introduction to the course and introduction to the Perl programming language □ W2: Regular expressions, word segmentation, frequency lists, types and tokens □ W3: Natural language processing: Part of Speech tagging, lemmatisation □ W4: Exploration of existing text mining tools Weekly Programme Cluster 1: Data creation
10
□ W5: Introduction to R package □ W6: Multivariate analysis: Principal Component Analysis, Clustering techniques □ W7: Visualisation □ W8: Conclusion: What type of knowledge can we create? Weekly Programme Cluster 2: Data analysis
11
□ 5 assignments (2 points to be earned for each) □ Final essay (ca. 3,000 words) □ Report of your individual research project □ Critical reflection on the merits of text mining: □ What sort of knowledge can be produced? □ How does this type of research relate to traditional scholarship? □ Main obstacles or challenges? □ Is the creation of a text analysis tool a legitimate scholarly activity in the humanities? Course evaluation
12
□ Programming languages: used to give instructions to a computer □ There is a gap between human language and machine language □ Digital information is information represented as combinations of 1s and 0s, e.g.: A = 01100001A = 01100001 Introduction to programming
13
□ First generation programming languages: Assembler, eg ADD X1 Y1 □ Higher-level programming languages: Compilers or Interpreter Human Programmer Language processor Computer Programming language, e.g. Perl Machine Language 0101100101010
14
The Perl programming language □ Open source □ Developed by the linguist Larry Wall □ Easy to learn; Code is often easy to read □ Developed specifically for text processing
15
Getting started 1. Create a working directory on your computer 2. Open a code editor and type the following lines: use strict ; use warnings ; print “It works!” ; 3. Modify the.bat file that is provided
16
Today’s exercise Create an application in Perl which can read a machine readable version of Shelley’s Collected Poems (file is provided) and which can print all lines that contain a given keyword. (suggestions: “fire”, “rain”, “moon”, “storm”, “time”)
17
Variables □ Always preceded by a dollar sign $keyword □ Variables can be assigned a value with a specific data type (‘string’ or ‘number’) $keyword = “time” ; $number = 10 ; □ Three types of variables: scalar, array, hash
18
Strings □ Can be created with single quotes and with double quotes □ In the case of double quotes, the contents of the string will be interpreted. □ For instance, you can then use “escape characters” in your string: “\n” new line “\t” tab “\a”alarm bell
19
Statements □ Perl statements can be compared to sentences. □ Perl statements end in a semi-colon! print “Now this makes a statement!” ;
20
Exercise Print a string that looks as follows: This is the first line. This is the second line. This line contains atab. Also try to use the “\a” escape character in your string.
21
Reading a file Is done as follows: open ( IN, “shelley.txt” ) ; while ( ) { print $_ ; } close ( IN ) ;
22
Exercise Create a Perl application which can read the text file “shelley.txt” and which can print all the lines.
23
Control keywords if ( ) { } elsif { } else { <last block of code ; default option> }
24
Regular expressions (2) □ The pattern is given within two forward slashes □ Use the =~ operator to test if a given string contains the regex. □ Example: $keyword =~ /rain/
25
Control keywords if ( ) { } elsif { } else { <last block of code ; default option> }
26
Regular expressions □ The pattern is given within two forward slashes □ Use the =~ operator to test if a given string contains the regex. □ Example: $keyword =~ /rain/
27
Exercise You should now be able to make the exercise that was discussed earlier
28
Regular expressions (2) □ If you place “i” directly after the second forward slash, the comparison will take place in a case insensitive manner. □ \b can be used in regular expressions to represent word boundaries if ( $keyword =~ /\btime\b/i ) { }
29
Additional exercises □ Create a program that can count the total number of lines in the file “shelley.txt” □ Create a program that can calculate the length of each line, using the length() function length( $line ) ; □ Calculate the average line length (in characters) for the entire file.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.