Digital Text and Data Processing

Slides:



Advertisements
Similar presentations
Perl Practical Extration and Reporting Language An Introduction by Shwen Ho.
Advertisements

 2005 Pearson Education, Inc. All rights reserved Introduction.
1 Chapter 2 Introduction to Java Applications Introduction Java application programming Display ____________________ Obtain information from the.
LING 388: Language and Computers Sandiway Fong Lecture 3: 8/28.
Introduction to a Programming Environment
C++ Programming Language Day 1. What this course covers Day 1 – Structure of C++ program – Basic data types – Standard input, output streams – Selection.
 2004 Prentice Hall, Inc. All rights reserved. Chapter 25 – Perl and CGI (Common Gateway Interface) Outline 25.1 Introduction 25.2 Perl 25.3 String Processing.
Shell Scripting Awk (part1) Awk Programming Language standard unix language that is geared for text processing and creating formatted reports but it.
Topics Introduction Hardware and Software How Computers Store Data
Lesson 1: Introduction to ABAP OBJECTS Todd A. Boyle, Ph.D. St. Francis Xavier University.
Chapter 2: Basic Elements of Java J ava P rogramming: From Problem Analysis to Program Design, From Problem Analysis to Program Design, Second Edition.
Chapter 2 Basic Elements of Java. Chapter Objectives Become familiar with the basic components of a Java program, including methods, special symbols,
Digital Text and Data Processing Week 1. □ Future of reading? □ Understanding “Machine reading”: □ Text analysis tools □ Visualisation tools Course background.
Copyright © 2010 Certification Partners, LLC -- All Rights Reserved Perl Specialist.
Java Programming: From Problem Analysis to Program Design, 4e Chapter 2 Basic Elements of Java.
Java Programming: From Problem Analysis to Program Design, 5e Chapter 2 Basic Elements of Java.
 Pearson Education, Inc. All rights reserved Introduction to Java Applications.
© 2004 Pearson Addison-Wesley. All rights reserved ComS 207: Programming I Instructor: Alexander Stoytchev
Copyright © 2003 ProsoftTraining. All rights reserved. Perl Fundamentals.
Course Title Object Oriented Programming with C++ instructor ADEEL ANJUM Chapter No: 03 Conditional statement 1 BY ADEEL ANJUM (MSc-cs, CCNA,WEB DEVELOPER)
Chapter – 8 Software Tools.
Java Programming: From Problem Analysis to Program Design, Second Edition 1 Lecture 1 Objectives  Become familiar with the basic components of a Java.
PROGRAMMING USING PYTHON LANGUAGE ASSIGNMENT 1. INSTALLATION OF RASPBERRY NOOB First prepare the SD card provided in the kit by loading an Operating System.
1 Agenda  Unit 7: Introduction to Programming Using JavaScript T. Jumana Abu Shmais – AOU - Riyadh.
COMP234-Perl Variables, Literals Context, Operators Command Line Input Regex Program template.
LINGO TUTORIAL.
Compilers Principles, Techniques, & Tools Taught by Jing Zhang
Definition of the Programming Language CPRL
Introduction to Perl: Practical extraction and report language
Advanced Higher Modern Languages
CSC201: Computer Programming
Learning to Program D is for Digital.
Topics Introduction Hardware and Software How Computers Store Data
Introduction to the C Language
Digital Text and Data Processing
Digital Text and Data Processing
Digital Text and Data Processing
Primitive Data Types August 28, 2006 ComS 207: Programming I (in Java)
C Language VIVA Questions with Answers
Digital Text and Data Processing
CS101 Introduction to Computing Lecture 19 Programming Languages
Variables, Expressions, and IO
Miscellaneous Items Loop control, block labels, unless/until, backwards syntax for “if” statements, split, join, substring, length, logical operators,
Getting Started with C.
Year 2 Block A.
Microsoft Excel 2003 Illustrated Complete
Intro to PHP & Variables
Java Programming: From Problem Analysis to Program Design, 4e
MATLAB: Structures and File I/O
Specifying, Compiling, and Testing Grammars
Number and String Operations
WEB PROGRAMMING JavaScript.
Chapter 2: Basic Elements of Java
Programming Funamental slides
Topics Introduction Hardware and Software How Computers Store Data
T. Jumana Abu Shmais – AOU - Riyadh
CS 100: Roadmap to Computing
Algorithm Discovery and Design
INTRODUCTION TO MATLAB
Review for Final Exam.
Homework Reading Programming Assignments Finish K&R Chapter 1
Compilers Principles, Techniques, & Tools Taught by Jing Zhang
12th Computer Science – Unit 5
Chap 2. Identifiers, Keywords, and Types
Introduction to C Programming
CMPT 120 Lecture 3 - Introduction to Computing Science – Programming language, Variables, Strings, Lists and Modules.
Instructor: Alexander Stoytchev
Hardware is… Software is…
CS313T Advanced Programming language
PYTHON - VARIABLES AND OPERATORS
Presentation transcript:

Digital Text and Data Processing Week 1

Course background Future of reading Understanding “Machine reading”: Text analysis tools Visualisation tools Differences between machine reading and human reading Images taken from textarc.org and from Google App store, Javelin for Android

Scale

Text Mining “a collection of methods used to find patterns and create intelligence from unstructured text data” (1) Related to data mining Information is found “not among formalised database records, but in the unstructured textual data” (2) (1) Francis, Louise. “Taming Text: An Introduction to Text Mining.” Casualty Actuarial Society Forum Winter (2006), p. 51 (2) Feldman, Ronan. The Text Mining Handbook: Advanced Approaches in Analyzing Unstructured Data. Cambridge: Cambridge University Press, 2007, p. 1

One thing was certain, that the WHITE kitten had had nothing to do with it:--it was the black kitten's fault entirely. For the white kitten had been having its face washed by the old cat for the last quarter of an hour (and bearing it pretty well, considering). [..] And here Alice began to get rather sleepy, and went on saying to herself, in a dreamy sort of way, 'Do cats eat bats? Do cats eat bats?' In a Wonderland they lie, Dreaming as the days go by, Dreaming as the summers die: Ever drifting down the stream, Lingering in the golden gleam. Life, what is it but a dream? [..] This piece of rudeness was more than Alice could bear: she got up in great disgust, and walked off; the Dormouse fell asleep instantly.

Difficulties of natural language Semantic categories are generally implicit Inflections: conjugations and declension Homonyms and synonyms: Meaning of polysemic are context-specific Spelling changes over time or may vary across regions

I trod on grass made green by summer's rain, Through the fast-falling rain and high-wrought sea 'Tis like a wondrous strain that sweeps And suddenly my brain became as sand She mixed; some impulse made my heart refrain were found where the rainbow quenches its points upon the earth Rain rain rains rain’s Rain’s Rain. rain. Rain! ‘rain’

Two stages in text mining Data creation Data analysis

Weekly Programme Cluster 1: Data creation W1: Introduction to the course and introduction to the Perl programming language W2: Regular expressions, word segmentation, frequency lists, types and tokens W3: Natural language processing: Part of Speech tagging, lemmatisation W4: Sentence segmentation, Complexity metrics

Weekly Programme Cluster 2: Data analysis W5: Introduction to R package W6: Ggplot: creating visualisations W7: Topic Modelling, Multivariate analysis: Principal Component Analysis, Clustering techniques W7: Geographic information

Individual Research project Techniques taught in DTDP generally enable you to study formal differences and similarities between texts, e.g. vocabulary, sentence length, grammatical structure Create a corpus consisting of ten different texts at a minimum, all of ca. 5000 words or more; you can copy texts from existing corpora You can apply the techniques which are explained in this class to your own corpus Formulate your own research question

Course evaluation Final essay (ca. 4,000 words) Report of your individual research project (50%) Critical reflection on digital humanities research (50%) How does this type of research relate to traditional scholarship? Is programming a legitimate scholarly activity in the humanities? Can visualisations of texts function as independent scholarly resources? Five “Coding Challenges” which need to be marked as sufficient

Course syllabus can be found at www.bookandbyte.org/DTDP Weekly programme (including the homework for each week) Course organisation Exercises and coding challenges Links to software tools and tutorials Bibliography List of text corpora Practical organisation of classes

Getting started Download and install Perl Create a working directory on your computer Open a code editor and type the following lines: print “It works!” ; Save the file, with extension .pl Use the .bat file that is provided. In the command prompt, type in: “[name of file].pl”

Variables Always preceded by a dollar sign $keyword Variables can be assigned a value with a specific data type (‘string’ or ‘number’) $keyword = “time” ; $number = 10 ; Three types of variables: scalar, array, hash

Strings Can be created with single quotes and with double quotes In the case of double quotes, the contents of the string will be interpreted. You can then use “escape characters” in your string to add basic formatting: “\n” new line “\t” tab

Statements Perl statements can be compared to sentences. Perl statements end in a semi-colon! print “This is a statement!” ;

Writing to file Done as follows: open( OUT , “>out.txt” ) ; print OUT “Text in out.txt” ; close(OUT)

Exercise Create a new file named “data.txt”, in which print the following lines: This is the first line. This is the second line. This line contains a tab.

Operators = Assignment e.g. $a = 5 ; Arithmetic operators + Addition Subtraction * Multiplication

Exercise Create two variables, and assign a numerical value to both of them Print their sum, their difference and their product.

Reading a file Use the following code: open ( IN , “shelley.txt” ) ; while ( <IN> ) { print $_ ; } close ( IN ) ;

Exercise Create a Perl application which can read the text file “shelley.txt” and which copy all the lines to a new file.

Control keywords if ( <condition> ) { <first block of code> } elsif ( <condition> ) { <second block of code> } else { <last block of code ; default option> }

Regular expressions A pattern which represents a particular sequence of characters The pattern is given within two forward slashes Use the =~ operator to test if a given string contains the regular expression. Example: $keyword =~ /rain/

Exercise Create an application in Perl which can read a machine readable version of Shelley’s Collected Poems (file is provided) and which can print all lines that contain a given keyword. (suggestions: “fire” , “rain” , “moon”, “storm”, “time”)

Regular expressions (2) If you place “i” directly after the second forward slash, the comparison will take place in a case insensitive manner. \b can be used in regular expressions to represent word boundaries if ( $keyword =~ /\btime\b/i ) { }

Coding challenge Create a program that can count the total number of lines in the file “shelley.txt” Create a program that can calculate the length of each line, using the length() function length( $line ) ; Calculate the average line length (in characters) for the entire file.