Digital Text and Data Processing

Slides:



Advertisements
Similar presentations
Conducting Research Investigating Your Topic Copyright 2012, Lisa McNeilley.
Advertisements

Harry Potter By J.K. Rowling
Object Oriented Programming Elhanan Borenstein Lecture #12 copyrights © Elhanan Borenstein.
CS 330 Programming Languages 10 / 11 / 2007 Instructor: Michael Eckmann.
An Introduction to Programming with C++ Fifth Edition
CS 330 Programming Languages 09 / 30 / 2008 Instructor: Michael Eckmann.
COS 381 Day 22. Agenda Questions?? Resources Source Code Available for examples in Text Book in Blackboard
Database Design IST 7-10 Presented by Miss Egan and Miss Richards.
Software Development Unit 2 Databases What is a database? A collection of data organised in a manner that allows access, retrieval and use of that data.
See Roger Run Roger lives on a Sun V240 server Roger lives on a Sun V240 server.
Tutorial 14 Working with Forms and Regular Expressions.
Digital Text and Data Processing Week 2. “The book is a machine to think with” I.A. Richards, Principles of Literary Criticism “The technologising of.
About Harry Potter By: Patrick O’Leary J.K.Rowling Born on July 31,1965 She is the 1,062 richest person in the world She married Neil Michael Murray.
Digital Text and Data Processing Week 1. □ Future of reading? □ Understanding “Machine reading”: □ Text analysis tools □ Visualisation tools Course background.
DAY 19: MICROSOFT ACCESS – CHAPTER 3 CONTD. Aliya Farheen March 17, 2015.
Copyright © 2010 Certification Partners, LLC -- All Rights Reserved Perl Specialist.
CS 330 Programming Languages 10 / 07 / 2008 Instructor: Michael Eckmann.
PHY 107 – Programming For Science. Announcements  Slides, activities, & solutions always posted to D2L  Note-taking versions before class, for those.
Digital Text and Data Processing Distant Reading.
Data Structures and Algorithms Dr. Tehseen Zia Assistant Professor Dept. Computer Science and IT University of Sargodha Lecture 1.
Copyright © 2003 ProsoftTraining. All rights reserved. Perl Fundamentals.
Digital Text and Data Processing Tokenisation. Today’s class □ Tokenisation and creation of frequency lists □ Keyword in context lists □ Moretti and distant.
RESEARCH METHODS IN TOURISM Nicos Rodosthenous PhD 07/03/ /3/2013Dr Nicos Rodosthenous1.
 2008 Pearson Education, Inc. All rights reserved. 1 Arrays and Vectors.
Chapter 10 Algorithmic Thinking. Learning Objectives Explain similarities and differences among algorithms, programs, and heuristic solutions List the.
Copyright Laws How to Get Permission? By: Ruth Garza EDTC
MICROSOFT ACCESS – CHAPTER 3 CONTD. Sravanthi Lakkimsetty Mar 09, 2016.
1 Midterm Examination. 2 General Observations Examination was too long! Most people submitted by .
Theory, Tools, History: A Brief Introduction August 17, 2016.
Writing a Research Report (Adapted from “Engineering Your Report: From Start to Finish” by Krishnan, L.A. et. al., 2003) Writing a Research Write the introduction.
Chapter 6: Using Arrays.
How to successfully prepare for tests and exams
Marta Canigiula CS FINAL PROJECT.
Legal and Ethical: Copyright Law and Plagiarism
LITERATURE REVIEW.
Digital Text and Data Processing
Legal and Ethical: Copyright Law and Plagiarism
Digital Text and Data Processing
Microsoft Visual Basic 2005: Reloaded Second Edition
Digital Text and Data Processing
“The Boy who lived” The Theme for this Assignment will be chosen from the Harry Potter series by author, JK Rowling.
Interfaces I once attended a Java user group meeting where James Gosling (Java's inventor) was the featured speaker. During the memorable Q&A session,
Microsoft Office Illustrated Fundamentals
Miscellaneous Items Loop control, block labels, unless/until, backwards syntax for “if” statements, split, join, substring, length, logical operators,
How Do We Teach and Assess Critical Thinking Skills?
Jeff Lester Jordan Vallandingham Mike Paczkowski
Analysing Journalism Research and the EIS
Evaluating Sources.
Working with Forms and Regular Expressions
Database Vocabulary Terms.
J.K. Rowling Quiz By Reece Bradley.
Topic 7 Interfaces I once attended a Java user group meeting where James Gosling (one of Java's creators) was the featured speaker. During the memorable.
Variables In programming, we often need to have places to store data. These receptacles are called variables. They are called that because they can change.
Teach A level Computing: Algorithms and Data Structures
Java Programming Arrays
Context.
CSCI 431 Programming Languages Fall 2003
Part Three: Topic Sentences & Evidence
Author Practicalities
Welcome.
Nagendra Vemulapalli Access chapters 3&5 Nagendra Vemulapalli 1/18/2019.
What is copyright? Copyright is a legal right created by the law of a country that grants the creator of an original work exclusive rights for its use.
J.K Rowling A brief presentation concerning Rowling as an author and the creator of a loved book series.
Allegory & Allusion Mrs. Groomer English I.
Spreadsheets, Modelling & Databases
Data Structures & Algorithms
CMPT 120 Lecture 3 - Introduction to Computing Science – Programming language, Variables, Strings, Lists and Modules.
Feature Article Unit of Study
Introduction to information retrieval
Paragraph Revision Week Twelve.
Presentation transcript:

Digital Text and Data Processing Week 2

Computation and Literary studies Statistical analysis as a “blunt hermeneutic instrument” (Trumpener) “Does the digital component of digital humanities give us new ways to think, or only ways to illustrate what we already know?” (Kirsch) “It’s like hitting a painting with a fish – why would you?” (Kennedy)

Algorithmic Criticism There is a need for an “criticism derived from algorithmic manipulation of text” (Ramsay) Digital methods ought to “assist the critic in the unfolding of interpretative possibilities” Cf. “Literary informatics” (Martin Mueller)

“Secondary query potential” of digital text (Mueller) From the “conduit model” to “transformation” and to “object manipulation” (Bradley) “performative” and “deformative criticism” (McGann)

Moretti dismisses close reading as a “theological exercise” and as a “very solemn treatment of very few texts taken very seriously” Literary research as "a patchwork of other people’s research, without a single direct textual reading” Chronological and geographical developments in "devices, themes, tropes — or genres and systems" Literary research which uses the analogy of science A method resting “solidly on facts” Concepts and visual models from natural sciences

From “The Slaughterhouse of Literature”

Effects on the research agenda Martin Mueller “The underlying methods (…) are probabilistic and in many ways more compatible with a spirit of tentative inquiry” “Is it an instance of the old joke about the drunk who is looking for his lost car key under a lamp post because that is where the light is?” Digital methods are concerned more with “Establishing the ‘‘fact that’’ than with explaining the ‘‘reason why’’. Shawna Ross Digital humanities needs to focus on “the conditional and the subjunctive, rather than inside absolutes and interdictions”

Source Criticism Which edition was digitised precisely? Does this edition have authority, or any historical importance? Which organisation has digitised the text? Does this organisation have sufficient expertise in digitisation projetcs? Which measures have been taken to avoid errors? Has the digitised text been appraised or checked? Did the digitisation process introduce changes to the text? If yes, has this editorial process been documented accurately? Which organisation has published these sources? Are you allowed to perform text mining on these sources?

IPR and licences Possibilities to mine recent texts depend on Intellectual Property Rights (IPR) and agreements in licences with Publishers National Library assumes that texts published before 1873 (2x70 years) are in the open domain. Texts from period in between 1873 and 1940 can be made available because of agreement with organisations such as LIRA and Pictoright

Study commissioned by EC led by by prof. Ian Hargreaves The right to read does not imply the right to mine

The Hague Declaration “A lack of clarity around the legality of TDM is inhibiting TDM-based research in Europe” “The solutions offered by publishers are insufficient to meet the needs of researchers and are placing European researchers at a disadvantage” “The introduction of a mandatory copyright exception to allow anyone to use computers to analyse anything to which they have legal access is essential”

Regular expressions Components of text patterns Character classes, e.g. \w , \d or . Quantifiers, e.g. {2,4} or ?, +, * Anchors, e.g. \b , ^ , $ Patterns need to be given in forward slashes

/\bthe (\w+ ){0,2}light\b/

Match variables Parentheses create substrings within a regular expression Perl stores the texts that is matched as variable $1 Example: $keyword = “well-known” ; if ( $keyword =~ /(\w+)-\w+/ ) { print $1 ; #This will print “well” }

Three types of variables Scalars: a single value; start with $ Arrays: multiple values; start with @ Hashes: Multiple values which can be referenced with ‘keys’; start with %

@potter = ("The Philosopher's Stone", "The Chamber of Secrets", "The Prisoner of Azkaban", "The Goblet of Fire", "The Order of the Phoenix", "The Half-Blood Prince", "The Deathly Hallows") ; $potter[0] # The Philosopher’s Stone $potter[4] # The Order of the Phoenix $potter[-1] # The Deathly Hallows

Looping through an array Looping through an array foreach my $book ( @potter ) { print $book ; }

A hash my %capitals = ( "Italy"=>"Rome", Can be thought of as an array in which you specify the keys yourself my %capitals = ( "Italy"=>"Rome", "Belgium"=> "Brussels" ) print $capitals{"Italy"} ## Rome

keys value Belgium Brussels Italy Rome France Paris …

Looping through a hash foreach my $c ( keys %capitals ) { print $c . ': ' . $capitals{$c} ; }

Sorting a hash foreach my $f ( sort keys %hash ) { print $f ; } Sorting, by default, is done alphabetically, by key, in ascending order

Ways of sorting Numerically by key: sort { $a <=> $b} Numerically by value: sort { $hash{$a} <=> $hash{$a} } Alphabetically by value: sort { $hash{a} cmp $hash{b} }

Exercises 13 and 14

Finding words $line = "If music be the food of love, play on" ; @array = split( /\s/ , $line ) ; # $array[0] contains "if" # $array[4] contains "food"

Tokenisation @words = split( /\s+/ , $line ) foreach my $w ( @words ) { print $w ; }

Frequency list for Heart of Darkness produced using TaporWare Frequency lists Frequency list for Heart of Darkness produced using TaporWare

Assigning / updating a value Creating a hash my %freq ; $freq{"if"}++ ; $freq{"music"}++ ; print $freq{"if"} . "\n" ; Assigning / updating a value

N.B. $a = $a + 1 ; is the same as $a++ ;

Calculation of frequencies my %freq ; @words = split( /\s+/ , $line ) foreach my $w ( @words ) { $freq{$w}++ ; }

But she returned to the writing-table, observing, as she passed her son, "Still page 322?" Freddy snorted, and turned over two leaves. For a brief space they were silent. Close by, beyond the curtains, the gentle murmur of a long conversation had never ceased.

Actually a “word”? foreach my $w ( @words ) { if ( $w =~ /(\w)/ ) { $freq{ $1 }++ ; } }