Digital Text and Data Processing

Slides:



Advertisements
Similar presentations
An Introduction to R: Logic & Basics. The R language Command line Can be executed within a terminal Within Emacs using ESS (Emacs Speaks Statistics)
Advertisements

Chapter 10 Introduction to Arrays
12 FURTHER MATHEMATICS Organising and Displaying Data.
Introduction to GTECH 201 Session 13. What is R? Statistics package A GNU project based on the S language Statistical environment Graphics package Programming.
R for Research Data Analysis using R Day1: Basic R Baburao Kamble University of Nebraska-Lincoln.
George Blank University Lecturer. CS 602 Java and the Web Object Oriented Software Development Using Java Chapter 4.
1 Query Languages. 2 Boolean Queries Keywords combined with Boolean operators: –OR: (e 1 OR e 2 ) –AND: (e 1 AND e 2 ) –BUT: (e 1 BUT e 2 ) Satisfy e.
©2004 Brooks/Cole Chapter 8 Arrays. Figures ©2004 Brooks/Cole CS 119: Intro to JavaFall 2005 Sometimes we have lists of data values that all need to be.
Review Java.
About the Presentations The presentations cover the objectives found in the opening of each chapter. All chapter objectives are listed in the beginning.
Digital Text and Data Processing Introduction to R.
Introduction to SPSS Edward A. Greenberg, PhD
WEKA - Explorer (sumber: WEKA Explorer user Guide for Version 3-5-5)
Chapter 3 Syntax, Errors, and Debugging Fundamentals of Java.
 Pearson Education, Inc. All rights reserved Arrays.
Hands-on Introduction to R. We live in oceans of data. Computers are essential to record and help analyse it. Competent scientists speak C/C++, Java,
WHAT IS A DATABASE? A DATABASE IS A COLLECTION OF DATA RELATED TO A PARTICULAR TOPIC OR PURPOSE OR TO PUT IT SIMPLY A GENERAL PURPOSE CONTAINER FOR STORING.
RazorFish Data Exploration-KModes Data Exploration utilizing the K-Modes Clustering algorithm Performed By: Hilbert G Locklear.
Computer Graphics Chapter 6 Andreas Savva. 2 Interactive Graphics Graphics provides one of the most natural means of communicating with a computer. Interactive.
Topic 1 Object Oriented Programming. 1-2 Objectives To review the concepts and terminology of object-oriented programming To discuss some features of.
Chapter 3 Syntax, Errors, and Debugging Fundamentals of Java.
McGraw-Hill Career Education© 2008 by the McGraw-Hill Companies, Inc. All Rights Reserved. Office Excel 2007 Lab 2 Charting Worksheet Data.
Introduction to Perl NICOLE VECERE. Background General Purpose Language ◦ Procedural, Functional, and Object-oriented Developed for text manipulation.
IS201 Agenda: 09/19  Modify contents of the database.  Discuss queries: Turning data stored in a database into information for decision making.  Create.
 2001 Prentice Hall, Inc. All rights reserved. Chapter 7 - Introduction to Common Gateway Interface (CGI) Outline 7.1Introduction 7.2A Simple HTTP Transaction.
Perl Variables: Array Web Programming1. Review: Perl Variables Scalar ► e.g. $var1 = “Mary”; $var2= 1; ► holds number, character, string Array ► e.g.
Math 252: Math Modeling Eli Goldwyn Introduction to MATLAB.
Chapter 9 Introduction to Arrays Fundamentals of Java.
Lecture 11 Introduction to R and Accessing USGS Data from Web Services Jeffery S. Horsburgh Hydroinformatics Fall 2013 This work was funded by National.
Digital Text and Data Processing Week 8. □ Is it a valid scholarly discipline? Can these technologies genuinely enable scholars to generate valuable insights?
Working with data in R 2 Fish 552: Lecture 3. Recommended Reading An Introduction to R (R Development Core Team) –
Exposure Mapping Assistance Workshop - April Kingstown St. Vincent and the Grenadines Eduardo Mattenet Consultant 2013 Navigating with QGIS 1.
Introduction to R.
Why indexing? For efficient searching of a document
MSAA PRESENTS: AN EXCEL TUTORIAL
Digital Media Technology
The Simple Corpus Tool Martin Weisser Research Center for Linguistics & Applied Linguistics Guangdong University of Foreign Studies
Overview of R and ggplot2 for graphics
Ggplot2 Wu Shaohuan.
Programming in R Intro, data and programming structures
3 Introduction to Classes and Objects.
Using R Graphs in R.
Chapter 3 Syntax, Errors, and Debugging
Digital Text and Data Processing
Digital Text and Data Processing
Data Visualizer.
Adobe Flash Professional CS5 – Illustrated
Chapter 5 - Control Structures: Part 2
Next Generation R tidyr, dplyr, ggplot2
A simple way to organize data
Introduction to R Studio
Array Array is a variable which holds multiple values (elements) of similar data types. All the values are having their own index with an array. Index.
JavaScript: Functions.
Unit Six: Labels In this unit… Review Adding Text to Maps
Bar Charts, Line Graphs & Frequency Polygons
Mean Shift Segmentation
Data Representation and Mapping
REDCap Data Migration from CSV file
Module 6: Presenting Data: Graphs and Charts
Chapter 8: Introduction to High-Level Language Programming
Search Techniques and Advanced tools for Researchers
Perl Variables: Array Web Programming.
Lecture 12: Data Wrangling
Context.
Object Oriented Programming in java
Structures- case, sequence, formula node
Overview of R and ggplot2 for graphics
Lecture 7 – Delivering Results with R
Unit 2 – Graphical Representation
Hash Maps Implementation and Applications
Presentation transcript:

Digital Text and Data Processing Week 5

Analyses of single texts Frequency lists, possibly filtered using a list of stopwords Distribution analysis Concordance (or Keyword in Context List) Collocation N.B. Code is available in DTDP file repository

Removing stopwords Save list as “stopwords.txt” open ( ST , "stopwords.txt" ) or die "Can't read file!" ; while(<ST>) { if ($_ =~ /\w/) { push( @stopwords , $_) ; } } close( ST) ;

Change the array into a long string, using join() my $stopwords = join( "|" , @stopwords ) ;

Functions Functions are declared using the keyword “sub” sub myFunction($) { my $parameter = shift ; return $parameter } print &myFunction(“parameter”) ; Functions are declared using the keyword “sub” They can take parameters as input (and these can be read using shift) Functions return a value Once the function has been defined, it can be invoked using "& + [name function]”

Add a function which can test if a word is contained within this long string sub isstopword($) { my $text = shift ; if ( lc($text) =~ /\b($stopwords)\b/) { return 1 ; } else { return 0 ; }

Next, this function can be used in the tokenisation program “!” is a negation if ( !( &isstopword($word) )) { $freq{ $word }++ ; }

Collocation Frequencies of the words that are used near a given search term In the file “collocation.pl”, the variable $distance specifies the size of the “window”: the context that will be considered It reads in texts in “paragraph mode”: # Paragraph mode for while loop $/ = "";

Textual units $/ = “\n” ; $/ = “” ; $/ = undef ; The special variable s/ defines the “textual units” that are read by the program Line by line (default option) Paragraph mode Full text (no segmentation) $/ = “\n” ; $/ = “” ; $/ = undef ;

Reading a corpus Save all the .txt files in a folder named, for instance “Corpus” my $dir = "Corpus" ; opendir(DIR, $dir) or die "Can't open directory!"; while (my $file = readdir(DIR)) { if ( $file =~ /txt$/) { push ( @texts, $file ) ; } }

Two-dimensional hashes Hashes with two keys Often useful for collecting data about different texts $freq{ $text }{ $word }

Lexical variety Number of types Number of tokens

Type-token ratio The higher the number, the higher the vocabulary diversity. If the number is (relatively) low, there is a high level of repetition The length of the text has an impact on the type-token ratio

This is a sentence is which all the words are unique. 11 / 11 = 1

Variables in R Any combination of alphanumerical characters, underscore and dot (but the variable name cannot begin with a dot) Unlike Perl, they do not begin with a $ The assignment operator in R is <- n <- 5

Vectors A collection of indexed values (similar to an array in Perl) Can be created using the c() function, or by supplying a range Examples: x <- c( 4, 5, 3, 7) ; y <- 1:30 ;

Data frame A collection of vectors, all of the same length Each column of the table is stored in R as a vector. V1 V2 V3 R1 3, 4, 5 R2 1, 21, 8 R3 23, 5, 6

CSV file type,token ARoomWithaView.txt,6925,66445 ATaleofTwoCities.txt,10188,135584 HeartofDarkness.txt,5512,37896 Ivanhoe.txt,12859,175069 MobyDick.txt,18582,211806 PrideandPrejudice.txt,6454,121781

Reading data in R Use the read.csv function The CSV file will be represented as a data frame Values on first line and first value of each subsequent line will be used as rownames and colnames data <- read.csv( "data.csv" , header = TRUE) ; colnames(data)

Data frame columns Can be accessed using the $ operator data <- read.csv( "data.csv" , header = TRUE) ; data$year

Calculations max(), min(), mean(), sd() y <- data$year ; max(y) ; sd(y) ;

Subsetting Selecting rows with specific characteristics; the result is a filtered data frame. You need to give criteria both for rows and for columns (but one of these can also remain empty) d2 <- d[ d$year > 2000 , ] d2 <- d[ d$year > 2000 , ]

Basic R Exercises

Visualisation Distinction between “scientific visualisation” and “information visualisation” Lev Manovich: “a mapping between discrete data and a visual representation” (p. 2) Conversion to a graphic modality

Data Visualisation Jacques Bertin created a classification of “graphical primitives” in Semiology of Graphics Similar to Lev Manovich’s description of the “graphical primitives such as points, straight lines, curves, and simple geometric shapes” which “stand in for objects and relations between them”

Ggplot A package for the creation of visuals (next to the base plotting system) A technical implementation of Leland Wilkinson’s book The Grammar of Graphics Images express meaning via position, size, shape, rotation, colour, saturation, orientation and blur (p. 118)

Grammar of Graphics Graphs consist of (1) aesthetic attributes and (2) geometric objects Data values are expressed through aesthetic attributes; they are provided in the aes() function in ggplot The ggplot() function specified the basic data set that is used in the visualization; additional functions (which begin with geom_) define the geometic objects (the “graphic primitives”).

Bar Chart e <- read.csv("e.csv" , header = TRUE ) ; p <- ggplot( e, aes(x=subject1) ) + geom_bar( ) ;

Flipping X and Y axes p <- ggplot( e, aes(x=subject1) ) + geom_bar( ) + coord_flip()

Scatter plot p <- ggplot( d, aes( x= avgWords , y = ratio , col = author , shape = century , label = rownames(d) ) ) + geom_point( size = 6 ) + geom_text( col = "black" , hjust = -0.2 , size = 4 )

Documentation