Digital Text and Data Processing Week 1. □ Future of reading? □ Understanding “Machine reading”: □ Text analysis tools □ Visualisation tools Course background.

Slides:



Advertisements
Similar presentations
Chapter 25 Perl and CGI (Common Gateway Interface)
Advertisements

A Guide to Unix Using Linux Fourth Edition
Introduction to PHP MIS 3501, Fall 2014 Jeremy Shafer
Chapter 3: Beginning Problem Solving Concepts for the Computer Programming Computer Programming Skills /1436 Department of Computer Science.
Lecture 2 Introduction to C Programming
 2005 Pearson Education, Inc. All rights reserved Introduction.
1 Chapter 2 Introduction to Java Applications Introduction Java application programming Display ____________________ Obtain information from the.
CSET4100 – Fall 2009 Perl Introduction Scalar Data, Operators & Control Blocks Acknowledgements: Slides adapted from NYU Computer Science course on UNIX.
CS 898N – Advanced World Wide Web Technologies Lecture 8: PERL Chin-Chih Chang
1 Key Concepts:  Why C?  Life Cycle Of a C program,  What is a computer program?  A program statement?  Basic parts of a C program,  Printf() function?
Introduction to a Programming Environment
Guide To UNIX Using Linux Third Edition
Introduction to Unix (CA263) Introduction to Shell Script Programming By Tariq Ibn Aziz.
Perl R and SQL for the very beginners Christoph Rau, Calvin Pan and Yehudit Hasin Dec 2011.
1.3 Executing Programs. How is Computer Code Transformed into an Executable? Interpreters Compilers Hybrid systems.
Chapter 3: Introduction to C Programming Language C development environment A simple program example Characters and tokens Structure of a C program –comment.
C++ Programming Language Day 1. What this course covers Day 1 – Structure of C++ program – Basic data types – Standard input, output streams – Selection.
 2004 Prentice Hall, Inc. All rights reserved. Chapter 25 – Perl and CGI (Common Gateway Interface) Outline 25.1 Introduction 25.2 Perl 25.3 String Processing.
Digital Text and Data Processing Introduction to R.
Introduction to Perl Thaddeus Aid IT Learning Programme University of Oxford 15/04/2014.
Introduction to Shell Script Programming
MGS 351 Introduction to Management Information Systems RECITATION 11.
Introduction to Perl Practical Extraction and Report Language or Pathologically Eclectic Rubbish Lister or …
XP 1 CREATING AN XML DOCUMENT. XP 2 INTRODUCING XML XML stands for Extensible Markup Language. A markup language specifies the structure and content of.
IST 210: PHP BASICS IST 210: Organization of Data IST210 1.
2440: 211 Interactive Web Programming Expressions & Operators.
Programming in Java Unit 2. Class and variable declaration A class is best thought of as a template from which objects are created. You can create many.
Introduction to Programming David Goldschmidt, Ph.D. Computer Science The College of Saint Rose Java Fundamentals (Comments, Variables, etc.)
CPS120: Introduction to Computer Science Compiling Your Programs Using Visual C++
The Beauty and Joy of Computing Lecture #3 : Creativity & Abstraction UC Berkeley EECS Lecturer Gerald Friedland.
Programming Languages Meeting 13 December 2/3, 2014.
CMSC 104, Version 9/011 Introduction to C Topics Compilation Using the gcc Compiler The Anatomy of a C Program 104 C Programming Standards and Indentation.
Introduction to C Programming Angela Chih-Wei Tang ( 唐 之 瑋 ) Department of Communication Engineering National Central University JhongLi, Taiwan 2010 Fall.
COMPUTER PROGRAMMING. A Typical C++ Environment Phases of C++ Programs: 1- Edit 2- Preprocess 3- Compile 4- Link 5- Load 6- Execute Loader Primary Memory.
Copyright © 2010 Certification Partners, LLC -- All Rights Reserved Perl Specialist.
 Pearson Education, Inc. All rights reserved Introduction to Java Applications.
Introduction to Perl Yupu Liang cbio at MSKCC
Chapter 9: Perl Programming Practical Extraction and Report Language Some materials are taken from Sams Teach Yourself Perl 5 in 21 Days, Second Edition.
7 1 User-Defined Functions CGI/Perl Programming By Diane Zak.
These notes were originally developed for CpSc 210 (C version) by Dr. Mike Westall in the Department of Computer Science at Clemson.
Algorithms  Problem: Write pseudocode for a program that keeps asking the user to input integers until the user enters zero, and then determines and outputs.
Prof. Alfred J Bird, Ph.D., NBCT Door Code for IT441 Students.
CS 330 Programming Languages 10 / 02 / 2007 Instructor: Michael Eckmann.
Computer Programming for Biologists Class 6 Nov 21 th, 2014 Karsten Hokamp
Copyright © 2003 ProsoftTraining. All rights reserved. Perl Fundamentals.
Department of Electrical and Computer Engineering Introduction to Perl By Hector M Lugo-Cordero August 26, 2008.
Structured Programming (4 Credits) HNDIT Week 2 – Learning Outcomes Design an algorithmic solution for simple problem such as computation of a factorial,
 In the java programming language, a keyword is one of 50 reserved words which have a predefined meaning in the language; because of this,
 History  Ease of use  Portability  Standard  Security & Privacy  User support  Application &Popularity Today  Ten Most Popular Programming Languages.
Course Title Object Oriented Programming with C++ instructor ADEEL ANJUM Chapter No: 03 Conditional statement 1 BY ADEEL ANJUM (MSc-cs, CCNA,WEB DEVELOPER)
Perl for Bioinformatics Part 2 Stuart Brown NYU School of Medicine.
MAHENDRAN. Session Objectives Session Objectives  Discuss the Origin of C  Features of C  Characteristics of C  Current Uses of C  “C” Programming.
Dept. of Animal Breeding and Genetics Programming basics & introduction to PERL Mats Pettersson.
Dr. Abdullah Almutairi Spring PHP is a server scripting language, and a powerful tool for making dynamic and interactive Web pages. PHP is a widely-used,
Sudeshna Sarkar, IIT Kharagpur 1 Programming and Data Structure Sudeshna Sarkar Lecture 3.
Introduction to C Programming
CS Class 04 Topics  Selection statement – IF  Expressions  More practice writing simple C++ programs Announcements  Read pages for next.
JavaScript Syntax Fort Collins, CO Copyright © XTR Systems, LLC Introduction to JavaScript Syntax Instructor: Joseph DiVerdi, Ph.D., MBA.
IST 210: PHP Basics IST 210: Organization of Data IST2101.
1 Lecture 2 - Introduction to C Programming Outline 2.1Introduction 2.2A Simple C Program: Printing a Line of Text 2.3Another Simple C Program: Adding.
Perl created in 1987 by Larry Wall. Perl is open source Probably best known as a CGIscripting language “Perl was designed to work more like a natural language.”
Definition of the Programming Language CPRL
The Machine Model Memory
Digital Text and Data Processing
Digital Text and Data Processing
Digital Text and Data Processing
WEB PROGRAMMING JavaScript.
elementary programming
WinSLAMM Batch Editor Module 23
INTRODUCTION to PERL PART 1.
Presentation transcript:

Digital Text and Data Processing Week 1

□ Future of reading? □ Understanding “Machine reading”: □ Text analysis tools □ Visualisation tools Course background □ Differences between machine reading and human reading Images taken from textarc.org and from Google App store, Javelin for Android

Scale

□ “a collection of methods used to find patterns and create intelligence from unstructured text data” (1) □ Information is found “not among formalised database records, but in the unstructured textual data” (2) □ Related to data mining Text Mining (1) Francis, Louise. “Taming Text: An Introduction to Text Mining.” Casualty Actuarial Society Forum Winter (2006), p. 51 (2) Feldman, Ronan. The Text Mining Handbook: Advanced Approaches in Analyzing Unstructured Data. Cambridge: Cambridge University Press, 2007, p. 1

□ Information is often implicit □ Homonyms and synonyms □ Computers do not have access to the meaning of the text □ Spelling changes over time or may be vary according to region Difficulties natural language

I trod on grass made green by summer's rain, Through the fast-falling rain and high- wrought sea 'Tis like a wondrous strain that sweeps And suddenly my brain became as sand She mixed; some impulse made my heart refrain were found where the rainbow quenches its points upon the earth Rain rain rains rain’s Rain’s Rain. rain. Rain! ‘rain’

The outworn creeds again believed, Hatred, despair, and fear and vain belief Because I am a Priest do you believe imagine, while asserting what it believes to be true … The pleasure of believing what we see long-believing courage, and the systematic efforts of generations of

□ Data creation □ Data analysis Two stages in text mining

□ W1: Introduction to the course and introduction to the Perl programming language □ W2: Regular expressions, word segmentation, frequency lists, types and tokens □ W3: Natural language processing: Part of Speech tagging, lemmatisation □ W4: Exploration of existing text mining tools Weekly Programme Cluster 1: Data creation

□ W5: Introduction to R package □ W6: Multivariate analysis: Principal Component Analysis, Clustering techniques □ W7: Visualisation □ W8: Conclusion: What type of knowledge can we create? Weekly Programme Cluster 2: Data analysis

□ 5 assignments (2 points to be earned for each) □ Final essay (ca. 3,000 words) □ Report of your individual research project □ Critical reflection on the merits of text mining: □ What sort of knowledge can be produced? □ How does this type of research relate to traditional scholarship? □ Main obstacles or challenges? □ Is the creation of a text analysis tool a legitimate scholarly activity in the humanities? Course evaluation

□ Programming languages: used to give instructions to a computer □ There is a gap between human language and machine language □ Digital information is information represented as combinations of 1s and 0s, e.g.: A = A = Introduction to programming

□ First generation programming languages: Assembler, eg ADD X1 Y1 □ Higher-level programming languages: Compilers or Interpreter Human Programmer Language processor Computer Programming language, e.g. Perl Machine Language

The Perl programming language □ Open source □ Developed by the linguist Larry Wall □ Easy to learn; Code is often easy to read □ Developed specifically for text processing

Getting started 1. Create a working directory on your computer 2. Open a code editor and type the following lines: use strict ; use warnings ; print “It works!” ; 3. Modify the.bat file that is provided

Today’s exercise Create an application in Perl which can read a machine readable version of Shelley’s Collected Poems (file is provided) and which can print all lines that contain a given keyword. (suggestions: “fire”, “rain”, “moon”, “storm”, “time”)

Variables □ Always preceded by a dollar sign $keyword □ Variables can be assigned a value with a specific data type (‘string’ or ‘number’) $keyword = “time” ; $number = 10 ; □ Three types of variables: scalar, array, hash

Strings □ Can be created with single quotes and with double quotes □ In the case of double quotes, the contents of the string will be interpreted. □ For instance, you can then use “escape characters” in your string: “\n” new line “\t” tab “\a”alarm bell

Statements □ Perl statements can be compared to sentences. □ Perl statements end in a semi-colon! print “Now this makes a statement!” ;

Exercise Print a string that looks as follows: This is the first line. This is the second line. This line contains atab. Also try to use the “\a” escape character in your string.

Reading a file Is done as follows: open ( IN, “shelley.txt” ) ; while ( ) { print $_ ; } close ( IN ) ;

Exercise Create a Perl application which can read the text file “shelley.txt” and which can print all the lines.

Control keywords if ( ) { } elsif { } else { <last block of code ; default option> }

Regular expressions (2) □ The pattern is given within two forward slashes □ Use the =~ operator to test if a given string contains the regex. □ Example: $keyword =~ /rain/

Control keywords if ( ) { } elsif { } else { <last block of code ; default option> }

Regular expressions □ The pattern is given within two forward slashes □ Use the =~ operator to test if a given string contains the regex. □ Example: $keyword =~ /rain/

Exercise You should now be able to make the exercise that was discussed earlier

Regular expressions (2) □ If you place “i” directly after the second forward slash, the comparison will take place in a case insensitive manner. □ \b can be used in regular expressions to represent word boundaries if ( $keyword =~ /\btime\b/i ) { }

Additional exercises □ Create a program that can count the total number of lines in the file “shelley.txt” □ Create a program that can calculate the length of each line, using the length() function length( $line ) ; □ Calculate the average line length (in characters) for the entire file.