Digital Text and Data Processing

Slides:



Advertisements
Similar presentations
Lexical Analysis Consider the program: #include main() { double value = 0.95; printf("value = %f\n", value); } How is this translated into meaningful machine.
Advertisements

Regular Expression Original Notes by Song Guo. What Regular Expressions Are Exactly - Terminology a regular expression is a pattern describing a certain.
ISBN Regular expressions Mastering Regular Expressions by Jeffrey E. F. Friedl –(on reserve.
CS 898N – Advanced World Wide Web Technologies Lecture 8: PERL Chin-Chih Chang
LING 388: Language and Computers Sandiway Fong Lecture 2: 8/23.
LING 388: Language and Computers Sandiway Fong Lecture 3: 8/28.
Introduction to Computers and Programming - Class 2 1 Introduction to Computers and Programming Class 2 Introduction to C Professor Avi Rosenfeld.
Regular expressions Mastering Regular Expressions by Jeffrey E. F. Friedl Linux editors and commands (e.g.
Introduction to Python
Regular Expression Darby Tien-Hao Chang (a.k.a. dirty) Department of Electrical Engineering, National Cheng Kung University.
Digital Text and Data Processing Week 2. “The book is a machine to think with” I.A. Richards, Principles of Literary Criticism “The technologising of.
2440: 211 Interactive Web Programming Expressions & Operators.
Regular Expressions in Perl Part I Alan Gold. Basic syntax =~ is the matching operator !~ is the negated matching operator // are the default delimiters.
Digital Text and Data Processing Week 1. □ Future of reading? □ Understanding “Machine reading”: □ Text analysis tools □ Visualisation tools Course background.
Programming Languages Meeting 13 December 2/3, 2014.
Regular Expressions Regular expressions are a language for string patterns. RegEx is integral to many programming languages:  Perl  Python  Javascript.
Computer Science 101 Introduction to Programming.
Perl created in 1987 by Larry Wall. Perl is open source Probably best known as a CGIscripting language “Perl was designed to work more like a natural language.”
Chapter 2: Java Fundamentals
BASICS CONCEPTS OF ‘C’.  C Character Set C Character Set  Tokens in C Tokens in C  Constants Constants  Variables Variables  Global Variables Global.
Regular Expressions for PHP Adding magic to your programming. Geoffrey Dunn
IN LINE FUNCTION AND MACRO Macro is processed at precompilation time. An Inline function is processed at compilation time. Example : let us consider this.
Lexical Analysis S. M. Farhad. Input Buffering Speedup the reading the source program Look one or more characters beyond the next lexeme There are many.
Chapter 2 Variables.
Constants, Variables and Data types in C The C character Set A character denotes any alphabet, digit or special symbol used to represent information.
Operators and Expressions. 2 String Concatenation  The plus operator (+) is also used for arithmetic addition  The function that the + operator performs.
Objective Write simple computer program in C++ Use simple Output statements Become familiar with fundamental data types.
Sudeshna Sarkar, IIT Kharagpur 1 Programming and Data Structure Sudeshna Sarkar Lecture 3.
Linux Administration Working with the BASH Shell.
FG Group -Afrilia BP -Liana F.B.I -Maulidatun Nisa -Riza Amini F.
Perl created in 1987 by Larry Wall. Perl is open source Probably best known as a CGIscripting language “Perl was designed to work more like a natural language.”
1 ENERGY 211 / CME 211 Lecture 3 September 26, 2008.
Regular Expressions Copyright Doug Maxwell (
RE Tutorial.
C++ First Steps.
More about comments Review Single Line Comments The # sign is for comments. A comment is a line of text that Python won’t try to run as code. Its just.
Java Programming Fifth Edition
Chapter 2 Variables.
Strings and Serialization
Ruby: An Introduction Created by Yukihiro Matsumoto in 1993 (named after his birthstone) Pure OO language (even the number 1 is an instance of a class)
Looking for Patterns - Finding them with Regular Expressions
Digital Text and Data Processing
CS170 – Week 1 Lecture 3: Foundation Ismail abumuhfouz.
CSC 594 Topics in AI – Natural Language Processing
Digital Text and Data Processing
Lecture 2 Data Types Richard Gesick.
Regular Expressions and perl
Python is a general-purpose interpreted, interactive, object-oriented, and high-level programming language. It was created by Guido van Rossum during.
Grep Allows you to filter text based upon several different regular expression variants Basic Extended Perl.
Intro to Java.
CSC 594 Topics in AI – Natural Language Processing
An Object-Oriented Approach to Programming Logic and Design Fourth Edition Chapter 2 Applications and Data.
Chapter 10 Programming Fundamentals with JavaScript
MSIS 655 Advanced Business Applications Programming
T. Jumana Abu Shmais – AOU - Riyadh
CSCI 431 Programming Languages Fall 2003
CMSC 202 Java Primer 2.
CS 3304 Comparative Languages
CS 3304 Comparative Languages
The Data Element.
PolyAnalyst Web Report Training
1.5 Regular Expressions (REs)
Unit 3: Variables in Java
Chap 2. Identifiers, Keywords, and Types
The Data Element.
Chapter 2 Variables.
Instructor: Alexander Stoytchev
ADVANCE FIND & REPLACE WITH REGULAR EXPRESSIONS
PYTHON - VARIABLES AND OPERATORS
Presentation transcript:

Digital Text and Data Processing Week 2

From texts to data

Text mining algorithms may focus on the recognition on Linguistic aspects: e.g. most frequent words, number of words in a sentence, number of nouns or adjectives, type/token ratio Semantic aspects: e.g. References to concepts, sentiments, named entities (personal names, organisations, geographic locations)

Research based on vocabulary Segmentation or tokenisation Often based on the fact that there are spaces in between words (at least since scriptura continua was abandoned in late 9th C.) “soft mark up” Source: Chistopher Kelty, Abracadabra: Language, Memory, Representation

Frequency lists Token counts reflect the total number of words; Types are the unique words in a text ‘Bag of words’ model: original word order is ignored the 6 it 6 of 6 was 6 epoch 2 age 2 times 2 foolishness 1 wisdom 1 belief 1 “It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, it was the epoch of belief, it was the epoch of incredulity” Tokens: 36 Types: 13

Hugh Craig, Stylistic Analysis and Authorship Studies Stylometrics Study of style on the basis of observable textual aspects Analyses of differences and similarities between texts in different genres, in different periods, texts by different authors Hugh Craig, Stylistic Analysis and Authorship Studies

Authorship attribution Suggesting an author for texts whose authorship is disputed John Burrows, Never Say Always Again: Reflections on the Numbers Game

Becket, Andrew, A concordance to Shakespeare suited to all the editions, in which the distinguished and parallel passages in the plays of that justly admired writer are methodically arranged. 1787

Larry Wall, Programming Perl “We will encourage you to develop the three great virtues of a programmer:  laziness, impatience and hybris” Larry Wall, Programming Perl

Recapitulation W1 Variables begin with a dollar sign. Two types: strings and numbers Statements end in a semi-colon Operators such as +, - and * can be used to do calculations Use “perl” + name of program to run a program in the command prompt

Avoiding errors “Use strict” has the effect that all variables need to be declared on first use with the “my” keyword “Use warnings” means that programmers will be warned when there errors, even when these are “non-fatal”

Operators Concatenation of strings with the dot $string1 = "Hello" ; $string2 = "World" ; $string3 = $string1 . " " . $string2 ; Shorthand notation for mathematical operators: $sum = 5 + 1 ; $sum = 5++ ; $number = 2 ; $number += 3 ;

Control keywords if ( <condition> ) { <first block of code> } elsif ( <condition> ) { <second block of code> } else { <last block of code ; default option> }

Reading a file open ( IN , "shelley.txt") ; while(<IN>) { print $_ ; } close ( IN ) ; Curly brackets create a “block” of code

Regular expressions A pattern which represents a specific sequence of characters The pattern is given within two forward slashes Use the =~ operator to test if a given string contains the regex. Example: $keyword =~ /rain/

Regular expressions Simplest regular expression: Simple sequence of characters Example: /sun/ Also matches: disunited, sunk, Sunday, asunder / sun / Does NOT match: […] the gate of the eastern sun, […] gloom beneath the noonday sun.  

\b can be used in regular expressions to represent word boundaries If you place “i” directly after the second forward slash, the comparison will take place in a case insensitive manner /\bsun\b/i […] Points to the unrisen sun! […] […] Startles the dreamer, sun-like truth […] […] stamped upon the sun; […]

Character classes . Any character \w Any alphanumerical character: alphabetical characters, numbers and underscore \d Any digit \s White space: space, tab, newline [..] Any of the characters supplied within square brackets, e.g. [A-Za-z]

Quantifiers {n,m} Pattern must occur a least n times, at most m times {n,} At least n times {n} Exactly n times ? is the same as {0,1} + is the same as {1,} * Is the same as {0,}

Examples /\d{4}/ Matches: 1234, 2013, 1066 /[a-zA-Z]+/ Matches any word that consists of alphabetical characters only Does not FULLY match: e-mail, catch22, can’t

Examples /b[aeiou]{1,2}t\w*/ bit boathouse beauty but beat beast blister boyhood

Anchors Do not match characters, but locations within strings. \b Word boundaries ^ Start of a line $ End of a line

Regular expressions can be combined with the vertical bar (‘|’) /\bsun\b|\bstar\b|\bmoon\b/ ‘special characters’ need to be escaped with the backslash (‘\’) /\?/

Exercises Try your hand at Perl exercises on regular expressions, numbers 9 and 10