Globalisation & Computer systems Week 7 Text processes and globalisation part 1: Sorting strings: collation Searching strings and regular expressions Practical:

Slides:



Advertisements
Similar presentations
Specifying Languages Our aim is to be able to specify languages for use in the computer. The sketch of the FSA is easy for us to understand, but difficult.
Advertisements

Specifying Languages Our aim is to be able to specify languages for use in the computer. The sketch of an FSA is easy for us to understand, but difficult.
Regular expressions Day 2
Character and String definitions, algorithms, library functions Characters and Strings.
CSCI 330 T HE UNIX S YSTEM Regular Expressions. R EGULAR E XPRESSION A pattern of special characters used to match strings in a search Typically made.
ISBN Regular expressions Mastering Regular Expressions by Jeffrey E. F. Friedl –(on reserve.
CS 497C – Introduction to UNIX Lecture 29: - Filters Using Regular Expressions – grep and sed Chin-Chih Chang
1 CSE 303 Lecture 7 Regular expressions, egrep, and sed read Linux Pocket Guide pp , 73-74, 81 slides created by Marty Stepp
CS 898N – Advanced World Wide Web Technologies Lecture 8: PERL Chin-Chih Chang
LING 388: Language and Computers Sandiway Fong Lecture 2: 8/23.
Computational Language Finite State Machines and Regular Expressions.
Regular expressions Mastering Regular Expressions by Jeffrey E. F. Friedl Linux editors and commands (e.g.
Guide To UNIX Using Linux Third Edition
CSE467/567 Computational Linguistics Carl Alphonce Computer Science & Engineering University at Buffalo.
Filters using Regular Expressions grep: Searching a Pattern.
Applications of Regular Expressions BY— NIKHIL KUMAR KATTE 1.
Regular Language & Expressions. Regular Language A regular language is one that a finite state machine (fsm) will accept. ‘Alphabet’: {a, b} ‘Rules’:
Regular Expressions A regular expression defines a pattern of characters to be found in a string Regular expressions are made up of – Literal characters.
Copyright © Cengage Learning. All rights reserved.
Language Recognizer Connecting Type 3 languages and Finite State Automata Copyright © – Curt Hill.
Overview of the grep Command Alex Dukhovny CS 265 Spring 2011.
System Programming Regular Expressions Regular Expressions
1 Regular Expressions CIS*2450 Advanced Programming Techniques Material for this lectures has been taken from the excellent book, Mastering Regular Expressions,
REGULAR EXPRESSIONS. Lexical Analysis Lexical analysers can be constructed by programs such as LEX These programs employ as input a description of the.
ASP.NET Programming with C# and SQL Server First Edition Chapter 5 Manipulating Strings with C#
Introduction to Unix – CS 21 Lecture 6. Lecture Overview Homework questions More on wildcards Regular expressions Using grep Quiz #1.
LING 388: Language and Computers Sandiway Fong Lecture 6: 9/15.
Agenda Regular Expressions (Appendix A in Text) –Definition / Purpose –Commands that Use Regular Expressions –Using Regular Expressions –Using the Replacement.
1 Assignment #1 is due on Friday. Any questions?.
1 Prove the following languages over Σ={0,1} are regular by giving regular expressions for them: 1. {w contains two or more 0’s} 2. {|w| = 3k for some.
I/O Redirection and Regular Expressions February 9 th, 2004 Class Meeting 4.
Regular Expression - Intro Patterns that define a set of strings (or, pieces of a string) Not wildcards (similar notion, but different thing) Used by utilities.
Instructor: Craig Duckett Lecture 08: Thursday, October 22 nd, 2015 Patterns, Order of Evaluation, Concatenation, Substrings, Trim, Position 1 BIT275:
Lecture 5 Regular Expressions CSCI – 1900 Mathematics for Computer Science Fall 2014 Bill Pine.
Corpus Linguistics- Practical utilities (Lecture 7) Albert Gatt.
XML 2nd EDITION Tutorial 4 Working With Schemas. XP Schemas A schema is an XML document that defines the content and structure of one or more XML documents.
Regular Expressions The ultimate tool for textual analysis.
Sequences and Summations Section 2.4. Section Summary Sequences. – Examples: Geometric Progression, Arithmetic Progression Recurrence Relations – Example:
Sys Prog & Scrip - Heriot Watt Univ 1 Systems Programming & Scripting Lecture 12: Introduction to Scripting & Regular Expressions.
May 2008CLINT-LIN Regular Expressions1 Introduction to Computational Linguistics Regular Expressions (Tutorial derived from NLTK)
I/O Redirection & Regular Expressions CS 2204 Class meeting 4 *Notes by Doug Bowman and other members of the CS faculty at Virginia Tech. Copyright
Regular Expressions CS 2204 Class meeting 6 Created by Doug Bowman, 2001 Modified by Mir Farooq Ali, 2002.
CIT 383: Administrative ScriptingSlide #1 CIT 383: Administrative Scripting Regular Expressions.
Specifying Languages Our aim is to be able to specify languages for use in the computer. The sketch of an FSA is easy for us to understand, but difficult.
UNIX Commands RTFM: grep(1), egrep(1) & fgrep(1) Gilbert Detillieux April 13, 2010 MUUG Meeting.
CSCI 330 UNIX and Network Programming Unit IV Shell, Part 2.
Natural Language Processing Lecture 4 : Regular Expressions and Automata.
Recursive Definations Regular Expressions Ch # 4 by Cohen
Regular expressions and the Corpus Query Language Albert Gatt.
What is grep ?  % man grep  DESCRIPTION  The grep utility searches text files for a pattern and prints all lines that contain that pattern. It uses.
An Introduction to Regular Expressions Specifying a Pattern that a String must meet.
Onlinedeeneislam.blogspot.com1 Design and Analysis of Algorithms Slide # 1 Download From
-Joseph Beberman *Some slides are inspired by a PowerPoint presentation used by professor Seikyung Jung, which was derived from Charlie Wiseman.
CSC 4630 Perl 3 adapted from R. E. Beck. Problem But we worked on it first: Input: Read from a text file named in a command line argument Output: List.
Pattern Matching: Simple Patterns. Introduction Programmers often need to scan a file, directory, etc. for a specific substring. –Find all files that.
Gollis University Faculty of Computer Engineering Chapter Five: Retrieval, Functions Instructor: Mukhtar M Ali “Hakaale” BCS.
May 2006CLINT-LIN Regular Expressions1 Introduction to Computational Linguistics Regular Expressions (Tutorial derived from NLTK)
ICS611 Lex Set 3. Lex and Yacc Lex is a program that generates lexical analyzers Converting the source code into the symbols (tokens) is the work of the.
Lecture Three: Finite Automata Finite Automata, Lecture 3, slide 1 Amjad Ali.
Lesson 5-Exploring Utilities
Advanced File Processing
Looking for Patterns - Finding them with Regular Expressions
Regular Expression - Intro
Formal Language Theory
Pattern Matching in Strings
Specifying Languages Our aim is to be able to specify languages for use in the computer. The sketch of the FSA is easy for us to understand, but difficult.
Regular Expressions
CSCI The UNIX System Regular Expressions
Lab 8: Regular Expressions
Presentation transcript:

Globalisation & Computer systems Week 7 Text processes and globalisation part 1: Sorting strings: collation Searching strings and regular expressions Practical: regular expressions in UNIX

Text processes Character encoding design: “must provide the set of code values that allows programmers to design applications capable of implementing a variety of text processes in the desired language” Text processes operate over text elements

Text processes Text elements The objects of a text Depends on perspective Different text processes operate over different objects

Sorting Sorting (collation) “The process of ordering units of textual information. Collation is usually specific to a particular language” (Unicode version 3: glossary)

Sorting Language specific sort order phonetically based sort graphically based sort sort element

Sorting Levels of comparison Level 1 (primary difference) Levels 2 and 3 (similar) Level 4 (exact match)

Sorting Levels of comparison Level 4: exact match match in code value character equivalence resumes : resumes

Sorting Levels of comparison Level 1 (primary difference: alphabetic)

Sorting Levels of comparison Level 1 (primary difference) resume < resumes

Sorting Levels of comparison Level 1 (primary difference) resume < resumes Level 2 (similar: no accent < accent) resume < résumé resumes < résumés Level 3 (similar: lower case < upper case) résumé < Résumé

Sorting Forward and backward sequence sort Forward sequence Start comparison from beginning of string Backward sequence Start comparison from end of string

Sorting Implementation Sort keys assign set of weights to each character in the string compare substrings according to weighting switch weightings on / off

Searching Text elements The objects of a text Depends on perspective Different text processes operate over different objects

Regular Expressions Basis of all web-based and word- processor-based searches Definition 1. An algebraic notation for describing a string Definition 2. A set of rules that you can use to specify one or more items, such as words in a file, by using a single character string (Sarwar et al.)

Regular Expressions regular expression, text corpus regular expression algebra has variants: Perl, Unix tools Unix tools: egrep, sed, awk

Regular Expressions Find occurrences of /Nokia/ in the text egrep -n ‘Nokia’ nokia_corpus.txt

Regular Expressions egrep -n ‘Nokia’ nokia_corpus.txt

Regular Expressions set operator egrep -n ‘[Nn]okia’ nokia_corpus.txt

Regular Expressions optional operator egrep -n ‘shares?’ nokia_corpus.txt

Regular Expressions egrep -n ‘shares?’ nokia_corpus.txt

Regular Expressions Kleene operators: /string*/ “zero or more occurrences of previous character” /string+/ “1 or more occurrences of previous character”

Regular Expressions Wildcard operator: /string./ “any character after the previous character”

Regular Expressions Wildcard operator: /string./ “any character after the previous character” Combine wildcard and kleene: /string.*/ “zero or more instances of any character after the previous character” /string.+/ “one or more instances of any character after the previous character”

Regular Expressions egrep –n ‘profit.*’ nokia_corpus.txt

Regular Expressions Anchors Beginning of line operator: ^ egrep ‘^said’ nokia_corpus.txt End of line operator: $ egrep ‘$said’ nokia_corpus.txt

Regular Expressions Disjunction: set operator /[Ss]tring/ “a string which begins with either S or s” Range /[A-Z]tring/ “a string beginning with a capital letter” pipe | /string1|string2/ “either string 1 or string 2”

Regular Expressions Disjunction egrep –n ‘weak|warning|drop’ nokia_corpus.txt egrep –n ‘weak.*|warn.*|drop.*’ nokia_corpus.txt

Regular Expressions Negation: /[^a-z]tring“ any strings that does not begin with a small letter”

Regular Expressions Precedence 1. Parantheses 2. Kleene and optional operators *. ? 3. Anchors and sequences 4. Disjunction operator | (a) /supply | iers/

Regular Expressions Precedence 1. Parantheses 2. Kleene and optional operators *. ? 3. Anchors and sequences 4. Disjunction operator | (a) /supply | iers/ /supply/ /iers/ (b) /suppl(y|iers)//supply/ suppliers/