Globalisation & Computer systems Week 7 Text processes and globalisation part 1: Sorting strings: collation Searching strings and regular expressions Practical: regular expressions in UNIX
Text processes Character encoding design: “must provide the set of code values that allows programmers to design applications capable of implementing a variety of text processes in the desired language” Text processes operate over text elements
Text processes Text elements The objects of a text Depends on perspective Different text processes operate over different objects
Sorting Sorting (collation) “The process of ordering units of textual information. Collation is usually specific to a particular language” (Unicode version 3: glossary)
Sorting Language specific sort order phonetically based sort graphically based sort sort element
Sorting Levels of comparison Level 1 (primary difference) Levels 2 and 3 (similar) Level 4 (exact match)
Sorting Levels of comparison Level 4: exact match match in code value character equivalence resumes : resumes
Sorting Levels of comparison Level 1 (primary difference: alphabetic)
Sorting Levels of comparison Level 1 (primary difference) resume < resumes
Sorting Levels of comparison Level 1 (primary difference) resume < resumes Level 2 (similar: no accent < accent) resume < résumé resumes < résumés Level 3 (similar: lower case < upper case) résumé < Résumé
Sorting Forward and backward sequence sort Forward sequence Start comparison from beginning of string Backward sequence Start comparison from end of string
Sorting Implementation Sort keys assign set of weights to each character in the string compare substrings according to weighting switch weightings on / off
Searching Text elements The objects of a text Depends on perspective Different text processes operate over different objects
Regular Expressions Basis of all web-based and word- processor-based searches Definition 1. An algebraic notation for describing a string Definition 2. A set of rules that you can use to specify one or more items, such as words in a file, by using a single character string (Sarwar et al.)
Regular Expressions regular expression, text corpus regular expression algebra has variants: Perl, Unix tools Unix tools: egrep, sed, awk
Regular Expressions Find occurrences of /Nokia/ in the text egrep -n ‘Nokia’ nokia_corpus.txt
Regular Expressions egrep -n ‘Nokia’ nokia_corpus.txt
Regular Expressions set operator egrep -n ‘[Nn]okia’ nokia_corpus.txt
Regular Expressions optional operator egrep -n ‘shares?’ nokia_corpus.txt
Regular Expressions egrep -n ‘shares?’ nokia_corpus.txt
Regular Expressions Kleene operators: /string*/ “zero or more occurrences of previous character” /string+/ “1 or more occurrences of previous character”
Regular Expressions Wildcard operator: /string./ “any character after the previous character”
Regular Expressions Wildcard operator: /string./ “any character after the previous character” Combine wildcard and kleene: /string.*/ “zero or more instances of any character after the previous character” /string.+/ “one or more instances of any character after the previous character”
Regular Expressions egrep –n ‘profit.*’ nokia_corpus.txt
Regular Expressions Anchors Beginning of line operator: ^ egrep ‘^said’ nokia_corpus.txt End of line operator: $ egrep ‘$said’ nokia_corpus.txt
Regular Expressions Disjunction: set operator /[Ss]tring/ “a string which begins with either S or s” Range /[A-Z]tring/ “a string beginning with a capital letter” pipe | /string1|string2/ “either string 1 or string 2”
Regular Expressions Disjunction egrep –n ‘weak|warning|drop’ nokia_corpus.txt egrep –n ‘weak.*|warn.*|drop.*’ nokia_corpus.txt
Regular Expressions Negation: /[^a-z]tring“ any strings that does not begin with a small letter”
Regular Expressions Precedence 1. Parantheses 2. Kleene and optional operators *. ? 3. Anchors and sequences 4. Disjunction operator | (a) /supply | iers/
Regular Expressions Precedence 1. Parantheses 2. Kleene and optional operators *. ? 3. Anchors and sequences 4. Disjunction operator | (a) /supply | iers/ /supply/ /iers/ (b) /suppl(y|iers)//supply/ suppliers/