Presentation is loading. Please wait.

Presentation is loading. Please wait.

Globalisation & Computer systems Week 7 Text processes and globalisation part 1: Sorting strings: collation Searching strings and regular expressions Practical:

Similar presentations


Presentation on theme: "Globalisation & Computer systems Week 7 Text processes and globalisation part 1: Sorting strings: collation Searching strings and regular expressions Practical:"— Presentation transcript:

1 Globalisation & Computer systems Week 7 Text processes and globalisation part 1: Sorting strings: collation Searching strings and regular expressions Practical: regular expressions in UNIX

2 Text processes Character encoding design: “must provide the set of code values that allows programmers to design applications capable of implementing a variety of text processes in the desired language” Text processes operate over text elements

3 Text processes Text elements The objects of a text Depends on perspective Different text processes operate over different objects

4 Sorting Sorting (collation) “The process of ordering units of textual information. Collation is usually specific to a particular language” (Unicode version 3: glossary)

5 Sorting Language specific sort order phonetically based sort graphically based sort sort element

6 Sorting Levels of comparison Level 1 (primary difference) Levels 2 and 3 (similar) Level 4 (exact match)

7 Sorting Levels of comparison Level 4: exact match match in code value character equivalence resumes : resumes

8 Sorting Levels of comparison Level 1 (primary difference: alphabetic)

9 Sorting Levels of comparison Level 1 (primary difference) resume < resumes

10 Sorting Levels of comparison Level 1 (primary difference) resume < resumes Level 2 (similar: no accent < accent) resume < résumé resumes < résumés Level 3 (similar: lower case < upper case) résumé < Résumé

11 Sorting Forward and backward sequence sort Forward sequence Start comparison from beginning of string Backward sequence Start comparison from end of string

12 Sorting Implementation Sort keys assign set of weights to each character in the string compare substrings according to weighting switch weightings on / off

13 Searching Text elements The objects of a text Depends on perspective Different text processes operate over different objects

14 Regular Expressions Basis of all web-based and word- processor-based searches Definition 1. An algebraic notation for describing a string Definition 2. A set of rules that you can use to specify one or more items, such as words in a file, by using a single character string (Sarwar et al.)

15 Regular Expressions regular expression, text corpus regular expression algebra has variants: Perl, Unix tools Unix tools: egrep, sed, awk

16 Regular Expressions Find occurrences of /Nokia/ in the text egrep -n ‘Nokia’ nokia_corpus.txt

17 Regular Expressions egrep -n ‘Nokia’ nokia_corpus.txt

18 Regular Expressions set operator egrep -n ‘[Nn]okia’ nokia_corpus.txt

19 Regular Expressions optional operator egrep -n ‘shares?’ nokia_corpus.txt

20 Regular Expressions egrep -n ‘shares?’ nokia_corpus.txt

21 Regular Expressions Kleene operators: /string*/ “zero or more occurrences of previous character” /string+/ “1 or more occurrences of previous character”

22 Regular Expressions Wildcard operator: /string./ “any character after the previous character”

23 Regular Expressions Wildcard operator: /string./ “any character after the previous character” Combine wildcard and kleene: /string.*/ “zero or more instances of any character after the previous character” /string.+/ “one or more instances of any character after the previous character”

24 Regular Expressions egrep –n ‘profit.*’ nokia_corpus.txt

25 Regular Expressions Anchors Beginning of line operator: ^ egrep ‘^said’ nokia_corpus.txt End of line operator: $ egrep ‘$said’ nokia_corpus.txt

26 Regular Expressions Disjunction: set operator /[Ss]tring/ “a string which begins with either S or s” Range /[A-Z]tring/ “a string beginning with a capital letter” pipe | /string1|string2/ “either string 1 or string 2”

27 Regular Expressions Disjunction egrep –n ‘weak|warning|drop’ nokia_corpus.txt egrep –n ‘weak.*|warn.*|drop.*’ nokia_corpus.txt

28 Regular Expressions Negation: /[^a-z]tring“ any strings that does not begin with a small letter”

29 Regular Expressions Precedence 1. Parantheses 2. Kleene and optional operators *. ? 3. Anchors and sequences 4. Disjunction operator | (a) /supply | iers/

30 Regular Expressions Precedence 1. Parantheses 2. Kleene and optional operators *. ? 3. Anchors and sequences 4. Disjunction operator | (a) /supply | iers/ /supply/ /iers/ (b) /suppl(y|iers)//supply/ suppliers/


Download ppt "Globalisation & Computer systems Week 7 Text processes and globalisation part 1: Sorting strings: collation Searching strings and regular expressions Practical:"

Similar presentations


Ads by Google