Download presentation
Presentation is loading. Please wait.
Published byOscar Carson Modified over 9 years ago
1
Globalisation & Computer systems Week 7 Text processes and globalisation part 1: Sorting strings: collation Searching strings and regular expressions Practical: regular expressions in UNIX
2
Text processes Character encoding design: “must provide the set of code values that allows programmers to design applications capable of implementing a variety of text processes in the desired language” Text processes operate over text elements
3
Text processes Text elements The objects of a text Depends on perspective Different text processes operate over different objects
4
Sorting Sorting (collation) “The process of ordering units of textual information. Collation is usually specific to a particular language” (Unicode version 3: glossary)
5
Sorting Language specific sort order phonetically based sort graphically based sort sort element
6
Sorting Levels of comparison Level 1 (primary difference) Levels 2 and 3 (similar) Level 4 (exact match)
7
Sorting Levels of comparison Level 4: exact match match in code value character equivalence resumes : resumes
8
Sorting Levels of comparison Level 1 (primary difference: alphabetic)
9
Sorting Levels of comparison Level 1 (primary difference) resume < resumes
10
Sorting Levels of comparison Level 1 (primary difference) resume < resumes Level 2 (similar: no accent < accent) resume < résumé resumes < résumés Level 3 (similar: lower case < upper case) résumé < Résumé
11
Sorting Forward and backward sequence sort Forward sequence Start comparison from beginning of string Backward sequence Start comparison from end of string
12
Sorting Implementation Sort keys assign set of weights to each character in the string compare substrings according to weighting switch weightings on / off
13
Searching Text elements The objects of a text Depends on perspective Different text processes operate over different objects
14
Regular Expressions Basis of all web-based and word- processor-based searches Definition 1. An algebraic notation for describing a string Definition 2. A set of rules that you can use to specify one or more items, such as words in a file, by using a single character string (Sarwar et al.)
15
Regular Expressions regular expression, text corpus regular expression algebra has variants: Perl, Unix tools Unix tools: egrep, sed, awk
16
Regular Expressions Find occurrences of /Nokia/ in the text egrep -n ‘Nokia’ nokia_corpus.txt
17
Regular Expressions egrep -n ‘Nokia’ nokia_corpus.txt
18
Regular Expressions set operator egrep -n ‘[Nn]okia’ nokia_corpus.txt
19
Regular Expressions optional operator egrep -n ‘shares?’ nokia_corpus.txt
20
Regular Expressions egrep -n ‘shares?’ nokia_corpus.txt
21
Regular Expressions Kleene operators: /string*/ “zero or more occurrences of previous character” /string+/ “1 or more occurrences of previous character”
22
Regular Expressions Wildcard operator: /string./ “any character after the previous character”
23
Regular Expressions Wildcard operator: /string./ “any character after the previous character” Combine wildcard and kleene: /string.*/ “zero or more instances of any character after the previous character” /string.+/ “one or more instances of any character after the previous character”
24
Regular Expressions egrep –n ‘profit.*’ nokia_corpus.txt
25
Regular Expressions Anchors Beginning of line operator: ^ egrep ‘^said’ nokia_corpus.txt End of line operator: $ egrep ‘$said’ nokia_corpus.txt
26
Regular Expressions Disjunction: set operator /[Ss]tring/ “a string which begins with either S or s” Range /[A-Z]tring/ “a string beginning with a capital letter” pipe | /string1|string2/ “either string 1 or string 2”
27
Regular Expressions Disjunction egrep –n ‘weak|warning|drop’ nokia_corpus.txt egrep –n ‘weak.*|warn.*|drop.*’ nokia_corpus.txt
28
Regular Expressions Negation: /[^a-z]tring“ any strings that does not begin with a small letter”
29
Regular Expressions Precedence 1. Parantheses 2. Kleene and optional operators *. ? 3. Anchors and sequences 4. Disjunction operator | (a) /supply | iers/
30
Regular Expressions Precedence 1. Parantheses 2. Kleene and optional operators *. ? 3. Anchors and sequences 4. Disjunction operator | (a) /supply | iers/ /supply/ /iers/ (b) /suppl(y|iers)//supply/ suppliers/
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.