Download presentation
Presentation is loading. Please wait.
1
CSE467/567 Computational Linguistics Carl Alphonce cse-467-alphonce@cse.buffalo.edu Computer Science & Engineering University at Buffalo
2
Fall 2006CSE 467/567 2 Levels of processing phonetics/phonology – sounds morphology – word structure syntax – sentence structure semantics – meaning pragmatics – goals of language use discourse – utterances in context
3
Fall 2006CSE 467/567 3 Words: the building blocks of sentences
4
Fall 2006CSE 467/567 4 Words have internal structure readable = read + able readability = read + able + ity the structure of words can be described using a regular grammar
5
Fall 2006CSE 467/567 5 Chomsky hierarchy regular languages context-free languages context-sensitive languages unrestricted languages
6
Fall 2006CSE 467/567 6 Problem I often need to find an e-mail, but I have thousands of e-mails in my various folders. Suppose I want to find an e-mail about geese. The e-mail may mention “geese” or “goose”; also, if it appears at the start of a sentence, its initial letter will be capitalized. Need to match “goose”, “geese”, “Goose” or “Geese”.
7
Fall 2006CSE 467/567 7 Regular expressions (in Perl) “a regular expression is an algebraic notation for characterizing a set of strings” [p. 22] Regular expressions are commonly used to specify search strings. For example, the UNIX utility program grep lets the user specify a pattern to search for in files.
8
Fall 2006CSE 467/567 8 Sequences of characters Matching a sequence of characters /…/ Examples: /a/ matches the character ‘a’ /fred/ matches the string ‘fred’ Note: /fred/ does not match the string ‘Fred’! In other words, patterns are case-sensitive.
9
Fall 2006CSE 467/567 9 Character disjunction (character classes) Square brackets are used to indicate disjunction of characters. Examples: /[Ff]/ matches either ‘f’ or ‘F’ /[Ff]red/ matches either ‘fred’ or ‘Fred’ This form of disjunction applies only at the character level. A set of characters in square brackets are sometimes referred to as a character class.
10
Fall 2006CSE 467/567 10 Ranges Sometimes it is useful to specify “any digit” or “any letter”. “Any digit” can be written as /[0123456789]/, since any of the ten digits satisfies the pattern. An alternative is to use a special range notation: /[0-9]/ Any letter can be specified as /[A-Za-z]/ Range notation does not extend the power of regular expressions, but gives us a convenient way to express them.
11
Fall 2006CSE 467/567 11 Complementing character classes To search for a character that is not in a character class, use the caret (^) in front of the character class that is enclosed in square brackets. Examples: /[^a]/ matches anything except ‘a’ /[^0-9]/ matches anything except a digit
12
Fall 2006CSE 467/567 12 Matching 0 or 1 occurrence The ‘?’ matches zero or one occurrences of the preceding expression. Examples: /a?/ matches ‘a’ or ‘’ (nothing) /cats?/ matches ‘cat’ or ‘cats’ Note that the “preceding expression”, in these examples, is a single letter. We’ll see how to form longer expressions later.
13
Fall 2006CSE 467/567 13 The Kleene star and plus The Kleene star (*) matches zero or more occurrences of the preceding expression. Examples: /a*/ matches ‘’, ‘a’, ‘aa’, ‘aaa’, etc. /[ab]*/ matches ‘’, ‘a’, ‘b’, ‘aa’, ‘ab’, ‘ba’, ‘bb’, etc. + matches one or more occurrences + is not necessary: /[ab]+/ is equiv. to /[ab][ab]*/
14
Fall 2006CSE 467/567 14 Wildcard The period (.) matches any single character except the newline (\n).
15
Fall 2006CSE 467/567 15 Anchors Anchors are used to restrict a match to a particular position within a string. ^ anchors to the start of a string $ anchors to the end of a string /[Ff]red/ matches both ‘Fred’ and ‘Fred is home’ /^[Ff]red$/ matches ‘Fred’ but not ‘Fred is home’ \b anchors to a word boundary \B anchors to a non-boundary
16
Fall 2006CSE 467/567 16 Conjunction Two regular expressions are conjoined by juxtaposition (placing the expressions side by side). Examples: /a/ matches ‘a’ /m/ matches ‘m’ /am/ matches ‘am’ but not ‘a’ or ‘m’ alone
17
Fall 2006CSE 467/567 17 Disjunction We have already seen disjunction of characters using the square bracket notation General disjunction is expressed using the vertical bar (|), also called the pipe symbol. This form of disjunction allows us to match any one of the alternative patterns, not just characters like the [ ] disjunction form.
18
Fall 2006CSE 467/567 18 Grouping Parentheses, ‘(’ and ‘)’, are used to group subpatterns of a larger pattern. Ex: /[Gg](ee)|(oo)se/
19
Fall 2006CSE 467/567 19 Replacement In addition to matching, we can do replacements when a match is found: Example: To replace the British spelling of color with the American spelling, we can write: s/colour/color/
20
Fall 2006CSE 467/567 20 Registers – saving matches To save a match from part of a pattern, to reuse it later on, Perl provides registers Registers are named \#, where # is the number of the register Ex. DE DO DO DO DE DA DA DA IS ALL I WANT TO SAY TO YOU /(D[AEO].)*/ will match the first line /(D[AEO])(.D[AEO]) \2 \2\s \1 (.D[AEO]) \3 \3/ matches it more specifically This pattern also matches strings like DA DE DE DE DA DO DO DO \s matches a whitespace character
21
Fall 2006CSE 467/567 21 For more information PERL Regular Expression TUTorial – http://perldoc.perl.org/perlretut.html http://perldoc.perl.org/perlretut.html PERL Regular Expression reference page – http://perldoc.perl.org/perlre.html http://perldoc.perl.org/perlre.html
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.