Download presentation
Presentation is loading. Please wait.
Published byPeter Martin Modified over 9 years ago
1
March 1, 2009 Dr. Muhammed Al-mulhem 1 ICS 482 Natural Language Processing Regular Expression and Finite Automata Muhammed Al-Mulhem March 1, 2009
2
Dr. Muhammed Al-mulhem 2 Regular Expressions Regular expression (RE): A formula for specifying a set of strings. String: A sequence of alphanumeric characters (letters, numbers, spaces, tabs, and punctuation).
3
March 1, 2009 Dr. Muhammed Al-mulhem 3 Regular Expression Patterns RE String matched woodchucks “interesting links to woodchucks and lemurs” a “Sarah Ali stopped by Mona’s” Ali says, “My gift please,” Ali says,” book “all our pretty books” ! “Leave him behind!” said Sami
4
March 1, 2009 Dr. Muhammed Al-mulhem 4 Specify Options and Range using [ ] and - REMatch [wW]ood Wood or wood [abc] “a”, “b”, or “c” [A-Z]an uppercase letter [a-z]a lowercase letter [0-9]a single digit
5
March 1, 2009 Dr. Muhammed Al-mulhem 5 RE Operators REDescription a* Zero or more a’s a+ One or more a’s a? Zero or one a’s [ab]* Zero or more a’s or b’s. Matches aaa.., ababab.., bbbb.. [0-9]+ Sequence of one or more digits.. Wildcard expression-matches any single character. \b Matches a word boundary. Matches the but not other
6
March 1, 2009 Dr. Muhammed Al-mulhem 6 Sidebar: Errors Find all instances of the word “the” in a text. –/the/ What About ‘The’ –/[tT]he/ What about ‘Theater”, ‘Another’
7
March 1, 2009 Dr. Muhammed Al-mulhem 7 Sidebar: Errors The process we just went through was based on: –Matching strings that we should not have matched (there, then, other) False positives –Not matching things that we should have matched (The) False negatives
8
March 1, 2009 Dr. Muhammed Al-mulhem 8 Sidebar: Errors Reducing the error rate for an application often involves two efforts –Increasing accuracy (minimizing false positives) –Increasing coverage (minimizing false negatives)
9
March 1, 2009 Dr. Muhammed Al-mulhem 9 Regular expressions Basic regular expression patterns Perl-based syntax (slightly different from other notations for regular expressions) Disjunctions [abc] Ranges [A-Z] Negations [^Ss] Optional characters ?, + and * Wild cards. Anchors \b and \B Disjunction, grouping, and precedence |
10
March 1, 2009 Dr. Muhammed Al-mulhem 10 Preceding character or nothing using ?
11
March 1, 2009 Dr. Muhammed Al-mulhem 11 10/10/201511 Wildcard
12
March 1, 2009 Dr. Muhammed Al-mulhem 12 10/10/201512 Negation using ^
13
March 1, 2009 Dr. Muhammed Al-mulhem 13 Writing correct expressions Exercise: write a regular expression to match the English article “the”: /the/ missed ‘The’ included ‘the’ in ‘others’ /[tT]he/ /\b[tT]he\b/ Missed ‘the25’ ‘the_’ /[^a-zA-Z][tT]he[^a-zA-Z]/ Missed ‘The’ at the beginning of a line
14
March 1, 2009 Dr. Muhammed Al-mulhem 14 A more complex example Exercise: Write a Perl regular expression that will match “any PC with more than 500MHz and 32 Gb of disk space for less than $1000”:
15
March 1, 2009 Dr. Muhammed Al-mulhem 15 Example Price –/$[0-9]+/ # whole dollars –/$[0-9]+\.[0-9][0-9]/ # dollars and cents –/$[0-9]+(\.[0-9][0-9])?/ #cents optional –/\b$[0-9]+(\.[0-9][0-9])?\b/ #word boundaries Specifications for processor speed –/\b[0-9]+ *(MHz|[Mm]egahertz|Ghz|[Gg]igahertz)\b/ Memory size –/\b[0-9]+ *(Mb|[Mm]egabytes?)\b/ –/\b[0-9](\.[0-9]+) *(Gb|[Gg]igabytes?)\b/ Vendors –/\b(Win95|WIN98|WINNT|WINXP *(NT|95|98|2000|XP)?)\b/ –/\b(Mac|Macintosh|Apple)\b/
16
March 1, 2009 Dr. Muhammed Al-mulhem 16 Advanced Operators – Aliases for common ranges Underscore: Correct figure 2.6
17
March 1, 2009 Dr. Muhammed Al-mulhem 17 \ to Reference special characters
18
March 1, 2009 Dr. Muhammed Al-mulhem 18 Operators for counting
19
March 1, 2009 Dr. Muhammed Al-mulhem 19 Finite State Automata FSA recognizes the regular languages represented by regular expressions –SheepTalk: /baa+!/ Directed graph with labeled nodes and arc transitions Five states: q0 the start state, q4 the final state, 5 transitions q0 q4 q1q2q3 ba a a!
20
March 1, 2009 Dr. Muhammed Al-mulhem 20 Formally FSA is a 5-tuple consisting of –Q : set of states {q0,q1,q2,q3,q4} – : an alphabet of symbols {a,b,!} –q0 : A start state –F : a set of final states in Q {q4} – (q,i) : a transition function mapping Q x to Q q0 q4 q1q2q3 ba a a!
21
March 1, 2009 Dr. Muhammed Al-mulhem 21 FSA recognizes (accepts) strings of a regular language –baa! –baaa! –baaaa! –… A rejected input aba!b q0 q4 q1q2q3 ba a a!
22
March 1, 2009 Dr. Muhammed Al-mulhem 22 State Transition Table State Input ba! 01ØØ 1Ø2Ø 2Ø3Ø 3Ø34 4ØØØ q0 q4 q1q2q3 ba a a! FSA can be represented with State Transition Table
23
March 1, 2009 Dr. Muhammed Al-mulhem 23 Non-Deterministic FSAs for SheepTalk q0 q4 q1q2q3 ba a a! q0 q4 q1q2q3 baa!
24
March 1, 2009 Dr. Muhammed Al-mulhem 24 A language is a set of strings String: A sequence of letters Languages
25
March 1, 2009 Dr. Muhammed Al-mulhem 25 Tracing FSA - Initial Configuration Input String
26
March 1, 2009 Dr. Muhammed Al-mulhem 26 Reading the Input
27
March 1, 2009 Dr. Muhammed Al-mulhem 27
28
March 1, 2009 Dr. Muhammed Al-mulhem 28
29
March 1, 2009 Dr. Muhammed Al-mulhem 29
30
March 1, 2009 Dr. Muhammed Al-mulhem 30 Output: “accept”
31
March 1, 2009 Dr. Muhammed Al-mulhem 31 Rejection
32
March 1, 2009 Dr. Muhammed Al-mulhem 32
33
March 1, 2009 Dr. Muhammed Al-mulhem 33
34
March 1, 2009 Dr. Muhammed Al-mulhem 34
35
March 1, 2009 Dr. Muhammed Al-mulhem 35 Output: “reject”
36
March 1, 2009 Dr. Muhammed Al-mulhem 36 Another Example
37
March 1, 2009 Dr. Muhammed Al-mulhem 37
38
March 1, 2009 Dr. Muhammed Al-mulhem 38
39
March 1, 2009 Dr. Muhammed Al-mulhem 39
40
March 1, 2009 Dr. Muhammed Al-mulhem 40 Output: “accept”
41
March 1, 2009 Dr. Muhammed Al-mulhem 41 Rejection
42
March 1, 2009 Dr. Muhammed Al-mulhem 42
43
March 1, 2009 Dr. Muhammed Al-mulhem 43
44
March 1, 2009 Dr. Muhammed Al-mulhem 44
45
March 1, 2009 Dr. Muhammed Al-mulhem 45 Output: “reject”
46
March 1, 2009 Dr. Muhammed Al-mulhem 46 Formalities Deterministic Finite Accepter (DFA) : set of states : input alphabet : transition function : initial state : set of final states
47
March 1, 2009 Dr. Muhammed Al-mulhem 47 About Alphabets Alphabets means we need a finite set of symbols in the input. These symbols can and will stand for bigger objects that can have internal structure.
48
March 1, 2009 Dr. Muhammed Al-mulhem 48 Input Aplhabet
49
March 1, 2009 Dr. Muhammed Al-mulhem 49 Set of States
50
March 1, 2009 Dr. Muhammed Al-mulhem 50 Initial State
51
March 1, 2009 Dr. Muhammed Al-mulhem 51 Set of Final States
52
March 1, 2009 Dr. Muhammed Al-mulhem 52 Transition Function
53
March 1, 2009 Dr. Muhammed Al-mulhem 53
54
March 1, 2009 Dr. Muhammed Al-mulhem 54
55
March 1, 2009 Dr. Muhammed Al-mulhem 55
56
March 1, 2009 Dr. Muhammed Al-mulhem 56 Transition Function
57
March 1, 2009 Dr. Muhammed Al-mulhem 57 Extended Transition Function (Reads the entire string)
58
March 1, 2009 Dr. Muhammed Al-mulhem 58
59
March 1, 2009 Dr. Muhammed Al-mulhem 59
60
March 1, 2009 Dr. Muhammed Al-mulhem 60
61
March 1, 2009 Dr. Muhammed Al-mulhem 61 Observation: There is a walk from to with label
62
March 1, 2009 Dr. Muhammed Al-mulhem 62 Example accept
63
March 1, 2009 Dr. Muhammed Al-mulhem 63 Another Example accept
64
March 1, 2009 Dr. Muhammed Al-mulhem 64 More Examples accept trap state
65
March 1, 2009 Dr. Muhammed Al-mulhem 65 = { all substrings with prefix } accept
66
March 1, 2009 Dr. Muhammed Al-mulhem 66 = { all strings without substring }
67
March 1, 2009 Dr. Muhammed Al-mulhem 67 Regular Languages A language is regular if there is a DFA such that All regular languages form a language family
68
March 1, 2009 Dr. Muhammed Al-mulhem 68 Example The language is regular:
69
March 1, 2009 Dr. Muhammed Al-mulhem 69 Finite State Automata Regular expressions can be viewed as a textual way of specifying the structure of finite-state automata.
70
March 1, 2009 Dr. Muhammed Al-mulhem 70 More Formally You can specify an FSA by enumerating the following things. –The set of states: Q –A finite alphabet: Σ –A start state –A set of accept/final states –A transition function that maps QxΣ to Q
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.