Presentation is loading. Please wait.

Presentation is loading. Please wait.

March 1, 2009 Dr. Muhammed Al-mulhem 1 ICS 482 Natural Language Processing Regular Expression and Finite Automata Muhammed Al-Mulhem March 1, 2009.

Similar presentations


Presentation on theme: "March 1, 2009 Dr. Muhammed Al-mulhem 1 ICS 482 Natural Language Processing Regular Expression and Finite Automata Muhammed Al-Mulhem March 1, 2009."— Presentation transcript:

1 March 1, 2009 Dr. Muhammed Al-mulhem 1 ICS 482 Natural Language Processing Regular Expression and Finite Automata Muhammed Al-Mulhem March 1, 2009

2 Dr. Muhammed Al-mulhem 2 Regular Expressions Regular expression (RE): A formula for specifying a set of strings. String: A sequence of alphanumeric characters (letters, numbers, spaces, tabs, and punctuation).

3 March 1, 2009 Dr. Muhammed Al-mulhem 3 Regular Expression Patterns RE String matched woodchucks “interesting links to woodchucks and lemurs” a “Sarah Ali stopped by Mona’s” Ali says, “My gift please,” Ali says,” book “all our pretty books” ! “Leave him behind!” said Sami

4 March 1, 2009 Dr. Muhammed Al-mulhem 4 Specify Options and Range using [ ] and - REMatch [wW]ood Wood or wood [abc] “a”, “b”, or “c” [A-Z]an uppercase letter [a-z]a lowercase letter [0-9]a single digit

5 March 1, 2009 Dr. Muhammed Al-mulhem 5 RE Operators REDescription a* Zero or more a’s a+ One or more a’s a? Zero or one a’s [ab]* Zero or more a’s or b’s. Matches aaa.., ababab.., bbbb.. [0-9]+ Sequence of one or more digits.. Wildcard expression-matches any single character. \b Matches a word boundary. Matches the but not other

6 March 1, 2009 Dr. Muhammed Al-mulhem 6 Sidebar: Errors Find all instances of the word “the” in a text. –/the/ What About ‘The’ –/[tT]he/ What about ‘Theater”, ‘Another’

7 March 1, 2009 Dr. Muhammed Al-mulhem 7 Sidebar: Errors The process we just went through was based on: –Matching strings that we should not have matched (there, then, other) False positives –Not matching things that we should have matched (The) False negatives

8 March 1, 2009 Dr. Muhammed Al-mulhem 8 Sidebar: Errors Reducing the error rate for an application often involves two efforts –Increasing accuracy (minimizing false positives) –Increasing coverage (minimizing false negatives)

9 March 1, 2009 Dr. Muhammed Al-mulhem 9 Regular expressions Basic regular expression patterns Perl-based syntax (slightly different from other notations for regular expressions) Disjunctions [abc] Ranges [A-Z] Negations [^Ss] Optional characters ?, + and * Wild cards. Anchors \b and \B Disjunction, grouping, and precedence |

10 March 1, 2009 Dr. Muhammed Al-mulhem 10 Preceding character or nothing using ?

11 March 1, 2009 Dr. Muhammed Al-mulhem 11 10/10/201511 Wildcard

12 March 1, 2009 Dr. Muhammed Al-mulhem 12 10/10/201512 Negation using ^

13 March 1, 2009 Dr. Muhammed Al-mulhem 13 Writing correct expressions Exercise: write a regular expression to match the English article “the”: /the/ missed ‘The’ included ‘the’ in ‘others’ /[tT]he/ /\b[tT]he\b/ Missed ‘the25’ ‘the_’ /[^a-zA-Z][tT]he[^a-zA-Z]/ Missed ‘The’ at the beginning of a line

14 March 1, 2009 Dr. Muhammed Al-mulhem 14 A more complex example Exercise: Write a Perl regular expression that will match “any PC with more than 500MHz and 32 Gb of disk space for less than $1000”:

15 March 1, 2009 Dr. Muhammed Al-mulhem 15 Example Price –/$[0-9]+/ # whole dollars –/$[0-9]+\.[0-9][0-9]/ # dollars and cents –/$[0-9]+(\.[0-9][0-9])?/ #cents optional –/\b$[0-9]+(\.[0-9][0-9])?\b/ #word boundaries Specifications for processor speed –/\b[0-9]+ *(MHz|[Mm]egahertz|Ghz|[Gg]igahertz)\b/ Memory size –/\b[0-9]+ *(Mb|[Mm]egabytes?)\b/ –/\b[0-9](\.[0-9]+) *(Gb|[Gg]igabytes?)\b/ Vendors –/\b(Win95|WIN98|WINNT|WINXP *(NT|95|98|2000|XP)?)\b/ –/\b(Mac|Macintosh|Apple)\b/

16 March 1, 2009 Dr. Muhammed Al-mulhem 16 Advanced Operators – Aliases for common ranges Underscore: Correct figure 2.6

17 March 1, 2009 Dr. Muhammed Al-mulhem 17 \ to Reference special characters

18 March 1, 2009 Dr. Muhammed Al-mulhem 18 Operators for counting

19 March 1, 2009 Dr. Muhammed Al-mulhem 19 Finite State Automata FSA recognizes the regular languages represented by regular expressions –SheepTalk: /baa+!/ Directed graph with labeled nodes and arc transitions Five states: q0 the start state, q4 the final state, 5 transitions q0 q4 q1q2q3 ba a a!

20 March 1, 2009 Dr. Muhammed Al-mulhem 20 Formally FSA is a 5-tuple consisting of –Q : set of states {q0,q1,q2,q3,q4} –  : an alphabet of symbols {a,b,!} –q0 : A start state –F : a set of final states in Q {q4} –  (q,i) : a transition function mapping Q x  to Q q0 q4 q1q2q3 ba a a!

21 March 1, 2009 Dr. Muhammed Al-mulhem 21 FSA recognizes (accepts) strings of a regular language –baa! –baaa! –baaaa! –… A rejected input aba!b q0 q4 q1q2q3 ba a a!

22 March 1, 2009 Dr. Muhammed Al-mulhem 22 State Transition Table State Input ba! 01ØØ 1Ø2Ø 2Ø3Ø 3Ø34 4ØØØ q0 q4 q1q2q3 ba a a! FSA can be represented with State Transition Table

23 March 1, 2009 Dr. Muhammed Al-mulhem 23 Non-Deterministic FSAs for SheepTalk q0 q4 q1q2q3 ba a a! q0 q4 q1q2q3 baa! 

24 March 1, 2009 Dr. Muhammed Al-mulhem 24 A language is a set of strings String: A sequence of letters Languages

25 March 1, 2009 Dr. Muhammed Al-mulhem 25 Tracing FSA - Initial Configuration Input String

26 March 1, 2009 Dr. Muhammed Al-mulhem 26 Reading the Input

27 March 1, 2009 Dr. Muhammed Al-mulhem 27

28 March 1, 2009 Dr. Muhammed Al-mulhem 28

29 March 1, 2009 Dr. Muhammed Al-mulhem 29

30 March 1, 2009 Dr. Muhammed Al-mulhem 30 Output: “accept”

31 March 1, 2009 Dr. Muhammed Al-mulhem 31 Rejection

32 March 1, 2009 Dr. Muhammed Al-mulhem 32

33 March 1, 2009 Dr. Muhammed Al-mulhem 33

34 March 1, 2009 Dr. Muhammed Al-mulhem 34

35 March 1, 2009 Dr. Muhammed Al-mulhem 35 Output: “reject”

36 March 1, 2009 Dr. Muhammed Al-mulhem 36 Another Example

37 March 1, 2009 Dr. Muhammed Al-mulhem 37

38 March 1, 2009 Dr. Muhammed Al-mulhem 38

39 March 1, 2009 Dr. Muhammed Al-mulhem 39

40 March 1, 2009 Dr. Muhammed Al-mulhem 40 Output: “accept”

41 March 1, 2009 Dr. Muhammed Al-mulhem 41 Rejection

42 March 1, 2009 Dr. Muhammed Al-mulhem 42

43 March 1, 2009 Dr. Muhammed Al-mulhem 43

44 March 1, 2009 Dr. Muhammed Al-mulhem 44

45 March 1, 2009 Dr. Muhammed Al-mulhem 45 Output: “reject”

46 March 1, 2009 Dr. Muhammed Al-mulhem 46 Formalities Deterministic Finite Accepter (DFA) : set of states : input alphabet : transition function : initial state : set of final states

47 March 1, 2009 Dr. Muhammed Al-mulhem 47 About Alphabets Alphabets means we need a finite set of symbols in the input. These symbols can and will stand for bigger objects that can have internal structure.

48 March 1, 2009 Dr. Muhammed Al-mulhem 48 Input Aplhabet

49 March 1, 2009 Dr. Muhammed Al-mulhem 49 Set of States

50 March 1, 2009 Dr. Muhammed Al-mulhem 50 Initial State

51 March 1, 2009 Dr. Muhammed Al-mulhem 51 Set of Final States

52 March 1, 2009 Dr. Muhammed Al-mulhem 52 Transition Function

53 March 1, 2009 Dr. Muhammed Al-mulhem 53

54 March 1, 2009 Dr. Muhammed Al-mulhem 54

55 March 1, 2009 Dr. Muhammed Al-mulhem 55

56 March 1, 2009 Dr. Muhammed Al-mulhem 56 Transition Function

57 March 1, 2009 Dr. Muhammed Al-mulhem 57 Extended Transition Function (Reads the entire string)

58 March 1, 2009 Dr. Muhammed Al-mulhem 58

59 March 1, 2009 Dr. Muhammed Al-mulhem 59

60 March 1, 2009 Dr. Muhammed Al-mulhem 60

61 March 1, 2009 Dr. Muhammed Al-mulhem 61 Observation: There is a walk from to with label

62 March 1, 2009 Dr. Muhammed Al-mulhem 62 Example accept

63 March 1, 2009 Dr. Muhammed Al-mulhem 63 Another Example accept

64 March 1, 2009 Dr. Muhammed Al-mulhem 64 More Examples accept trap state

65 March 1, 2009 Dr. Muhammed Al-mulhem 65 = { all substrings with prefix } accept

66 March 1, 2009 Dr. Muhammed Al-mulhem 66 = { all strings without substring }

67 March 1, 2009 Dr. Muhammed Al-mulhem 67 Regular Languages A language is regular if there is a DFA such that All regular languages form a language family

68 March 1, 2009 Dr. Muhammed Al-mulhem 68 Example The language is regular:

69 March 1, 2009 Dr. Muhammed Al-mulhem 69 Finite State Automata Regular expressions can be viewed as a textual way of specifying the structure of finite-state automata.

70 March 1, 2009 Dr. Muhammed Al-mulhem 70 More Formally You can specify an FSA by enumerating the following things. –The set of states: Q –A finite alphabet: Σ –A start state –A set of accept/final states –A transition function that maps QxΣ to Q


Download ppt "March 1, 2009 Dr. Muhammed Al-mulhem 1 ICS 482 Natural Language Processing Regular Expression and Finite Automata Muhammed Al-Mulhem March 1, 2009."

Similar presentations


Ads by Google