Presentation is loading. Please wait.

Presentation is loading. Please wait.

Korea Maritime and Ocean University NLP Jung Tae LEE

Similar presentations


Presentation on theme: "Korea Maritime and Ocean University NLP Jung Tae LEE"— Presentation transcript:

1 Korea Maritime and Ocean University NLP Jung Tae LEE inverse90@nate.com

2

3 ` 1. Regular Expression Regular Expression?  formula in a special language that specifies simple classes of strings. (a string is a sequence of symbols)  algebraic notation for characterizing a set of strings.  A language for sepcifying text search strings. so, RE is an important theoretical tool throughout computer science and linguistics.

4 `  RE search requires a pattern that we want to search for and a corpus of texts to serach through.  Simplest kind of regular expression is a sequence of simple characters. like cf) search for green, we type /green/. (recall that we are assuming a search application that returns entire lines) REExample Patterns Matched /interested/“We are interested in NLP” /DOROTHY/“SURRENDER DOROTHY” /!/“I’m in danger now!” /Claire says,/“”Dagmar, my gift plz,” Claire says,”  Regular expressions are case sensitive; lower case is distinct from upper case Basic Regular Expression Pattern

5 ` REMatchExample Patterns Matched /[bB]lue/Blue or blue“deep blue sea” /[abc]/‘a’, ‘b’, or ‘c’“algebra” /[1234567890]/Any digit“plenty of 7 to 5” Basic Regular Expression Pattern  Sensitive problem solve with the use of braces [, ] Use of the brackets [ ] to specify a disjunction of characters.  The brankets can be used with the dash(-) to specify any one character in a range REMatchExample Patterns Matched /[A-Z]/An upper case letter“we are INFINITY” /[a-z]/A lower case letter“not enough to love” /[0-9]/A single digit“chapter 2 : RE Use of the brankets [ ] plus the dash – to specify a range.

6 ` REMatchExample Patterns Matched [^A-Z]Not an upper case latter“Lee jung tae” a^bThe pattern ‘a^b’“look up a^b now” [e^]Either ‘e’ or ‘^’“Kleene star” Basic Regular Expression Pattern  Braces can also be used to specify what a single character cannot be, by use of the caret ^. Use of the caret ^ for negation or just to mean ^.  Question mark ?, which means “the preceding character or nothing” REMatchExample Patterns Matched means?mean or means“mean” colou?rcolor or colour“colour” The question mark ? Marks optionality of the previous expression.

7 `  Sometimes we need regular expressions that allow repetitions.  Ex) ba! baa! baaa! baaaa! ba…..a! these are based on the asterisk or *, commoly called the Kleene * The Kleene star means “zero or more occurrences of the immediately previous character or regular expression  Sometimes there is a shorter way to sepcify “at least one” of some character. This is a Kleene+, which means “one or more of the previous character” Basic Regular Expression Pattern REMatch Example Patterns Matched /[0-9]*/String of digits or nothing123.45$ /[0-9]+/[0-9][0-9]*.123$ /beg.n/Any char between beg and nbegin, beg’n, begun /^The dog\.$/ The matches start of line and dog. matches end of line. The dog. The use of the specify case about Kleene, period or anchors.

8 `  Still we can’t distinct such as cat or dog. So, we need new operator, the disjunction operator, called the pipe symbol |.  To make disjunction operator apply only to a specific pattern, we need to use the parenthesis operator ( and ). ex) /guppy | ies/ are match only string guppy or ies. But we want guppy or guppies. So the pattern /gupp(y|ies)/ would specify that. Disjunction, Grouping, and Precedence Operator precedence hierarchy operatorRegular expression Parenthesis( ) Counters* + ? { } Sequences and anchors The ^my end$ Disjunction| ※ RE always match the largest string they can. Patterns are greedy!

9 `  There is more useful operator. Advanced Operator REExpansionMatchExamples \d[0-9]Any digitParty of 5 \D[^0-9]Any non-digitBlue moon \w[a-zA-Z0-9_]Any alphanumeric/underscoreDaiyu \W[^\w]A non-alphanumeric!!!!! \s[ \r\t\n\f]Whitespace(space, tab) \S[^\s]Non-whitespaceIn Concord Aliases for common sets of characters. REMatch {n}n occurrences of the previous char or expression {n,m}From n to m occurrences of the previous char or expression Regular expression operator for counting.

10 `  Ex) Perl substitution operator s/regexp1/pattern/ allows a string characterized by a regular expression to be replaced by another string Regular Expression Subtitution, Memory ExampleREReplaced string 35 boxess/([0-9]+)/ / boxes The Xer is Ying/The (.*)er is (.*)ing/The \1er will \2/The Xer will Y  To do this, we put parentheses ( and ) around the pattern.  Using memory called register.

11 `  http://www.codejs.co.kr/%EC%A0%95%EA%B7%9C%EC%8B%9D-regular- expression/ http://www.codejs.co.kr/%EC%A0%95%EA%B7%9C%EC%8B%9D-regular- expression/ this page containing information about meta-characters written in Korean Reference  http://gskinner.com/RegExr/ http://gskinner.com/RegExr/ there is useful regular expression.

12 ` 2. Finite-State Automata FSA?  With a regular expressions used to describe regular languages.  It is good theoretical foundation to deal of computational work. Regular languages Regular expressions Finite automata Regular grammars Three equivalent ways of describing regular languages. Except RE that use the memory feature;

13 `  Automata for modeling about regular expression.  Recognizes a set of strings Here how it(/baa+!/) look: Use of an FSA to Recognize R.Language b aa a !  State 0 is the start state(generally).  Final state or accepting state represent by the double circle like state 4.

14 `  The FSA can be used for recognizing (we also say accepting) string in the following way. Use of an FSA to Recognize R.Language

15 `  It can represent an automata with a state-transition table.  Formally, FA is defined by following five parameters: Use of an FSA to Recognize R.Language Input Stateba! 0100 1020 2030 3034 4:000

16 `  Formal Language: A model that can both generate and recognize all and only the strings of a formal language acts as a definition of the formal L.  Set of strings  Each string composed of symbols from a finite symbol set called an alphabet Formal Languages Previous language have the set ∑ = {a, b, !} Given a model m(such as particular FSA), we can use L(m) to mean “the formal language characterized by m” b aa a ! L(m) = { baa!, baaa!, baaaa!, baaaaa!, baaaaaaa!, …}

17 ` Non-Deterministic FSAs b aa a !  Consider from the previous one to the next figure : Self-loop is on state2 instead of state 3.  When we get to state 2, if we see an a we don’t know whether to remain in state 2 or go on to state3. Automata with decision point like this, we called non-deterministic FSAs (or NFSAs, NFA). b aa ε or λ ! Arcs have no symbols on them(called λ-transitions). Also NFA

18 ` Use of an NFSA to Accept Strings  We might follow the wrong arc and reject it when we should have accepted it. That is, since is more than one choice at some point.  So, there are three standard solution to the problem :  Backup: whenever we come to a choice point, we could put a marker to mark where we were in the input and what state the automata was in. then if it turns out that we took the wrong choice, we could back up and try another path.  Look-ahead: we could look ahead in the input to help us decide which path to take.  Parallelism: whenever we come to a choice point, we could look at every alternative path in parallel.

19 ` Recognition as Search  If yields a path ending in an accept state, ND-RECOGNIZE accepts the string.  Otherwise, it rejects the string  Searching for solutions, are known as state-space search algorithms. baaa! baaa! baaa! baaa! baaa! baaa! baaa! baaa! 1. 2. 3. 4. 5. 6. 7. 8. *Depth-first search implemented by stack

20 ` Recognition as Search baaa! baaa! baaa! baaa! baaa! baaa! 1. 2. 3. 4. 5. *Breadth-first search implemented by queue baaa! 4. baaa! 5. baaa! 6.

21 ` Use of an NFSA to Accept Strings  Like DFS, BFS has its pitfalls. As with depth-first, if the state-space is infinite, the search may never terminate.  And due to growth in the size of the agenda of the state-space is even moderately large.  For larger problems, more complex search techniques such as dynamic programming or A* must be used. => we will discuss in other chapter. * Following va Santen and Sproat(1998), Kaplan and Kay(1994), and Lewis and Papadimitriou(1988).

22 ` Relation of NFA and DFA  For any NFA, there is an exactly equivalent DFA.

23 ` 3. Regular Languages and FSAs Regular Languages?  Class of languages that are definable by regular expressions  And same as characterizable by finite-state automata  The class of regular languages over ∑ is then formally defined as follows :

24 ` Operations  Regular languages are closed under the following operations(Such as a regular expression) :

25 ` RE are equivalent to FSA.  For the inductive step, we show that each of the primitive operations of a regular expression(concatenation, union, closure) can be imitated by an automata.  Start with three base case, a (a) r = λ (b) r = Ø(c) r = a Automata for the base case (no operators) for the induction showing that any regular expression can be turned into an equivalent automata.

26 ` RE are equivalent to FSA.  Concatenation  Closure FSA 1 FSA 2 FSA 1 λ λ λ λ

27 ` RE are equivalent to FSA.  Union FSA 1 FSA 2 λ λ λ λ

28 ` 4. Summary  Introduced the most important fundamental concept in language processing, the automata.  RE language is a powerful tool for pattern-matching.  Basic operations in RE include concatenation of symbols, disjuction of symbols, counters, anchors, and precedence operators.  The behavior of a deterministic automata is fully determined by the state it is in.  Any RE can be realized as a FSA.  Memory is an advanced operation that is often considered part of regular expressions but cannot be realized as a finite automata.  Any NFA can be converted to a DFA.  NFA search strategy.

29 Korea Maritime and Ocean University NLP Jung Tae LEE inverse90@nate.com


Download ppt "Korea Maritime and Ocean University NLP Jung Tae LEE"

Similar presentations


Ads by Google