Presentation is loading. Please wait.

Presentation is loading. Please wait.

CS 4705 Lecture 2 Regular Expressions and Automata in Language Analysis.

Similar presentations


Presentation on theme: "CS 4705 Lecture 2 Regular Expressions and Automata in Language Analysis."— Presentation transcript:

1 CS 4705 Lecture 2 Regular Expressions and Automata in Language Analysis

2 Statistical vs. Symbolic (Knowledge Rich) Techniques How much linguistic knowledge do our representations and algorithms need to have to do ‘successful’ NLP? –Bill hit John. –John, Bill hit. 80/20 Rule: when do we need to worry about the other 20%?

3 Today Review some of the simple representations and ask ourselves how we might use them to do interesting and useful things –Regular Expressions –Finite State Automata Think about the limits of these simple approaches: when do we need more?

4 Uses of Regular Expressions in NLP Simple but powerful tools for large corpus analysis -- ‘shallow’ processing –What word is most likely to begin a sentence? –What word is most likely to begin a question? –How often do people end sentences with prepositions? With other simple statistical tools, allow us to –Obtain word frequency and co-occurrence statistics –Build simple interactive applications (e.g. Eliza)Eliza –Authorship: Who wrote Shakespeare’s plays? The Federalist papers? The Unibomber letters?

5 Review A non-blank lineAny character/./ /[^A-Z][a-z]*/Any non-u.c. char/[^A-Z]/ /[A-Z][a-z]*/Any u.c. letter/[A-Z]/ Any l.c. letter/[a-z]/ Rhyme:/[bckmrs]i te/ Any of these chars/[bckmsr]/ A ‘.’, a ‘?’/\./, /\?/ Possible useMatchesRE Rhyme: /[a-z]ite/ A statement, a question

6 REDescriptionUses? /a*/Zero or more a’s/(very[ ])*/ /a+/One or more a’s/(very[ ])+/ /a?/Optional single a/(very[ ])?/ /cat|dog/‘cat’ or ‘dog’/[a-z]* (cat|dog)/ /^[Nn]o$/A line with only ‘No’ or ‘no’ in it /\bun\B/PrefixesWords prefixed by ‘un’ (nb. union)

7 Patterns: happier and happier, fuzzier and fuzzier, classifier and classifier / (.+ier) and \1 / Morphological variants of ‘kitty’ -- butbut/kitt(y|ies)/ E.G.RE plus

8 Substitutions (Transductions) E.g. unix sed or ‘s’ operator in Perl –s/regexp1/pattern/ –s/I am feeling (.+)/You are feeling \1 ?/ –s/I gave (.+) to (.+)/Why would you give \2 \1 ?/ –s/You are (.+)[.]*/Why would you say that I am \1?/ –s/([1]?[0-9]) o’clock ([AaPp][. ]*[Mm][. ]*)/\1:00 \2/ How would you convert to 24-hour clock? –s/[0-9][0-9][0-9]-[0-9][0-9][0-9]-[0-9][0-9][0-9][0-9]/

9 Examples Predictions from a news corpus: –Which candidate for President is mentioned most often in the news? Is going to win? –What stock should you buy? –Which White House advisers have the most power? Language use: –Which form of comparative is more frequent: ‘Xer’ or ‘more X’? –Which pronouns occur most often in subject position? –How often do sentences end with infinitival ‘to’? –What words most often begin and end sentences? –What are the 20 most common words in your email? In the news? In Shakespeare’s plays?

10 Emotional language: –What words indicate what emotions? Happiness Anger Confidence Despair –How can we identify emotions automatically?

11 Finite State Automata FSAs recognize the regular languages represented by regular expressions –SheepTalk: /baa+!/ q0 q4 q1q2q3 ba a a! Directed graph with labeled nodes and arc transitions Five states: q0 the start state, q4 the final state, 5 transitions

12 Formally FSA is a 5-tuple consisting of –Q: set of states {q0,q1,q2,q3,q4} –  : an alphabet of symbols {a,b,!} –q0: a start state in Q –F: a set of final states in Q {q4} –  (q,i): a transition function mapping Q x  to Q q0 q4 q1q2q3 ba a a!

13 FSA recognizes (accepts) strings of a regular language –baa! –baaa! –baaaa! –… Tape metaphor: will this input be accepted? aba!b

14 State Transition Table for SheepTalkSheepTalk State Input ba! 0100 1020 2030 3034 4000

15 Non-Deterministic FSAs for SheepTalk q0 q4 q1q2q3 ba a a! q0 q4 q1q2q3 baa! 

16 Problems of Non-Determinism At any choice point, we may follow the wrong arc Potential solutions: –Save backup states at each choice point –Look-ahead in the input before making choice –Pursue alternatives in parallel –Determinize our NFSAs (and then minimize) FSAs can be useful tools for recognizing – and generating – subsets of natural language –But they cannot represent all NL phenomena (center embedding: The mouse the cat chased died.)

17 –Simple vs. linguistically rich representations…. –How do we decide what we need?

18 FSAs as Grammars for Natural Language q2q4q5q0q3q1q6 therev mr dr hon patl.robinson ms mrs 

19 If we want to extract all the proper names in the news, will this work? –What will it miss? –Will it accept something that is not a proper name? –How would you change it to accept all proper names without false positives? –Precision vs. recall….

20 Summing Up Regular expressions and FSAs can represent subsets of natural language as well as regular languages –Both representations may be impossible for humans to understand for any real subset of a language –But they are relatively easy to use for small subsets –Can be hard to scale up: when many choices at any point (e.g. surnames) Next time: Read Ch 3


Download ppt "CS 4705 Lecture 2 Regular Expressions and Automata in Language Analysis."

Similar presentations


Ads by Google