Regular Expressions and Automata in Language Analysis

Slides:



Advertisements
Similar presentations
4b Lexical analysis Finite Automata
Advertisements

1 Regular Expressions and Automata September Lecture #2.
C O N T E X T - F R E E LANGUAGES ( use a grammar to describe a language) 1.
CS 4705 Regular Expressions and Automata in Natural Language Analysis CS 4705 Julia Hirschberg.
Finite-state automata 2 Day 13 LING Computational Linguistics Harry Howard Tulane University.
Regular Expressions (RE) Used for specifying text search strings. Standarized and used widely (UNIX: vi, perl, grep. Microsoft Word and other text editors…)
1 Regular Expressions and Automata September Lecture #2-2.
CS5371 Theory of Computation
CS 4705 Some slides adapted from Hirschberg, Dorr/Monz, Jurafsky.
Finite Automata Finite-state machine with no output. FA consists of States, Transitions between states FA is a 5-tuple Example! A string x is recognized.
CS 4705 Lecture 2 Regular Expressions and Automata in Language Analysis.
Computational Language Finite State Machines and Regular Expressions.
1 Languages and Finite Automata or how to talk to machines...
January 14, 2015CS21 Lecture 51 CS21 Decidability and Tractability Lecture 5 January 14, 2015.
CS 4705 Regular Expressions and Automata in Natural Language Analysis CS 4705.
CS 4705 Regular Expressions and Automata in Natural Language Analysis CS 4705 Julia Hirschberg.
Finite Automata Chapter 5. Formal Language Definitions Why need formal definitions of language –Define a precise, unambiguous and uniform interpretation.
Regular Expressions and Automata Chapter 2. Regular Expressions Standard notation for characterizing text sequences Used in all kinds of text processing.
Grammars, Languages and Finite-state automata Languages are described by grammars We need an algorithm that takes as input grammar sentence And gives a.
CMSC 723: Intro to Computational Linguistics Lecture 2: February 4, 2004 Regular Expressions and Finite State Automata Professor Bonnie J. Dorr Dr. Nizar.
Morphological Recognition We take each sub-lexicon of each stem class and we expand each arc (e.g. the reg-noun arc) with all the morphemes that make up.
Chapter 2. Regular Expressions and Automata From: Chapter 2 of An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition,
March 1, 2009 Dr. Muhammed Al-mulhem 1 ICS 482 Natural Language Processing Regular Expression and Finite Automata Muhammed Al-Mulhem March 1, 2009.
Natural Language Processing Lecture 6 : Revision.
4b 4b Lexical analysis Finite Automata. Finite Automata (FA) FA also called Finite State Machine (FSM) –Abstract model of a computing entity. –Decides.
1 Regular Expressions and Automata CPE 641 Natural Language Processing from Kathy McCoy’s slides, CISC 882 Introduction to NLP
Lexical Analysis: Finite Automata CS 471 September 5, 2007.
2. Regular Expressions and Automata 2007 년 3 월 31 일 인공지능 연구실 이경택 Text: Speech and Language Processing Page.33 ~ 56.
LING 388: Language and Computers Sandiway Fong 9/27 Lecture 10.
Copyright © Curt Hill Finite State Automata Again This Time No Output.
Finite-state automata Day 12 LING Computational Linguistics Harry Howard Tulane University.
1 Regular Expressions and Automata August Lecture #2.
CS 4705 Some slides adapted from Hirschberg, Dorr/Monz, Jurafsky.
Natural Language Processing Lecture 4 : Regular Expressions and Automata.
CS 4705 Lecture 2 Regular Expressions and Automata.
Modeling Computation: Finite State Machines without Output
UNIT - I Formal Language and Regular Expressions: Languages Definition regular expressions Regular sets identity rules. Finite Automata: DFA NFA NFA with.
using Deterministic Finite Automata & Nondeterministic Finite Automata
Overview of Previous Lesson(s) Over View  A token is a pair consisting of a token name and an optional attribute value.  A pattern is a description.
BİL711 Natural Language Processing1 Regular Expressions & FSAs Any regular expression can be realized as a finite state automaton (FSA) There are two kinds.
Deterministic Finite Automata Nondeterministic Finite Automata.
Department of Software & Media Technology
WELCOME TO A JOURNEY TO CS419 Dr. Hussien Sharaf Dr. Mohammad Nassef Department of Computer Science, Faculty of Computers and Information, Cairo University.
Finite Automata.
Ruby Regular Expressions
Chapter 3 Lexical Analysis.
Finite-State Machines (FSMs)
Lexical analysis Finite Automata
Non Deterministic Automata
Compilers Welcome to a journey to CS419 Lecture5: Lexical Analysis:
Finite-State Machines (FSMs)
Two issues in lexical analysis
Recognizer for a Language
Lexical Analysis Why separate lexical and syntax analyses?
Some slides by Elsa L Gunter, NJIT, and by Costas Busch
Deterministic Finite Automata And Regular Languages Prof. Busch - LSU.
CSC NLP - Regex, Finite State Automata
CSE322 Definition and description of finite Automata
Finite Automata.
Finite Automata.
4b Lexical analysis Finite Automata
Regular Expressions
CS21 Decidability and Tractability
Compiler Construction
4b Lexical analysis Finite Automata
CPSC 503 Computational Linguistics
NFAs and Transition Graphs
Chapter # 5 by Cohen (Cont…)
Natural Language Processing (NLP)
Statistical NLP Winter 2009
Presentation transcript:

Regular Expressions and Automata in Language Analysis Lecture 2 Regular Expressions and Automata in Language Analysis CS 4705

Statistical vs. Symbolic (Knowledge Rich) Techniques How much linguistic knowledge do our representations and algorithms need to have to do ‘successful’ NLP? Bill hit John. John, Bill hit. 80/20 Rule: when do we need to worry about the other 20%?

Today Review some of the simple representations and ask ourselves how we might use them to do interesting and useful things Regular Expressions Finite State Automata Think about the limits of these simple approaches: when do we need more?

Uses of Regular Expressions in NLP As grep, perl: Simple but powerful tools for large corpus analysis and ‘shallow’ processing What word is most likely to begin a sentence? What word is most likely to begin a question? How often do people end sentences with prepositions? With other unix tools, allow us to Obtain word frequency and co-occurrence statistics Build simple interactive applications (e.g. Eliza) Authorship: Who wrote Shakespeare’s plays? The Federalist papers? The Unibomber letters?

A Quick Review A non-blank line Any character /./ /[^A-Z][a-z]*/ Any non-u.c. char /[^A-Z]/ /[A-Z][a-z]*/ Any u.c. letter /[A-Z]/ Any l.c. letter /[a-z]/ /[bckmrs]ite/ Any of these chars /[bckmsr]/ Any ‘?’ /?/ Possible use Matches RE A question /[a-z]ite/

RE Description Uses? /a*/ Zero or more a’s /(very[ ])*/ /a+/ One or more a’s /(very[ ])+/ /a?/ Zero or one a’s /(very[ ])?/ /cat|dog/ ‘cat’ or ‘dog’ /[a-z]* (cat|dog)/ /^no$/ A line with only ‘no’ in it /\bun\B/ Prefixes Words prefixed by ‘un’

RE E.G. /pupp(y|ies)/ Morphological variants of ‘puppy’ / (.+)ier and \1ier / happier and happier, fuzzier and fuzzier

Substitutions (Transductions) Sed or ‘s’ operator in Perl s/regexp1/pattern/ s/I am feeling (.+)/You are feeling \1?/ s/I gave (.+) to (.+)/Why would you give \2 \1?/

Examples Predictions from a news corpus: Language use: Which candidate for Governor is mentioned most often in the news? Is going to win? What stock should you buy? Which White House advisers have the most power? Language use: Which form of comparative is more frequent: ‘Xer’ or ‘more X’? Which pronouns occur most often in subject position? How often do sentences end with infinitival ‘to’? What words most often begin and end sentences? What are the 20 most common words in your email? In the news? In Shakespeare’s plays?

Emotional language: What words indicate what emotions? Happiness Anger Confidence Despair How can we identify emotions automatically?

Finite State Automata FSAs recognize the regular languages represented by regular expressions SheepTalk: /baa+!/ q0 q4 q1 q2 q3 b a ! Directed graph with labeled nodes and arc transitions Five states: q0 the start state, q4 the final state, 5 transitions

Formally FSA is a 5-tuple consisting of Q: set of states {q0,q1,q2,q3,q4} : an alphabet of symbols {a,b,!} q0: a start state in Q F: a set of final states in Q {q4} (q,i): a transition function mapping Q x  to Q q0 q4 q1 q2 q3 b a !

a b ! FSA recognizes (accepts) strings of a regular language baa! baaa! baaaa! … Tape metaphor: will this input be accepted? a b !

State Transition Table for SheepTalk Input b a ! 1 2 3 4

Non-Deterministic FSAs for SheepTalk q0 q4 q1 q2 q3 b a ! b a a ! q0 q1 q2 q3 q4 

Problems of Non-Determinism At any choice point, we may follow the wrong arc Potential solutions: Save backup states at each choice point Look-ahead in the input before making choice Pursue alternatives in parallel Determinize our NFSAs (and then minimize) FSAs can be useful tools for recognizing – and generating – subsets of natural language But they cannot represent all NL phenomena (center embedding: The mouse the cat chased died.)

Simple vs. linguistically rich representations…. How do we decide what we need?

FSAs as Grammars for Natural Language dr the rev mr pat l. robinson q0 q1 q2 q3 q4 q5 q6 ms hon  mrs 

If we want to extract all the proper names in the news, will this work? What will it miss? Will it accept something that is not a proper name? How would you change it to accept all proper names without false positives? Precision vs. recall….

Summing Up Regular expressions and FSAs can represent subsets of natural language as well as regular languages Both representations may be impossible for humans to understand for any real subset of a language But they are relatively easy to use for small subsets Can be hard to scale up: when many choices at any point (e.g. surnames) Next time: Read Ch 3 Class participation opportunity: Do the experiment at http://pi2354.uvt.nl/wwstim/gudrun/html/ and print a copy of the final page of the experiment to show you’ve done it