CS 4705 Some slides adapted from Hirschberg, Dorr/Monz, Jurafsky.

Slides:



Advertisements
Similar presentations
4b Lexical analysis Finite Automata
Advertisements

1 Regular Expressions and Automata September Lecture #2.
Regular Expressions and DFAs COP 3402 (Summer 2014)
C O N T E X T - F R E E LANGUAGES ( use a grammar to describe a language) 1.
Finite Automata CPSC 388 Ellen Walker Hiram College.
CS 4705 Regular Expressions and Automata in Natural Language Analysis CS 4705 Julia Hirschberg.
Using Rules Chapter 6. 2 Logic Programming (Definite) logic program: A  B 1, B 2, …, B m program clause (Horn) head body  A 1, A 2, …, A n goal clause.
Finite-state automata 2 Day 13 LING Computational Linguistics Harry Howard Tulane University.
Regular Expressions (RE) Used for specifying text search strings. Standarized and used widely (UNIX: vi, perl, grep. Microsoft Word and other text editors…)
1 Regular Expressions and Automata September Lecture #2-2.
CS5371 Theory of Computation
Finite Automata Finite-state machine with no output. FA consists of States, Transitions between states FA is a 5-tuple Example! A string x is recognized.
CS147 - Terry Winograd - 1 Lecture 14 – Agents and Natural Language Terry Winograd CS147 - Introduction to Human-Computer Interaction Design Computer Science.
CS 4705 Lecture 2 Regular Expressions and Automata in Language Analysis.
Computational Language Finite State Machines and Regular Expressions.
CMSC 723 / LING 645: Intro to Computational Linguistics September 8, 2004: Monz Regular Expressions and Finite State Automata (J&M 2) Prof. Bonnie J. Dorr.
January 14, 2015CS21 Lecture 51 CS21 Decidability and Tractability Lecture 5 January 14, 2015.
1 Foundations of Software Design Lecture 23: Finite Automata and Context-Free Grammars Marti Hearst Fall 2002.
CS 4705 Regular Expressions and Automata in Natural Language Analysis CS 4705.
CS 4705 Regular Expressions and Automata in Natural Language Analysis CS 4705 Julia Hirschberg.
CS120: Lecture 17 MP Johnson Hunter
1 Foundations of Software Design Lecture 22: Regular Expressions and Finite Automata Marti Hearst Fall 2002.
CS 4705 Some slides adapted from Hirschberg, Dorr/Monz, Jurafsky.
Finite Automata Chapter 5. Formal Language Definitions Why need formal definitions of language –Define a precise, unambiguous and uniform interpretation.
Turing Machines CS 105: Introduction to Computer Science.
CSC 361Finite Automata1. CSC 361Finite Automata2 Formal Specification of Languages Generators Grammars Context-free Regular Regular Expressions Recognizers.
Introduction to Finite Automata Adapted from the slides of Stanford CS154.
Regular Expressions and Automata Chapter 2. Regular Expressions Standard notation for characterizing text sequences Used in all kinds of text processing.
Grammars, Languages and Finite-state automata Languages are described by grammars We need an algorithm that takes as input grammar sentence And gives a.
CMSC 723: Intro to Computational Linguistics Lecture 2: February 4, 2004 Regular Expressions and Finite State Automata Professor Bonnie J. Dorr Dr. Nizar.
Finite-State Machines with No Output
Morphological Recognition We take each sub-lexicon of each stem class and we expand each arc (e.g. the reg-noun arc) with all the morphemes that make up.
CS490 Presentation: Automata & Language Theory Thong Lam Ran Shi.
Chapter 2. Regular Expressions and Automata From: Chapter 2 of An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition,
Theory of Computation 1 Theory of Computation Peer Instruction Lecture Slides by Dr. Cynthia Lee, UCSD are licensed under a Creative Commons Attribution-NonCommercial-ShareAlike.
March 1, 2009 Dr. Muhammed Al-mulhem 1 ICS 482 Natural Language Processing Regular Expression and Finite Automata Muhammed Al-Mulhem March 1, 2009.
Great Theoretical Ideas in Computer Science.
4b 4b Lexical analysis Finite Automata. Finite Automata (FA) FA also called Finite State Machine (FSM) –Abstract model of a computing entity. –Decides.
COMP3190: Principle of Programming Languages DFA and its equivalent, scanner.
Natural Language Processing Lecture 2—1/15/2015 Susan W. Brown.
TRANSITION DIAGRAM BASED LEXICAL ANALYZER and FINITE AUTOMATA Class date : 12 August, 2013 Prepared by : Karimgailiu R Panmei Roll no. : 11CS10020 GROUP.
1 Regular Expressions and Automata CPE 641 Natural Language Processing from Kathy McCoy’s slides, CISC 882 Introduction to NLP
Lexical Analysis: Finite Automata CS 471 September 5, 2007.
2. Regular Expressions and Automata 2007 년 3 월 31 일 인공지능 연구실 이경택 Text: Speech and Language Processing Page.33 ~ 56.
Copyright © Curt Hill Finite State Automata Again This Time No Output.
Finite-state automata Day 12 LING Computational Linguistics Harry Howard Tulane University.
1 LING 6932 Spring 2007 LING 6932 Topics in Computational Linguistics Hana Filip Lecture 2: Regular Expressions, Finite State Automata.
1 Regular Expressions and Automata August Lecture #2.
CS 4705 Some slides adapted from Hirschberg, Dorr/Monz, Jurafsky.
Natural Language Processing Lecture 4 : Regular Expressions and Automata.
1Computer Sciences Department. Book: INTRODUCTION TO THE THEORY OF COMPUTATION, SECOND EDITION, by: MICHAEL SIPSER Reference 3Computer Sciences Department.
CS 4705 Lecture 2 Regular Expressions and Automata.
TM Design Macro Language D and SD MA/CSSE 474 Theory of Computation.
UNIT - I Formal Language and Regular Expressions: Languages Definition regular expressions Regular sets identity rules. Finite Automata: DFA NFA NFA with.
Chapter 2 Scanning. Dr.Manal AbdulazizCS463 Ch22 The Scanning Process Lexical analysis or scanning has the task of reading the source program as a file.
Overview of Previous Lesson(s) Over View  A token is a pair consisting of a token name and an optional attribute value.  A pattern is a description.
BİL711 Natural Language Processing1 Regular Expressions & FSAs Any regular expression can be realized as a finite state automaton (FSA) There are two kinds.
Conversions Regular Expression to FA FA to Regular Expression.
COMP3190: Principle of Programming Languages DFA and its equivalent, scanner.
Theory of Computation Automata Theory Dr. Ayman Srour.
WELCOME TO A JOURNEY TO CS419 Dr. Hussien Sharaf Dr. Mohammad Nassef Department of Computer Science, Faculty of Computers and Information, Cairo University.
Lexical analysis Finite Automata
Compilers Welcome to a journey to CS419 Lecture5: Lexical Analysis:
CSC NLP - Regex, Finite State Automata
4b Lexical analysis Finite Automata
Regular Expressions and Automata in Language Analysis
4b Lexical analysis Finite Automata
CPSC 503 Computational Linguistics
A User study on Conversational Software
Statistical NLP Winter 2009
Presentation transcript:

CS 4705 Some slides adapted from Hirschberg, Dorr/Monz, Jurafsky

 Some simple problems: ◦ How much is Google worth? ◦ How much is the Empire State Building worth? ◦ How much is Columbia University worth? ◦ How much is the United States worth? ◦ How much is a college education worth?  How much knowledge of language do our algorithms need to do useful NLP? ◦ 80/20 Rule:  Claim: 80% of NLP can be done with simple methods  When should we worry about the other 20%?

 Review some simple representations of language and see how far they will take us ◦ Regular Expressions ◦ Finite State Automata  Think about the limits of these simple approaches ◦ When are simple methods good enough? ◦ When do we need something more?

 Simple but powerful tools for ‘shallow’ processing, e.g. of very large corpora ◦ What word is begins a sentence? ◦ What word is most likely to begin a question? ◦ How often do people end sentences with prepositions?  With other simple statistical tools, allow us to ◦ Obtain word frequency and co-occurrence statistics  What is this document ‘about’?  What words typically modify other words? ◦ Build simple interactive applications (e.g. Eliza)Eliza ◦ Recognize Named Entities (NE): people names, company names ◦ Deception detection: Statement Analysis

A non-blank lineAny character/./ Not an NELower case letter/ [^A-Z]/ Possible NECapitalized letter/ [A-Z]/ Any l.c. letter/[a-z]/ Rhyme:/[bckmrs ]ite/ Any char in set/[bckmsr]/ A ‘.’, a ‘?’/\./, /\?/ UsesMatchesRE Rhyme: /[a-z]ite/ A statement, a question

REDescriptionUses? /a*/Zero or more a’s/(very[ ])*/ /a+/One or more a’s/(very[ ])+/ /a?/Optional single a/(very[ ])?/ /cat|dog/‘cat’ or ‘dog’/[A-Z,a-z]* (cat|dog)/ /^[Nn]o$/A line with only ‘No’ or ‘no’ in it /\bun\B/PrefixesWords prefixed by ‘un’ (nb. union)

Patterns: happier and happier, fuzzier and fuzzier, classifier and classifier / (.+ier) and \1 / Morphological variants of ‘kitty’/kitt(y|ies|en|ens )/ E.G.RE plus

 Men are all alike.  IN WHAT WAY?  They're always bugging us about something or other.  CAN YOU THINK OF A SPECIFIC EXAMPLE?  Well, my boyfriend made me come here.  YOUR BOYFRIEND MADE YOU COME HERE  He says I'm depressed much of the time.  I AM SORRY TO HEAR YOU ARE DEPRESSED  It's true. I am unhappy  DO YOU THINK COMING HERE WILL HELP YOU NOT TO BE UNHAPPY  I need some help, that much seems certain.  WHAT WOULD IT MEAN TO YOU IF YOU GOT SOME HELP  Perhaps I could learn to get along with my mother.  TELL ME MORE ABOUT YOUR FAMILY  My mother takes care of me.  WHO ELSE IN YOUR FAMILY TAKES CARE OF YOU  My father.  YOUR FATHER  You are like my father in some ways.

s/.* YOU ARE (depressed|sad).*/I AM SORRY TO HEAR YOU ARE \1/ s/.* YOU ARE (depressed|sad).*/WHY DO YOU THINK YOU ARE \1/ s/.* all.*/IN WHAT WAY/ s/.* always.*/CAN YOU THINK OF A SPECIFIC EXAMPLE/ Step 1: replace first person with second person references s/\bI(’m| am)\b /YOU ARE/g s/\bmy\b /YOUR/g S/\bmine\b /YOURS/g Step 2: use additional regular expressions to generate replies Step 3: use scores to rank possible transformations Slide from Dorr/Monz

 E.g. unix sed or ‘s’ operator in Perl (s/regexpr/pattern/) ◦ Transform time formats:  s/([1]?[0-9]) o’clock ([AaPp][Mm])/\1:00 \2/  How would you convert to 24-hour clock? ◦ What does this do?  s/[0-9][0-9][0-9]-[0-9][0-9][0-9]-[0-9][0- 9][0-9][0-9]/ /

 Predictions from a news corpus: ◦ Which candidate for President is mentioned most often in the news? Is going to win? ◦ Which White House advisors have the most power?  Language usage: ◦ Which form of comparative is more common: ‘Xer’ or ‘more X’? ◦ Which pronouns occur most often in subject position? ◦ How often do sentences end with infinitival ‘to’? ◦ What words typically begin and end sentences?

 Three equivalent formal ways to look at what we’re up to Regular Expressions Regular Languages Finite State Automata Regular Grammars

/^baa+!$/ q0q0 q1q1 q2q2 q3q3 q4q4 baa! a state transition final state baa! baaa! baaaa! baaaaa!... Slide from Dorr/Monz

 FSA is a 5-tuple consisting of ◦ Q: set of states {q0,q1,q2,q3,q4} ◦ : an alphabet of symbols {a,b,!} ◦ q0: a start state in Q ◦ F: a set of final states in Q {q4} ◦ (q,i): a transition function mapping Q x  to Q q0 q4 q1q2q3 ba a a!

 State-transition table

 Recognition is the process of determining if a string should be accepted by a machine  Or… it’s the process of determining if a string is in the language we’re defining with the machine  Or… it’s the process of determining if a regular expression matches a string

 Traditionally, (Turing’s idea) this process is depicted with a tape.

 Start in the start state  Examine the current input  Consult the table  Go to a new state and update the tape pointer.  Until you run out of tape.

aba!b q0q baa ! a REJECT Slide from Dorr/Monz

baaa q0q0 q1q1 q2q2 q3q3 q3q3 q4q4 ! baa ! a ACCEPT Slide from Dorr/Monz

 Deterministic means that at each point in processing there is always one unique thing to do (no choices).  D-recognize is a simple table-driven interpreter  The algorithm is universal for all unambiguous languages. ◦ To change the machine, you change the table. Slide from Jurafsky

q0 q4 q1q2q3 ba a a! q0 q4 q1q2q3 baa! 

 At any choice point, we may follow the wrong arc  Potential solutions: ◦ Save backup states at each choice point ◦ Look-ahead in the input before making choice ◦ Pursue alternatives in parallel ◦ Determinize our NFSAs (and then minimize)  FSAs can be useful tools for recognizing – and generating – subsets of natural language ◦ But they cannot represent all NL phenomena (e.g. center embedding: The mouse the cat chased died.)

◦ Simple vs. linguistically rich representations…. ◦ How do we decide what we need?

q2q4q5q0q3q1q6 therev mr dr hon patl.robinson ms mrs 

 If we want to extract all the proper names in the news, will this work? ◦ What will it miss? ◦ Will it accept something that is not a proper name? ◦ How would you change it to accept all proper names without false positives? ◦ Precision vs. recall….

 Regular expressions and FSAs can represent subsets of natural language as well as regular languages ◦ Both representations may be difficult for humans to understand for any real subset of a language ◦ Can be hard to scale up: e.g., when many choices at any point (e.g. surnames) ◦ But quick, powerful and easy to use for small problems  Next class: ◦ Read Ch 3.1