Regular Expressions (RE) Used for specifying text search strings. Standarized and used widely (UNIX: vi, perl, grep. Microsoft Word and other text editors…)

Slides:



Advertisements
Similar presentations
Recognising Languages We will tackle the problem of defining languages by considering how we could recognise them. Problem: Is there a method of recognising.
Advertisements

4b Lexical analysis Finite Automata
CS Morphological Parsing CS Parsing Taking a surface input and analyzing its components and underlying structure Morphological parsing:
1 1 CDT314 FABER Formal Languages, Automata and Models of Computation Lecture 3 School of Innovation, Design and Engineering Mälardalen University 2012.
Finite-state automata 2 Day 13 LING Computational Linguistics Harry Howard Tulane University.
1 Languages. 2 A language is a set of strings String: A sequence of letters Examples: “cat”, “dog”, “house”, … Defined over an alphabet: Languages.
Chapter Section Section Summary Set of Strings Finite-State Automata Language Recognition by Finite-State Machines Designing Finite-State.
LING 438/538 Computational Linguistics Sandiway Fong Lecture 8: 9/29.
1 Regular Expressions and Automata September Lecture #2-2.
Finite Automata and Regular Expressions i206 Fall 2010 John Chuang Some slides adapted from Marti Hearst.
LING 388: Language and Computers Sandiway Fong Lecture 2: 8/23.
Computational language: week 9 Finish finite state machines FSA’s for modelling word structure Declarative language models knowledge representation and.
Computational Language Finite State Machines and Regular Expressions.
1 Languages and Finite Automata or how to talk to machines...
CMSC 723 / LING 645: Intro to Computational Linguistics September 8, 2004: Monz Regular Expressions and Finite State Automata (J&M 2) Prof. Bonnie J. Dorr.
1 Foundations of Software Design Lecture 22: Regular Expressions and Finite Automata Marti Hearst Fall 2002.
Topics Automata Theory Grammars and Languages Complexities
Regular Expressions and Automata Chapter 2. Regular Expressions Standard notation for characterizing text sequences Used in all kinds of text processing.
Finite State Transducers The machine model we will study for morphological parsing is called the finite state transducer (FST) An FST has two tapes –input.
Regular Expressions & Automata Fawzi Emad Chau-Wen Tseng Department of Computer Science University of Maryland, College Park.
CMSC 723: Intro to Computational Linguistics Lecture 2: February 4, 2004 Regular Expressions and Finite State Automata Professor Bonnie J. Dorr Dr. Nizar.
CSE467/567 Computational Linguistics Carl Alphonce Computer Science & Engineering University at Buffalo.
Introduction to English Morphology Finite State Transducers
CPSC 388 – Compiler Design and Construction
Chapter 2: Finite-State Machines Heshaam Faili University of Tehran.
Language Recognizer Connecting Type 3 languages and Finite State Automata Copyright © – Curt Hill.
Finite-State Machines with No Output Longin Jan Latecki Temple University Based on Slides by Elsa L Gunter, NJIT, and by Costas Busch Costas Busch.
Finite-State Machines with No Output
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2007 Lecture4 1 August 2007.
Morphological Recognition We take each sub-lexicon of each stem class and we expand each arc (e.g. the reg-noun arc) with all the morphemes that make up.
Chapter 2. Regular Expressions and Automata From: Chapter 2 of An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition,
PZ02B Programming Language design and Implementation -4th Edition Copyright©Prentice Hall, PZ02B - Regular grammars Programming Language Design.
March 1, 2009 Dr. Muhammed Al-mulhem 1 ICS 482 Natural Language Processing Regular Expression and Finite Automata Muhammed Al-Mulhem March 1, 2009.
1 Regular Expressions. 2 Regular expressions describe regular languages Example: describes the language.
Words: Surface Variation and Automata CMSC Natural Language Processing April 3, 2003.
LING 388: Language and Computers Sandiway Fong Lecture 6: 9/15.
1 CD5560 FABER Formal Languages, Automata and Models of Computation Lecture 2 Mälardalen University 2006.
4b 4b Lexical analysis Finite Automata. Finite Automata (FA) FA also called Finite State Machine (FSM) –Abstract model of a computing entity. –Decides.
COMP3190: Principle of Programming Languages DFA and its equivalent, scanner.
1 CD5560 FABER Formal Languages, Automata and Models of Computation Lecture 3 Mälardalen University 2010.
2. Regular Expressions and Automata 2007 년 3 월 31 일 인공지능 연구실 이경택 Text: Speech and Language Processing Page.33 ~ 56.
LING 388: Language and Computers Sandiway Fong 9/27 Lecture 10.
May 2008CLINT-LIN Regular Expressions1 Introduction to Computational Linguistics Regular Expressions (Tutorial derived from NLTK)
Overview of Previous Lesson(s) Over View  Symbol tables are data structures that are used by compilers to hold information about source-program constructs.
1 LING 6932 Spring 2007 LING 6932 Topics in Computational Linguistics Hana Filip Lecture 2: Regular Expressions, Finite State Automata.
1 Regular Expressions and Automata August Lecture #2.
Natural Language Processing Lecture 4 : Regular Expressions and Automata.
CS 4705 Lecture 2 Regular Expressions and Automata.
Modeling Computation: Finite State Machines without Output
UNIT - I Formal Language and Regular Expressions: Languages Definition regular expressions Regular sets identity rules. Finite Automata: DFA NFA NFA with.
Overview of Previous Lesson(s) Over View  A token is a pair consisting of a token name and an optional attribute value.  A pattern is a description.
BİL711 Natural Language Processing1 Regular Expressions & FSAs Any regular expression can be realized as a finite state automaton (FSA) There are two kinds.
COMP3190: Principle of Programming Languages DFA and its equivalent, scanner.
1 Regular grammars Programming Language Design and Implementation (4th Edition) by T. Pratt and M. Zelkowitz Prentice Hall, 2001 Section
WELCOME TO A JOURNEY TO CS419 Dr. Hussien Sharaf Dr. Mohammad Nassef Department of Computer Science, Faculty of Computers and Information, Cairo University.
Finite State Machines Dr K R Bond 2009
Languages.
Lexical analysis Finite Automata
Compilers Welcome to a journey to CS419 Lecture5: Lexical Analysis:
Regular grammars Programming Language Design and Implementation (4th Edition) by T. Pratt and M. Zelkowitz Prentice Hall, 2001 Section
Some slides by Elsa L Gunter, NJIT, and by Costas Busch
LING 138/238 SYMBSYS 138 Intro to Computer Speech and Language Processing Dan Jurafsky 11/24/2018 LING 138/238 Autumn 2004.
CSC NLP - Regex, Finite State Automata
4b Lexical analysis Finite Automata
Regular Expressions and Automata in Language Analysis
4b Lexical analysis Finite Automata
CPSC 503 Computational Linguistics
Regular grammars Programming Language Design and Implementation (4th Edition) by T. Pratt and M. Zelkowitz Prentice Hall, 2001 Section
Regular grammars Programming Language Design and Implementation (4th Edition) by T. Pratt and M. Zelkowitz Prentice Hall, 2001 Section
PZ02B - Regular grammars Programming Language Design and Implementation (4th Edition) by T. Pratt and M. Zelkowitz Prentice Hall, 2001 Section PZ02B.
Presentation transcript:

Regular Expressions (RE) Used for specifying text search strings. Standarized and used widely (UNIX: vi, perl, grep. Microsoft Word and other text editors…) A RE is a notation for characterizing a set of strings. Formally a language is defined as a (possibly infinite) set of strings of a given alphabet. A regular expression search consists of a search pattern and a text to search through.

Basic RE Patterns E.g /woodchuck/ Case sensitive /Woodchuck/ not the same as /woodchuck/ Disjunction /[Ww]oodchuck/ : Woodchuck or woodchuck Ranges –/[A-Z]/ : [ABCDEFGHIJKLMNOPQRSTUVWXYZ] –/[0-9]/ : [ ] Negation –[^a] : anything that is not an “a” –[^A-Z] : anything that is not an uppercase letter –But: [a^b] : the pattern “a^b”

Basic RE Patterns Optional characters –/woodchucks?/ : woodchuck or woodchucks Zero or more instances (Kleene star) –/baa*!/ : ba! or baa! or baaa! or baaaa! … –/c[ab]*c/ : cabababc or caaaac or cc … –Note: /a*/ matches everything. One or more instances –/ba+!/ : ba! or baa! or baaa! or baaaa! … –/[0-9]+/: A string of digits.

Basic RE Patterns Wildcards: /./ matches any character – /beg.n/ : begin, begun, beg_n… Anchors: –Pattern at beginning of string: /^the car/ matches “the car I drive” but not “I drive the car” –Pattern at end of string: /the car$/ matches “I drive the car” but not “the car I drive” –\b matches a word boundary: /\bthe\b/ matches “the” but not “other”

Basic RE Patterns Parentheses: (abc)+ matches abc, abcabc, abcabcabc... Disjunction: /cit(y|ies)/ matches city or cities Repetitions: /(abc){3}/ matches abcabcabc Backslash: Used for escaping special characters. –\*, \+, \., \?... Aliases –\n: newline, \t:tab, \d:[0-9], \w:[a-zA-Z0-9 ]

RE Substitution s/regexp1/regexp2/ E.g. s/colour/color/ Back references: \1, \2, \3 … –s/([0-9]+)/ / : the 35 boxes -> the boxes –s/^\s*(\w+)\W+(\w+)/\2 \1/ : reverses the first two words of a sentence. –Also used in search REs /A [a-z]+ is a \1/ : matches “A car is a car”.

ELIZA Simulated the responses of a psychologist based on simple pattern substitution. Initially it cascades through a set of RE substitutions that change for example s/I’m/YOU ARE/, s/my/YOUR/... Then it runs the input through RE substitutions looking for relevant patterns and produces the appropriate output. e.g. s/.* YOU ARE (depressed|sad).*/I AM SORRY TO HEAR THAT YOU ARE \1/ s/.* YOU ARE (depressed|sad).*/WHY DO YOU THINK YOU ARE \1\?/ s/.* always.*/CAN YOU THINK OF A SPECIFIC EXAMPLE/

Finite State Automata (FSA) REs (that don’t use back-references) can be implemented as finite-state automata. A FSA is described by a regular expression. A RE or a FSA can be used to describe a class of languages called Regular Languages (RL).

Finite State Automata A FSA is represented as a graph with a finite set of nodes (called states) and directed arcs between pairs of states (called transition) labeled with symbols from the alphabet. One state is a start state, represented by an incoming arrow. Some states are final or accepting states represented by a double circle.

FSA Example Sheeptalk: baa! baaa! baaaa! baaaaa! … Equivalent to RE: /baaa*!/

FSA Recognition Examples: baaa! Succeeds aba!b Fails

FSA State Transition Table Alternative representation for FSA

FSA Example

Formal FSA Definition Q: a finite set of states. (q0, q1, q2, …) Σ: a finite input alphabet of symbols q0: the start state (first state) F: the states with of final states (subset of Q) δ(q,i): the transition function from states and inputs to states. Given a state q and an input i, it returns a new state q’. Deterministic FSA (DFSA). The recognition of a string has no choice points.

Non Deterministic FSA (NFSA) When in state q2 with input a, the FSA has the choice to move to state q3 or remain in state q2.

Empty Arcs From state q3 the FSA can move to state q2, without looking at the input (without advancing the tape).

NFSA Transition Tables An extra ε column is added. The transitions are now sets of states (instead of single states)

Accepting Strings with NFSA Since there is a choice of which arc to follow it is possible to take the wrong path and reject a string that should be accepted. All possible paths should be followed and if even one reaches a final state then the string is accepted. Computational approaches –Backup: When we store the current search-state (the state of the FSA and the position of the tape) and when we reach dead end we back up to that search-state and try another path from there. –Lookahead: We look ahead in the input to decide which path to take. –Parallelism: Alternative paths are explored in parallel.

NFSA Recognition as Search The NFSA recognition can be seen as a search through a space of search-states. This consists of all the possible pairings of FSA-states and tape positions. The order that these search-states are visited (i.e. the decision about which possible path to follow) is important for performance. Depth-first or breadth-first search. For larger search spaces it may be necessary to use more complex search tehniques (e.g Dynamic programming or A*).

Relating DFSA and NFSA For every NFSA there exists an equivalent DFSA (i.e. that accepts exactly the same set of strings). The idea behind the proof is based on converting a NFSA to an equivalent DFSA. The resulting DFSA, may have many more states than the original NFSA (up to 2 N states for a NFSA with N states).

Morphological Parsing and Recognition Morphological recognition: Accepts and rejects forms: –Accept: geese –Reject: gooses Morphological parsing: produces a morphological analysis (stem followed by morphological features) –geese: goose + N + PL –cats: cat + N + PL –ground: ground +N +SG, grind +V +PPart

Morphological Parsing A morphological parser is composed of –lexicon: the list of stems or affixes in a language, together this basic information about them. –morphotactics: model of morpheme ordering, that defines which morpheme classes may follow other classes. –orthographic rules: spelling rules used to model changes that occur in the language (e.g. city+s -> cities)

Lexicon A repository of words: a, AAA, AA, Aachen, aardvark, aardwolf... Not practical to list every word in the language. Impossible for some languages (e.g. Finnish, Turkish...) Usually only the stems and the affixes are listed. Ideally every word possible word (or stem) should be in the lexicon, including abbreviations and proper names. Often along with stems in the lexicon we keep information about stem classes. –e.g. dog: reg-noun, goose: irreg-sg-noun, –geese: irreg-pl-noun, -s: plural-suffix

Morphotactics Commonly represented as a FSA. e.g. Simple FSA for plural formation in English

Morphotactics In cases where a morphological process is more complicated, or not fully productive (unhappy, unreal but *unbig, *unred) the morphotactics FSA, may become quite complicated and many different stem classes may be necessary.