Computational Lexicology, Morphology and Syntax Diana Trandab ă ț Academic year 2014-2015.

Slides:



Advertisements
Similar presentations
Nondeterministic Finite Automata CS 130: Theory of Computation HMU textbook, Chapter 2 (Sec 2.3 & 2.5)
Advertisements

C O N T E X T - F R E E LANGUAGES ( use a grammar to describe a language) 1.
Lecture 2 Introduction To Sets CSCI – 1900 Mathematics for Computer Science Fall 2014 Bill Pine.
Beesley 2001 Finite-State Technology and Linguistic Applications March 2001 Xerox Research Centre Europe Grenoble Laboratory 6, chemin de Maupertuis.
1 Languages. 2 A language is a set of strings String: A sequence of letters Examples: “cat”, “dog”, “house”, … Defined over an alphabet: Languages.
176 Formal Languages and Applications: We know that Pascal programming language is defined in terms of a CFG. All the other programming languages are context-free.
LING 388: Language and Computers Sandiway Fong Lecture 2: 8/23.
Context-Free Grammars Lecture 7
1 Foundations of Software Design Lecture 23: Finite Automata and Context-Free Grammars Marti Hearst Fall 2002.
Topics Automata Theory Grammars and Languages Complexities
JavaScript, Third Edition
Languages and Machines Unit two: Regular languages and Finite State Automata.
CSE467/567 Computational Linguistics Carl Alphonce Computer Science & Engineering University at Buffalo.
CPSC 388 – Compiler Design and Construction
1 Introduction to Parsing Lecture 5. 2 Outline Regular languages revisited Parser overview Context-free grammars (CFG’s) Derivations.
Topic #3: Lexical Analysis
Language Recognizer Connecting Type 3 languages and Finite State Automata Copyright © – Curt Hill.
CPSC 388 – Compiler Design and Construction Scanners – Finite State Automata.
CMPS 3223 Theory of Computation
Chapter 2. Regular Expressions and Automata From: Chapter 2 of An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition,
CS/IT 138 THEORY OF COMPUTATION Chapter 1 Introduction to the Theory of Computation.
CSC312 Automata Theory Lecture # 2 Languages.
Compiler Construction Lexical Analysis. The word lexical means textual or verbal or literal. The lexical analysis implemented in the “SCANNER” module.
REGULAR EXPRESSIONS. Lexical Analysis Lexical analysers can be constructed by programs such as LEX These programs employ as input a description of the.
Introduction to CS Theory Lecture 3 – Regular Languages Piotr Faliszewski
1 Regular Expressions. 2 Regular expressions describe regular languages Example: describes the language.
Finding the needle(s) in the textual haystack
1 111 Computability, etc. Midterm review. Turing machines. Finite state machines. Push down automata. Homework: FSA, PDA, TM problems (can work in teams)
PART I: overview material
Lexical Analysis I Specifying Tokens Lecture 2 CS 4318/5531 Spring 2010 Apan Qasem Texas State University *some slides adopted from Cooper and Torczon.
LING 388: Language and Computers Sandiway Fong Lecture 6: 9/15.
Computabilty Computability Finite State Machine. Regular Languages. Homework: Finish Craps. Next Week: On your own: videos +
COP 4620 / 5625 Programming Language Translation / Compiler Writing Fall 2003 Lecture 3, 09/11/2003 Prof. Roy Levow.
Introduction to Computational Linguistics Finite State Machines (derived from Ken Beesley)
Copyright © Curt Hill Languages and Grammars This is not English Class. But there is a resemblance.
Regular Expressions CIS 361. Need finite descriptions of infinite sets of strings. Discover and specify “regularity”. The set of languages over a finite.
Review: Compiler Phases: Source program Lexical analyzer Syntax analyzer Semantic analyzer Intermediate code generator Code optimizer Code generator Symbol.
When you read a sentence, your mind breaks it into tokens—individual words and punctuation marks that convey meaning. Compilers also perform tokenization.
2. Regular Expressions and Automata 2007 년 3 월 31 일 인공지능 연구실 이경택 Text: Speech and Language Processing Page.33 ~ 56.
May 2008CLINT-LIN Regular Expressions1 Introduction to Computational Linguistics Regular Expressions (Tutorial derived from NLTK)
ELEMENTARY SET THEORY.
Overview of Previous Lesson(s) Over View  Symbol tables are data structures that are used by compilers to hold information about source-program constructs.
Natural Language Processing Lecture 4 : Regular Expressions and Automata.
1Computer Sciences Department. Book: INTRODUCTION TO THE THEORY OF COMPUTATION, SECOND EDITION, by: MICHAEL SIPSER Reference 3Computer Sciences Department.
CS355 - Theory of Computation Regular Expressions.
CS 203: Introduction to Formal Languages and Automata
FST Morphology Miriam Butt October 2003 Based on Beesley and Karttunen 2003.
1 Compiler Construction (CS-636) Muhammad Bilal Bashir UIIT, Rawalpindi.
Donghyun (David) Kim Department of Mathematics and Physics North Carolina Central University 1 Chapter 1 Regular Languages Some slides are in courtesy.
Sahar Mosleh California State University San MarcosPage 1 Finite State Machine.
UNIT - I Formal Language and Regular Expressions: Languages Definition regular expressions Regular sets identity rules. Finite Automata: DFA NFA NFA with.
November 2003Computational Morphology III1 CSA405: Advanced Topics in NLP Xerox Notation.
Chapter 2 Scanning. Dr.Manal AbdulazizCS463 Ch22 The Scanning Process Lexical analysis or scanning has the task of reading the source program as a file.
Overview of Previous Lesson(s) Over View  A token is a pair consisting of a token name and an optional attribute value.  A pattern is a description.
1 Compiler Construction (CS-636) Muhammad Bilal Bashir UIIT, Rawalpindi.
CS 404Ahmed Ezzat 1 CS 404 Introduction to Compiler Design Lecture 1 Ahmed Ezzat.
OOP Tirgul 11. What We’ll Be Seeing Today  Regular Expressions Basics  Doing it in Java  Advanced Regular Expressions  Summary 2.
A Quick Review of Set Theory A set is a collection of objects. A B D E We can enumerate the “members” or “elements” of finite sets: { A, D, B, E }. There.
May 2006CLINT-LIN Regular Expressions1 Introduction to Computational Linguistics Regular Expressions (Tutorial derived from NLTK)
ICS611 Lex Set 3. Lex and Yacc Lex is a program that generates lexical analyzers Converting the source code into the symbols (tokens) is the work of the.
Akram Salah ISSR Basic Concepts Languages Grammar Automata (Automaton)
Lecture 2 Compiler Design Lexical Analysis By lecturer Noor Dhia
Finite State Machines Dr K R Bond 2009
CS 3304 Comparative Languages
Lecture 4: Lexical Analysis & Chomsky Hierarchy
CS 3304 Comparative Languages
Compiler Construction
CSC312 Automata Theory Lecture # 2 Languages.
COMPILER CONSTRUCTION
Presentation transcript:

Computational Lexicology, Morphology and Syntax Diana Trandab ă ț Academic year

Today’s Topics Finite State Technology Regular Languages and Relations Review of Set Theory Understand the mathematical operations that can be performed on such Languages. Understand how Languages, Relations, Regular Expressions, and Networks are interrelated.

What is Finite State Technology? Finite State Technology refers to a collection of techniques for application of Finite State Automata (FSA) to a range of linguistically motivated problems. Such Techniques include – Design of user languages for specifying FSA – Compilation of such languages into efficient transition networks. – Development environments and runtime systems

What is Finite-State Technology Good For? Finite-state techniques cannot handle central embedding –the man the dog the cat bit followed ate. They are well suited to “lower-level” natural language processing such as –Tokenization – what is the next word? –Spelling error detection: does the next word belong to a list? –Morphological/phonological analysis/generation –Shallow syntactic parsing and “chunking”

Languages, Notations and Machines LANGUAGE (set of strings) NOTATION MACHINE

Languages, Notations and Machines FINITE STATE LANGUAGE FINITE STATE NOTATION FINITE STATE AUTOMATON

FINITE STATE AUTOMATA: preliminary definition A finite state automaton includes: A finite set of states A finite set of labelled transitions between states

Physical Machines with Finite States The Lightswitch Machine OFFON UP DOWN

Physical Machines with Finite States The Lightswitch Toggle Machine OFFON PUSH

The Five Cent Machine Problem: Assume you have one, two, and five cent pieces Design a finite state automaton which accepts exactly 5 cents.

The Cola Machine Need to enter 25 cents (USA) to get a drink Accepts the following coins: –Nickel = 5 cents –Dime = 10 cents –Quarter = 25 cents For simplicity, our machine needs exact change We will model only the coin-accepting mechanism

Physical Machines with Finite States The Cola Machine 0 N D Q NNNN DDD Start StateFinal State 10

The Cola Machine Language List of all the sequences of coins accepted: –{ Q, DDN, DND, NDD, DNNN, NDNN, NNDNNNND, NNNNN } Think of the coins as SYMBOLS or CHARACTERS The set of symbols accepted is the ALPHABET of the machine Think of sequences of coins as WORDS or “strings” The set of words accepted by the machine is its LANGUAGE

FINITE STATE AUTOMATA: better definition A finite state automaton includes: A finite set of states – Initial State – Final State (s) A finite set of labelled transitions beween states Labels are symbols from an alphabet Recognises a language Generates a language as well!

A Network that Accepts a One Word Language c a n to

A Network that Accepts a Three Word Language c ant o t i g re m e s a

Scaling Up the Network Imagine the same network expanded to handle three million words, all of them corresponding to valid words of a given language. We supply a word and ‘apply’ it to the network. If it is accepted by the network, then it is a valid word. Otherwise it does not belong to the language This is the basis for a Spanish spelling error detector.

Looking Up a Word c ant o t i g re m e s a m e s a “Apply”

Lookup Failure Lookup succeeds when all input is consumed and final state is reached. Lookup can fail because: Not all input is consumed ("libro", "tigra") Input is fully consumed but state is not final ("cant") Final state is reached but there is still unconsumed output ("mesas")

Shared Structure clea e v r e

Transducers m e s a s “Lookup” “Lookdown” m esa+Noun+Fem +Pl mesa 00s mesa+Noun+Fem+Pl

A Morphological Analyzer Transducer dogs dog +n +pl

A Morphological Analyzer Transducer Surface Language Lexical Language

A Quick Review of Set Theory A set is a collection of objects. A B D E We can enumerate the “members” or “elements” of finite sets: { A, D, B, E }. There is no significant order in a set, so { A, D, B, E } is the same set as { E, A, D, B }, etc.

Uniqueness of Elements You cannot have two or more ‘A’ elements in the same set A B DE { A, A, D, B, E} is just a redundant specification of the set { A, D, B, E }.

Cardinality of Sets The Empty Set: A Finite Set: An Infinite Set: e.g. The Set of all Positive Integers Norway Denmark Sweden

Simple Operations on Sets: Union A B C D E Set 1Set 2 B C A D E Union of Set1 and Set 2

Simple Operations on Sets (2): Union A B C C D Set 1Set 2 B C A D Union of Set1 and Set 2

Simple Operations on Sets (3): Intersection A B C C D Set 1Set 2 C Intersection of Set1 and Set 2

Simple Operations on Sets (4): Subtraction A B C C D Set 1Set 2 A B Set 1 minus Set 2

Formal Languages Very Important Concept in Formal Language Theory: A Language is just a Set of Words. We use the terms “word” and “string” interchangeably. A Language can be empty, have finite cardinality, or be infinite in size. You can union, intersect and subtract languages, just like any other sets.

Union of Languages (Sets) dog cat rat elephant mouse Language 1 Language 2 dog cat rat elephant mouse Union of Language 1 and Language 2

Intersection of Languages (Sets) dog cat rat elephant mouse Language 1 Language 2 Intersection of Language 1 and Language 2

Intersection of Languages (Sets) dog cat rat rat mouse Language 1 Language 2 Intersection of Language 1 and Language 2 rat

Subtraction of Languages (Sets) dog cat rat rat mouse Language 1 Language 2 Language 1 minus Language 2 dog cat

Languages A language is a set of words (=strings). Words (strings) are composed of symbols (letters) that are “concatenated” together. At another level, words are composed of “morphemes”. In most natural languages, we concatenate morphemes together to form whole words. For sets consisting of words (i.e. for Languages), the operation of concatenation is very important.

Concatenation of Languages work talk walk Root Language 0 ing ed s Suffix Language work working worked works talk talking talked talks walk walking walked walks The concatenation of the Suffix language after the Root language.

Languages and Networks w a l k o r t Network/Language 1 Network/Language 2 s o r s The concatenation of Network 1 and Network 2 w a l k t a a s e d i n g 0 s e d i 0 s

Why is “Finite State” Computing so Interesting? Finite-state systems are mathematically elegant, easily manipulated and modifiable. Computationally efficient. Usually very compact. The programming linguists do is declarative. We describe the facts of our natural language; i.e. we write grammars. We do not hack ad hoc code. The runtime code, which applies our systems to linguistic input, is already written and it is completely language- independent. Finite-state systems are inherently bidirectional: we can use the same system to analyze and to generate.

Languages, Notations and Machines FINITE STATE LANGUAGE FINITE STATE NOTATION FINITE STATE MACHINE

41 Regular Expressions Abstract Definition 0 is a regular expression ε is a regular expression if α Σ is a letter then α is a regular expression if Ψ and Φ are regular expressions then so are (Ψ + Φ) and (Ψ. Φ) if Φ is a regular expression then so is (Φ)* Nothing else is a regular expression

42 Searching Text Analysis of written texts often involves searching for (and subsequent processing of): – a particular word – a particular phrase – a particular pattern of words involving gaps How can we specify the things we are searching for?

43 Regular Expressions and Matching A Regular Expression is a special notation used to specify the things we want to search for. We use regular expressions to define patterns. We then implement a matching operation m(, ) which tries to match the pattern against the text.

44 Simple Regular Expressions Most ordinary characters match themselves. For example, the pattern sing exactly matches the string sing. In addition, regular expressions provide us with a set of special characters

45 The Wildcard Symbol The “.” symbol is called a wildcard: it matches any single character. For example, the expression s.ng matches sang, sing, song, and sung. Note that "." will match not only alphabetic characters, but also numeric and whitespace characters. Consequently, s.ng will also match non-words such as s3ng

46 Assignment Draw the FSM which corresponds to s.ng

47 Repeated Wildcards We can also use the wildcard symbol for counting characters. For instance....zy matches six-letter strings that end in zy. The pattern t... will match the words that and term It will also match the word sequence to a (since the third "." in the pattern can match the space character).

48 Optionality The “?” symbol indicates that the immediately preceding regular expression is optional. The regular expression colou?r matches both British and American spellings, colour and color.

49 Repetition The "+" symbol indicates that the immediately preceding expression is repeatable at least once For example, the regular expression "coo+l" matches cool, coool, and so on. This symbol is particularly effective when combined with the. symbol. For example, f.+ f matches all strings of length greater than two, that begin and end with the letter f (e.g foolproof).

50 Repetition 2 The “*” symbol indicates that the immediately preceding expression is both optional and repeatable. For example.*gnt.* matches all strings that contain gnt.

51 Character Class The [ ] notation enumerates the set of characters to be matched is called a character class. For example, we can match any English vowel, but no consonant, using [aeiou]. We can combine the [] notation with our notation for repeatability. For example, expression p[aeiou]+t matches peat, poet, and pout.

52 The Choice Operator Often the choices we want to describe cannot be expressed at the level of individual characters. In such cases we use the choice operator "|" to indicate the alternate choices. The operands can be any expression. For instance, jack | gill will match either jack or gill.

53 Choice Operator 2 Note that the choice operator has wide scope, so that abc|def is a choice between abd and def, and not between abcef and abdef. The latter choice must be written using parentheses: ab(c|d)ef

54 Ranges The [ ] notation is used to express a set of choices between individual characters. Instead of listing each character, it is also possible to express a range of characters, using the - operator. For example, [a-z] matches any lowercase letter

55 Exercise Write regular expressions matching All 1 digit numbers All 2 digit numbers All date expressions such as 12/12/1950

56 Ranges II Ranges can be combined with other operators. For example [A-Z][a-z]* matches words that have an initial capital letter followed by any number of lowercase letters. Ranges can be combined as in [A-Za-z] which matches any alphabetical character.

57 Assignment What does the following expression match? [b-df-hj-np-tv-z]+

58 Complementation the character class [b-df-hj-np-tv-z] allows us to match consonants. However, this expression is quite cumbersome. A better alternative is to say: let’s match anything which isn’t a vowel. To do this, we need a way of expressing complementation.

59 Complementation 2 We do this using the symbol “^” as the first character within the class expression [ ]. [^aeiou] is just like our earlier character class, except now the set of vowels is preceded by ^. The expression as a whole is interpreted as matching anything which fails to match [aeiou] In other words, it matches all lowercase consonants (plus all uppercase letters and non-alphabetic

60 Complementation 3 As another example, suppose we want to match any string which is enclosed by the HTML tags for boldface, namely and, We might try something like this:.*. This would successfully match important, but would also match important and urgent, since the <.*y subpattern will happily match all the characters from the end of important to the end of urgent.

61 Complementation 4 One way of ensuring that we only look at matched pairs of tags would be to use the expression [^, where the character class matches anything other than a left angle bracket. Finally, note that character class complementation also works with ranges. Thus [^a-z] matches anything other than the lower case alphabetic characters a through z.

62 Other Special Symbols Two important symbols in this are “^” and “$” which are used to anchor matches to the beginnings or ends of lines in a file. Note: “^” has two quite distinct uses: it is interpreted as complementation when it occurs as the first symbol within a character class, and as matching the beginning of lines when it occurs elsewhere in a pattern.

63 Special Symbols 2 As an example, [a-z]*s$ will match words ending in s that occur at the end of a line. Finally, consider the pattern ^$; this matches strings where no character occurs between the beginning and the end of a line — in other words, empty lines.

64 Special Symbols 3 Special characters like “.”, “*”, “+” and “$” give us powerful means to generalise over character strings. Suppose we wanted to match against a string which itself contains one or more special characters? An example would be the arithmetic statement $5.00 * ($ $0.85). In this case, we need to resort to the so-called escape character “\” (“backslash”). For example, to match a dollar amount, we might use \$[1- 9][0-9]*\.[0-9][0-9]

65 Summary Regular Expressions are a special notation. Regular expressions describe patterns which can be matched in text. A particular regular expression E stands for a set of strings. We can thus say that E describes a language.

Great! See you next time!