1 Chapter 20 Understanding Language. 2 Chapter 20 Contents (1) l Natural Language Processing l Morphologic Analysis l BNF l Rewrite Rules l Regular Languages.

Slides:



Advertisements
Similar presentations
CS Morphological Parsing CS Parsing Taking a surface input and analyzing its components and underlying structure Morphological parsing:
Advertisements

C O N T E X T - F R E E LANGUAGES ( use a grammar to describe a language) 1.
Grammars, constituency and order A grammar describes the legal strings of a language in terms of constituency and order. For example, a grammar for a fragment.
Grammars.
Chapter Chapter Summary Languages and Grammars Finite-State Machines with Output Finite-State Machines with No Output Language Recognition Turing.
Understanding Natural Language
ISBN Chapter 3 Describing Syntax and Semantics.
1 Introduction to Computability Theory Lecture12: Decidable Languages Prof. Amos Israeli.
CS 330 Programming Languages 09 / 13 / 2007 Instructor: Michael Eckmann.
Chapter 3 Describing Syntax and Semantics Sections 1-3.
PZ02A - Language translation
Chapter 3 Describing Syntax and Semantics Sections 1-3.
Chapter 3 Describing Syntax and Semantics Sections 1-3.
Artificial Intelligence 2004 Natural Language Processing - Syntax and Parsing - Language Syntax Parsing.
Dr. Muhammed Al-Mulhem 1ICS ICS 535 Design and Implementation of Programming Languages Part 1 Fundamentals (Chapter 4) Compilers and Syntax.
Chapter 3: Formal Translation Models
COP4020 Programming Languages
(2.1) Grammars  Definitions  Grammars  Backus-Naur Form  Derivation – terminology – trees  Grammars and ambiguity  Simple example  Grammar hierarchies.
1 13. LANGUAGE AND COMPLEXITY 2007 년 11 월 03 일 인공지능연구실 한기덕 Text: Speech and Language Processing Page.477 ~ 498.
Lee CSCE 314 TAMU 1 CSCE 314 Programming Languages Syntactic Analysis Dr. Hyunyoung Lee.
Grammars.
Copyright © Cengage Learning. All rights reserved.
CCSB354 ARTIFICIAL INTELLIGENCE (AI)
ITEC 380 Organization of programming languages Lecture 2 – Grammar / Language capabilities.
Introduction Syntax: form of a sentence (is it valid) Semantics: meaning of a sentence Valid: the frog writes neatly Invalid: swims quickly mathematics.
1 CS 385 Fall 2006 Chapter 14 Understanding Natural Language (omit 14.4)
Winter 2007SEG2101 Chapter 71 Chapter 7 Introduction to Languages and Compiler.
A sentence (S) is composed of a noun phrase (NP) and a verb phrase (VP). A noun phrase may be composed of a determiner (D/DET) and a noun (N). A noun phrase.
CS Describing Syntax CS 3360 Spring 2012 Sec Adapted from Addison Wesley’s lecture notes (Copyright © 2004 Pearson Addison Wesley)
Grammars CPSC 5135.
Natural Language Processing Artificial Intelligence CMSC February 28, 2002.
CSNB143 – Discrete Structure Topic 11 – Language.
Copyright © by Curt Hill Grammar Types The Chomsky Hierarchy BNF and Derivation Trees.
Transition Network Grammars for Natural Language Analysis - W. A. Woods In-Su Yoon Pusan National University School of Electrical and Computer Engineering.
Copyright © Curt Hill Languages and Grammars This is not English Class. But there is a resemblance.
Parsing Introduction Syntactic Analysis I. Parsing Introduction 2 The Role of the Parser The Syntactic Analyzer, or Parser, is the heart of the front.
Artificial Intelligence: Natural Language
1 Language translation Programming Language Design and Implementation (4th Edition) by T. Pratt and M. Zelkowitz Prentice Hall, 2001 Sections
ISBN Chapter 3 Describing Syntax and Semantics.
1Computer Sciences Department. Book: INTRODUCTION TO THE THEORY OF COMPUTATION, SECOND EDITION, by: MICHAEL SIPSER Reference 3Computer Sciences Department.
CPSC 422, Lecture 27Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 27 Nov, 16, 2015.
Syntax and Semantics Form and Meaning of Programming Languages Copyright © by Curt Hill.
Parsing and Code Generation Set 24. Parser Construction Most of the work involved in constructing a parser is carried out automatically by a program,
Programming Languages and Design Lecture 2 Syntax Specifications of Programming Languages Instructor: Li Ma Department of Computer Science Texas Southern.
GRAMMARS & PARSING. Parser Construction Most of the work involved in constructing a parser is carried out automatically by a program, referred to as a.
Natural Language Processing (NLP)
Chapter 4: Syntax analysis Syntax analysis is done by the parser. –Detects whether the program is written following the grammar rules and reports syntax.
NATURAL LANGUAGE PROCESSING
Formal grammars A formal grammar is a system for defining the syntax of a language by specifying sequences of symbols or sentences that are considered.
By Kyle McCardle.  Issues with Natural Language  Basic Components  Syntax  The Earley Parser  Transition Network Parsers  Augmented Transition Networks.
Syntax Analysis By Noor Dhia Syntax analysis:- Syntax analysis or parsing is the most important phase of a compiler. The syntax analyzer considers.
BNF A CFL Metalanguage Some Variations Particular View to SLK Copyright © 2015 – Curt Hill.
Chapter 3 – Describing Syntax CSCE 343. Syntax vs. Semantics Syntax: The form or structure of the expressions, statements, and program units. Semantics:
Formal Languages and Automata FORMAL LANGUAGES FINITE STATE AUTOMATA.
Modeling Arithmetic, Computation, and Languages Mathematical Structures for Computer Science Chapter 8 Copyright © 2006 W.H. Freeman & Co.MSCS SlidesAlgebraic.
Describing Syntax and Semantics Chapter 3: Describing Syntax and Semantics Lectures # 6.
Chapter 3 – Describing Syntax
Programming Languages Translator
Language translation Programming Language Design and Implementation (4th Edition) by T. Pratt and M. Zelkowitz Prentice Hall, 2001 Sections
FORMAL LANGUAGES AND AUTOMATA THEORY
Formal Language Theory
Natural Language - General
Language translation Programming Language Design and Implementation (4th Edition) by T. Pratt and M. Zelkowitz Prentice Hall, 2001 Sections
Language translation Programming Language Design and Implementation (4th Edition) by T. Pratt and M. Zelkowitz Prentice Hall, 2001 Sections
Discrete Maths 13. Grammars Objectives
Language translation Programming Language Design and Implementation (4th Edition) by T. Pratt and M. Zelkowitz Prentice Hall, 2001 Sections
Language translation Programming Language Design and Implementation (4th Edition) by T. Pratt and M. Zelkowitz Prentice Hall, 2001 Sections
Language translation Programming Language Design and Implementation (4th Edition) by T. Pratt and M. Zelkowitz Prentice Hall, 2001 Sections
Language translation Programming Language Design and Implementation (4th Edition) by T. Pratt and M. Zelkowitz Prentice Hall, 2001 Sections
COMPILER CONSTRUCTION
Presentation transcript:

1 Chapter 20 Understanding Language

2 Chapter 20 Contents (1) l Natural Language Processing l Morphologic Analysis l BNF l Rewrite Rules l Regular Languages l Context Free Grammars l Context Sensitive Grammars l Recursively Enumerable Grammars l Parsing l Transition Networks l Augmented Transition Networks

3 Chapter 20 Contents (2) l Chart Parsing l Semantic Analysis l Ambiguity and Pragmatic Analysis l Machine Translation l Language Identification l Information Retrieval l Stemming l Precision and Recall

4 Natural Language Processing l Natural languages are human languages such as English and Chinese, as opposed to formal languages such as C++ and Prolog. l NLP enables computer systems to understand written or spoken utterances made in human languages.

5 Morphologic Analysis l The first analysis stage in a NLP system. l Morphology: the components that make up words. nOften these components have grammatical significance, such as “-es”, “-ed”, “-ing”. l Morphologic analysis can be useful in identifying which part of speech (noun, verb, etc.) a word is. l This is vital for syntactic analysis. l Identifying parts of speech can be done by having a list of standard endings (such as “-ly” – adverb). l This works well for regular verbs, but will not work for irregular verbs such as “go”.

6 BNF (1) l A grammar defines the syntactic rules for a language. l Backus-Naur Form (or Backus Normal Form) is used to write define a grammar in terms of: nTerminal symbols nNon-terminal symbols nThe start symbol nRewrite rules

7 BNF (2) l Terminal symbols: the symbols (or words) that are used in the language. In English, these are the letters of the Roman alphabet, for example. l Non-terminal symbols: symbols such as noun, verb that are used to define parts of the language. l The start symbol represents a complete sentence. l Rewrite rules define the structure of the grammar.

8 Rewrite Rules l For example: Sentence  NounPhrase VerbPhrase l The rule states that the item on the left can be rewritten in the form on the right. l This rule says that one valid form for a sentence is a Noun Phrase followed by a Verb Phrase. l Two more complex examples: NounPhrase  Noun | Article Noun | Adjective Noun | Article Adjective Noun VerbPhrase  Verb | Verb NounPhrase | Adverb Verb NounPhrase

9 Regular Languages l The Simplest type of grammar from Chomsky’s hierarchy. l Regular languages can be described by Finite State Automata (FSAs) l A regular expression is a sentence defined by a regular language. l Regular languages are of interest to computer scientists, but are no use for NLP, as they cannot describe even simple formal languages, let alone human languages.

10 Conext Free Grammars l The rewrite rules we saw above define a context-free grammar. l They define what words can be used together, but do not take into account context. l They allow sentences that are not grammatically correct, such as: Chickens eats dog. l A context free grammar can have at most one terminal symbol on the right hand side of its rewrite rules.

11 Context Sensitive Grammars l A context sensitive grammar can have more than one terminal symbol on the RHS of its rewrite rules. l This allows the rules to specify context – such as case, gender and number. E.g.: A X B  A Y B l This says that in the context of A and B, X can be rewritten as Y.

12 Recursively Enumerable Grammars l The most complex grammars in Chomsky’s hierarchy. l There are no rules to limit the rewrite rules of these grammars. l Also known as unrestricted grammars. l Not useful for NLP.

13 Parsing l Parsing involves determining the syntactic structure of a sentence. l Parsing first tells us whether a sentence is valid or not. l A parsed sentence is usually represented as a parse tree. l The tree shown here is for the sentence “the black cat crossed the road”.

14 Transition Networks l Transition networks are FSAs used to represent grammars. l A transition network parser uses transition networks to parse. l In the following examples, S1 is the start state; the accepting state has a heavy border. l When a word matches an arc in a current state, the arc is followed to the new state. l If no match is found, a different transition network must be used.

15 Transition Networks – examples (1)

16 Transition Networks – examples (2) l These transition networks represent rewrite rules with terminal symbols:

17 Augmented Transition Networks l ATNs are transition networks with the ability to apply conditions (such as tests for gender, number or case) to arcs. l Each arc has one or more procedures attached to it that checks conditions. l These procedures are also able to build up a parse tree while the network is applied to a sentence.

18 Inefficiency l Using transition networks to parse sentences such as the following can be inefficient – it will involve backtracking if the wrong interpretation is made: Have all the fish been fed? Have all the fish.

19 Chart Parsing (1) l Chart parsing avoids the backtracking problem. l At most, a chart parser will examine a sentence of n words in O(n 3 ) time. l Chart parsing involves manipulating charts which consist of edges, vertices and words:

20 Chart Parsing (2) l The edge notation is as follows: [x, y, A  B ● C] l This edge connects nodes x and y; It says that to create an A, we need a B and a C; The dot shows that we have already found a B, and need to find a C.

21 Chart Parsing (3) l To start with, a chart is as shown above. l The chart parser can add edges to the chart according to these rules: 1)If we have an edge [x, y, A  B ● C], an edge can be added that supplies that C – i.e. the edge [x, y, C  ● E], where E can be replaced by a C. 2)If we have [x, y, A  B ● C D] and [y, z, C  E ●] then we can form the edge: [x, z, A  B C ● D]. 3)If we have an edge [x, y, A  B ● C] and the word at y is of type C then we can add: [y, y+1, A  B C ●].

22 Semantic Analysis l Aftering determining the syntactic structure of a sentence, we need to determine its meaning: semantics. l We can use semantic nets to represent the various components of a sentence, and their relationships. E.g.:

23 Ambiguity and Pragmatic Analysis l Unlike formal languages, human languages contain a lot of ambiguity. E.g.: n“General flies back to front” n“Fruit flies like a bat” l The above sentences are ambiguous, but we can disambiguate using world knowledge (fruit doesn’t fly). l NLP systems need to use a number of approaches to disambiguate sentences.

24 Ambiguity and Pragmatic Analysis l Unlike formal languages, human languages contain a lot of ambiguity. E.g.: n“General flies back to front” n“Fruit flies like a bat” l The above sentences are ambiguous, but we can disambiguate using world knowledge (fruit doesn’t fly). l NLP systems need to use a number of approaches to disambiguate sentences.

25 Syntactic Ambiguity l This is where a sentence has two (or more) correct ways to parse it:

26 Semantic Ambiguity l Where a sentence has more than one possible meaning. l Often as a result of syntactic ambiguity.

27 Referential Ambiguity l Occurs as a result of a use of anaphoric expressions: John gave the ball to the dog. It wagged its tail. l Was it John, the ball or the dog that wagged? l Of course, humans know the answer to this: an NLP system needs world knowledge to disambiguate.

28 Disambiguation l Probabilistic approach: nThe word “bat” is usually used to refer to the sporting implement. nThe word “bat”, when used in a scientific article, usually means the winged mammal. l Context is also useful: n “I went into the cave. It was full of bats.” n “I looked in the locker. It was full of bats.” l A good, relevant world model with knowledge about the universe of discourse is vital.

29 Machine Translation l One of the earliest goals of NLP. l Indeed – one of the early goals of AI. l Translating entire sentences from one human language to another is extremely difficult to automate. l Ambiguities in one language may not be ambiguous in another (e.g. “bat”). l Syntax and semantics are usually not enough – world knowledge is also needed. l Machine translation systems exist (e.g. Babelfish) but none have 100% accuracy.

30 Language Identification l Similar problem to machine translation – but much easier. l Particularly useful with Internet documents which usually have no indication of which language they are in. l Acquaintance algorithm uses n-grams. l An n-gram is a collection of n letters. l Statistics exist which correlate 3-grams with languages. l E.g. “ing”, “and”, “ent” and “the” occur very often in English but less frequently in other languages.

31 Information Retrieval l Matching queries to a corpus of documents (such as the Internet). l One approach uses Bayes’ theorem: nThis assumes that the most important words in a query are the least common ones. nE.g. if “elephants in New York” is submitted as a query to the New York Times, the word “elephants” is the one that contains the most information. l Stop words are ignored – “the”, “and”, “if”, “not”, etc.

32 TF – IDF (1) l Term Frequency - Inverse Document Frequency. l Inverse Document Frequency for a word is: n |D| is the number of documents in the corpus. nDF (W) is the number of documents in the corpus that have the word W. l TF (W, D) is the number of times word W occurs in document D.

33 TF – IDF (2) l The TF-IDF value for a word and document is: TF-IDF (D, W i ) = TF(W i, D) x IDF (W i ) l This is calculated as a vector for each document, using the words in the query. l This gives high priority to words that occur infrequently in the corpus, but frequently in a particular document. l Then the document whose TF-IDF vector has the greatest magnitude is the one that is considered to be most relevant to the query.

34 Stemming l Removing common suffices from words (such as “-ing”, “-ed”, “-es”). l Means that a query “swims” will match “swimming”, “swimmers”. l Porter’s algorithm is the one most commonly used. l Has been shown to give some improvement in the performance of Information Retrieval systems. l Not good when applied to names – e.g. “Ted Turner” might match “Ted turned”.

35 Precision and Recall l 100% precision means no false positives: nAll returned documents are relevant. l 100% recall means no false negatives: nAll relevant documents will be returned. l In practice, high precision means low recall and vice-versa. l The holy grail of IR is to have high precision and high recall.