Language Modeling Speech Recognition is enhanced if the applications are able to verify the grammatical structure of the speech This requires an understanding.

Slides:



Advertisements
Similar presentations
Machine Learning Approaches to the Analysis of Large Corpora : A Survey Xunlei Rose Hu and Eric Atwell University of Leeds.
Advertisements

C O N T E X T - F R E E LANGUAGES ( use a grammar to describe a language) 1.
May 2006CLINT-LN Parsing1 Computational Linguistics Introduction Approaches to Parsing.
Grammars, Languages and Parse Trees. Language Let V be an alphabet or vocabulary V* is set of all strings over V A language L is a subset of V*, i.e.,
Chapter Chapter Summary Languages and Grammars Finite-State Machines with Output Finite-State Machines with No Output Language Recognition Turing.
Part of Speech Tagging Importance Resolving ambiguities by assigning lower probabilities to words that don’t fit Applying to language grammatical rules.
March 1, 2009 Dr. Muhammed Al-Mulhem 1 ICS 482 Natural Language Processing Probabilistic Context Free Grammars (Chapter 14) Muhammed Al-Mulhem March 1,
Machine Learning: Connectionist McCulloch-Pitts Neuron Perceptrons Multilayer Networks Support Vector Machines Feedback Networks Hopfield Networks.
For Monday Read Chapter 23, sections 3-4 Homework –Chapter 23, exercises 1, 6, 14, 19 –Do them in order. Do NOT read ahead.
1 Words and the Lexicon September 10th 2009 Lecture #3.
NLP and Speech Course Review. Morphological Analyzer Lexicon Part-of-Speech (POS) Tagging Grammar Rules Parser thethe – determiner Det NP → Det.
CS4705 Natural Language Processing.  Regular Expressions  Finite State Automata ◦ Determinism v. non-determinism ◦ (Weighted) Finite State Transducers.
Parsing with CFG Ling 571 Fei Xia Week 2: 10/4-10/6/05.
Prénom Nom Document Analysis: Artificial Neural Networks Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Amirkabir University of Technology Computer Engineering Faculty AILAB Efficient Parsing Ahmad Abdollahzadeh Barfouroush Aban 1381 Natural Language Processing.
Syntactic Parsing with CFGs CMSC 723: Computational Linguistics I ― Session #7 Jimmy Lin The iSchool University of Maryland Wednesday, October 14, 2009.
Language Model. Major role: Language Models help a speech recognizer figure out how likely a word sequence is, independent of the acoustics. A lot of.
Chapter 3: Formal Translation Models
Regular Expressions and Automata Chapter 2. Regular Expressions Standard notation for characterizing text sequences Used in all kinds of text processing.
Context-Free Grammar CSCI-GA.2590 – Lecture 3 Ralph Grishman NYU.
Models of Generative Grammar Smriti Singh. Generative Grammar  A Generative Grammar is a set of formal rules that can generate an infinite set of sentences.
1 Basic Parsing with Context Free Grammars Chapter 13 September/October 2012 Lecture 6.
11 CS 388: Natural Language Processing: Syntactic Parsing Raymond J. Mooney University of Texas at Austin.
Albert Gatt Corpora and Statistical Methods Lecture 9.
Lemmatization Tagging LELA /20 Lemmatization Basic form of annotation involving identification of underlying lemmas (lexemes) of the words in.
Part II. Statistical NLP Advanced Artificial Intelligence Applications of HMMs and PCFGs in NLP Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme.
1 7-Speech Recognition (Cont’d) HMM Calculating Approaches Neural Components Three Basic HMM Problems Viterbi Algorithm State Duration Modeling Training.
Probabilistic Parsing Reading: Chap 14, Jurafsky & Martin This slide set was adapted from J. Martin, U. Colorado Instructor: Paul Tarau, based on Rada.
For Friday Finish chapter 23 Homework: –Chapter 22, exercise 9.
7-Speech Recognition Speech Recognition Concepts
1 Statistical Parsing Chapter 14 October 2012 Lecture #9.
Natural Language Processing Lecture 6 : Revision.
GRAMMARS David Kauchak CS159 – Fall 2014 some slides adapted from Ray Mooney.
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging II Transformation Based Tagging Brill (1995)
Unsupervised learning of Natural languages Eitan Volsky Yasmine Meroz.
May 2006CLINT-LN Parsing1 Computational Linguistics Introduction Parsing with Context Free Grammars.
October 2005csa3180: Parsing Algorithms 11 CSA350: NLP Algorithms Sentence Parsing I The Parsing Problem Parsing as Search Top Down/Bottom Up Parsing Strategies.
CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov
PARSING David Kauchak CS159 – Spring 2011 some slides adapted from Ray Mooney.
11 Chapter 14 Part 1 Statistical Parsing Based on slides by Ray Mooney.
Copyright © Curt Hill Languages and Grammars This is not English Class. But there is a resemblance.
Word classes and part of speech tagging Chapter 5.
Coarse-to-Fine Efficient Viterbi Parsing Nathan Bodenstab OGI RPE Presentation May 8, 2006.
PARSING 2 David Kauchak CS159 – Spring 2011 some slides adapted from Ray Mooney.
February 2007CSA3050: Tagging III and Chunking 1 CSA2050: Natural Language Processing Tagging 3 and Chunking Transformation Based Tagging Chunking.
1Computer Sciences Department. Book: INTRODUCTION TO THE THEORY OF COMPUTATION, SECOND EDITION, by: MICHAEL SIPSER Reference 3Computer Sciences Department.
CS 4705 Lecture 7 Parsing with Context-Free Grammars.
Neural Networks Teacher: Elena Marchiori R4.47 Assistant: Kees Jong S2.22
Syntax Analyzer (Parser)
Data Mining and Decision Support
GRAMMARS David Kauchak CS457 – Spring 2011 some slides adapted from Ray Mooney.
Instructor: Nick Cercone CSEB - 1 Parsing and Context Free Grammars Parsers, Top Down, Bottom Up, Left Corner, Earley.
Word classes and part of speech tagging. Slide 1 Outline Why part of speech tagging? Word classes Tag sets and problem definition Automatic approaches.
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging II Transformation Based Tagging Brill (1995)
Word classes and part of speech tagging Chapter 5.
Part-of-Speech Tagging CSCI-GA.2590 – Lecture 4 Ralph Grishman NYU.
PARSING David Kauchak CS159 – Fall Admin Assignment 3 Quiz #1  High: 36  Average: 33 (92%)  Median: 33.5 (93%)
1 7-Speech Recognition Speech Recognition Concepts Speech Recognition Approaches Recognition Theories Bayse Rule Simple Language Model P(A|W) Network Types.
Formal grammars A formal grammar is a system for defining the syntax of a language by specifying sequences of symbols or sentences that are considered.
Natural Language Processing Vasile Rus
Natural Language Processing Vasile Rus
CSC 594 Topics in AI – Natural Language Processing
Basic Parsing with Context Free Grammars Chapter 13
Probabilistic and Lexicalized Parsing
CSCI 5832 Natural Language Processing
CS4705 Natural Language Processing
Chunk Parsing CS1573: AI Application Development, Spring 2003
David Kauchak CS159 – Spring 2019
David Kauchak CS159 – Spring 2019
Artificial Intelligence 2004 Speech & Natural Language Processing
Presentation transcript:

Language Modeling Speech Recognition is enhanced if the applications are able to verify the grammatical structure of the speech This requires an understanding of formal language theory Formal language theory is equivalent to the CS subject of Theory of Computation, but was developed independently (Chomsky)

Formal Grammars (Chomsky 1950) Formal grammar definition: G = (N, T, s 0, P, F) o N is a set of non-terminal symbols (or states) o T is the set of terminal symbols (N ∩ T = {}) o S 0 a start symbol o P is a set of production rules o F (a subset of N) is a set of final symbols Right regular grammar productions have the forms B → a, B → aC, or B → "" where B,C ∈ N and a ∈ T Context Free (Programming language) productions have forms B → w where B ∈ N and w is a possibly empty string from N, T Context Sensitive (Natural language) productions have forms αAβ → αγβ or αAβ  "" where A ∈ N and α,γ,β ∈ (N U T)* abd |αAβ|≤|αγβ|

Chomsky Language Hierarchy

Example Grammar (L 0 )

Classifying the Chomsky Grammars Regular : Left hand side contains one non terminal, right hand has only one non-terminal. Regular expressions and FSAs fit this category. Context Free : Left hand side contains one non-terminal, right hand side mixes terminals and non-terminals. Can parse with a tree-based algorithm Context sensitive: Left hand side has both terminals and non-terminals. The only restriction is that the length of left side is less than the length of the right side. Parsing algorithms become difficult. Turing Equivalent: All rules are fair game. These languages have the computational power of a digital computer

Context Free Grammars Capture constituents and ordering o Regular grammars are too limited to represent grammars Context Free Grammars consist of o Set of non-terminal symbols N o Finite alphabet of terminals  o Set of productions A →  such that A  N,  -string  (  N)* o A designated start symbol Characteristics o Used for programming language syntax. o Okay for basic natural language grammatical syntax o Too restrictive to capture all of the nuances of typical speech Chomsky (1956) Backus (1959)

Context Free Grammar Example G = (N, T, s 0, P, F)

Lexicon for L 0 Rule based languages

Top Down Parsing Driven by the grammar, working down [ S [ NP [ Pro I]] [ VP [ V prefer] [ NP [ Det a] [ Nom [ N morning] [ N flight]]]]] S → NP VP, NP→Pro, Pro→I, VP→V NP, V→prefer, NP→Det Nom, Det→a, Nom→Noun Nom, Noun→morning, Noun→flight

Bottom Up Parsing Driven by the words, working up The Grammar 0) S  E $ 1)E  E + T | E - T | T 2)T  T * F | T / F | F 3) F  num | id The Bottom Up Parse 1)id - num * id 2)F - num * id 3)T - num * id 4)E - num * id 5)E - F * id 6)E - T * id 7)E - T * F 8)E - T 9)E 10)S  correct sentence Note: If there is no rule that applies, backtracking is necessary

Top-Down and Bottom-Up Top-down o Advantage: Searches only trees that are legal o Disadvantage: Tries trees that don’t match the words Bottom-up o Advantage: Only forms trees matching the words o Disadvantage: Tries trees that make no sense globally Efficient combined algorithms o Link top-down expectations with bottom-up data o Example: Top-down parsing with bottom-up filtering

Stochastic Language Models Problems o A Language model cannot cover all grammatical rules o Spoken language is often ungrammatical Possible Solutions o Constrain search space emphasizing likely word sequences o Enhance the grammar to recognize intended sentences even when the sequence doesn't quite satisfy the rules A probabilistic view of language modeling

Probabilistic Context-Free Grammars (PCFG) Definition: G = (V N, V T, S, P, p); V N = non-terminal set of symbols V T = terminal set of symbols S = start symbol p = set of rule probabilities R = set of rules P(S ->W |G): S is the start symbol, W = expression in grammar G Training the Grammar : Count rule occurrences in a training corpus P(R | G) = Count(R) / ∑C(R) Goal: Assist in discriminating among competing choices

Phoneme Marking Goal : Mark the start and end of phoneme boundaries Research o Unsupervised text (language) independent algorithms have been proposed o Accuracy: 75% to 80%, which is 5-10% lower than supervised algorithms that make assumptions about the language If successful, a database of phonemes can be used in conjunction with dynamic time warping to simplify the speech recognition problem To apply the concepts of Formal language Theory, it is helpful to mark phoneme boundaries and parts of speech

Phonological Grammars Sound Patterns o English: 13 features for 8192 combinations o Complete descriptive grammar o Rule based, meaning a formal grammar can represent valid sound combinations in a language o Unfortunately, these rules are language-specific Recent research o Trend towards context-sensitive descriptions o Little thought concerning computational feasibility o Human listeners likely don’t perceive meaning with thousands of rules encoded in their brains Phonology: Study of sound combinations

Part of Speech Tagging Importance Resolving ambiguities by assigning lower probabilities to words that don’t fit Applying to language grammatical rules to parse meanings of sentences and phrases

Part of Speech Tagging Approaches to POS Tagging Determine a word’s lexical class based on context

Approaches to POS Tagging Initialize and maintain tagging criteria o Supervised : uses pre-tagged corpora o Unsupervised : Automatically induce classes by probability and learning algorithms o Partially supervised : combines the above approaches Algorithms o Rule based : Use pre-defined grammatical rules o Stochastic : use HMM and other probabilistic algorithms o Neural : Use neural nets to learn the probabilities

Example The man ate the fish on the boat in the morning WordTag TheDeterminer ManNoun AteVerb TheDeterminer FishNoun OnPreposition TheDeterminer BoatNoun InPreposition TheDeterminer MorningNoun

Word Class Categories Note: Personal pronoun often PRP, Possessive Pronoun often PRP$

Word Classes o Open (Classes that frequently spawn new words) Common Nouns, Verbs, Adjectives, Adverbs. o Closed (Classes that don’t often spawn new words): prepositions: on, under, over, … particles: up, down, on, off, … determiners: a, an, the, … pronouns: she, he, I, who,... conjunctions: and, but, or, … auxiliary verbs: can, may should, … numerals: one, two, three, third, … Particle: An uninflected item with a grammatical function but without clearly belonging to a major part of speech. Example: He looked up the word.

The Linguistics Problem Words often are in multiple classes. Example: this o This is a nice day = preposition o This day is nice = determiner o You can go this far = adverb Accuracy o 96 – 97% is a baseline for new algorithms o 100% impossible even for human annotators 2 tags3,760 3 tags264 4 tags61 5 tags12 6 tags2 7 tags1 Unambiguous: 35,340 (Derose, 1988)

Rule-Based Tagging Basic Idea: o Assign all possible tags to words o Remove tags according to a set of rules IF word+1 is adjective, adverb, or quantifier ending a sentence IF word-1 is not a verb like “consider” THEN eliminate non-adverb ELSE eliminate adverb o English has more than 1000 hand- written rules

Rule Based Tagging First Stage: For each word, a morphological analysis algorithm itemizes all possible parts of speech Example PRPVBD,VBNTOVB,JJ,RB,NNDTNN, VB Shepromisedtoback thebill Second State : Apply rules to remove possibilities Example Rule: IF VBD is an option and VBN|VBD follows “ PRP” THEN Eliminate VBN PRPVBD, VBNTOVB, JJ, NN, RBDTNN, VB Shepromisedtoback thebill

Stochastic Tagging Use probability of certain tag occurring given various possibilities Requires a training corpus Problems to overcome o How do we assign phoneme types for words not in corpus o Naive Method Choose most frequent tag in training text for each word! Result: 90% accuracy

HMM Stochastic Tagging Intuition : Pick the most likely tag based on context Maximize the formula using a HMM o P(word|tag) × P(tag|previous n tags) Observe : W = w 1, w 2, …, w n Hidden : T = t 1,t 2,…,t n Goal : Find the part of speech that most likely generate a sequence of words

Transformation-Based Tagging (TBL) (Brill Tagging) Combine Rule-based and stochastic tagging approaches Uses rules to guess at tags machine learning using a tagged corpus as input Basic Idea: Later rules correct errors made by earlier rules Set the most probable tag for each word as a start value Change tags according to rules of type: IF word-1 is a determiner and word is a verb THEN change the tag to noun Training uses a tagged corpus Step 1: Write a set of rule templates Step 2: Order the rules based on corpus accuracy

TBL: The Algorithm Step 1 : Use dictionary to label every word with the most likely tag Step 2 : Select the transformation rule which most improves tagging Step 3 : Re-tag corpus applying the rules Repeat 2-3 until accuracy reaches threshold RESULT : Sequence of transformation rules

TBL: Problems Problems o Infinite loops and rules may interact o The training algorithm and execution speed is slower than HMM Advantages o It is possible to constrain the set of transformations with “templates” IF tag Z or word W is in position *-k THEN replace tag X with tag o Learns a small number of simple, non-stochastic rules o Speed optimizations are possible using finite state transducers o TBL is the best performing algorithm on unknown words o The Rules are compact and can be inspected by humans Accuracy o First 100 rules achieve 96.8% accuracy First 200 rules achieve 97.0% accuracy

Neural Networks HMM-based algorithms dominate the field of Natural Language processing Unfortunately, HMMs have a number of disadvantages o Due to their Markovian nature, HMMs do not take into account the sequence of states leading into any given state o Due to their Markovian nature, the time spent in a given state is not captured explicitly o Requires annotated data, which may not be readily available o Any dependency between states cannot be represented. o The computational and memory cost to evaluate and train is significant Neural Networks present a possible stochastic alternative

Neural Network Digital approximation of biological neurons

Digital Neuron Σ f(n) W W W W Outputs Activation Function INPUTSINPUTS W =Weight Neuron

Transfer Functions 1 0 Input Output

Networks without feedback Multiple Inputs and Single Layer Multiple Inputs and layers

Feedback (Recurrent Networks) Feedback

Supervised Learning Inputs from the environment Neural Network Actual System Σ Error + - Expected Output Actual Output Training Run a set of training data through the network and compare the outputs to expected results. Back propagate the errors to update the neural weights, until the outputs match what is expected

Multilayer Perceptron Definition: A network of neurons in which the output(s) of some neurons are connected through weighted connections to the input(s) of other neurons. Inputs First Hidden layer Second Hidden Layer Output Layer

Backpropagation of Errors Function Signals Error Signals