Natural Language Understanding Raivydas Simenas
Overwiev History Speech Recognition Natural Language Understanding statistical methods to resolve ambiguities Current situation
History Roots in teaching the deaf to speak using “visible speech” 1874: Alexander Bell’s invention of harmonic telegraph Different frequency harmonics from an electrical signal could be separated Could sent multiple messages over the same wire at the same time 1940’s: separating the speech signal into different frequency components using the spectrogram 1950’s: the beginning of computer use for automatic speech recognition
The Nature of Speech Phoneme – a basic sound, e.g. a vowel The complexity of human vocal apparatus: about 18 phonemes per second Speech viewed as a sound wave Identifying sounds: analyzing the sound wave into its frequency components
The Spectrogram I A visual representation of speech which contains all the salient information Plots the amount of energy at different frequencies against time Discontinuous speech (making a pause after each word) – easier to recognize on the spectrogram
The Spectrogram II The same word uttered twice (especially by different speakers – speaker independence) might look radically different on a spectrogram The need to recognize invariant features in a spectrogram Formants: resonant frequencies sustained for a short time period in pronouncing a vowel Normalization: distinguishing between relevant and irrelevant information Nonlinear time compression: taking care of the changing speed of a speech Matching a spoken word to a template
Robust Speech Recognition Need to maintain accuracy when the quality of the input speech is degraded or when the speech characteristics differ due to change in environment or speakers Dynamic parameter adaptation: either alter the input signal or the internally stored representations Optimal parameter estimation: based on a statistical model characterizing the differences between training and test sets Empirical feature comparison: based on comparison between high-quality speech and the same speech recorded under degraded conditions
Stochastic Methods in Speech Recognition Generating the sequence of word hypotheses for an acoustic signal is most often done using statistics The process: A sequence of acoustic signals is represented using a collection of vectors Such collections are used to build acoustic word models, which consist of probabilities of certain sequences of vectors representing a word Acoustic word models utilize Markov chains
Representing Sentences Syntactic form: indicates the way the words are related to each other in a sentence Logical form: identifying the semantic relationships between words based solely on the knowledge of the language (independently of the situation) Final meaning representation: mapping the information from the syntactic and logical form into knowledge representation System uses knowledge representation to represent and reason about its application domain
Parsing a Sentence Parsing – determining the structure of the sentence according to the grammar Tree representation of a sentence Transition network grammars Start with initial node Can traverse an arc only if it is labeled with an appropriate category
Stochastic Methods for Ambiguity Resolution I Some sentences can be parsed many different ways, e.g. time flies like an arrow The most popular method for this is based on statistics Some facts from probability theory The concept of the random variable, e.g. the lexical category of “flies” Probability function assigns probability to every possible value of the random variable, e.g. 0.3 for “flies” being a noun, 0.7 for its being a verb conditional probability functions (Pr(A|B)), e.g. the probability for the occurrence of a verb given the fact that a noun already occurred
Stochastic Methods for Ambiguity Resolution II Probabilities are used to predict future events given some data about the past Maximum likelihood estimator (MLE) Probability of X happening in the future = number of cases of X happening in the past/total number of events in the past Works well only if X occurred often, not very useful for low-frequency events Expected likelihood estimator (ELE) Probability of X happening in the future = f(number of cases of X happening in the past)/Sum(f(number of cases of some event happening in the past)), e.g. if f(Pr(X))=Pr(X)+0.5 and we know that Pr(X)=0.4 and Pr(Y)=0.6, then ELE(X)=(0.4+0.5)/(0.4+0.5+0.6+0.5)=0.45 MLE is a special case of ELE, i.e. for MLE f(Pr(X)=Pr(X) Given a large amount of text, one can use MLE or ELE to determine the lexical category of an ambiguous word, e.g. the word flies
Stochastic Methods for Ambiguity Resolution III Always choosing the interpretation that occurs most frequently in the training set on average obtains 90% success rate (not good) Some of the local context should be used to determine the lexical category of a word Ideally, for a sequence of words w1,w2,…,wn we want a lexical category sequence c1,c2,…,cn which maximizes the probability of right interpretation In practice, approximations of such probabilities are made
Stochastic Methods for Ambiguity Resolution IV n-gram models Look at the probability of a lexical category Ci which follows the sequence of lexical categories Ci-1,Ci-2,…,Ci-n+1 Probability of c1,c2,…,ck occurring is approximately the product of n-gram probabilities for each word, e.g. the probability of a sequence ART, N, V is 0.71*1*0.43=.3053 In practice, bigram or trigram models are used most often The models capturing the concept are called Hidden Markov Models
Stochastic Methods for Ambiguity Resolution V In order to determine the most likely interpretation of a given sequence of n words, we want to maximize the value of The Viterbi algorithm Given k lexical categories, the total number of possibilities to consider for a sequence of n words is kn The Viterbi algorithm reduces this number to const*n*k2
Logical Form Although interpreting sentence often requires the knowledge of the context, some interpretation can be done independently of it basic semantic properties of a word, its different senses etc. Ontology each word has 1 or more senses in which it can be used, e.g. go has about 40 senses the different senses of all the words of a natural language are organized into classes of objects, such as events, actions etc. the set of such classes is called an ontology Logical form of an utterance can be viewed as a function that maps current discourse situation into a new one resulting from the occurrence of the utterance
Current Situation Inexpensive software for speech recognition The issues: large vocabulary, continuous speech and speaker independence Automated speech recognition for restricted domains The speed of serial processes in a computer vs. the number of parallel processes in human brain
References Survey of the State of the Art in Human Language Technology, edited by Ronald A. Cole, 1996 James Allen. Natural Language Understanding, 1995 Raymond Kurzweil. When will HAL understand what we are saying? Computer Speech Recognition and Understanding. Taken from HAL’s Legacy, 1996