Natural Language Processing (NLP) I. Introduction II. Issues in NLP III. Statistical NLP: Corpus-based Approach.

Natural Language Processing (NLP) I. Introduction II. Issues in NLP III. Statistical NLP: Corpus-based Approach

I. Introduction Language: Medium for transfer of information Natural Language: any language used by humans, not artificial, computer languages Two Basic Questions in Linguistics: Q1 (Syntax): What kinds of things do people say? Q2 (Semantics): What do these things say about the world?

Natural Language Processing (NLP): As a branch of computer science, the goal is to use computers to process natural language. Computational Linguistics (CL): As an interdisciplinary field between linguistics and computer science, it concerns with the computational aspects (e.g., theory building & testing) of human language. NLP is an applied component of CL

Uses of Natural Language Processing: - Speech Recognition (convert continuous stream of sound waves into discrete words) – phonetics & signal processing - Language Understanding (extract ‘meaning’ from identified words) – syntax & semantics - Language Generation/ Speech Synthesis: Generate appropriate meaningful NL responses to NL inputs - Turing Test (some humans fail this test!) - ELIZA (Weizenbaum, 1966): Rule-based keyword matchingELIZA - Loebner Prize (since 1991): $100,000; so far none with above 50% success - Automatic Machine Translation (translate one NL into another) – hard problem - Automatic Knowledge Acquisition (computer programs that read books and listen human conversation to extract knowledge) – very hard problem

II. Issues in NLP 1.Rational vs Empiricist Approach 2.Role of Nonlinguistic Knowledge

1.Rational vs Empiricist Approach Rational Approach to Language (1960–1985): Most of the human language faculty is hardwired in the brain at birth and inherited in the gene. - Universal Grammar (Chomsky, 1957) - Explain why children can learn something as complex as a natural language from limited input in such a short time (2-3 years) - Poverty of the Stimulus (Chomsky, 1986): There are simply not enough inputs for children to learn key parts of language

Empiricist Approach (1920–60, 1985-present): Baby’s brain with some general rules of language operation, but its detailed structure must be learned from external inputs (e.g., N–V–O vs N–O–V) - Values of Parameters: A General language model is predetermined but the values of its parameters must be fine-tined (e.g.) - Y = aX + b (a, b: parameters) - M/I Home: Basic Floor Plan plus Custom Options

2. Role of Nonlinguistic Knowledge Grammatical Parsing (GP) View of NLP: Grammatical principles and rules play a primary role in language processing. Extreme GP view: Every grammatically well-formed sentence is meaningfully interpretable, and vice versa. - unrealistic view! (“All grammars leak” (Sapir, 1921)) (e.g.) Colorless green ideas sleep furiously (grammatically correct but semantically strange) The horse raced past the barn fell (ungrammatical but can be semantically ok) The horse that was raced (by someone) past the barn fell -- Nonlinguistic Knowledge

Examples of lexically ambiguous words (but semantically ok), The astronomer married the star. The sailor ate a submarine. Time flies like an arrow. Our company is training workers. Clearly, language processing requires more than grammatical information Integrated Knowledge View of NLP:Language processing grammatical knowledge (grammaticality) and general world knowledge (conventionality). (e.g.) John wanted money. He got a gun and walked into a liquor store. He told the owner he wanted some money. The owner gave John the money and John left. This explains how difficult the NLP problem is and why no one has yet succeeded in developing a reliable NLP system.

III. Statistical NLP: Corpus-based approach Rather than studying language by observing language use in actual situation, researchers use a pre-collected body of texts called a corpus. Brown Corpus (1960s): one million words put together at Brown University from fiction, newspaper articles, scientific text, legal text, etc. Susanne Corpus: 130,000 words; a subset of the Brown Corpus; syntactically annotated; available free. Canadian Hansards: Bilingual corpus; fully translated between English and French.

Example of the Corpus-based Approach to NLP Mark Twain’s Tom Sawyer: - 71,370 words total (tokens) - 8,018 different words (types) WordFreqWordFreq the3332 in906 and 2972 that877 a1775 he877 to1725 I783 of1440 his772 was1161 you686 it1027 Tom679 Q: Why are not the words equally frequent? What does it tell us about language?

- Out of 8018 words (types), 3393 (50%) occurs only one, 1292 (16%) twice, 664 (8%) three times, … - Over 90% of the word types occur 10 times or less. - Each of the most common 12 words occurs over 700 times (1%), total over 12% together Occurrence CountNo. of Word Types 1 3993 2 1292 3 664 4 410 … 10 91 11-50 540 51-100 99 >100 102

Zipf’s Law and Principle of Least Effort - Empirical law uncovered by Zipf in 1929: f (word type frequency) x r (rank) = k (constant) The higher frequency a word type is, the higher rank it is in the frequency list. According to the Zipf’s law, the actual frequency count should be inversely related to its rank value. Principle of Least Effort: A unifying principle proposed by Zipf to account for the Zipf’s law: “Both the speaker and the hearer are trying to minimize their effort. The former’s effort is conserved by having a small vocabulary of common words (I.e., larger f) whereas the latter’s effort is conserved by a having large vocabulary of rarer words (I.e, smaller f) so that messages are less ambiguous. The Zipf’s law represents an optimal compromise between these opposing efforts.”

Zipf’s on log-log Plot: (freq) = k/(rank) Or, log(freq) = - log(rank) + log (k)

“More exact” Zipf’s Law (Mandelbrot, 1954): Mandelbrot derived a more general form of Zipf’s law from theoretical principles: which reduces to Zipf’s law for a=0 and b=1.

Mandelbrot’s Fit: log(freq) = -b*log(rank+a) + log(k)

What does Zipf’s law tell us about language? ANS: Not much (virtually nothing!) A Zipf’s law can be obtained under the assumption that text is randomly produced by independently choosing one of N letters with equal probability r and the space with probability (1-Nr). Thus, the Zipf’s law is about the distribution of words whereas language, especially semantics, is about interrelations between words..

In short, the Zipf’s law does not indicate some deep underlying process of language. Zipf’s law or the like is typical of many stochastic random processes, unrelated to the characteristic features of a particular random process. In short, it represents a phenomenon of universal occurrence, that contains no specific information about the underlying process. Rather, language specific information is hidden in the deviations from the law, not the law itself.

Natural Language Processing (NLP) I. Introduction II. Issues in NLP III. Statistical NLP: Corpus-based Approach.

Similar presentations

Presentation on theme: "Natural Language Processing (NLP) I. Introduction II. Issues in NLP III. Statistical NLP: Corpus-based Approach."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Natural Language Processing (NLP) I. Introduction II. Issues in NLP III. Statistical NLP: Corpus-based Approach.

Similar presentations

Presentation on theme: "Natural Language Processing (NLP) I. Introduction II. Issues in NLP III. Statistical NLP: Corpus-based Approach."— Presentation transcript:

Similar presentations

About project

Feedback