Natural Language Processing (NLP) I. Introduction II. Issues in NLP III. Statistical NLP: Corpus-based Approach.

Slides:



Advertisements
Similar presentations
Why study grammar? Knowledge of grammar facilitates language learning
Advertisements

SEMANTICS.
Semantics Semantics is the branch of linguistics that deals with the study of meaning, changes in meaning, and the principles that govern the relationship.
Introduction: The Chomskian Perspective on Language Study.
Statistical Methods and Linguistics - Steven Abney Thur. POSTECH Computer Science NLP Lab Shim Jun-Hyuk.
January 12, Statistical NLP: Lecture 2 Introduction to Statistical NLP.
Natural Language and Speech Processing Creation of computational models of the understanding and the generation of natural language. Different fields coming.
Advanced AI - Part II Luc De Raedt University of Freiburg WS 2004/2005 Many slides taken from Helmut Schmid.
Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.
Statistical Natural Language Processing Advanced AI - Part II Luc De Raedt University of Freiburg WS 2005/2006 Many slides taken from Helmut Schmid.
Fundamentals: Linguistic principles
CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.
TRANSFORMATIONAL GRAMMAR An introduction. LINGUISTICS Linguistics Traditional Before 1930 Structural 40s -50s Transformational ((Chomsky 1957.
Lecture 1 Introduction: Linguistic Theory and Theories
Generative Grammar(Part ii)
1 Statistical NLP: Lecture 13 Statistical Alignment and Machine Translation.
Statistical Natural Language Processing. What is NLP?  Natural Language Processing (NLP), or Computational Linguistics, is concerned with theoretical.
Natural Language Understanding
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 1 21 July 2005.
McEnery, T., Xiao, R. and Y.Tono Corpus-based language studies. Routledge. Unit A 2. Representativeness, balance and sampling (pp13-21)
Lecture 12: 22/6/1435 Natural language processing Lecturer/ Kawther Abas 363CS – Artificial Intelligence.
Computational Linguistics Yoad Winter *General overview *Examples: Transducers; Stanford Parser; Google Translate; Word-Sense Disambiguation * Finite State.
Grammaticality Judgments Do you want to come with?20% I might could do that.38% The pavements are all wet.60% Y’all come back now.38% What if I were Romeo.
1 Computational Linguistics Ling 200 Spring 2006.
Information Retrieval and Web Search Text properties (Note: some of the slides in this set have been adapted from the course taught by Prof. James Allan.
Natural Language Processing Rogelio Dávila Pérez Profesor – Investigador
THE BIG PICTURE Basic Assumptions Linguistics is the empirical science that studies language (or linguistic behavior) Linguistics proposes theories (models)
GRAMMARS David Kauchak CS159 – Fall 2014 some slides adapted from Ray Mooney.
Introduction to CL & NLP CMSC April 1, 2003.
1 Statistical Properties for Text Rong Jin. 2 Statistical Properties of Text  How is the frequency of different words distributed?  How fast does vocabulary.
Exploring Text: Zipf’s Law and Heaps’ Law. (a) (b) (a) Distribution of sorted word frequencies (Zipf’s law) (b) Distribution of size of the vocabulary.
CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov
인공지능 연구실 황명진 FSNLP Introduction. 2 The beginning Linguistic science 의 4 부분 –Cognitive side of how human acquire, produce, and understand.
Research Topics CSC Parallel Computing & Compilers CSC 3990.
Natural Language Processing Spring 2007 V. “Juggy” Jagannathan.
Copyright © Curt Hill Languages and Grammars This is not English Class. But there is a resemblance.
NLP ? Natural Language is one of fundamental aspects of human behaviors. One of the final aim of human-computer communication. Provide easy interaction.
October 2005CSA3180 NLP1 CSA3180 Natural Language Processing Introduction and Course Overview.
CSA2050 Introduction to Computational Linguistics Lecture 1 Overview.
Computational Linguistics. The Subject Computational Linguistics is a branch of linguistics that concerns with the statistical and rule-based natural.
CSA2050 Introduction to Computational Linguistics Lecture 1 What is Computational Linguistics?
Discourse Analysis ENGL4339
Artificial Intelligence: Natural Language
1 Branches of Linguistics. 2 Branches of linguistics Linguists are engaged in a multiplicity of studies, some of which bear little direct relationship.
Linguistic Anthropology Bringing Back the Brain. What Bloomfield Got “Right” Emphasized spoken language rather than written language The role of the linguist.
Introduction Chapter 1 Foundations of statistical natural language processing.
For Friday Finish chapter 23 Homework –Chapter 23, exercise 15.
Exploring Text: Zipf’s Law and Heaps’ Law. (a) (b) (a) Distribution of sorted word frequencies (Zipf’s law) (b) Distribution of size of the vocabulary.
SYNTAX.
Levels of Linguistic Analysis
SIMS 296a-4 Text Data Mining Marti Hearst UC Berkeley SIMS.
Statistical Properties of Text
Linguistics and Language Technologies Lori Levin : Grammars and Lexicons Fall Term 2003.
A Simple English-to-Punjabi Translation System By : Shailendra Singh.
MENTAL GRAMMAR Language and mind. First half of 20 th cent. – What the main goal of linguistics should be? Behaviorism – Bloomfield: goal of linguistics.
INTRODUCTION TO APPLIED LINGUISTICS
King Faisal University جامعة الملك فيصل Deanship of E-Learning and Distance Education عمادة التعلم الإلكتروني والتعليم عن بعد [ ] 1 جامعة الملك فيصل عمادة.
Linguistics Linguistics can be defined as the scientific or systematic study of language. It is a science in the sense that it scientifically studies the.
Approaches to Machine Translation

What is linguistics?.
Theories of Language Development
Introduction to Corpus Linguistics: Exploring Collocation
What is Linguistics? The scientific study of human language
Introduction to Linguistics
Approaches to Machine Translation
Levels of Linguistic Analysis
Natural Language Processing
Content Analysis of Text
Presentation transcript:

Natural Language Processing (NLP) I. Introduction II. Issues in NLP III. Statistical NLP: Corpus-based Approach

I. Introduction Language: Medium for transfer of information Natural Language: any language used by humans, not artificial, computer languages Two Basic Questions in Linguistics: Q1 (Syntax): What kinds of things do people say? Q2 (Semantics): What do these things say about the world?

Natural Language Processing (NLP): As a branch of computer science, the goal is to use computers to process natural language. Computational Linguistics (CL): As an interdisciplinary field between linguistics and computer science, it concerns with the computational aspects (e.g., theory building & testing) of human language. NLP is an applied component of CL

Uses of Natural Language Processing: - Speech Recognition (convert continuous stream of sound waves into discrete words) – phonetics & signal processing - Language Understanding (extract ‘meaning’ from identified words) – syntax & semantics - Language Generation/ Speech Synthesis: Generate appropriate meaningful NL responses to NL inputs - Turing Test (some humans fail this test!) - ELIZA (Weizenbaum, 1966): Rule-based keyword matchingELIZA - Loebner Prize (since 1991): $100,000; so far none with above 50% success - Automatic Machine Translation (translate one NL into another) – hard problem - Automatic Knowledge Acquisition (computer programs that read books and listen human conversation to extract knowledge) – very hard problem

II. Issues in NLP 1.Rational vs Empiricist Approach 2.Role of Nonlinguistic Knowledge

1.Rational vs Empiricist Approach Rational Approach to Language (1960–1985): Most of the human language faculty is hardwired in the brain at birth and inherited in the gene. - Universal Grammar (Chomsky, 1957) - Explain why children can learn something as complex as a natural language from limited input in such a short time (2-3 years) - Poverty of the Stimulus (Chomsky, 1986): There are simply not enough inputs for children to learn key parts of language

Empiricist Approach (1920–60, 1985-present): Baby’s brain with some general rules of language operation, but its detailed structure must be learned from external inputs (e.g., N–V–O vs N–O–V) - Values of Parameters: A General language model is predetermined but the values of its parameters must be fine-tined (e.g.) - Y = aX + b (a, b: parameters) - M/I Home: Basic Floor Plan plus Custom Options

2. Role of Nonlinguistic Knowledge Grammatical Parsing (GP) View of NLP: Grammatical principles and rules play a primary role in language processing. Extreme GP view: Every grammatically well-formed sentence is meaningfully interpretable, and vice versa. - unrealistic view! (“All grammars leak” (Sapir, 1921)) (e.g.) Colorless green ideas sleep furiously (grammatically correct but semantically strange) The horse raced past the barn fell (ungrammatical but can be semantically ok) The horse that was raced (by someone) past the barn fell -- Nonlinguistic Knowledge

Examples of lexically ambiguous words (but semantically ok), The astronomer married the star. The sailor ate a submarine. Time flies like an arrow. Our company is training workers. Clearly, language processing requires more than grammatical information Integrated Knowledge View of NLP:Language processing grammatical knowledge (grammaticality) and general world knowledge (conventionality). (e.g.) John wanted money. He got a gun and walked into a liquor store. He told the owner he wanted some money. The owner gave John the money and John left. This explains how difficult the NLP problem is and why no one has yet succeeded in developing a reliable NLP system.

III. Statistical NLP: Corpus-based approach Rather than studying language by observing language use in actual situation, researchers use a pre-collected body of texts called a corpus. Brown Corpus (1960s): one million words put together at Brown University from fiction, newspaper articles, scientific text, legal text, etc. Susanne Corpus: 130,000 words; a subset of the Brown Corpus; syntactically annotated; available free. Canadian Hansards: Bilingual corpus; fully translated between English and French.

Example of the Corpus-based Approach to NLP Mark Twain’s Tom Sawyer: - 71,370 words total (tokens) - 8,018 different words (types) WordFreqWordFreq the3332 in906 and 2972 that877 a1775 he877 to1725 I783 of1440 his772 was1161 you686 it1027 Tom679 Q: Why are not the words equally frequent? What does it tell us about language?

- Out of 8018 words (types), 3393 (50%) occurs only one, 1292 (16%) twice, 664 (8%) three times, … - Over 90% of the word types occur 10 times or less. - Each of the most common 12 words occurs over 700 times (1%), total over 12% together Occurrence CountNo. of Word Types … >

Zipf’s Law and Principle of Least Effort - Empirical law uncovered by Zipf in 1929: f (word type frequency) x r (rank) = k (constant) The higher frequency a word type is, the higher rank it is in the frequency list. According to the Zipf’s law, the actual frequency count should be inversely related to its rank value. Principle of Least Effort: A unifying principle proposed by Zipf to account for the Zipf’s law: “Both the speaker and the hearer are trying to minimize their effort. The former’s effort is conserved by having a small vocabulary of common words (I.e., larger f) whereas the latter’s effort is conserved by a having large vocabulary of rarer words (I.e, smaller f) so that messages are less ambiguous. The Zipf’s law represents an optimal compromise between these opposing efforts.”

Zipf’s on log-log Plot: (freq) = k/(rank) Or, log(freq) = - log(rank) + log (k)

“More exact” Zipf’s Law (Mandelbrot, 1954): Mandelbrot derived a more general form of Zipf’s law from theoretical principles: which reduces to Zipf’s law for a=0 and b=1.

Mandelbrot’s Fit: log(freq) = -b*log(rank+a) + log(k)

What does Zipf’s law tell us about language? ANS: Not much (virtually nothing!) A Zipf’s law can be obtained under the assumption that text is randomly produced by independently choosing one of N letters with equal probability r and the space with probability (1-Nr). Thus, the Zipf’s law is about the distribution of words whereas language, especially semantics, is about interrelations between words..

In short, the Zipf’s law does not indicate some deep underlying process of language. Zipf’s law or the like is typical of many stochastic random processes, unrelated to the characteristic features of a particular random process. In short, it represents a phenomenon of universal occurrence, that contains no specific information about the underlying process. Rather, language specific information is hidden in the deviations from the law, not the law itself.