COMP 791A: Statistical Language Processing

Slides:



Advertisements
Similar presentations
School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Chunking: Shallow Parsing Eric Atwell, Language Research Group.
Advertisements

Corpus Processing and NLP
Jing-Shin Chang National Chi Nan University, IJCNLP-2013, Nagoya 2013/10/15 ACLCLP – Activities ( ) & Text Corpora.
Statistical Methods and Linguistics - Steven Abney Thur. POSTECH Computer Science NLP Lab Shim Jun-Hyuk.
January 12, Statistical NLP: Lecture 2 Introduction to Statistical NLP.
Natural Language and Speech Processing Creation of computational models of the understanding and the generation of natural language. Different fields coming.
Predicting Text Quality for Scientific Articles AAAI/SIGART-11 Doctoral Consortium Annie Louis : Louis A. and Nenkova A Automatically.
Shallow Processing: Summary Shallow Processing Techniques for NLP Ling570 December 7, 2011.
By Rohana Mahmud (NLP week 1-2)
Introduction to CL Session 1: 7/08/2011. What is computational linguistics? Processing natural language text by computers  for practical applications.
Natural Language Query Interface Mostafa Karkache & Bryce Wenninger.
Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.
تمرين شماره 1 درس NLP سيلابس درس NLP در دانشگاه هاي ديگر ___________________________ راحله مکي استاد درس: دکتر عبدالله زاده پاييز 85.
Machine Learning in Natural Language Processing Noriko Tomuro November 16, 2006.
1/23 Applications of NLP. 2/23 Applications Text-to-speech, speech-to-text Dialogues sytems / conversation machines NL interfaces to –QA systems –IR systems.
Research methods in corpus linguistics Xiaofei Lu.
March 1, 2009 Dr. Muhammed Al-Mulhem 1 ICS 482 Natural Language Processing INTRODUCTION Muhammed Al-Mulhem March 1, 2009.
Statistical Natural Language Processing. What is NLP?  Natural Language Processing (NLP), or Computational Linguistics, is concerned with theoretical.
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 1 21 July 2005.
Lecture 2, 7/22/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 2 22 July 2005.
Mining and Summarizing Customer Reviews
ELN – Natural Language Processing Giuseppe Attardi
9/8/20151 Natural Language Processing Lecture Notes 1.
Introduction to Natural Language Processing Heshaam Faili University of Tehran.
BTANT 129 w5 Introduction to corpus linguistics. BTANT 129 w5 Corpus The old school concept – A collection of texts especially if complete and self-contained:
1 Statistical NLP: Lecture 10 Lexical Acquisition.
Lecture 12: 22/6/1435 Natural language processing Lecturer/ Kawther Abas 363CS – Artificial Intelligence.
Distributional Part-of-Speech Tagging Hinrich Schütze CSLI, Ventura Hall Stanford, CA , USA NLP Applications.
Information Retrieval and Web Search Text properties (Note: some of the slides in this set have been adapted from the course taught by Prof. James Allan.
Natural Language Processing Introduction. 2 Natural Language Processing We’re going to study what goes into getting computers to perform useful and interesting.
Natural Language Processing (NLP) I. Introduction II. Issues in NLP III. Statistical NLP: Corpus-based Approach.
1 Statistical NLP: Lecture 9 Word Sense Disambiguation.
1 Ch-1: Introduction (1.3 & 1.4 & 1.5) Prepared by Qaiser Abbas ( )
Natural Language Processing Artificial Intelligence CMSC February 28, 2002.
Introduction to CL & NLP CMSC April 1, 2003.
Exploring Text: Zipf’s Law and Heaps’ Law. (a) (b) (a) Distribution of sorted word frequencies (Zipf’s law) (b) Distribution of size of the vocabulary.
Tracking Language Development with Learner Corpora Xiaofei Lu CALPER 2010 Summer Workshop July 12, 2010.
인공지능 연구실 황명진 FSNLP Introduction. 2 The beginning Linguistic science 의 4 부분 –Cognitive side of how human acquire, produce, and understand.
Research Topics CSC Parallel Computing & Compilers CSC 3990.
Natural Language Processing Spring 2007 V. “Juggy” Jagannathan.
1 CSI 5180: Topics in AI: Natural Language Processing, A Statistical Approach Instructor: Nathalie Japkowicz Objectives of.
Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.
October 2005CSA3180 NLP1 CSA3180 Natural Language Processing Introduction and Course Overview.
CSA2050 Introduction to Computational Linguistics Lecture 1 Overview.
Computational Linguistics. The Subject Computational Linguistics is a branch of linguistics that concerns with the statistical and rule-based natural.
Computational linguistics A brief overview. Computational Linguistics might be considered as a synonym of automatic processing of natural language, since.
CSA2050 Introduction to Computational Linguistics Lecture 1 What is Computational Linguistics?
Tokenization & POS-Tagging
For Monday Read chapter 26 Last Homework –Chapter 23, exercise 7.
Auckland 2012Kilgarriff: NLP and Corpus Processing1 The contribution of NLP: corpus processing.
Introduction Chapter 1 Foundations of statistical natural language processing.
CS460/IT632 Natural Language Processing/Language Technology for the Web Lecture 1 (03/01/06) Prof. Pushpak Bhattacharyya IIT Bombay Introduction to Natural.
For Friday Finish chapter 23 Homework –Chapter 23, exercise 15.
CPSC 422, Lecture 27Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 27 Nov, 16, 2015.
Exploring Text: Zipf’s Law and Heaps’ Law. (a) (b) (a) Distribution of sorted word frequencies (Zipf’s law) (b) Distribution of size of the vocabulary.
1 An Introduction to Computational Linguistics Mohammad Bahrani.
SIMS 296a-4 Text Data Mining Marti Hearst UC Berkeley SIMS.
Statistical Properties of Text
For Monday Read chapter 26 Homework: –Chapter 23, exercises 8 and 9.
Overview of Statistical NLP IR Group Meeting March 7, 2006.
INTRODUCTION TO APPLIED LINGUISTICS
King Faisal University جامعة الملك فيصل Deanship of E-Learning and Distance Education عمادة التعلم الإلكتروني والتعليم عن بعد [ ] 1 جامعة الملك فيصل عمادة.
NLP Midterm Solution #1 bilingual corpora –parallel corpus (document-aligned, sentence-aligned, word-aligned) (4) –comparable corpus (4) Source.

Natural Language - General
Natural Language Processing
Artificial Intelligence 2004 Speech & Natural Language Processing
Information Retrieval
Presentation transcript:

COMP 791A: Statistical Language Processing Introduction Chap. 1

Course information Prof: Leila Kosseim Office: LB 903-7 Email: kosseim@cs.concordia.ca Office hours: TBA

Goal of NLP Develop techniques and tools to build practical and robust systems that can communicate with users in one or more natural language Natural Lang. Artificial Lang. Lexical >100 000 words ~100 words Syntax Complex Simple Semantic 1 word --> several meanings 1 word --> 1 meaning

References Foundations of Statistical Natural Language Processing, by Chris Manning and Hinrich Schutze, MIT Press, 1999. Speech and Language Processing, Daniel Jurafsky & James H. Martin. Prentice Hall, 2000.  Current literature available on the Web. See course Web page: www.cs.concordia.ca/~kosseim/Teaching/COMP791-W04/

Other References Proceedings of major conferences ACL: Association for Computational Linguistics EACL: European chapter of ACL ANLP: Applied NLP COLING: Computational Linguistics TREC: Text Retrieval Conference

Who studies languages? Linguist Psycholinguist Philosopher What constraints the possible meanings of a sentence? Uses mathematical models (ex. formal grammars) Psycholinguist How do people produce a discourse from an idea? Uses: experimental observations with human subjects Philosopher What is meaning anyways? How do words identify objects in the world? Uses: argumentations, examples and counter-examples Computational Linguist (NLP) How can we identify the structure of sentences automatically? Uses: data structures, algorithms, AI techniques (search, knowledge-representation, machine learning, …)

Why study NLP? necessary to many useful applications: information retrieval, information extraction, filtering, spelling and grammar checking, automatic text summarization, understanding and generation of natural language, machine translation…

Who needs NLP? Too many texts to manipulate Too many languages On Internet E-mails Various corporate documentation Too many languages 39000 languages and dialects

Languages on the Internet In fact, native English speakers are already a minority on the web. Chinese speakers are projected to pass us on the web in 2007 – Chinese being the biggest language in the world. And at the other end of the spectrum, there are hundreds of little bitty languages. They don’t show up on this chart, but they sure come in handy when you’re a U.N. peacekeeper or a tourist or a penpal or a businessperson or a websurfer. Source: Global Reach (www.glreach.com)

Source: Global Reach (www.glreach.com) [globalization …] And they’re not all doing it in English! Contrary to popular belief. Source: Global Reach (www.glreach.com)

Applications of NLP Text-based: processing of written texts (ex. Newspaper articles, e-mails, Web pages…) Text understanding/analysis (NLU) IR, IE, MT, … Text generation (NLG) Dialog-based systems (human-machine communication) Ex: QA, tutoring systems, …

Brief history of NLP 1940s - 1950s Foundational Insights Automata, finite-state machines & formal languages (Turing, Chomsky, Backus&Naur) Probability and information theory (Shannon) Noisy channel and decoding (Shannon) 1960s - 1970s Two Camps Symbolic: Linguists & Computer Scientists Transformational grammars (Chomsky, Harris) Artificial Intelligence (Minsky, McCarthy) Theorem Proving, heuristics, general problem solver (Newell&Simon) Stochastic: Statisticians & Electrical Engineers Bayesian reasoning for character recognition Authorship attribution Corpus Work

Brief history of NLP (con’t) 1970s - 1980s 4 Paradigms Stochastic approaches Logic-based / Rule-based approaches Scripts and plans for NL understanding of “toy worlds” Discourse modeling (discourse structures & coreference resolution) Late 1980s - 1990s Rise of probabilistic models Data-driven probabilistic approaches (more robust) Engineering practical solutions using automatic learning Strict evaluation of work

Why study NLP Statistically? Up to about 10 years, NLP was mainly investigated using a rule-based approach. But: Rules are often too strict to characterize people’s use of language (people tend to stretch and bend rules in order to meet their communicative needs.) Need (expert) people to develop rules (knowledge acquisition bottleneck) Statistical methods are more flexible & more robust

Tools and Resources Needed Probability/Statistical Theory: Statistical Distributions, Bayesian Decision Theory. Linguistics Knowledge: Morphology, Syntax, Semantics, Pragmatics… Corpora: Bodies of marked or unmarked text to which statistical methods and current linguistic knowledge can be applied in order to discover novel linguistic theories or interesting and useful knowledge to build applications.

The Alphabet Soup NLP  Natural Language Processing CL  Computational Linguistics NLE  Natural Language Engineering HLT  Human Language Technology IE  Information Extraction IR  Information Retrieval MT  Machine Translation QA  Question-Answering POS  Part-of-speech NLG  Natural Language Generation NLU  Natural Language Understanding

Why is NLP difficult? Because Natural Language is highly ambiguous. Syntactic ambiguity I made her duck. has 2 parses (i.e., syntactic analysis) The president spoke to the nation about the problem of drug use in the schools from one coast to the other. has 720 parses. Ex: “to the other” can attach to any of the previous NPs (ex. “the problem”), or the head verb  6 places “from one coast” has 5 places to attach … (S (NP I) (VP (V made) (NP (PRO her) (N duck))) (VP (V duck))))

Why is NLP difficult? (con’t) Word category ambiguity book --> verb? or noun? Word sense ambiguity bank --> financial institution? building? or river side? Words can mean more than their sum of parts make up a story Fictitious worlds People on mars can fly. Defining scope People like ice-cream. Does this mean that all (or some?) people like ice cream? Language is changing and evolving I’ll email you my answer. This new S.U.V. as a compartment for your mobile phone.

Methods that do not work well Hand-coded rules produce a knowledge acquisition bottleneck perform poorly on naturally occurring text Ex: Hand-coded syntactic constraints and preference rules Ex: selectional restrictions animate being --> swallow--> physical object I swallowed his story / line. The supernova swallowed the planet.

What Statistical NLP can do seeks to solve the acquisition bottelneck: by automatically learning preferences from corpora (ex, lexical or syntactic preferences). offers a solution to the problem of ambiguity and "real" data because statistical models are robust generalize well behave gracefully in the presence of errors and new data.

Some standard corpora Brown corpus Lancaster-Oslo-Bergen (LOB) corpus ~1 million words Tagged corpus (POS) Balanced (representative sample of American English in the 1960-1970) (different genres) Lancaster-Oslo-Bergen (LOB) corpus British replication of the Brown corpus Susanne corpus Free subset of Brown corpus (130 000 words) Syntactic structure Penn Treebank Articles from Wall Street Journal Canadian Hansard Bilingual corpus of parallel texts

What to do with text corpora? Count words Count words to find: What are the most common words in the text? How many words are in the text? word tokens vs word types What is the average frequency of each word in the text?

What’s a word anyways? I have a can opener; but I can’t open these cans. how many words? Word form inflected form as it appears in the text can and cans ... different word forms Lemma a set of lexical forms having the same stem, same POS and same meaning can and cans … same lemma Word token: an occurrence of a word I have a can opener; but I can’t open these cans. 11 word tokens (not counting punctuation) Word type: a different realization of a word I have a can opener; but I can’t open these cans. 10 word types (not counting punctuation)

An example Mark Twain’s Tom Sawyer Complete Shakespeare work 71,370 word tokens 8,018 word types tokens/type ratio = 8.9 (indication of text complexity) Complete Shakespeare work 884,647 word tokens 29,066 word types tokens/type ratio = 30.4

Common words in Tom Sawyer but words in NL have an uneven distribution…

Frequency of frequencies most words are rare 3993 (50%) word types appear only once they are called happax legomena (read only once) but common words are very common 100 words account for 51% of all tokens (of all text)

Word counts are interesting... As an indication of a text’s style As an indication of a text’s author But, because most words appear very infrequently, it is hard to predict much about the behavior of words (if they do not occur often in a corpus) --> Zipf’s Law

Zipf’s Law Count the frequency of each word type in a large corpus List the word types in order of their frequency Let: f = frequency of a word type r = its rank in the list Zipf’s Law says: f  1/r In other words: there exists a constant k such that: f × r = k The 50th most common word should occur with 3 times the frequency of the 150th most common word.

Zipf’s Law on Tom Saywer k ≈ 8000-9000 except for The 3 most frequent words Words of frequency ≈ 100

Plot of Zipf’s Law On chap. 1-3 of Tom Sawyer (≠ numbers from p. 25&26) f×r = k

Plot of Zipf’s Law (con’t) On chap. 1-3 of Tom Sawyer f×r = k ==> log(f×r) = log(k) ==> log(f)+log(r) = log(k)

Zipf’s Law, so what? Significance of Zipf’s Law for us: There are: A few very common words A medium number of medium frequency words A large number of infrequent words Principle of Least effort: Tradeoff between speaker and hearer’s effort Speaker communicates with a small vocabulary of common words (less effort) Hearer disambiguates messages through a large vocabulary of rare words (less effort) Significance of Zipf’s Law for us: For most words, our data about their use will be very sparse Only for a few words will we have a lot of examples

Another Zipf law on language Nb of meanings of a word is correlated to its frequency the more frequent a word, the more senses it can have Ex: Words at rank 2,000 have 4.6 meanings Words at rank 5,000 have 3 meanings Words at rank 10,000 have 2.1 meanings Ex: Verb senses in WordNet: serve has 13 senses but most verbs have only 1 sense f = frequency of word m = num of senses r = rank of word

Yet another Zipf law on language Content words tend to "clump" together if we take a text and count the distance between identical words (tokens) then the freq of intervals of size s between identical tokens is inversely proportional to the size s i.e. we have a large number of small intervals i.e. we have a small number of large intervals --> most content words occur near each other f = frequency of intervals of size s s = size of interval p = varied between 1 and 1.3 xxx xxx xxx xxx

What to do with text corpora? Find Collocations Collocation: a phrase where the whole expression is perceived as having an existence beyond the sum of its parts disk drive, make up, bacon and eggs… important for machine translation strong tea --> thé fort strong argument -->?argument fort (convainquant) can be extracted from a text find the most common bigrams however, since these bigrams are often insignificant (ex, “at the”, “of a”) they can be filtered.

Collocations Raw bigrams Filtered bigrams

What to do with text corpora? Concordances Find the different contexts in which a word occurs. Key Word In Context (KWIC) concordancing program.

Concordances useful for: Finding syntactic frames of verbs Transitive? Intransitive? Building dictionaries for learners of foreign languages Guiding statistical parsers