Natural Language Processing Spring 2007 V. “Juggy” Jagannathan.

Slides:



Advertisements
Similar presentations
Grammar is to Meaning as the Law if to Good Behaviour Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex.
Advertisements

School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING A comparative study of the tagging of adverbs in modern English corpora.
School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Chunking: Shallow Parsing Eric Atwell, Language Research Group.
Using Corpus Tools in Discourse Analysis Discourse and Pragmatics Week 12.
The Google Similarity Distance  We’ve been talking about Natural Language parsing  Understanding the meaning in a sentence requires knowing relationships.
© author(s) of these slides including research results from the KOM research network and TU Darmstadt; otherwise it is specified at the respective slide.
Chapter 3 Recap Lesson Objective: By the end of the lesson I will be able to analyse George and Lennie’s relationship through quotes from chapter 3 STARTER.
Statistical Methods and Linguistics - Steven Abney Thur. POSTECH Computer Science NLP Lab Shim Jun-Hyuk.
A Brief Overview. Contents Introduction to NLP Sentiment Analysis Subjectivity versus Objectivity Determining Polarity Statistical & Linguistic Approaches.
Recent Developments in Technological Tools for the Purpose of Facilitating SLA.
What is a corpus?* A corpus is defined in terms of  form  purpose The word corpus is used to describe a collection of examples of language collected.
January 12, Statistical NLP: Lecture 2 Introduction to Statistical NLP.
Advanced AI - Part II Luc De Raedt University of Freiburg WS 2004/2005 Many slides taken from Helmut Schmid.
The Unreasonable Effectiveness of Data Alon Halevy, Peter Norvig, and Fernando Pereira Kristine Monteith May 1, 2009 CS 652.
Today Listening test Corpus linguistics talk, Part 3 News task NEOs Life on Mars.
1/7 INFO60021 Natural Language Processing Harold Somers Professor of Language Engineering.
Talking about your homework News story? –What made you choose…? One of your words? –What made you choose…? (Give your vocabulary books to another student.
KnowItNow: Fast, Scalable Information Extraction from the Web Michael J. Cafarella, Doug Downey, Stephen Soderland, Oren Etzioni.
1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 22, 2004.
Grammar induction by Bayesian model averaging Guy Lebanon LARG meeting May 2001 Based on Andreas Stolcke’s thesis UC Berkeley 1994.
Statistical Natural Language Processing Advanced AI - Part II Luc De Raedt University of Freiburg WS 2005/2006 Many slides taken from Helmut Schmid.
Corpus Linguistics: session 2 Corpus Linguistics (2): The Tools of the Trade 669o4zt
Corpus Linguistics What can a corpus tell us ? Levels of information range from simple word lists to catalogues of complex grammatical structures and.
Research methods in corpus linguistics Xiaofei Lu.
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 1 21 July 2005.
Automated Essay Evaluation Martin Angert Rachel Drossman.
McEnery, T., Xiao, R. and Y.Tono Corpus-based language studies. Routledge. Unit A 2. Representativeness, balance and sampling (pp13-21)
Part II. Statistical NLP Advanced Artificial Intelligence Applications of HMMs and PCFGs in NLP Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme.
1 Corpora, Language Technology and Maltese Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd University of Sussex.
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
Natural Language Processing Spring 2007 V. “Juggy” Jagannathan.
COMP 791A: Statistical Language Processing
Researching language with computers Paul Thompson.
Distributional Part-of-Speech Tagging Hinrich Schütze CSLI, Ventura Hall Stanford, CA , USA NLP Applications.
Information Retrieval and Web Search Text properties (Note: some of the slides in this set have been adapted from the course taught by Prof. James Allan.
Why We Need Corpora and the Sketch Engine Adam Kilgarriff Lexical Computing Ltd, UK Universities of Leeds and Sussex.
Natural Language Processing (NLP) I. Introduction II. Issues in NLP III. Statistical NLP: Corpus-based Approach.
1 Ch-1: Introduction (1.3 & 1.4 & 1.5) Prepared by Qaiser Abbas ( )
Exploring Text: Zipf’s Law and Heaps’ Law. (a) (b) (a) Distribution of sorted word frequencies (Zipf’s law) (b) Distribution of size of the vocabulary.
Tracking Language Development with Learner Corpora Xiaofei Lu CALPER 2010 Summer Workshop July 12, 2010.
인공지능 연구실 황명진 FSNLP Introduction. 2 The beginning Linguistic science 의 4 부분 –Cognitive side of how human acquire, produce, and understand.
1 CSI 5180: Topics in AI: Natural Language Processing, A Statistical Approach Instructor: Nathalie Japkowicz Objectives of.
Comparing and Ranking Documents Once our search engine has retrieved a set of documents, we may want to Rank them by relevance –Which are the best fit.
October 2005CSA3180 NLP1 CSA3180 Natural Language Processing Introduction and Course Overview.
Introduction to Dialogue Systems. User Input System Output ?
Tokenization & POS-Tagging
Corpus Linguistics in Research Doctorate in Education University of Warwick 6th November 2008.
Language and Statistics
Ngram models and the Sparcity problem. The task Find a probability distribution for the current word in a text (utterance, etc.), given what the last.
Introduction Chapter 1 Foundations of statistical natural language processing.
Enda F. Scott 2001 Good morning An introduction to modern dictionary making.
Exploring Text: Zipf’s Law and Heaps’ Law. (a) (b) (a) Distribution of sorted word frequencies (Zipf’s law) (b) Distribution of size of the vocabulary.
The Unreasonable Effectiveness of Data
Automatic acquisition for low frequency lexical items Nuria Bel, Sergio Espeja, Montserrat Marimon.
FILTERED RANKING FOR BOOTSTRAPPING IN EVENT EXTRACTION Shasha Liao Ralph York University.
SIMS 296a-4 Text Data Mining Marti Hearst UC Berkeley SIMS.
Statistical Properties of Text
Corpus Linguistics MOHAMMAD ALIPOUR ISLAMIC AZAD UNIVERSITY, AHVAZ BRANCH.
Competing Conceptions of Language Dr. Douglas Fleming University of Ottawa.
What is a Corpus? What is not a corpus?  the Web  collection of citations  a text Definition of a corpus “A corpus is a collection of pieces of language.
Using Corpus Resources in English Language Teaching Hilary Nesi Coventry University UK.
E303 Part II The Context of Language Research
Tools for Natural Language Processing Applications
Text Based Information Retrieval
Exploring the BNC Corpus
Writing Analytics Clayton Clemens Vive Kumar.
Language and Statistics
Content Analysis of Text
Using GOLD to Tracking L2 Development

Presentation transcript:

Natural Language Processing Spring 2007 V. “Juggy” Jagannathan

Foundations of Statistical Natural Language Processing By Christopher Manning & Hinrich Schutze Course Book

Chapter 1 Introduction January 8, 2007

Linguistic vs Statistic Rationale for a statistical approach Linguistic approaches that attempt to parse language based on grammar have failed Edward Saphir famous quote: “All grammars leak” Statistical approaches have been shown to be practical to look at “What are the common patterns that occur in language use”

Rationalist vs Empiricist Sort of the difference between “nature” and “nurture” Rationalist: Innate intelligence of humans is inherited and hence computational system must be loaded with pre-knowledge to be effective Empiricist: Lot can be learned through examining actual use of language – and hence statistical approaches that learn from “corpus” are germane. Corpus – a body of text Corpora – a collection of texts

Scientific content: Questions that linguistics should answer What kinds of things do people say? What do these things say/ask/request about the world? Traditional linguistic approach –Competence grammar and grammaticality determination –But this is hard… trying to determine whether sentences are grammatical or not. –Some examples in page 10 – next page

Some examples of sentences

Non-categorical phenomena in language Language usage changes with time Some words defy categorization into rigid linguistic boundaries Example of “near” which can be an adjective, adverb or both simultaneously Example of change: kind of and sort of Language usage change can be better tracked using statistical NLP approaches

Language and cognition as probabilistic phenomena One view of the world – the Chomsky line of thinking is that probability and statistics are inappropriate for determining “grammaticality” and understanding the “meaning” of sentences. The viewpoint with statistical NLP is that “grammar” is not necessarily relevant to understand and develop practical solutions

Some parses of the sentence: “Our company is training workers”

The ambiguity of language: why NLP is hard Linguists like to parse sentences to determine things like: who did what to whom Parsing sentences is hard 455 parses to the sentence: –“List the sales of the products produced in 1973 with the products produced in 1972” AI approaches to understanding meaning have failed and have been shown to be brittle and non-scalable

Dirty Hands Variety of corpus available for statistical NLP research Tom Sawyer example

Common Words in Tom Sawyer

Word Counts Some statistics from Tom Sawyer # of word tokens: 71,370 # of word types (unique words): 8,018 Average frequency: 71,370/8,018 = 8.9 Some words are very common! 12 words appear more than 700 times each 100 words account for more than 50.9% of the text 49.8% of “word types” appear only once in the corpus! “hapax legomena” Greek for “read only once” How can statistics help us understand the meaning of sentences if half the words only appear once?

Frequency of frequencies 8018 Total # of Word Types

Zipf Law Empirical evaluation of Zipf law for Tom Sawyer

Basic Insight from Power Laws What makes frequency-based approaches to language hard is almost all words are rare. Zipf’s law is a good way to encapsulate this insight.

Collocations Collocations in New York Times corpus with and without filtering

Concordances Key Word In Context (KWIC)

ff