Download presentation
Presentation is loading. Please wait.
Published byWilfred Garrison Modified over 8 years ago
1
Natural Language Processing Spring 2007 V. “Juggy” Jagannathan
2
Foundations of Statistical Natural Language Processing By Christopher Manning & Hinrich Schutze Course Book
3
Chapter 1 Introduction January 8, 2007
4
Linguistic vs Statistic Rationale for a statistical approach Linguistic approaches that attempt to parse language based on grammar have failed Edward Saphir famous quote: “All grammars leak” Statistical approaches have been shown to be practical to look at “What are the common patterns that occur in language use”
5
Rationalist vs Empiricist Sort of the difference between “nature” and “nurture” Rationalist: Innate intelligence of humans is inherited and hence computational system must be loaded with pre-knowledge to be effective Empiricist: Lot can be learned through examining actual use of language – and hence statistical approaches that learn from “corpus” are germane. Corpus – a body of text Corpora – a collection of texts
6
Scientific content: Questions that linguistics should answer What kinds of things do people say? What do these things say/ask/request about the world? Traditional linguistic approach –Competence grammar and grammaticality determination –But this is hard… trying to determine whether sentences are grammatical or not. –Some examples in page 10 – next page
7
Some examples of sentences
8
Non-categorical phenomena in language Language usage changes with time Some words defy categorization into rigid linguistic boundaries Example of “near” which can be an adjective, adverb or both simultaneously Example of change: kind of and sort of Language usage change can be better tracked using statistical NLP approaches
9
Language and cognition as probabilistic phenomena One view of the world – the Chomsky line of thinking is that probability and statistics are inappropriate for determining “grammaticality” and understanding the “meaning” of sentences. The viewpoint with statistical NLP is that “grammar” is not necessarily relevant to understand and develop practical solutions
10
Some parses of the sentence: “Our company is training workers”
11
The ambiguity of language: why NLP is hard Linguists like to parse sentences to determine things like: who did what to whom Parsing sentences is hard 455 parses to the sentence: –“List the sales of the products produced in 1973 with the products produced in 1972” AI approaches to understanding meaning have failed and have been shown to be brittle and non-scalable
12
Dirty Hands Variety of corpus available for statistical NLP research Tom Sawyer example
13
Common Words in Tom Sawyer
14
Word Counts Some statistics from Tom Sawyer # of word tokens: 71,370 # of word types (unique words): 8,018 Average frequency: 71,370/8,018 = 8.9 Some words are very common! 12 words appear more than 700 times each 100 words account for more than 50.9% of the text 49.8% of “word types” appear only once in the corpus! “hapax legomena” Greek for “read only once” How can statistics help us understand the meaning of sentences if half the words only appear once?
15
Frequency of frequencies 8018 Total # of Word Types
16
Zipf Law Empirical evaluation of Zipf law for Tom Sawyer
20
Basic Insight from Power Laws What makes frequency-based approaches to language hard is almost all words are rare. Zipf’s law is a good way to encapsulate this insight.
21
Collocations Collocations in New York Times corpus with and without filtering
22
Concordances Key Word In Context (KWIC)
23
ff
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.