1 Ch-1: Introduction (1.3 & 1.4 & 1.5) Prepared by Qaiser Abbas (07-0906)

Slides:



Advertisements
Similar presentations
Outline What is a collocation?
Advertisements

What is a corpus?* A corpus is defined in terms of  form  purpose The word corpus is used to describe a collection of examples of language collected.
January 12, Statistical NLP: Lecture 2 Introduction to Statistical NLP.
An Inference Procedure
Inducing Information Extraction Systems for New Languages via Cross-Language Projection Ellen Riloff University of Utah Charles Schafer, David Yarowksy.
Input-Output Relations in Syntactic Development Reflected in Large Corpora Anat Ninio The Hebrew University, Jerusalem The 2009 Biennial Meeting of SRCD,
1/17 Probabilistic Parsing … and some other approaches.
IN350: Text properties, Zipf’s Law,and Heap’s Law. Judith A. Molka-Danielsen September 12, 2002 Notes are based on Chapter 6 of the Article Collection.
Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.
Corpus 3 Corpus-based Description. Aspects of corpus-based studies lexis, morphology, syntax and discourse. fig. 3.1 A classification of corpus-based.
Part of speech (POS) tagging
Language Model. Major role: Language Models help a speech recognizer figure out how likely a word sequence is, independent of the acoustics. A lot of.
Statistical Natural Language Processing Advanced AI - Part II Luc De Raedt University of Freiburg WS 2005/2006 Many slides taken from Helmut Schmid.
Determining the Size of
The LC-STAR project (IST ) Objectives: Track I (duration 2 years) Specification and creation of large word lists and lexica suited for flexible.
Research methods in corpus linguistics Xiaofei Lu.
Albert Gatt LIN 3098 Corpus Linguistics. In this lecture Some more on corpora and grammar Construction Grammar as a theoretical framework Collostructional.
Physics 114: Lecture 15 Probability Tests & Linear Fitting Dale E. Gary NJIT Physics Department.
Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved Section 10-3 Regression.
McEnery, T., Xiao, R. and Y.Tono Corpus-based language studies. Routledge. Unit A 2. Representativeness, balance and sampling (pp13-21)
CS324e - Elements of Graphics and Visualization Java Intro / Review.
BTANT 129 w5 Introduction to corpus linguistics. BTANT 129 w5 Corpus The old school concept – A collection of texts especially if complete and self-contained:
BİL711 Natural Language Processing1 Statistical Parse Disambiguation Problem: –How do we disambiguate among a set of parses of a given sentence? –We want.
Probabilistic Parsing Reading: Chap 14, Jurafsky & Martin This slide set was adapted from J. Martin, U. Colorado Instructor: Paul Tarau, based on Rada.
Class Meeting #11 Data Analysis. Types of Statistics Descriptive Statistics used to describe things, frequently groups of people.  Central Tendency 
NLP superficial and lexic level1 Superficial & Lexical level 1 Superficial level What is a word Lexical level Lexicons How to acquire lexical information.
1 A study on automatically extracted keywords in text categorization Authors:Anette Hulth and Be´ata B. Megyesi From:ACL 2006 Reporter: 陳永祥 Date:2007/10/16.
COMP 791A: Statistical Language Processing
Distributional Part-of-Speech Tagging Hinrich Schütze CSLI, Ventura Hall Stanford, CA , USA NLP Applications.
Evaluating Statistically Generated Phrases University of Melbourne Department of Computer Science and Software Engineering Raymond Wan and Alistair Moffat.
1 Statistical Parsing Chapter 14 October 2012 Lecture #9.
CJT 765: Structural Equation Modeling Class 7: fitting a model, fit indices, comparingmodels, statistical power.
Information Retrieval and Web Search Text properties (Note: some of the slides in this set have been adapted from the course taught by Prof. James Allan.
Natural Language Processing (NLP) I. Introduction II. Issues in NLP III. Statistical NLP: Corpus-based Approach.
1 Statistical NLP: Lecture 9 Word Sense Disambiguation.
WORD SENSE DISAMBIGUATION STUDY ON WORD NET ONTOLOGY Akilan Velmurugan Computer Networks – CS 790G.
The Examination of Residuals. Examination of Residuals The fitting of models to data is done using an iterative approach. The first step is to fit a simple.
1 Statistical Properties for Text Rong Jin. 2 Statistical Properties of Text  How is the frequency of different words distributed?  How fast does vocabulary.
Exploring Text: Zipf’s Law and Heaps’ Law. (a) (b) (a) Distribution of sorted word frequencies (Zipf’s law) (b) Distribution of size of the vocabulary.
Tracking Language Development with Learner Corpora Xiaofei Lu CALPER 2010 Summer Workshop July 12, 2010.
Chapter 4 Linear Regression 1. Introduction Managerial decisions are often based on the relationship between two or more variables. For example, after.
인공지능 연구실 황명진 FSNLP Introduction. 2 The beginning Linguistic science 의 4 부분 –Cognitive side of how human acquire, produce, and understand.
Natural Language Processing Spring 2007 V. “Juggy” Jagannathan.
Using Surface Syntactic Parser & Deviation from Randomness Jean-Pierre Chevallet IPAL I2R Gilles Sérasset CLIPS IMAG.
1 Statistical NLP: Lecture 7 Collocations. 2 Introduction 4 Collocations are characterized by limited compositionality. 4 Large overlap between the concepts.
C.Watterscsci64031 Term Frequency and IR. C.Watterscsci64032 What is a good index term Occurs only in some documents ??? –Too often – get whole doc set.
Copyright © Cengage Learning. All rights reserved. 2 Descriptive Analysis and Presentation of Single-Variable Data.
CSA2050 Introduction to Computational Linguistics Parsing I.
Natural Language Processing
Introduction Chapter 1 Foundations of statistical natural language processing.
Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏
CPSC 422, Lecture 27Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 27 Nov, 16, 2015.
Exploring Text: Zipf’s Law and Heaps’ Law. (a) (b) (a) Distribution of sorted word frequencies (Zipf’s law) (b) Distribution of size of the vocabulary.
Corpus search What are the most common words in English
Introduction to statistics I Sophia King Rm. P24 HWB
SIMS 296a-4 Text Data Mining Marti Hearst UC Berkeley SIMS.
Statistical Properties of Text
Word classes and part of speech tagging. Slide 1 Outline Why part of speech tagging? Word classes Tag sets and problem definition Automatic approaches.
Worked examples and exercises are in the text STROUD PROGRAMME 27 STATISTICS.
Overview of Statistical NLP IR Group Meeting March 7, 2006.
Use of Concordancers A corpus (plural corpora) – a large collection of texts, written or spoken, stored on a computer. A concordancer – a computer programme.
NLP Midterm Solution #1 bilingual corpora –parallel corpus (document-aligned, sentence-aligned, word-aligned) (4) –comparable corpus (4) Source.
Statistical NLP: Lecture 7
CORPUS LINGUISTICS Corpus linguistics is the study of language as expressed in samples (corpora) or "real world" text. An approach to derive at a set of.

CJT 765: Structural Equation Modeling
Statistical NLP: Lecture 9
Probabilistic and Lexicalized Parsing
Normal as Approximation to Binomial
Statistical NLP : Lecture 9 Word Sense Disambiguation
Presentation transcript:

1 Ch-1: Introduction (1.3 & 1.4 & 1.5) Prepared by Qaiser Abbas ( )

2 1.3:- Ambiguity of Language NLP system determine structure of text e.g. “who did what to whom?” Conventional Parsing System answer this question syntactically limited e.g. “Our company is training workers”. The three different parses represented as in (1.11): Making Verb Group while in others “is” as main verb Adj. Particle modifies workers Present Participle followed by Noun (Gerund)

3 Last two parses b & c are anomalous. It means when sentences get longer and grammar get more comprehensive then such ambiguities lead to terrible multiplication of parses. Martin(1987) report 455 parses for the following sentence (1.12): “List the sales of the products produced in 1973 with the products produced in 1972”. Practical NLP system are good for making disambiguation decisions of word sense, word category, syntactic structure and semantic scope. Goal is to maximize coverage while minimize ambiguity but maximize coverage increases the undesired parses and vice versa. AI approaches to parsing and disambiguation has shown that hand coded syntactic constraints and preference rules are time consuming and not scale up well and hard and easily broken(Lakoff 1987).

4 In selectional restriction e.g. a verb “swallow” requires an animate as subject and physical object as object. These restriction disallow common and simple extensions of usage of swallow as in (1.13): a. I swallowed (believe) his story, hook, line, and sinker. b. The supernova swallowed (Nervously landing) the planet. Statistical NLP approaches solves these problems automatically by learning lexical and structural preferences from corpora through information lies in relationship between words. Statistical models are robust, generalize well, and behave gracefully in presence of errors and new data. Moreover, parameters of SNLP models can often be estimated automatically from corpora. Automatic learning reduces human effort and raises interesting scientific issues.

5 1.4: Dirty Hands 1.4.1: Lexical Resources Read machine readable text, dictionaries, thesauri and tools for processing them Brown Corpus ( ): widely known, million words corpus, American English, pay to use it, include press reportage, fiction, scientific text, legal text, and many others. Lancaster Oslo Bergen (LOB) corpus is British English replication of Brown Corpus. Susanne Corpus : 130,000 words, freely available, subset of Brown Corpus, contain information on syntactic structure of sentence, Penn Treebank: text from Wall Street Journal, widely used, not free Canadian Hansards: proceeding of Canadian parliament, bilingual corpus, not freely available, such parallel text corpus is important for statistical machine translation and other cross lingual NLP work. WordNet Electronic Dictionary: hierarchical, includes synset ( identical meaning), meronym or part-whole relations between words, free and downloaded from internet. Further details in Ch-4

: Word Counts Question: what are the most common words in the text?. Table 1.1 includes common words known as function words from the Mark Twain’s Tom Sawyer Corpus e.g. determiners, prepositions and complementizers Frequency of Tom, corpus reflect the material from which it was constructed, Question: how many words are there in the text?. Corpus includes work tokens (very small corpus), less than half a MB of online text. it includes 8018 word types (different words) while a sample of newswire of same size contains 11,000 word types Ratio of tok to typ  71370/8018 = 8.9 which is average frequency with which each type is used Table1.2 shows word types occur with a certain frequency. Word in corpus occure “on average” about 9 times each Some common words are occurring over 700 times and individually accounting for over 1% of the words e.g. 3332x100/71370 = 4.67% and 772x100/71370 = 1.08% Overall the most common 100 words account for over half (50.9%) of the word tokens in the text On the other extreme, almost half (49.8%) of the word types occur only once in the corpus known as hapax legomena (read only once) Vast majority of word types occur extremely infrequently e.g. over 90% of wordtypes occur 10 times or less e.g …3993 = 7277 out of 8018 word types. Rare words make up a considerable proportion of the text e.g. 12% of the text is words that occur 3 times or less e.g = 8569 out of 71370

: Zipf’s Law The Principle of Least Effort: The people will act to minimize their probable average rate of work. Zipf uncovered this theory through certain empirical laws. Count how often each word type occurs in a large corpus and then list the words in order of their frequency. We can explore the relationship between the frequency of a word f and its position in the list known as its rank r. The Law states f ∞ 1/r (1.14) or in other words f. r = k where k is constant. This equation says e.g. 50 th most common word should occur with three times the frequency of the 150 th most common word. This concept first introduced by Estoup(1916) but widely publicized by Zipf. Zipf’s Law holds for table 1.3 approx. except the three highest frequency words and product f.r make a curve (bulge) for words of rank around 100. This curve gives information about frequency distribution that a few very common words, a middling number of medium frequency words and many low frequency words are exists in human languages. The validity and possibilities for the derivation of Zipf’s law is studies by Mandelbrot(1954) and found that Zipf’s law show closer match with large Corpus sometime and give general shape of the curve but poor in reflecting details. Figure 1.1 is rank frequency plot. Zipf’s law predicts that this graph should be a straight line with the slop-1 but mandelbrot showed that it is bad fit especially for low (most low ranks) and high ranks (greater than 10,000). Low ranks, The slop-I line is too low High ranks >10,000. The line is too high

8 Mandelbrot derives the following relationship to achieve the closer fit. f = P(r+p) -B or logf = logP – B log(r+p) where P, B and p(ro) are parameters of text that collectively measure the richness of the text’s use of words. Hyperbolic distribution still exist as in the case of Zipf law but for large value of r, it closely approximate a straight line descending with slop –B just as Zipf’s law. By appropriate setting of parameters, one can model a curve where the frequency of most common words is lower. The graph in fig 1.2 shows the Mandelbrot formula which is better fit than Zipf’s law for given corpus. Other Laws Zipf proposed a number of other empirical laws relating to language. Among them two important SNLP concerns are as follows: “the number of meaning of a word is correlated with its frequency” or m∞√f where m is number of meaning and f is frequency. Or m ∞ 1/√r. Zipf gives empirical support in his study as words of frequency rank about 10,000 average about 2.1 meaning, 5000 average about 3 meanings and 2000 about 4.6 meaning. Slight bulge in the upper left corner and large slope of model the lowest and highest ranks better than Zipf’s law Straight line at end One can measure the number of line and pages b/t each occurrence of the word in a text and then calculate the frequency F of different interval size I e.g. for words of frequency at most 24 in 260,000 word corpus zipf found F ∞ I-p where p varied b/t 1 and 1.3 in Zipf’s studies. In short, most of the time content words occur near another occurrence of the same word. (Detail in ch-7 and 15.3). Other laws of Zipf’s almost represent there is an inverse relationship b/t the frequency of words and their length. The significance of power law (read yourself)

: Collocations Collocation include compound words (disk drive). Phrasal verbs (make up) and other stock phrases (bacon and eggs). Have specialized meaning and idiomatic (natural style of speech and writing) but they need not be e.g. international best practice. The frequent use of fixed expression is candidate for collocation. Important in areas of SNLP e.g. machine translation (ch-13) and information retrieval (ch-15). Lexicographer are also interested in collocations to put it in dictionary due to frequent ways of word usage, multiword units and independent existence. Chomskyan focus on the creativity of language use is de-emphasized by the people practice of collocation and Hallidayan gives another idea that language is inseparable from its pragmatic (words with special meaning w.r.t their use) and social context. Collocations may be several words long or discontinuous (make [something] up). Common bigram collocation from New York Times is given in Table 1.4. Problem: Not normalized for the frequency of words. In case of “of the” and “in the” the most common word sequences concludes that the determiner commonly follows a preposition but these are not collocations. Solution: count frequency of each word Another approach to filter collocations first, then remove those that are POS or syntactic categories or rarely associated with collocations. Two most frequent patterns are adj-noun and noun- noun as shown in Table 1.5.

: Concordances Key word in context (KWIC) concordancing programme which produces displays of data as in fig uses of “showed off” in double quotes either due to neologism (new word) or slang at that time. All of these uses are intransitive(which has subject and no object) although some take prepositional phrase modifiers e.g. in and with in sentences. (6,8,12,15) uses transitive verb (which has object compulsory) (16) uses ditransitive (which has direct and indirect objects) verb In (13,15), object is NP and that clause and (7) as non finite and (10) as finite question form complement clauses. (9,14) has NP object followed by PP but quite idiomatic. In both cases object noun is modified to make a more complex NP. We could systematize the pattern as in fig1.4. Collecting information about patterns of occurrence of verbs like this is useful for dictionaries for foreign language learners, guiding statistical parses,

Further Readings References from text book Questions, Discussion and Comments are Welcomed