Automatic Morphology and Minimum Description Length John Goldsmith Department of Linguistics.

Slides:

Advertisements

Similar presentations

Kees van Deemter Matthew Stone Formal Issues in Natural Language Generation Lecture 4 Shieber 1993; van Deemter 2002.

Advertisements

The Robert Gordon University School of Engineering Dr. Mohamed Amish

Variation and regularities in translation: insights from multiple translation corpora Sara Castagnoli (University of Bologna at Forlì – University of Pisa)

Lecture 4 (week 2) Source Coding and Compression

Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser ICL, U. Heidelberg CIS, LMU München Statistical Machine Translation.

Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna

Fast Algorithms For Hierarchical Range Histogram Constructions

Statistical Machine Translation Part II – Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart

Huffman code and ID3 Prof. Sin-Min Lee Department of Computer Science.

Binary Trees CSC 220. Your Observations (so far data structures) Array –Unordered Add, delete, search –Ordered Linked List –??

Bar Ilan University And Georgia Tech Artistic Consultant: Aviya Amir.

Zakaria A. Khamis GE 2110 GEOGRAPHICAL STATISTICS GE 2110.

What is a corpus?* A corpus is defined in terms of  form  purpose The word corpus is used to describe a collection of examples of language collected.

January 12, Statistical NLP: Lecture 2 Introduction to Statistical NLP.

The Comparison of the Software Cost Estimating Methods

Research Basics PE 357. What is Research? Can be diverse General definition is “finding answers to questions in an organized and logical and systematic.

Unsupervised language acquisition Carl de Marcken 1996.

Unsupervised Learning of Natural Language Morphology using MDL John Goldsmith November 9, 2001.

Towards a new empiricism in linguistics John A. Goldsmith The University of Chicago.

Cognitive modelling (Cognitive Science MSc.) Fintan Costello

Linguistica. Powerpoint? This presentation borrows heavily from slides written by John Goldsmith who has graciously given me permission to use them. Thanks,

Using eigenvectors of a bigram-induced matrix to represent and infer syntactic behavior Mikhail Belkin and John Goldsmith The University of Chicago July.

1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.

Statistical approaches to language learning John Goldsmith Department of Linguistics May 11, 2000.

1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.

Tom Griffiths CogSci C131/Psych C123 Computational Models of Cognition.

Unsupervised Learning of Natural Language Morphology using MDL John Goldsmith November 9, 2001.

Unsupervised language acquisition Carl de Marcken 1996.

This material in not in your text (except as exercises) Sequence Comparisons –Problems in molecular biology involve finding the minimum number of edit.

Grammar induction by Bayesian model averaging Guy Lebanon LARG meeting May 2001 Based on Andreas Stolcke’s thesis UC Berkeley 1994.

Linguistica: Unsupervised Learning of Natural Language Morphology Using MDL John Goldsmith Department of Linguistics The University of Chicago.

Example of a Decision Tree categorical continuous class Splitting Attributes Refund Yes No NO MarSt Single, Divorced Married TaxInc NO < 80K > 80K.

Variant definitions of pointer length in MDL Aris Xanthos, Yu Hu, and John Goldsmith University of Chicago.

Probabilistic models in Phonology John Goldsmith University of Chicago Tromsø: CASTL August 2005.

1 Unsupervised Discovery of Morphemes Presented by: Miri Vilkhov & Daniel Feinstein linja-autonautonkuljettajallakaan linja-auton auto kuljettajallakaan.

Tch-prob1 Chapter 1 Introduction Use of probability in daily life A.Lotto B.Batting average in baseball C.Election Poll (statistics) D.Weather Forecast.

Information Theory and Security

CHAPTER 3 RESEARCH TRADITIONS.

SPA WORKSHOPS PRESENTS. STEP ONE: Devising your research question/topic. If there is a specific research question/topic given in your assignment, try.

A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.

Databases From A to Boyce Codd. What is a database? It depends on your point of view. For Manovich, a database is a means of structuring information in.

The Communicative Language Teaching Lecture # 18.

Fall 2002CMSC Discrete Structures1 One, two, three, we’re… Counting.

Channel Capacity.

Institute of Professional Studies School of Research and Graduate Studies Introduction to Business and Management Research Lecture One (1)

How We Know What We Know Direct Experience and Observation What happens when it is challenged? How do we observe?

인공지능 연구실 황명진 FSNLP Introduction. 2 The beginning Linguistic science 의 4 부분 –Cognitive side of how human acquire, produce, and understand.

How to write a professional paper. 1. Developing a concept of the paper 2. Preparing an outline 3. Writing the first draft 4. Topping and tailing 5. Publishing.

Comparing and Ranking Documents Once our search engine has retrieved a set of documents, we may want to Rank them by relevance –Which are the best fit.

Using traveling salesman problem algorithms for evolutionary tree construction Chantal Korostensky and Gaston H. Gonnet Presentation by: Ben Snider.

Evaluation of the Advice Generator of an Intelligent Learning Environment Maria Virvou, Katerina Kabassi Department of Informatics University of Piraeus.

LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.

Machine Learning Chapter 5. Artificial IntelligenceChapter 52 Learning 1. Rote learning rote( โรท ) n. วิถีทาง, ทางเดิน, วิธีการตามปกติ, (by rote จากความทรงจำ.

Recent Results in Combined Coding for Word-Based PPM Radu Rădescu George Liculescu Polytechnic University of Bucharest Faculty of Electronics, Telecommunications.

GAME PLAYING 1. There were two reasons that games appeared to be a good domain in which to explore machine intelligence: 1.They provide a structured task.

LIMITATIONS OF ALGORITHM POWER

How To Program An Overview Or A Reframing of the Question of Programming.

Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart

2/24/20161 One, two, three, we’re… Counting. 2/24/20162 Basic Counting Principles Counting problems are of the following kind: “How many different 8-letter.

Academic Writing Fatima AlShaikh. A duty that you are assigned to perform or a task that is assigned or undertaken. For example: Research papers (most.

NLP Midterm Solution #1 bilingual corpora –parallel corpus (document-aligned, sentence-aligned, word-aligned) (4) –comparable corpus (4) Source.

Vocabulary 7b Thinking Language Intelligence. a methodical, logical rule or procedure that guarantees solving a particular problem. Contrasts with the.

Plan for Today’s Lecture(s)

Statistical Machine Translation Part II: Word Alignments and EM

Chapter 6 Morphology.

CS 430: Information Discovery

The Scientific Method in Psychology

Data Mining Lecture 11.

Chi Square (2) Dr. Richard Jackson

Fourier Transform of Boundaries

Presentation transcript:

Automatic Morphology and Minimum Description Length John Goldsmith Department of Linguistics

Today’s plan 1 A computer program -- what it looks like, what it does. 2 The framework -- Minimum Description Length 3 Situate MDL within a linguistic context... Comparison with Early Generative Grammar 4 Situate MDL within a broader intellectual context 5 More substantive description of Automorphology’s design 6 The broader perspective

Today’s plan 1 A computer program -- what it looks like, what it does. 2 The framework -- Minimum Description Length 4 Comparison with Early Generative Grammar 3 Situate MDL within a linguistic context... 5 Situate MDL within a broader intellectual context 6 More substantive description of Automorphology’s design 7 Consequences for anthropology

WinAutomorphology 1 n A version available on the web at goldsmith n A C++ Windows program that accepts data as input and provides a morphological analysis....

Automorphology

n What do you have to put into a program like that? How much do you have to put into a program like that? n That is, does it have to have a lot of innate knowledge? Does it help for it to have a lot of innate knowledge? n If you build such a program, how do you know if it does it the same way as a child?

What do we want? If you give the program a computer file containing Tom Sawyer, it should tell you that the language has a category of words that take the suffixes ing,s,ed, and NULL; another category that takes the suffixes 's, s, and NULL; If you give it Jules Verne, it tells you there's a category with suffixes: a aient ait ant (chanta, chantaient, chantait, chantant)

n And it should tell you about irregular stem allomorphy if your language contains it.

That's what AutoMorphology does. How much data do you need? n You get reasonable results fast, with 5,000 words, but results are much better with 50,000, and much better with 500,000 words (length of corpus).

Unsupervised learning... n No prepared corpus; no tagging; just the facts. n The goal is to reconstruct the logic of linguistics in a quantitative fashion (to the extent that is necessary).

Unsupervised learning n A fully explicit linguistic hypothesis. n A device (an algorithm) with immediate practical uses. n Arguably the embodiment of linguistic theory: the explicit and quantifiable specification of the relationship between data and analysis (grammar).

n For the purposes of version 1 of AutoMorphology, I will restrict myself to Indo-European languages, and in general languages in which the average number of suffixes per word is not greater than 2. (We drop this requirement in AutoMorphology 2.)

Today’s plan 1 A computer program -- what it looks like, what it does. 2 The framework -- Minimum Description Length 3 Situate MDL within a linguistic context... Comparison with Early Generative Grammar 4 Situate MDL within a broader intellectual context 5 More substantive description of Automorphology’s design 6 The broader perspective

Minimum Description Length Jorma Rissanen (1989) Data Analyzer Analysis Select the analyzer and analysis such that the sum of their lengths is a minimum.

Data Analyzer Analysis Analyzer Analysis Analyzer Analysis Analyzer Analysis Analyzer Analysis Etc...

The challenge Is to find a means of quantifying n the length of an analyzer, and n the length of an analysis

“Compressed form of data?” Think of data as a dense, rich, detailed description (evidence), and Think of compressed form as n Description in high level language + n Description of the particulars of the event in question (a.k.a. boundary conditions, etc.)...

“Analyzer” Is the set of statements that allows translation between high-level and low- level descriptions.

Minimizing sum of length of Analyzer + Compressed form of data = Aim for conciseness in high-level description + Principles of analysis

Don’t overlook the fact... …that the goal of MDL analysis is nothing less than the solution of the problem of induction. How do we justify generalization, given evidence?

the problem of induction Speechchild/linguistic theory grammar Datascientisttheory Sensebrainthought/percept Evidencemindbelief

Today’s plan 1 A computer program -- what it looks like, what it does. 2 The framework -- Minimum Description Length 4 Comparison with Early Generative Grammar 3 Situate MDL within a linguistic context... 5 Situate MDL within a broader intellectual context 6 More substantive description of Automorphology’s design 7 Consequences for anthropology

Data Morphological analyser Morphological analysis of that corpus “signature”

Very simply put... n Just state “ed” “s” “ing” “heit” “ité” once in the grammar; n pay for its occurrence (how many bits does it take to pay for those few letters) just once; n then make repeated reference (use pointers) to those entries.

References, pointers... n Are not free. n Information theory tells us exactly what they cost. The fundamental measure is Shannon’s: a pointer to an item of reference frequency P out of a universe of N possibilties is of length: log (N/P)

Summing over all items, and weighting by count gives us the famous formula:

A probabilistic morphology: n Assigns a probability to all words that it can generate; and these probabilities must add up to 1.0. n A word is three choices: –choice of signature –choice of stem within signature –choice of suffix within signature

n Each of those is assigned a probability, based on counts. n Probability of a signature

Similarly, the probability of a stem is the number of times of its occurrence divided by the number of occurrences of that signature in the corpus.

Likewise for the suffixes… If the analysis is wrong, the numbers will be much worse than if it’s right. “The numbers” a model of frequencies of words.

Maximum Likelihood n The best morphology is the one that assigns the highest probability to the observed data. n …known in the biz as Maximum Likelihood.

Today’s plan 1 A computer program -- what it looks like, what it does. 2 The framework -- Minimum Description Length 4 Comparison with Early Generative Grammar 3 Situate MDL within a linguistic context... 5 Situate MDL within a broader intellectual context 6 More substantive description of Automorphology’s design 7 Consequences for anthropology

Compare with Early Generative Grammar (EGG) Data Linguistic Theory Analysis 1 Analysis 2 Preference: A1/A2

Linguistic theory Data Analysis Linguistic theory Data Analysis Yes/No Linguistic theory Analysis 1 Analysis 2 Data 1 is better/ 2 is better

Implicit in EGG was the notion... that the best Linguistic Theory could be selected by... Getting a set of n candidate LTs; submitting to each a set of corpora; search (using unknown heuristics) for best analyses of each corpus within each LT; The LT wins for whom the sum total of all of the analyses is the smallest.

No cost to UG n In EGG, there was no cost associated with the size of UG -- in effect, no plausibility measure.

In MDL, in contrast…. n we can argue for a grammar for a given corpus. n We can also argue at the Linguistic Theory level if we so choose...

n Select n corpora, and select that LT on the basis of LT’s length plus the length of all of the grammars derived from it, plus the lengths of the compressed corpora derived from those grammars. n Pick the LT with the shorted some total length.

Today’s plan 1 A computer program -- what it looks like, what it does. 2 The framework -- Minimum Description Length 4 Comparison with Early Generative Grammar 3 Situate MDL within a linguistic context... 5 Situate MDL within a broader intellectual context 6 More substantive description of Automorphology’s design 7 Consequences for anthropology

Today’s plan 1 A computer program -- what it looks like, what it does. 2 The framework -- Minimum Description Length 4 Comparison with Early Generative Grammar 3 Situate MDL within a linguistic context... 5 Situate MDL within a broader intellectual context 6 More substantive description of Automorphology’s design 7 Consequences for anthropology

Distinction between heuristics and “theory” n In the context of MDL, the heuristics are extratheoretical, but from the point of view of the (psycho-)linguist, they are very important. n The heuristics propose; the theory disposes.

Stems with their signatures abrupt NULL ly ness. abs ence ent. absent -minded NULL ia ly. absent-minded NULL ly absentee NULL ism absolu NULL e ment. absorb ait ant e er é ée abus ait er abîm e es ée.

Now build up signature collection... Top 10, 100K words 1.NULL.ed.ing NULL.ed.ing.s NULL.s 's.NULL.s NULL.ed.s NULL.ly NULL.ed 's.NULL NULL.d.s NULL.ing

Verbose signature....NULL.ed.ing. 58 heapcheckrevolt plunderlookobtain escortproclaimarrest gaindestroystay suspectkillconsent knocktracksucceed answerfrightenglitter....

Stem allomorphy In a corpus of French, we find pairs of stems: ç:c/_# 10commenç\commenc menaç\menac renonç\renonc avanç\avanc annonç\annonc s'effaç\s'effac enfonç\enfonc recommenç\recommenc perç\perc forç\forc lanç\lanc

compressed length of corpus:

Heuristics Find more than one stem that commutes with more than one suffix

n Negotiate for where the stem/suffix break should be: mea all take the suffixes n/ns. christia roma reig rui saxo tow

Today’s plan 1 A computer program -- what it looks like, what it does. 2 The framework -- Minimum Description Length 4 Comparison with Early Generative Grammar 3 Situate MDL within a linguistic context... 5 Situate MDL within a broader intellectual context 6 More substantive description of Automorphology’s design 7 Consequences for the broader context

Data Mind/head/brain/ nervous system Analysis But are the contributions of these two of equal magnitude, in the case of language? Otherwise put, to what extent does the structure here reside in the data -- and to what extent in the analyzer?

A rich, deductive structure? n No shadow of a rich, deductive structure in the learner casting its image on the form of the learned morphology. n A pure structuralism -- a structuralism without Jakobsonian dualities (but see my paper on Jakobson….)