Unsupervised learning of Natural languages Eitan Volsky Yasmine Meroz.

Slides:

Advertisements

Similar presentations

1 Minimally Supervised Morphological Analysis by Multimodal Alignment David Yarowsky and Richard Wicentowski.

Advertisements

Feature Selection as Relevant Information Encoding Naftali Tishby School of Computer Science and Engineering The Hebrew University, Jerusalem, Israel NIPS.

C O N T E X T - F R E E LANGUAGES ( use a grammar to describe a language) 1.

DYNAMIC PROGRAMMING ALGORITHMS VINAY ABHISHEK MANCHIRAJU.

Proceedings of the Conference on Intelligent Text Processing and Computational Linguistics (CICLing-2007) Learning for Semantic Parsing Advisor: Hsin-His.

May 2006CLINT-LN Parsing1 Computational Linguistics Introduction Approaches to Parsing.

Chapter Chapter Summary Languages and Grammars Finite-State Machines with Output Finite-State Machines with No Output Language Recognition Turing.

Part of Speech Tagging Importance Resolving ambiguities by assigning lower probabilities to words that don’t fit Applying to language grammatical rules.

Week 3b. Constituents CAS LX 522 Syntax I.

Iowa State University Department of Computer Science, Iowa State University Artificial Intelligence Research Laboratory Center for Computational Intelligence,

March 1, 2009 Dr. Muhammed Al-Mulhem 1 ICS 482 Natural Language Processing Probabilistic Context Free Grammars (Chapter 14) Muhammed Al-Mulhem March 1,

Week 2b. Constituents CAS LX 522 Syntax I.

Introduction to Computability Theory

CS 330 Programming Languages 09 / 13 / 2007 Instructor: Michael Eckmann.

This material in not in your text (except as exercises) Sequence Comparisons –Problems in molecular biology involve finding the minimum number of edit.

1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 22, 2004.

Grammar induction by Bayesian model averaging Guy Lebanon LARG meeting May 2001 Based on Andreas Stolcke’s thesis UC Berkeley 1994.

Parsing — Part II (Ambiguity, Top-down parsing, Left-recursion Removal)

Normal forms for Context-Free Grammars

Semi-Automatic Learning of Transfer Rules for Machine Translation of Low-Density Languages Katharina Probst April 5, 2002.

Latent Semantic Analysis (LSA). Introduction to LSA Learning Model Uses Singular Value Decomposition (SVD) to simulate human learning of word and passage.

Parsing SLP Chapter 13. 7/2/2015 Speech and Language Processing - Jurafsky and Martin 2 Outline  Parsing with CFGs  Bottom-up, top-down  CKY parsing.

Winter 2003/4Pls – syntax – Catriel Beeri1 SYNTAX Syntax: form, structure The syntax of a pl: The set of its well-formed programs The rules that define.

11 CS 388: Natural Language Processing: Syntactic Parsing Raymond J. Mooney University of Texas at Austin.

Context Free Grammars Reading: Chap 12-13, Jurafsky & Martin This slide set was adapted from J. Martin, U. Colorado Instructor: Paul Tarau, based on Rada.

Invitation to Computer Science 5th Edition

Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.

Adaptor Grammars Ehsan Khoddammohammadi Recent Advances in Parsing Technology WS 2012/13 Saarland University 1.

THE BIG PICTURE Basic Assumptions Linguistics is the empirical science that studies language (or linguistic behavior) Linguistics proposes theories (models)

May 2006CLINT-LN Parsing1 Computational Linguistics Introduction Parsing with Context Free Grammars.

Copyright © by Curt Hill Grammar Types The Chomsky Hierarchy BNF and Derivation Trees.

Context Free Grammars Reading: Chap 9, Jurafsky & Martin This slide set was adapted from J. Martin, U. Colorado Instructor: Rada Mihalcea.

11 Chapter 14 Part 1 Statistical Parsing Based on slides by Ray Mooney.

Copyright © Curt Hill Languages and Grammars This is not English Class. But there is a resemblance.

1 Introduction to Computational Linguistics Eleni Miltsakaki AUTH Fall 2005-Lecture 4.

November 2011CLINT-LN CFG1 Computational Linguistics Introduction Context Free Grammars.

Center for PersonKommunikation P.1 Background for NLP Questions brought up by N. Chomsky in the 1950’ies: –Can a natural language like English be described.

Artificial Intelligence: Natural Language

CSA2050 Introduction to Computational Linguistics Parsing I.

PARSING 2 David Kauchak CS159 – Spring 2011 some slides adapted from Ray Mooney.

1 Language translation Programming Language Design and Implementation (4th Edition) by T. Pratt and M. Zelkowitz Prentice Hall, 2001 Sections

CPSC 422, Lecture 15Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 15 Oct, 14, 2015.

Supertagging CMSC Natural Language Processing January 31, 2006.

Automatic Grammar Induction and Parsing Free Text - Eric Brill Thur. POSTECH Dept. of Computer Science 심 준 혁.

1Computer Sciences Department. Book: INTRODUCTION TO THE THEORY OF COMPUTATION, SECOND EDITION, by: MICHAEL SIPSER Reference 3Computer Sciences Department.

CPSC 422, Lecture 27Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 27 Nov, 16, 2015.

Building Sub-Corpora Suitable for Extraction of Lexico-Syntactic Information Ondřej Bojar, Institute of Formal and Applied Linguistics, ÚFAL.

The estimation of stochastic context-free grammars using the Inside-Outside algorithm Oh-Woog Kwon KLE Lab. CSE POSTECH.

N-Gram Model Formulas Word sequences Chain rule of probability Bigram approximation N-gram approximation.

Chapter 3 Language Acquisition: A Linguistic Treatment Jang, HaYoung Biointelligence Laborotary Seoul National University.

King Faisal University جامعة الملك فيصل Deanship of E-Learning and Distance Education عمادة التعلم الإلكتروني والتعليم عن بعد [ ] 1 King Faisal University.

Natural Language Processing Vasile Rus

Natural Language Processing Vasile Rus

Comp 411 Principles of Programming Languages Lecture 3 Parsing

Lexical and Syntax Analysis

Language translation Programming Language Design and Implementation (4th Edition) by T. Pratt and M. Zelkowitz Prentice Hall, 2001 Sections

Formal Language.

CS 388: Natural Language Processing: Syntactic Parsing

N-Gram Model Formulas Word sequences Chain rule of probability

Text Categorization Berlin Chen 2003 Reference:

Language translation Programming Language Design and Implementation (4th Edition) by T. Pratt and M. Zelkowitz Prentice Hall, 2001 Sections

CSCI 5832 Natural Language Processing

Language translation Programming Language Design and Implementation (4th Edition) by T. Pratt and M. Zelkowitz Prentice Hall, 2001 Sections

Dekai Wu Presented by David Goss-Grubbs

Language translation Programming Language Design and Implementation (4th Edition) by T. Pratt and M. Zelkowitz Prentice Hall, 2001 Sections

David Kauchak CS159 – Spring 2019

Language translation Programming Language Design and Implementation (4th Edition) by T. Pratt and M. Zelkowitz Prentice Hall, 2001 Sections

Language translation Programming Language Design and Implementation (4th Edition) by T. Pratt and M. Zelkowitz Prentice Hall, 2001 Sections

Language translation Programming Language Design and Implementation (4th Edition) by T. Pratt and M. Zelkowitz Prentice Hall, 2001 Sections

COMPILER CONSTRUCTION

Presentation transcript:

Unsupervised learning of Natural languages Eitan Volsky Yasmine Meroz

Introduction Grammar learning methods can be grouped into two kinds: supervised and unsupervised. Roughly speaking, unsupervised methods use only pre-tagged sentences, while supervised methods are first initialized with structured sentences. Other Forms of supervision exist as well, for example, probabilistic grammars.

Supervised methods clearly outperform unsupervised ones, but they are much more time consuming, and in many cases it’s impossible to find a treebank or corpus, suitable for a specific task. Examples : –Deciphering text in an unknown Language –DNA sequence analysis.

The bootstrapping process The process generates the syntactic structure of a sentence while it begins from scratch (when it’s completely unsupervised) The structure has to be useful, thus arbitrary, random or incomplete structures are avoided. The system should try to minimize the amount of the information it needs to learn structure.

The Scope of the article The article presents two unsupervised learning frameworks : –EMILE 4.1 –ABL (Alignment-based learning) We’ll present the frameworks and the algorithms that underlay them, and compare them on the ATIS and the OVIS corpora.

EMILE 4.1 Some definitions first : The sentence : David makes tea “ David tea” is a “Context” makes is an “ Expression”

Substitution Classes - intuition If a language has a CFG then expressions, which are generated from the same non-terminal can substitute each other in each context where that non- terminal is a valid constituent. If we have a sufficiently rich example we can expect to find classes of expressions that cluster together.

Primary and characteristic contexts and expressions A grammatical type is defined as a pair where T C is a set of context and T E is a set of expressions. Expressions and Contexts from those sets are called primary. Characteristic Context for T is a context which appears only with expressions of type T. The same for characteristic expressions.

Example “ walk ” can be both noun and a verb. So it cannot be characteristic neither for noun phrases nor for verb phrases. “ thing ” can only be a noun, thus it appears only in noun phrases. “ thing ” is characteristic for the type “noun”.

Shallow languages within Chomsky Hierarchy C type 0 Context sensitive Context- free regular Shallow Seems to be an independent category

Shallow languages - first criterion Grammar G has context separability if each type of G has characteristic context and expression separability if each type of G has characteristic context. Shallow language has to be context and expression separable.

Shallow languages - second criterion Shallow language has to have a sample set of sentences S inducing characteristic contexts and expressions for all types of G L. It’s called characteristic sample. For all sentences of this sample set : K(s) < log(|G|) Kolmogorov complexity = descriptive length of s

Why Shallow languages ? If the sample is taken under simple distribution (dominated by recursively enumerable), The last criteria promises us the sample can be learnt (its grammar to be induced) in Polynomial Time to |G|, Shallow grammars can be learned efficiently from positive examples, what turns the argument of poverty of stimulus, based on Gold’s results to unconvincing.

Natural Languages are Shallow It is claimed (unproven) that natural language are shallow. NL have large lexicons and relatively few rules. Their Shallowness ensures us that if we sample enough sentences, the sample will be characteristic with large confidence.

How does EMILE really work ? Two Phases : –Clustering –Rule Induction

Corpus John makes tea. John likes coffee. John is eating. John likes tea. John likes eating. John makes coffee. …

Clustering John is (.) John likes (.) John makes (.) John (.) eating John (.) coffee John (.) tea context expr’ xx makes xxx likes x is xx tea xx coffee xx eating

identification of clusters - settings The identification of clusters depends on the following settings : –Total_support% –Context_support% –Expression_support% Suppose that : Total_support% = Context_support% = 75% Expression_support% = 80%

John buys (.) John drinks (.) John likes (.) John makes (.) context expr’ xxxxtea xxxxcoffee xxxlemonade xxxxsoup xapples

Rule Induction T => s 0 [T1] s 1 [T2] s 2 [T3] s 3 EMILE attempts to transform the collection of derivation rules found into CFG, consisting of those rules. [0] => what is a family fare [19] => a family fare. [0] => what is [19]

ABL (Alignment-Based Learning) ABL is based on Harris’ principle of substitutability (1951) : All constituents of the same kind can be replaced by each other. ABL uses a reversed version of this principle : If parts of sentences can be substituted by each other, they are constituents of the same type.

The Algorithm The output of algorithm is a labeled, bracketed version of the input corpus. The model learns by comparing all sentences in the input corpus to each other in pairs. Two Phases : –Alignment learning –Selection learning

A Comparison of two sentences The comparison of two sentences falls into one of three different categories : –All words in the two sentences are the same –The sentences are completely unequal –Some words in the sentences are same in both and some are not.

Alignment Learning What is a family fare What is a payload of an African swallow ? The unequaled parts of the sentence are possible constituents. {a family fare, the payload of an African swallow}

The Edit Distance The edit distance is the minimum edit cost needed to transform one sentence into another (Wagner & Fischer 1971) The algorithm which finds the edit distance also finds the longest common subsequence, and it also gives an estimation how far is the link between the two parts.

Example From (San Francisco to)1 Dallas ()2 From ()1 Dallas (to San Francisco)2 From (San Francisco)1 to (Dallas)2 From (Dallas)1 to (San Francisco)2

Overlapping Constituents I didn’t take my passport. I didn’t like this plane. If {this plane} was already stored, “like this plane” overlaps with it, and we cannot assign them different types because it would prevent us from inducing a CFG in a later stage.

Selection Learning In the Selection Learning phase, we try to get rid of the overlapping constituents by finding the best combination of constituents of each type. 3 ways to compute constituent’s probability : –ABL : first-is-correct –ABL : leaf –ABL : branch

Selection Learning (cont’) After the probabilities of the overlapping constituents were computed, The probability of each combination is computed using geometric mean, while using the Viterbi algorithm optimization, in order to do it efficiently.

Theoretical Comparison ABL is much more greedy, and thus learns faster and better on small corpora, but cannot learn on large corpora out of efficiency reasons. It stores all possible constituents, and only then selects the best ones. EMILE is developed for large corpora (more than 100K sentences) and is much less greedy. It finds a grammar rule only when enough information was found to support it.

Conclusions Both frameworks work pretty well for unsupervised learning models. Their underlying ideas match rather well. It should be possible to develop a hybrid version, which uses the best qualities of both algorithms.