Kewei Tu and Vasant Honavar

Slides:

Advertisements

Similar presentations

Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser ICL, U. Heidelberg CIS, LMU München Statistical Machine Translation.

Advertisements

Artificial Intelligence 13. Multi-Layer ANNs Course V231 Department of Computing Imperial College © Simon Colton.

Learning Accurate, Compact, and Interpretable Tree Annotation Recent Advances in Parsing Technology WS 2011/2012 Saarland University in Saarbrücken Miloš.

Global Flow Optimization (GFO) in Automatic Logic Design “ TCAD91 ” by C. Leonard Berman & Louise H. Trevillyan CAD Group Meeting Prepared by Ray Cheung.

Unsupervised learning

Iowa State University Department of Computer Science, Iowa State University Artificial Intelligence Research Laboratory Center for Computational Intelligence,

Recognizing Implicit Discourse Relations in the Penn Discourse Treebank Ziheng Lin, Min-Yen Kan, and Hwee Tou Ng Department of Computer Science National.

Review of : Yoav Freund, and Robert E

1 Prepared and presented by Roozbeh Farahbod Voted Perceptron: Modified for NP-Chunking A Re-ranking Method.

PCFG Parsing, Evaluation, & Improvements Ling 571 Deep Processing Techniques for NLP January 24, 2011.

Clustering short time series gene expression data Jason Ernst, Gerard J. Nau and Ziv Bar-Joseph BIOINFORMATICS, vol

© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.

Measuring Model Complexity (Textbook, Sections ) CS 410/510 Thurs. April 27, 2007 Given two hypotheses (models) that correctly classify the training.

Incremental Learning of Temporally-Coherent Gaussian Mixture Models Ognjen Arandjelović, Roberto Cipolla Engineering Department, University of Cambridge.

1/13 Parsing III Probabilistic Parsing and Conclusions.

COMP305. Part II. Genetic Algorithms. Genetic Algorithms.

Basic Scientific Writing in English Lecture 3 Professor Ralph Kirby Faculty of Life Sciences Extension 7323 Room B322.

Introduction to Boosting Aristotelis Tsirigos SCLT seminar - NYU Computer Science.

Normal forms for Context-Free Grammars

Maximum Entropy Model LING 572 Fei Xia 02/07-02/09/06.

Parsing SLP Chapter 13. 7/2/2015 Speech and Language Processing - Jurafsky and Martin 2 Outline  Parsing with CFGs  Bottom-up, top-down  CKY parsing.

Introduction to Data Mining Engineering Group in ACL.

Ensembles of Classifiers Evgueni Smirnov

AdaBoost Robert E. Schapire (Princeton University) Yoav Freund (University of California at San Diego) Presented by Zhi-Hua Zhou (Nanjing University)

Introduction to Profile Hidden Markov Models

PATTERN RECOGNITION AND MACHINE LEARNING

Part II. Statistical NLP Advanced Artificial Intelligence Applications of HMMs and PCFGs in NLP Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme.

Unambiguity Regularization for Unsupervised Learning of Probabilistic Grammars Kewei TuVasant Honavar Departments of Statistics and Computer Science University.

Curriculum Learning Yoshua Bengio, U. Montreal Jérôme Louradour, A2iA

Adaptor Grammars Ehsan Khoddammohammadi Recent Advances in Parsing Technology WS 2012/13 Saarland University 1.

1 Statistical Parsing Chapter 14 October 2012 Lecture #9.

New Bulgarian University 9th International Summer School in Cognitive Science Simplicity as a Fundamental Cognitive Principle Nick Chater Institute for.

CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 16 Nov, 3, 2011 Slide credit: C. Conati, S.

1 Semi-Supervised Approaches for Learning to Parse Natural Languages Rebecca Hwa

Topic Modelling: Beyond Bag of Words By Hanna M. Wallach ICML 2006 Presented by Eric Wang, April 25 th 2008.

Recognizing Names in Biomedical Texts: a Machine Learning Approach GuoDong Zhou 1,*, Jie Zhang 1,2, Jian Su 1, Dan Shen 1,2 and ChewLim Tan 2 1 Institute.

Part 1: Biological Networks 1.Protein-protein interaction networks 2.Regulatory networks 3.Expression networks 4.Metabolic networks 5.… more biological.

CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov

CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 29– CYK; Inside Probability; Parse Tree construction) Pushpak Bhattacharyya CSE.

A Model for Learning the Semantics of Pictures V. Lavrenko, R. Manmatha, J. Jeon Center for Intelligent Information Retrieval Computer Science Department,

SemiBoost : Boosting for Semi-supervised Learning Pavan Kumar Mallapragada, Student Member, IEEE, Rong Jin, Member, IEEE, Anil K. Jain, Fellow, IEEE, and.

Topic Models Presented by Iulian Pruteanu Friday, July 28 th, 2006.

Effective Automatic Image Annotation Via A Coherent Language Model and Active Learning Rong Jin, Joyce Y. Chai Michigan State University Luo Si Carnegie.

Evaluating Models of Computation and Storage in Human Sentence Processing Thang Luong CogACLL 2015 Tim J. O’Donnell & Noah D. Goodman.

Supertagging CMSC Natural Language Processing January 31, 2006.

Iowa State University Department of Computer Science Center for Computational Intelligence, Learning, and Discovery Harris T. Lin, Sanghack Lee, Ngot Bui.

Ensemble Methods in Machine Learning

Natural Language Processing Lecture 15—10/15/2015 Jim Martin.

CPSC 422, Lecture 27Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 27 Nov, 16, 2015.

1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.

Probabilistic Automaton Ashish Srivastava Harshil Pathak.

1 Definitions In statistics, a hypothesis is a claim or statement about a property of a population. A hypothesis test is a standard procedure for testing.

Lecture 3: MLE, Bayes Learning, and Maximum Entropy

Virtual Examples for Text Classification with Support Vector Machines Manabu Sassano Proceedings of the 2003 Conference on Emprical Methods in Natural.

NLP. Introduction to NLP Time flies like an arrow –Many parses –Some (clearly) more likely than others –Need for a probabilistic ranking method.

An evolutionary approach for improving the quality of automatic summaries Constantin Orasan Research Group in Computational Linguistics School of Humanities,

Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 6 –Multiple hypothesis testing Marshall University Genomics.

Outline Variables – definition  Physical dimensions  Abstract dimensions Systematic vs. random variables Scales of measurement Reliability of measurement.

Dan Roth University of Illinois, Urbana-Champaign 7 Sequential Models Tutorial on Machine Learning in Natural.

Unsupervised Learning

PRESENTED BY: PEAR A BHUIYAN

Introduction to Automata Theory

Instance Based Learning (Adapted from various sources)

Hidden Markov Models Part 2: Algorithms

Sets and Probabilistic Models

Sets and Probabilistic Models

Introduction to Machine learning

Unsupervised Learning

Clustering-based Studies on the upgraded ITS of the Alice Experiment

Presentation transcript:

On the Utility of Curricula in Unsupervised Learning of Probabilistic Grammars Kewei Tu and Vasant Honavar Artificial Intelligence Research Laboratory Department of Computer Science Iowa State University

Outline Unsupervised Grammar Learning Grammar Learning with a Curriculum The Incremental Construction Hypothesis Theoretical Analysis Empirical Support

Probabilistic Grammars A probabilistic grammar is a set of probabilistic production rules that define a joint probability of a grammatical structure and its sentence P = 2.2 × 10-6 …… Example from [Jurafsky & Martin, 2006]

Probabilistic Grammars Probabilistic grammars are widely used in Natural language parsing Bioinformatics, e.g., RNA structure modeling Pattern recognition Specifying grammars is hard Machine learning offers a practical alternative Mention: in many applications, no grammar ready for use, hence machine learning

Learning a grammar from a corpus Supervised Methods Rely on a training corpus of sentences annotated with grammatical structures (parses) Unsupervised Methods Do not require annotated data Training Corpus Probabilistic Grammar Induction A square is above the triangle. A triangle rolls. The square rolls. A triangle is above the square. A circle touches a square. …… S ® NP VP NP ® Det N VP ® Vt NP (0.3) | Vi PP (0.2) | rolls (0.2) | bounces(0.1) ……

Current Approaches Process the entire corpus to learn the grammar No, it wasn't Black Monday. But while the New York Stock Exchange didn't fall apart Friday as the Dow Jones Industrial Average plunged 190.58 points -- most of it in the final hour -- it barely managed to stay this side of chaos. Some “circuit breakers”' installed after the October 1987 crash failed their first test, traders say, unable to cool the selling panic… Image from www.editorsweblog.org Image from www.christart.com

Grammar Learning with a Curriculum Good. Come here. …… The rabbit is behind the tree. Alice is sitting on the riverbank. …… Alice: I wonder if I've been changed in the night? Let me think. Was I the same when I got up this morning? I almost think I can remember feeling a little different… Image from www.ibirthdayclipart.com Start with the simplest sentences Progress to increasingly more complex sentences

Curriculum Learning [Bengio et al., 2009] A curriculum is a sequence of weighting schemes of the training data: assigns more weight to “easier” training samples Each subsequent weighting scheme assigns more weight to “harder” samples assigns uniform weight to each sample Learning is iterative In each iteration, the learner is initialized with the model learned during the previous iteration trained from the data weighted by the current weighting scheme

Experiments Learning a probabilistic dependency grammar from the Wall Street Journal corpus of the Penn Treebank Base learning algorithm Expectation-maximization Sentence complexity measure Sentence length Sentence likelihood given the learned grammar Weight Assignment 0 or 1 A continuous function

Experimental Results All of the four curricula help learning.

Questions Under what conditions does a curriculum help in unsupervised learning of probabilistic grammars? How can we design good curricula? How can we design algorithms that can take advantage of the curricula?

The Incremental Construction Hypothesis An ideal curriculum gradually emphasizes data samples that help the learner to successively discover new substructures (i.e., grammar rules) of the target grammar, which facilitates the learning. We say a curriculum satisfies incremental construction if: For any , the weighted training data correspond to a sentence distribution defined by a probabilistic grammar For any , is a sub-grammar of (See Section 3 of the paper for the more precise definitions) Mention: no need to be a precise sub-grammar

Theoretical Analysis Theorem: If a curriculum satisfies incremental construction, then for any s.t. , we have where is the distance between the grammar rule probabilities; is the total variation distance between the distributions of grammatical structures defined by the two grammars.

Intermediate grammars With a curriculum G0 Gn Without a curriculum

Guidelines for Curriculum Design A good curriculum should: (approximately) satisfy incremental construction effectively break down the target grammar into as many chunks as possible at each stage, introduce the new rule(s) that results in the largest number of new sentences if r1 is required for r2 to be used, then r1 shall be introduced earlier than r2 among rules with the same LFS, rules with larger probabilities shall be introduced first

Guideline for Algorithm Design Observation the learning target at each stage of a curriculum is a partial grammar Guideline avoid the over-fitting to this partial grammar that hinders the acquisition of new grammar rules in later stages

Experiments on Synthetic Data Data generated from the Treebank grammar of WSJ30 Curricula constructed based on the target grammar Ideal: Satisfies all the guidelines Sub-Ideal: Doesn’t satisfy the 3rd guideline: randomly choosing new grammar rules at each stage Random: Doesn’t satisfy any guideline: randomly choosing new sentences at each stage Ideal-10, Sub-Ideal-10, Random-10: Introduce at least 10 new sentences at each stage, hence containing fewer stages Length-based: Introduces new sentences based on their lengths

Experiments on Synthetic Data

Length-based Curriculum Very similar to the ideal curricula in this case (measured by rank correlation)

Analysis on Real Data Ideal curricula cannot be constructed in unsupervised learning from real data We find evidence that the length-based curriculum can be seen as a proxy for an ideal curriculum on real data

Evidence from WSJ30 The introduction of grammar rules is spread throughout the entire curriculum More frequently used rules are introduced earlier

Evidence from WSJ30 Grammar rules introduced in earlier stages are always used in sentences introduced in later stages

Evidence from WSJ30 In the sequence of intermediate grammars, most rule probabilities first increase and then decrease, which satisfies a relaxed definition of ideal curricula that satisfy incremental construction

Conclusion We have introduced the incremental construction hypothesis an explanation of the benefits of curricula in unsupervised learning of probabilistic grammars. a source of guidelines for designing curricula as well as unsupervised grammar learning algorithms The hypothesis is supported by both theoretical analysis and experimental results (on both synthetic and real data)

Thank You! Q&A

Backup

lr : the length of the shortest sentence in the set of sentences that use rule r

Mean and std of the lengths of the sentences that use each rule

The change of probabilities of VBD headed rules with the stages of the length-based curriculum in the treebank grammar.