Statistical NLP Spring 2010 Lecture 25: Diachronics Dan Klein – UC Berkeley.

Slides:



Advertisements
Similar presentations
CS460/IT632 Natural Language Processing/Language Technology for the Web Lecture 2 (06/01/06) Prof. Pushpak Bhattacharyya IIT Bombay Part of Speech (PoS)
Advertisements

Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser Institute for Natural Language Processing University of Stuttgart
Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser ICL, U. Heidelberg CIS, LMU München Statistical Machine Translation.
Statistical Machine Translation Part II – Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
Proceedings of the Conference on Intelligent Text Processing and Computational Linguistics (CICLing-2007) Learning for Semantic Parsing Advisor: Hsin-His.
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
Supervised Learning Recap
Probabilistic Language Processing Chapter 23. Probabilistic Language Models Goal -- define probability distribution over set of strings Unigram, bigram,
Hidden Markov Models Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004.
Logistics Course reviews Project report deadline: March 16 Poster session guidelines: – 2.5 minutes per poster (3 hrs / 55 minus overhead) – presentations.
Predicting Text Quality for Scientific Articles Annie Louis University of Pennsylvania Advisor: Ani Nenkova.
Topic Modeling with Network Regularization Md Mustafizur Rahman.
. Phylogeny II : Parsimony, ML, SEMPHY. Phylogenetic Tree u Topology: bifurcating Leaves - 1…N Internal nodes N+1…2N-2 leaf branch internal node.
Multiple-Instance Learning Paper 1: A Framework for Multiple-Instance Learning [Maron and Lozano-Perez, 1998] Paper 2: EM-DD: An Improved Multiple-Instance.
Distributional Cues to Word Boundaries: Context Is Important Sharon Goldwater Stanford University Tom Griffiths UC Berkeley Mark Johnson Microsoft Research/
Grammar induction by Bayesian model averaging Guy Lebanon LARG meeting May 2001 Based on Andreas Stolcke’s thesis UC Berkeley 1994.
Genome evolution: a sequence-centric approach Lecture 3: From Trees to HMMs.
Designing clustering methods for ontology building: The Mo’K workbench Authors: Gilles Bisson, Claire Nédellec and Dolores Cañamero Presenter: Ovidiu Fortu.
Presented by Zeehasham Rasheed
Course Summary LING 575 Fei Xia 03/06/07. Outline Introduction to MT: 1 Major approaches –SMT: 3 –Transfer-based MT: 2 –Hybrid systems: 2 Other topics.
Machine Translation A Presentation by: Julie Conlonova, Rob Chase, and Eric Pomerleau.
Semi-Automatic Learning of Transfer Rules for Machine Translation of Low-Density Languages Katharina Probst April 5, 2002.
Application of RNNs to Language Processing Andrey Malinin, Shixiang Gu CUED Division F Speech Group.
Approximate Factoring for A* Search Aria Haghighi, John DeNero, and Dan Klein Computer Science Division University of California Berkeley.
LEARNING WORD TRANSLATIONS Does syntactic context fare better than positional context? NCLT/CNGL Internal Workshop Ankit Kumar Srivastava 24 July 2008.
1 Statistical NLP: Lecture 13 Statistical Alignment and Machine Translation.
Jan 2005Statistical MT1 CSA4050: Advanced Techniques in NLP Machine Translation III Statistical MT.
Learning Table Extraction from Examples Ashwin Tengli, Yiming Yang and Nian Li Ma School of Computer Science Carnegie Mellon University Coling 04.
Natural Language Processing Expectation Maximization.
Phylogeny Estimation: Traditional and Bayesian Approaches Molecular Evolution, 2003
Part II. Statistical NLP Advanced Artificial Intelligence Applications of HMMs and PCFGs in NLP Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme.
1 The Hidden Vector State Language Model Vidura Senevitratne, Steve Young Cambridge University Engineering Department.
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
Unambiguity Regularization for Unsupervised Learning of Probabilistic Grammars Kewei TuVasant Honavar Departments of Statistics and Computer Science University.
The Impact of Grammar Enhancement on Semantic Resources Induction Luca Dini Giampaolo Mazzini
Adaptor Grammars Ehsan Khoddammohammadi Recent Advances in Parsing Technology WS 2012/13 Saarland University 1.
Bayesian Hierarchical Clustering Paper by K. Heller and Z. Ghahramani ICML 2005 Presented by HAO-WEI, YEH.
Binary Encoding and Gene Rearrangement Analysis Jijun Tang Tianjin University University of South Carolina (803)
David L. Chen Fast Online Lexicon Learning for Grounded Language Acquisition The 50th Annual Meeting of the Association for Computational Linguistics (ACL)
1 CS546: Machine Learning and Natural Language Latent-Variable Models for Structured Prediction Problems: Syntactic Parsing Slides / Figures from Slav.
A Language Independent Method for Question Classification COLING 2004.
Posterior Regularization for Structured Latent Variable Models Li Zhonghua I2R SMT Reading Group.
1 LING 696B: Midterm review: parametric and non-parametric inductive inference.
Lecture 4: Statistics Review II Date: 9/5/02  Hypothesis tests: power  Estimation: likelihood, moment estimation, least square  Statistical properties.
PGM 2003/04 Tirgul 2 Hidden Markov Models. Introduction Hidden Markov Models (HMM) are one of the most common form of probabilistic graphical models,
Improving Named Entity Translation Combining Phonetic and Semantic Similarities Fei Huang, Stephan Vogel, Alex Waibel Language Technologies Institute School.
Significance Tests for Max-Gap Gene Clusters Rose Hoberman joint work with Dannie Durand and David Sankoff.
Supertagging CMSC Natural Language Processing January 31, 2006.
1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.
Selecting Genomes for Reconstruction of Ancestral Genomes Louxin Zhang Department of Mathematics National University of Singapore.
Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
Virtual Examples for Text Classification with Support Vector Machines Manabu Sassano Proceedings of the 2003 Conference on Emprical Methods in Natural.
September 2004CSAW Extraction of Bilingual Information from Parallel Texts Mike Rosner.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Melody Recognition with Learned Edit Distances Amaury Habrard Laboratoire d’Informatique Fondamentale CNRS Université Aix-Marseille José Manuel Iñesta,
CSC Lecture 23: Sigmoid Belief Nets and the wake-sleep algorithm Geoffrey Hinton.
Dan Roth University of Illinois, Urbana-Champaign 7 Sequential Models Tutorial on Machine Learning in Natural.
Neural Machine Translation
Statistical Machine Translation Part II: Word Alignments and EM
Statistical NLP Spring 2011
David Mareček and Zdeněk Žabokrtský
Probabilistic Data Management
Joint Training for Pivot-based Neural Machine Translation
Statistical NLP: Lecture 13
CSc4730/6730 Scientific Visualization
Overview of Machine Learning
Large scale multilingual and multimodal integration
University of Illinois System in HOO Text Correction Shared Task
Statistical NLP Spring 2011
Johns Hopkins 2003 Summer Workshop on Syntax and Statistical Machine Translation Chapters 5-8 Ethan Phelps-Goodman.
Presentation transcript:

Statistical NLP Spring 2010 Lecture 25: Diachronics Dan Klein – UC Berkeley

Evolution: Main Phenomena Mutations of sequences Time Speciation Time

Tree of Languages  Challenge: identify the phylogeny  Much work in biology, e.g. work by Warnow, Felsenstein, Steele…  Also in linguistics, e.g. Warnow et al., Gray and Atkinson…

Statistical Inference Tasks Modern Text Ancestral Word Forms Cognate Groups / Translations Grammatical Inference InputsOutputs Phylogeny FR IT PT ES fuegofeu focus fuego feuhuevo oeuf les faits sont très clairs

Outline Ancestral Word Forms Cognate Groups / Translations Grammatical Inference fuegofeu focus fuego feuhuevo oeuf les faits sont très clairs

Language Evolution: Sound Change camera / kamera / Latin chambre / ʃambʁ / French Deletion: / e / Change: / k /.. / tʃ /.. / ʃ / Insertion: / b / Eng. camera from Latin, “camera obscura” Eng. chamber from Old Fr. before the initial / t / dropped

Diachronic Evidence tonitru non tonotrutonight not tonite Yahoo! Answers [2009]Appendix Probi [ca 300]

Synchronic (Comparative) Evidence

Simple Model: Single Characters CGCCCCGG CG C G G

A Model for Words Forms focus /fokus/ fuego /fweɣo/ /fogo/ fogo /fogo/ fuoco /fwɔko/ IT ES PT IB LA [BGK 07]

Contextual Changes /fokus/ /fwɔko/ f# o f# w ɔ …

Changes are Systematic /fokus/ /fweɣo/ /fogo/ /fwɔko/ /fokus/ /fweɣo/ /fogo/ /fwɔko/ /kentrum/ /sentro/ /tʃɛntro/

Experimental Setup  Data sets  Small: Romance  French, Italian, Portuguese, Spanish  2344 words  Complete cognate sets  Target: (Vulgar) Latin  Large: Oceanic  661 languages  140K words  Incomplete cognate sets  Target: Proto-Oceanic [Blust, 1993] FR IT PT ES

Data: Romance

Learning: Objective /fokus/ /fweɣo/ /fogo/ /fwɔko/

Learning: EM /fokus/ /fweɣo/ /fogo/ /fwɔko/ /fokus/ /fweɣo/ /fogo/ /fwɔko/  M-Step  Find parameters which fit (expected) sound change counts  Easy: gradient ascent on theta  E-Step  Find (expected) change counts given parameters  Hard: variables are string-valued

Computing Expectations ‘grass’ Standard approach, e.g. [Holmes 2001]: Gibbs sampling each sequence [Holmes 01, BGK 07]

A Gibbs Sampler ‘grass’

A Gibbs Sampler ‘grass’

A Gibbs Sampler ‘grass’

Getting Stuck How to jump to a state where the liquids /r/ and /l/ have a common ancestor? ?

Getting Stuck

Solution: Vertical Slices Single Sequence Resampling Ancestry Resampling [BGK 08]

Details: Defining “Slices” The sampling domains (kernels) are indexed by contiguous subsequences ( anchors ) of the observed leaf sequences Correct construction section(G) is non-trivial but very efficient anchor

Results: Alignment Efficiency Is ancestry resampling faster than basic Gibbs? Hypothesis: Larger gains for deeper trees Depth of the phylogenetic tree Sum of Pairs Setup: Fixed wall time Synthetic data, same parameters

Results: Romance

Learned Rules / Mutations

Comparison to Other Methods  Evaluation metric: edit distance from a reconstruction made by a linguist (lower is better) UsOakes  Comparison to system from [Oakes, 2000]  Uses exact inference and deterministic rules  Reconstruction of Proto-Malayo-Javanic cf [Nothefer, 1975]

Data: Oceanic Proto-Oceanic

Data: Oceanic

Results: Large Phylogenies  Centroid: a novel heuristic based on an approximation to the minimum Bayes risk  Reconstruction of Proto-Oceanic [Blust, 1993]  Both algorithms use 64 modern languages Us Centroid

Result: More Languages Help Number of modern languages used Mean edit distance Distance from [Blust, 1993] Reconstructions

Results: Large Scale

Visualization: Learned universals *The model did not have features encoding natural classes

Regularity and Functional Load In a language, some pairs of sounds are more contrastive than others (higher functional load) Example: English “p”/“b” versus “t”/”th” “p”/“b”: pot/dot, pin/din, dress/press, pew/dew,... “t”/”th”: thin/tin

Functional Load: Timeline Functional Load Hypothesis (FLH): sounds changes are less frequent when they merge phonemes with high functional load [Martinet, 55] Previous research within linguistics : “FLH does not seem to be supported by the data” [King, 67] Caveat : only four languages were used in King’s study [Hocket 67; Surandran et al., 06] Our work: we reexamined the question with two orders of magnitude more data [BGK, under review ]

Regularity and Functional Load Functional load as computed by [King, 67] Data: only 4 languages from the Austronesian data Merger posterior probability Each dot is a sound change identified by the system

Regularity and Functional Load Data: all 637 languages from the Austronesian data Functional load as computed by [King, 67] Merger posterior probability

Outline Ancestral Word Forms Cognate Groups / Translations Grammatical Inference fuegofeu focus fuego feuhuevo oeuf les faits sont très clairs

Cognate Groups /fweɣo/ /fogo/ /fwɔko/ /berβo/ /vɛrbo/ /tʃɛntro/ /sentro/ /sɛntro/ ‘fire’

Model: Cognate Survival omnis ogni IT ES PT IB LA IT ES PT IB LA

Results: Grouping Accuracy Fraction of Words Correctly Grouped Method [Hall and Klein, in submission]

Semantics: Matching Meanings day Occurs with: “night” “sun” “week” tag DE tag EN Occurs with: “nacht” “sonne” “woche” Occurs with: “name” “label” “along”

Outline Ancestral Word Forms Cognate Groups / Translations Grammatical Inference fuegofeu focus fuego feuhuevo oeuf les faits sont très clairs

Grammar Induction congress narrowly passed the amended bill les faits sont très clairs Task: Given sentences, infer grammar (and parse tree structures)

Shared Prior les faits sont très clairs congress narrowly passed the amended bill

Results: Phylogenetic Prior DutchDanishSwedishSpanishPortugueseSloveneChineseEnglish WGNG RM G IE GL Avg rel gain: 29%

Conclusion  Phylogeny-structured models can:  Accurately reconstruct ancestral words  Give evidence to open linguistic debates  Detect translations from form and context  Improve language learning algorithms  Lots of questions still open:  Can we get better phylogenies using these high- res models?  What do these models have to say about the very earliest languages? Proto-world?

Thank you! nlp.cs.berkeley.edu

Machine Translation Approach Source Text Target Text nous acceptons votre opinion. we accept your view.

Translations from Monotexts Source Text Target Text  Translation without parallel text?  Need (lots of) sentences [Fung 95, Koehn and Knight 02, Haghighi and Klein 08]

Task: Lexicon Matching Source Text Target Text Matching m state world name Source Words s nation estado política Target Words t mundo nombre [Haghighi and Klein 08]

Data Representation state Source Text What are we generating? Orthographic Features 1.0 #st tat te# Context Features world politics society

Data Representation state Orthographic Features 1.0 #st tat te# Context Features world politics society Source Text estado Orthographic Features 1.0 #es sta do# Context Features mundo politica sociedad Target Text What are we generating?

Generative Model (CCA) estado state Source Space Target Space Canonical Space

Generative Model (Matching) Source Words s Target Words t Matching m state world name nation estado nombre politica mundo

E-Step: Find best matching M-Step: Solve a CCA problem Inference: Hard EM

Experimental Setup  Data: 2K most frequent nouns, texts from Wikipedia  Seed: 100 translation pairs  Evaluation: Precision and Recall against lexicon obtained from Wiktionary  Report p 0.33, precision at recall 0.33

Lexicon Quality (EN-ES) Precision Recall

Seed Lexicon Source  Automatic Seed  Edit distance seed 92 4k EN-ES Wikipedia Articles Precision

Analysis

Interesting MatchesInteresting Mistakes

Language Variation

Analysis Orthography Features Context Features

Language Variation

Markov Mutation Model /fokus/ /fwɔko/

Local Mutation along Tree f# o f# w ɔ. …

Model: Many Words /fokus/ /fweɣo/ /fogo/ /fwɔko/

Results

Results: More Languages