Chapter 11. The Origin of Communicative System: Communicative Efficiency Min Su Lee The Computational Nature of Language Learning and Evolution.

Slides:



Advertisements
Similar presentations
Completeness and Expressiveness
Advertisements

5.1 Real Vector Spaces.
Artificial Intelligence Chapter 13 The Propositional Calculus Biointelligence Lab School of Computer Sci. & Eng. Seoul National University.
On Complexity, Sampling, and -Nets and -Samples. Range Spaces A range space is a pair, where is a ground set, it’s elements called points and is a family.
Approximation Algorithms Chapter 14: Rounding Applied to Set Cover.
Gibbs sampler - simple properties It’s not hard to show that this MC chain is aperiodic. Often is reversible distribution. If in addition the chain is.
Copyright © Cengage Learning. All rights reserved.
Bounds on Code Length Theorem: Let l ∗ 1, l ∗ 2,..., l ∗ m be optimal codeword lengths for a source distribution p and a D-ary alphabet, and let L ∗ be.
Chain Rules for Entropy
By : L. Pour Mohammad Bagher Author : Vladimir N. Vapnik
Introduction to Computability Theory
Algorithmic and Economic Aspects of Networks Nicole Immorlica.
Fundamental limits in Information Theory Chapter 10 :
Communicating Agents in a Shared World Natalia Komarova (IAS & Rutgers) Partha Niyogi (Chicago)
1 Introduction to Kernels Max Welling October (chapters 1,2,3,4)
Tirgul 8 Universal Hashing Remarks on Programming Exercise 1 Solution to question 2 in theoretical homework 2.
Evaluating Hypotheses
CHAPTER 4 Decidability Contents Decidable Languages
Preference Analysis Joachim Giesen and Eva Schuberth May 24, 2006.
Experimental Evaluation
1 Introduction to Computability Theory Lecture11: The Halting Problem Prof. Amos Israeli.
Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning 1 Evaluating Hypotheses.
Information Theory and Security
Maximum likelihood (ML)
Lecture II-2: Probability Review
MATRICES. Matrices A matrix is a rectangular array of objects (usually numbers) arranged in m horizontal rows and n vertical columns. A matrix with m.
INFORMATION THEORY BYK.SWARAJA ASSOCIATE PROFESSOR MREC.
1 Chapter 8 The Discrete Fourier Transform 2 Introduction  In Chapters 2 and 3 we discussed the representation of sequences and LTI systems in terms.
Copyright © Cengage Learning. All rights reserved. CHAPTER 11 ANALYSIS OF ALGORITHM EFFICIENCY ANALYSIS OF ALGORITHM EFFICIENCY.
Copyright © Cengage Learning. All rights reserved. 8 Tests of Hypotheses Based on a Single Sample.
Diophantine Approximation and Basis Reduction
Chapter 6: Probability Distributions
Lecture 12 Statistical Inference (Estimation) Point and Interval estimation By Aziza Munir.
1 2. Independence and Bernoulli Trials Independence: Events A and B are independent if It is easy to show that A, B independent implies are all independent.
Chapter 3 – Set Theory  .
Motif finding with Gibbs sampling CS 466 Saurabh Sinha.
Channel Capacity.
Advanced Topics in Propositional Logic Chapter 17 Language, Proof and Logic.
Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by Yung-Kyun Noh and Joo-kyung Kim Biointelligence.
Communication System A communication system can be represented as in Figure. A message W, drawn from the index set {1, 2,..., M}, results in the signal.
Week 11 - Monday.  What did we talk about last time?  Binomial theorem and Pascal's triangle  Conditional probability  Bayes’ theorem.
Section 2.3 Properties of Solution Sets
1 11 Channel Assignment for Maximum Throughput in Multi-Channel Access Point Networks Xiang Luo, Raj Iyengar and Koushik Kar Rensselaer Polytechnic Institute.
CHAPTER 5 SIGNAL SPACE ANALYSIS
1 8. One Function of Two Random Variables Given two random variables X and Y and a function g(x,y), we form a new random variable Z as Given the joint.
8.4.2 Quantum process tomography 8.5 Limitations of the quantum operations formalism 量子輪講 2003 年 10 月 16 日 担当:徳本 晋
Basic Concepts of Encoding Codes and Error Correction 1.
Expected values of discrete Random Variables. The function that maps S into S X in R and which is denoted by X(.) is called a random variable. The name.
Discrete Random Variables. Introduction In previous lectures we established a foundation of the probability theory; we applied the probability theory.
6. Population Codes Presented by Rhee, Je-Keun © 2008, SNU Biointelligence Lab,
5.3 Algorithmic Stability Bounds Summarized by: Sang Kyun Lee.
Channel Coding Theorem (The most famous in IT) Channel Capacity; Problem: finding the maximum number of distinguishable signals for n uses of a communication.
Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by Joo-kyung Kim Biointelligence Laboratory,
Approximation Algorithms based on linear programming.
The Computational Nature of Language Learning and Evolution 10. Variations and Case Studies Summarized by In-Hee Lee
Theory of Computational Complexity Probability and Computing Lee Minseon Iwama and Ito lab M1 1.
Matrices CHAPTER 8.9 ~ Ch _2 Contents  8.9 Power of Matrices 8.9 Power of Matrices  8.10 Orthogonal Matrices 8.10 Orthogonal Matrices 
Ch 6. Markov Random Fields 6.1 ~ 6.3 Adaptive Cooperative Systems, Martin Beckerman, Summarized by H.-W. Lim Biointelligence Laboratory, Seoul National.
Network Topology Single-level Diversity Coding System (DCS) An information source is encoded by a number of encoders. There are a number of decoders, each.
Statistical Properties of Digital Piecewise Linear Chaotic Maps and Their Roles in Cryptography & Pseudo-Random Coding Li ShujunLi Shujun 1, Li Qi 2, Li.
Theory of Computational Complexity M1 Takao Inoshita Iwama & Ito Lab Graduate School of Informatics, Kyoto University.
Ch 4. Language Acquisition: Memoryless Learning 4.1 ~ 4.3 The Computational Nature of Language Learning and Evolution Partha Niyogi 2004 Summarized by.
Theory of Computational Complexity Probability and Computing Chapter Hikaru Inada Iwama and Ito lab M1.
The Acceptance Problem for TMs
Biointelligence Laboratory, Seoul National University
The Computational Nature of Language Learning and Evolution
Discrete Event Simulation - 4
Ch 6. Language Change: Multiple Languages 6.1 Multiple Languages
8. One Function of Two Random Variables
8. One Function of Two Random Variables
Presentation transcript:

Chapter 11. The Origin of Communicative System: Communicative Efficiency Min Su Lee The Computational Nature of Language Learning and Evolution

Contents 11.1 Communicative Efficiency of Language  1. Communicability in animal, human and machine communication 11.2 Communicability for Linguistic Systems  1. Basic notions  2. Probability of events and a communicability function 11.3 Reaching the Highest Communicability  1. A special case of finite languages  2. Generalizations (C) 2009, SNU Biointelligence Lab, 2

Contents 11.4 Implications for Learning  1. Estimating P  2. Estimating Q  3. Sample Complexity Bounds 11.5 Communicative Efficiency and Linguistic Structure  1. Phonemic Contrasts and Lexical Structure  2. Functional Load and Communicative Efficiency  3. Perceptual Confusibility and Functional Load (C) 2009, SNU Biointelligence Lab, 3

Introduction In this part, we turn our attention to  The genesis of human language from prelinguistic versions of it  How and why did the recursive communication system of human language arise in biological populations? Communicative efficiency  Important in the evolution of competing linguistic groups where the different groups had different communicative ability  Differential fitness and natural selection  If communicative efficiency provides biological fitness to individuals in terms of increased ability to reproduce and survive, then would populations converges to coherent linguistic states?  Coherence  Coherent population: Linguistically homogeneous population (C) 2009, SNU Biointelligence Lab, 4

Introduction In this part, the book studies  Interplay between communicative efficiency, learning fitness, and coherence In this chapter,  Develop the notion of communicative efficiency and fitness  Characterize language as a probabilistic association between form and meaning  Provide a natural definition of communicative efficiency between two linguistic agents possessing different languages  Perform empirical study on large linguistic corpora to find that the structure of the lexicon of several modern languages do not reflect optimality in terms of communicability (C) 2009, SNU Biointelligence Lab, 5

1. Communicative Efficiency of Language Mutual intelligibility [Communicative efficiency]  Quantify the rate of success in information transfer between two linguistic agents  Increasing intelligibility F(L 1, L 2 ) between two languages, L 1 and L 2  Given a language L, what language L’ maximizes the mutual intelligibility F(L, L’) for two way communication about the shared world?  What are some acquisition mechanisms/learning algorithms that can serve the task of improving intelligibility?  What are the consequences of individual language acquisition behavior on the population dynamics and the communicative efficiency of an interacting population of linguistic agents? (C) 2009, SNU Biointelligence Lab, 6

1. Communicative Efficiency of Language Communicability  Language may be viewed as an association matrix A which links referents to signals  M referents and N signals  A is an N×M matrix  a ij : relative strength of the association between signal i and meaning j.  The matrix A characterizes the behavior of the linguistic agent in  Production mode –Produce any of the signals corresponding to a particular meaning in proportion to the strength of the association  Comprehension mode –Interpret a particular signal as any of the meanings in proportion to the association strength (C) 2009, SNU Biointelligence Lab, 7

1. Communicative Efficiency of Language Communicability in animal communication  Finite lexical association matrix (animal signals & their specific meaning) Communicability in human communication  Infinite lexical association matrices  Human grammars mediate a complex mapping between form and meaning  The set of possible sentences and meanings are infinite (infinite expressibility of human grammars) Communicability in machine communication  AI  Linguistic agents interact with each other in simulated worlds  Study whether coherent communication ultimately emerges  Natural language understanding systems  Develop a computer system that is able to communicate with a human  Underlying probability model is learned from data (C) 2009, SNU Biointelligence Lab, 8

2. Communicability for Linguistic Systems Basic notions  S: the set of all possible linguistic forms (signals), {s 1, s 2,…}  M: the set of all possible semantic objects (meanings), {m 1, m 2, …}  Define a language to be a probability measure μ over S×M  Encoding matrix P (production mode) –Prob. of producing the signal s i given that one wishes to convey the meaning m j –  Decoding matrix Q (comprehension mode) –Prob. of interpreting the expression s j to mean m j by the same user (C) 2009, SNU Biointelligence Lab, 9

2. Communicability for Linguistic Systems Probability of events and a communicability function  Given two communication systems (language μ 1, μ 2 )  The prob. that an event occurs whose meaning is successfully communicated  From μ 1 to μ 2  Define communicability function of μ 1 and μ 2 (mutual intelligibility, communicative efficiency)  where Λ is a diagonal matrix s.t. Λ ii =σ(m i ), tr(A) denotes the trace of matrix A, and P (i), Q (i) refer to the coding and decoding matrices associated with μ i. Note that tr(P (1) Λ(Q (2) ) T ) is simply the prob. that an event occurs and is successfully communicated from user of μ 1 to user of μ 2 (C) 2009, SNU Biointelligence Lab, 10

3. Reaching the Highest Communicability Given a language μ 0 For any language μ, we have (where σ i =σ(m i )) Define the best response as a language μ * s.t.  The maximum possible mutual intelligibility between a user of μ 0 and a user of any allowable language How to construct a family of languages (μ ε where ε>0) s.t. F(μ 0, μ ε ) can be made arbitrarily close to sup μ F(μ 0, μ) (C) 2009, SNU Biointelligence Lab, 11

3. Reaching the Highest Communicability A special case of finite languages  Three simplifying assumptions  The languages are finite, and the matrices have the size N×M  The distribution σ is uniform, i.e. σ i = 1/M ∀ i  The measure μ 0 satisfies the property of unique maxima, i.e. for each i, there exist a unique p 0 (i) and a unique r 0 (i) s.t. –There exists strictly one element of each column of μ 0 (s|m) (row of μ 0 (m|s)) s.t. it is the biggest element in the column (row) (C) 2009, SNU Biointelligence Lab, 12

3. Reaching the Highest Communicability A special case of finite languages (cont.)  Maximize communicative efficiency  Find a matrix Q* s.t. where we maximize over all matrices Q whose elements are non-negative and sum up to one within each row  The best decoder Q*:  Find a matrix P* s.t. where we maximize over all matrices P whose elements are non-negative and sum up to one within each column  The best encoder P*:  If a μ * existed s.t. μ * (s|m)=P* and μ * (m|s)=Q*, then the μ * is best response  It turns out that in general, μ * does not exist.  However, there always exists a measure which approximates the performance of P* and Q* arbitrarily well (C) 2009, SNU Biointelligence Lab, 13

3. Reaching the Highest Communicability Theorem 11.1 (Komarova and Niyogi 2004)  For any finite language μ 0 satisfying the property of unique maxima, and a uniform probability distribution σ, we have  In order to prove Theorem 11.1, we need to show that The auxiliary matrix and the absence of loops  Define an auxiliary matrix X  X contains nonzero entries at the slots where either of P* or Q* contains a nonzero entry  Draw lines connecting all the “ones” of the X matrix that belong to the same row, and all the “ones” of the X matrix that belong to the same column (C) 2009, SNU Biointelligence Lab, 14  Def. of the best decoder and the best encoder

3. Reaching the Highest Communicability Lemma 11.1  Suppose that a finite measure μ 0 has the property of unique maxima. Graphs constructed as described above do not contain any closed loops Proof of Lemma 11.1  Assume that there exists a close loop  Consider its “turning points”  Suppose there are 2K such vertices: x α i, β j, where the pair of integers, (α i,β j ), gives the coordinates of the vertex. (1≤i,j≤K)  Let x α 1, β 1 be connected with x α 1, β 2 with a horizontal line. Then x α 1, β 2 is connected with x α 2, β 2 with a vertical line,...., x α K, β 1 is connected with x α 1, β 1 with a vertical line, the closing the loop. (C) 2009, SNU Biointelligence Lab, 15

3. Reaching the Highest Communicability Proof of lemma 11.1 (cont.)  If a vertex corresponds to a “one” of the Q* matrix, then the corresponding slot of the P* matrix is zero, and vice versa.  Suppose that Q* α 1,β 1 =1, P* α 1,β 1 =0.  Q* α 1,β 2 =0 ( ∵ there can be only one nonzero element in the same row of the Q*)  P* α 1,β 2 =1 ( ∵ the corresponding vertex is preset in the X )  P* α 2,β 2 =0 ( ∵ we can only have one positive element in each column of P*)...  ( ∵ positive elements in the Q* correspond to the largest elements in the corresponding rows of the P 0 matrix   Show the system is incompatible  Let where  Rewrite (C) 2009, SNU Biointelligence Lab, 16

3. Reaching the Highest Communicability Proof of lemma 11.1 (cont.)  The system can be presented as a closed chain of inequalities for Q 0  This contradiction proves that there can be no close loops in the matrix X (C) 2009, SNU Biointelligence Lab, 17

3. Reaching the Highest Communicability Constructing the matrix μ ε  From Lemma 11.1, if we connect all the vertices of the matrix X by horizontal and vertical lines, the resulting graphs will contain no closed loops  For each of these graphs, perform following procedure  Repeat this for all pairs of vertices –Take a pair of vertices –If they are connected by a horizontal (vertical) line, refer to corresponding entries of the Q* matrix (P* matrix) –One of them will be one and the other, zero. –Draw an arrow on the graph from the element corresponding to zero to the element corresponding to one  Starting from some vertex, replace the corresponding element in X by ε  Following the arrows keep replacing the elements of X by entries of the form ε k, where the integer k increases or decreases from one vertex to the next depending on the direction of the arrow.  The resulting matrix: A ε  The measure μ ε : (C) 2009, SNU Biointelligence Lab, 18

3. Reaching the Highest Communicability e.g.   Proof of theorem 11.1, part(b)  To find entries of μ ε (s|m), renormalize each column of the matrix μ ε so that its elements sum up to one  Each column will contain at most one segment of one of the graphs  By construction, the biggest element of this segment of the graph corresponds to the positive element of Q*  In the limit ε  0, the other elements will be vanshingly small in comparison with the corresponding column of the P* matrix  Similarly, rows of μ ε (s|m) that in the limit become the rows of the Q* matrix  ∴ The family of measures μ ε satisfy the requirements of Theorem 11.1 (C) 2009, SNU Biointelligence Lab, 19

3. Reaching the Highest Communicability Generalizations  Three restrictions on the measures μ  Unique local maxima in the rows and columns of μ(s|m) and μ(m|s) respectively  If there are multiple maxima, it turn out that loops may exist. Lemma 11.1 can be modified using a neutral vertex of graph  Uniform distribution of events in the world that need to be communicated  If events do not occur with uniform probability, redefine P* and Q* accordingly. e.g.  Finite cardinality of S and M  μ on a countably infinite space can be approximated arbitrarily closely by a measure with finite support this reduces the infinite case to the finite case (C) 2009, SNU Biointelligence Lab, 20

4. Implications for Learning Language learning  An agent trying to learn a language in order to communicate with some other agent whose language is characterized by the measure μ  The best response μ* itself may not exist  An arbitrarily close approximation μ ε (for any ε) does exist  The best the learner can do is estimate μ ε Two natural learning scenarios (How much info. is available to the learner)  Full information  Where the learner is able to sample μ directly to get (sentence, meaning) pairs  Partial information  The learner only hears the sentence while the intended meaning is latent  What the learner reasonably may have access to is whether its interpretation of the sentence was successful or not (C) 2009, SNU Biointelligence Lab, 21

4. Implications for Learning Estimating P  Q* is derivable from the P matrix of the teacher  Learning with full information  The learner has access to (s, m) pairs every time the teacher produces the sentence  Define the event A ij = Teacher produces s i to communicate m j  The prob. of event A ij :  If the teacher produces n (s, m) pairs at random in i.i.d. manner, then the ratio is an empirical estimate of the prob. of the event A ij  n  ∞, with prob. 1.  Bound the rate at which this convergence occurs –Applying Hoeffding’s inequality  (C) 2009, SNU Biointelligence Lab, 22

4. Implications for Learning Estimating P (cont.)  Learning with full information (cont.)  Assume N possible sentences and M possible meanings   Total NM different events whose probabilities need to be estimated (A ij, i=1,...., N, j=1,...,N)  Let event E ij be  By the union bound,   With the high probability (depending on n) all empirical estimates are close to respectively (C) 2009, SNU Biointelligence Lab, 23

4. Implications for Learning Estimating P (cont.)  Learning with partial information  Let the learner guess a meaning uniformly at random.  Define event A ij = Teacher produces s i ; Learner guesses m j ; Communication successful.  Prob. of event A ij :  The learner counts k ij : the number of times event A ij has occurred  Empirical estimates of the prob. of A ij :  Since M is fixed in advance and known, this allows the learner to guess for each i, j arbitrarily well   Uniform bound: (C) 2009, SNU Biointelligence Lab, 24

4. Implications for Learning Estimating Q  Estimating P*: derivable from the Q matrix of the teacher  Learning with full information  The learner picks a sentence uniformly at random (with prob. 1/N) and produces it for the teacher to hear  Define event A ij = Learner produces s i ; Teacher interprets as m j  Prob. of A ij :  After n trials, estimate of is   (C) 2009, SNU Biointelligence Lab, 25

4. Implications for Learning Estimating Q (cont.)  Learning with partial information  Learner picks a (sentence, meaning) pair uniformly at random (with prob. 1/NM)  Define event A ij = Learner produces (s i, m j ); Communication is successful  Prob. of A ij is   (C) 2009, SNU Biointelligence Lab, 26

4. Implications for Learning Sample complexity bounds  Determine the number of learning events that need to occur so that with high prob., the learner will be able to develop a language with ε-good communicabilitly  Let the teacher’s measure be μ. & Assume that μ is s.t. the P and Q have unique row-wise and column-wise maxima respectively.  Margin (C) 2009, SNU Biointelligence Lab,

4. Implications for Learning Sample complexity bounds (cont.)  Learning with partial information  Proof of Theorem 11.2 –Let there be n/2 interactions where the teacher speaks and the learner listens and n/2 interactions of the other form. –Estimation of : & estimation of : –Setting ε = γ/4 – & –Using the fact that, and are both within γ/4 of the true values with prob. greater than (C) 2009, SNU Biointelligence Lab,

4. Implications for Learning Sample complexity bounds (cont.)  Learning with partial information (cont.), Proof of Theorem 11.2 (cont.) –The learner chooses Q* For each i the learner desires to obtain j i * The learner chooses Show Assume that this is not the case. Then, However, leads to a contradiction. Since for each i, the Q* is identified exactly –The learner chooses P*  Similarly, P* is also identified exactly –Ensure that n is large enough so that this occurs with high prob. is satisfied for  with the prob. greater than 1-δ, both P* and Q* are identified exactly (C) 2009, SNU Biointelligence Lab, 29

4. Implications for Learning Remarks  The number of examples is seen to be a function of M, N, and γ.  The margin γ that depends on the teacher’s language μ, determines how easy it is to estimate Q* and P* for the learner.  It characterizes the learning difficulty of μ in this setting.  Infinite matrices are not learnable.  Infinite dimensional spaces are known to be unlearnable  Further constraints will be required on the space of possible measures to which the teacher’s language belongs.  The constants in the bound on sample complexity may be tightened, although the order is essentially correct. (C) 2009, SNU Biointelligence Lab, 30

4. Implications for Learning Sample complexity bounds (cont.)  Learning with full information (C) 2009, SNU Biointelligence Lab,

5. Communicative Efficiency and Linguistic Structure Empirical study of the structure of lexical items that suggests that a tight coupling between communicative efficiency and lexical structure may not always be present. Phonemic contrasts and lexical structure  If all the phonemes in the sequence are heard correctly by the hearer, then the word has been successfully transmitted from speaker to hearer.  Communicative efficiency would be high  If hearer cannot distinguish between /p/ and /b/  Hearer would not be able to tell apart the worlds pat & bat, pit & bit...  Information would no longer be perfectly transmitted from speaker to hearer  How much information is lost on the whole? (C) 2009, SNU Biointelligence Lab, 32

5. Communicative Efficiency and Linguistic Structure Phonemic contrasts and lexical structure (cont.)  p 1,..., p n : the probabilities with which the n words are used on average  W: lexicon  Entropy (information content) of the entire lexicon:  If all words were equally likely,  H(W): average measure of the information transmitted from speaker to hearer by transmitting words of the lexicon.  Reduced lexicon W({/p/, /b/}) = {c 1 ({/p/, /b/}),..., c k ({/p/, /b/})}.  : The prob. with which the hearer will encounter a word that belongs to c 1 ({/p/, /b/})  : Information content of the reduced lexicon  : Normalized loss of information (functional load) 0≤ FL ≤1, FL: fraction of information lost (at the lexical level) by losing the ability to distinguish between /p/ and /b/ (C) 2009, SNU Biointelligence Lab, 33

5. Communicative Efficiency and Linguistic Structure Functional load and communicative efficiency  Listener’s guessing strategy (Randomized strategy)  Prob. of transmission:, here  Communicative efficiency:  Loss in communicative efficiency:  Information-theoretic measure of functional load –H(W) = log(n), H(W({/p/, /b/}) = –  Range of FL (C) 2009, SNU Biointelligence Lab, 34

5. Communicative Efficiency and Linguistic Structure Perceptual confusiblity and functional load  If communicative efficiency played a role in the evolution of linguistic structure, we should observe a correlation between the perceptual difficulty of making a phonetic contrast and the functional load of that contrast  Empirical experiment  Data: Dutch, English, and Chinese  Perceptual confusibility between phonemes: psychoacoustic data (acoustic difference)  Phoneme-confusion matrix; experimental psycholinguistic data  Lexical data: corpus-based linguistic data (colloquial pronunciation patterns, frequency of usage, semantic and syntactic information)  Result  There is no significant correlation between functional load and confusibility (C) 2009, SNU Biointelligence Lab, 35

5. Communicative Efficiency and Linguistic Structure Perceptual confusiblity and functional load  Functional load against perceptual confusibility for phonetic distinctions in English  Other contextual cues help in identifying the word uniquely  Several possible interpretations  The structure of the lexicon does not display any sign of having been optimized to suit the perceptual limitations of humans  Communicative efficiency might play little role in the structure of natural languages  More proper quantitative formulation of functional load or communicative efficiency may be needed.  Internal optimization of linguistic interfaces rather than external optimization of communicative efficiency derives change and evolution (C) 2009, SNU Biointelligence Lab, 36