Chap2. Language Acquisition: The Problem of Inductive Inference (2.1 ~ 2.2) Min Su Lee The Computational Nature of Language Learning and Evolution
Contents 2.1 A framework for learning Remarks 2.2 The inductive inference approach Discussion Additional Results (C) 2009, SNU Biointelligence Lab, 2
Introduction Problem of language acquisition The computational difficulty of the problem that children solve so routinely Child is exposed to a finite number of sentences as result of interaction with its linguistic environment. The child can generalize them and thus infer novel sentences it has not encountered before. Consider a language to be a set of sentences Σ: a finite alphabet, Σ * : the universe of all possible finite sentences L t ⊂ Σ * : Target language s 1, s 2, s 3,..., s n,... (where s i ∈ L t is the ith example available to the learner): Children learner ultimately has access to a sequence of sentences The learner makes a guess about the target language L t after each new sentence becomes available There are an infinite number of possible languages that contain s 1,...,s k Learning in the complete absence of prior information is impossible (C) 2009, SNU Biointelligence Lab, 3
A framework for learning Target language L t ∈ L A target language drawn from a class of possible target languages ( L ) This is the language the learner must identify on the basis of examples Example sentences s i ∈ L t Drawn from the target language and presented to the learner s i is the ith such example sentence Hypothesis language h ∈ H Drawn from a class of possible hypothesis languages that child learners construct on the basis of exposure to example sentences in the environment Learning algorithm A An effective procedure by which languages from H are chosen on the basis of example sentences received by the learner (C) 2009, SNU Biointelligence Lab, 4
A framework for learning - Remarks 1. Languages and grammars Recursively enumerable (r.e.) languages Has a potentially infinite number of sentences and has a finite representation in terms of a Turing Machine or a Phrase Structure Grammar Language, machines, and grammars are formally equivalent ways of specifying the same set. G : possible target grammars, L = {L g | g ∈ G } All computable Phrase Structure Grammars may be enumerated as g 1, g 2,... Given any r.e. language L there are an infinite number of g i ’s s.t. L g i = L Then any collection of grammars may be defined by specifying their indices in an acceptable enumeration (C) 2009, SNU Biointelligence Lab, 5
A framework for learning - Remarks 2. Example sentences Psychologically plausible learning algorithm converges to the target in a manner that is independent of the order of presentation of sentences Examples presents in i.i.d. fashion according to a fixed, underlying, probability distribution μ μ characterizes the relative frequency of different kinds of sentences that children are likely to encounter during language acquisition e.g. children are more likely to hear short sentences than substantially embedded longer sentences μ might have support only over L t, in which case only positive examples are presented to the learner (C) 2009, SNU Biointelligence Lab, 6
A framework for learning - Remarks 3. Learning algorithm An effective procedure allowing the learning child to construct hypotheses about the identity of the target language on the basis of the examples it has received Mapping from the set of all finite data streams to hypotheses in H (s 1, s 2,..., s k ): a particular finite data stream of k example sentences D k = {(s 1,..., s k )|s i ∈ Σ * } = (Σ * ) k : the set of all possible sequences of k example sentences that the learner might potentially receive D = ∪ k>0 D k : the set of all finite data sequences H : the enumerable set of hypothesis grammars (languages) A : a partial recursive function s.t. A : D H Given a data stream t ∈ D, the learner’s hypothesis is given by A (t) and is a element of H (C) 2009, SNU Biointelligence Lab, 7
A framework for learning - Remarks 3. Learning algorithm (cont.) The behavior of the learner for a particular data stream depends only on the data stream and can be predicted either deterministically or probabilistically if the learning algorithm is analyzable Some kinds of learning procedure. A consistent learner always maintains a hypothesis (call it h n after n examples) that is consistent with the entire data set it has received so far –If the data set the learner has received t n =(s 1,..., s n ), then the learner’s grammatical hypothesis h n is s.t. each s i ∈ L h n –An empirical risk minimizing learner uses the following procedures: Risk function R(h,(s 1,...,s n )) measures the fit to the empirical data consisting of the examples sentences If not unique, the learner might be conjecture the smallest or simplest grammar that fits the data (C) 2009, SNU Biointelligence Lab, 8
A framework for learning - Remarks 3. Learning algorithm (cont.) Some kinds of learning procedure. (cont.) A memoryless learning algorithm is one whose hypothesis at every point depends only on the current data and the previous hypothesis –A (t n ) depends only upon A (t n-1 ) and S n Learning by enumeration –The learner enumerates all possible grammars in H in some order –Let this enumeration be h 1, h 2,.... –It then begins with the conjecture h 1 –Upon receiving new example sentences, the learner simply goes down this list and updates its conjecture to the first one that is consistent with the data seen so far (C) 2009, SNU Biointelligence Lab, 9
A framework for learning - Remarks 4. Criterion of success Measure how well the learner has learned at any point in the learning process Learnability d: distance measure, g t : any target grammar, h: any hypothesis grammar, n: # of sentences Learner’s hypothesis converges to the target in the limit The family G (g t ∈ G ) is said to be learnable by A (learning algorithm) Learnability implies that the generalization error eventually converges to zero as the number of examples goes to infinity –Generalization error The quantity d(g t, h n ) that measures the distance of the learner’s hypothesis (after n examples) to the target (C) 2009, SNU Biointelligence Lab, 10
A framework for learning - Remarks 5. Generality of the framework The basic framework presented here for the analysis of learning systems is quite general The target and hypothesis classes L and H might consist grammars in a generative linguistics tradition Learning algorithm could in principle be grammatical inference procedures, gradient descent schemes, Minimum Description Length (MDL) learning, maximum-likelihood learning via the EM algorithm, and so on. Consider the case in which The hypothesis class H is equal to the target class G (C) 2009, SNU Biointelligence Lab, 11
The inductive inference approach Definition 2.1 A text t for a language L is an infinite sequence of examples s 1,..., s n,... s.t. (1) each s i ∈ L (2) every element of L appears at least once in t Notations t(k): the kth element of the text (s k ) t k : the first k elements of the text (s 1,..., s k ) t k ∈ D k (if represented as a k-tuple) (C) 2009, SNU Biointelligence Lab, 12
The inductive inference approach Definition 2.2 Fix a distance metric d, a target grammar g t ∈ G, and a text t for the target language (L g t ). The language algorithm A identifies (learns) the target g t (L g t ) on the text t in the limit if If the target grammar g t is identified (learned) by A for all texts of L g t, the learning algorithm is said to identify (learn) g t in the limit. If all grammars in G can be identified (learned) in the limit, then the class of grammars G (correspondingly class of language L ) is said to be identifiable (learnable) in the limit. Thus learnability is equivalent to identification in the limit (C) 2009, SNU Biointelligence Lab, 13
The inductive inference approachThe inductive inference approach Definition 2.3 Given a finite sequence x = s 1, s 2,..., s n (of length n), I denote the length of the sequence by lh(x) = n For such a sequence s, x ⊆ L: each s i in x is contained in the language L range (x): the set of all unique elements of x x ⊆ L is shorthand for range(x) ⊆ L x ◦ y=x 1,...,x n,y 1,...,y n : the concatenation of two sequences x=x 1,...,x n and y=y 1,...,y m (C) 2009, SNU Biointelligence Lab, 14
The inductive inference approachThe inductive inference approach Theorem 2.1 (after Blum and Blum; ε-version) If A identifies a target grammar g in the limit, then, for every ε>0, there exists a locking data set l ε ∈ D s.t. (1) l ε ⊆ L g (2) d( A (l ε ), g) < ε (3) d( A (l ε ◦ σ), g) < ε for all σ ∈ D where σ ⊆ L g In other words, after encountering the locking data set, the learner will be ε-close to the target as long as it continues to be given sentences from the target language. This suggests that if a grammar g (correspondingly, a language L g ) is identifiable (learnable) in the limit, a locking data set exists that “locks” the learner’s conjectures to within an ε-ball of the target grammar after encountering this locking data set (C) 2009, SNU Biointelligence Lab, 15
The inductive inference approachThe inductive inference approach Proof of Theorem 2.1 (C) 2009, SNU Biointelligence Lab, 16
The inductive inference approachThe inductive inference approach Considering the important and classical case of exactly identifying the target language in the limit Theorem 2.2 (Blum and Blum 1975) If A identifies a target grammar g in the limit, then there exists a locking data set l ∈ D such that (1) l ⊆ L g (2) d(A(l), g)=0 (3) d(A(l ◦ σ), g)=0 for all σ ∈ D where σ ⊆ L g (C) 2009, SNU Biointelligence Lab, 17
The inductive inference approachThe inductive inference approach Theorem 2.3 (Gold 1967) If the family L consists of all the finite languages and at least one infinite language, then it is not learnable (identifiable) in the limit (C) 2009, SNU Biointelligence Lab, 18
The inductive inference approachThe inductive inference approach Corollary 2.1 The family of languages represented by (1) deterministic finite automata (2) context free grammars are not identifiable in the limit ( ∵ Since both regular and context free languages contain all the finite languages and many infinite ones) All grammatical families in the core Chomsky hierarchy of grammars are unlearnable in this sense (C) 2009, SNU Biointelligence Lab, 19
The inductive inference approachThe inductive inference approach Theorem 2.4 (Angluin 1980) The family L is learnable (identifiable ) if and only if for each L ∈ L, there is a subset D L such that if L’ ∈ L contains D L,. then L’ is not a proper subset of L (C) 2009, SNU Biointelligence Lab, 20
The inductive inference approachThe inductive inference approach (C) 2009, SNU Biointelligence Lab, 21
The inductive inference approach – Discussion Remark 1: Difficulty of inferring a set from examples of members of this set Children are not exposed directly to the grammar They only exposed only to the expressions of their language and their task is to learn the grammar that provides a compact encoding of the ambient language they are exposed to. Remark 2: The precise notion of convergence depends upon the distance metric d Case 1. d(g 1, g 2 ): 0-1 valued and depends on whether L g 1 =L g 2 In this metric, learner may converge on the correct extensional set but not converge to correct grammar Case 2. d(g i, g j ) = | i – j | or | C(i) – C(j) | where C(i) and C(j) are measures of grammatical complexity in some senses This notion of convergence is significantly more stringent and much less is learnable (C) 2009, SNU Biointelligence Lab, 22
The inductive inference approach - Discussion Remark 3. The proper role of language is to mediate a mapping between form and meaning or symbol and referent Reduce a language to a mapping from Σ 1 * to Σ 2 *, where Σ 1 * is the set of all linguistic expressions (over some alphabet Σ 1 ) and Σ 2 * is a characterization of the set of all possible meanings A language L is regarded as a subset of Σ 1 * ×Σ 2 * (form-meaning pairs) Possible formulations of the notion of a language 1. L ⊂ Σ*: this is the central and traditional notion of a language 2. L ⊂ Σ 1 * ×Σ 2 * : a subset of form-meaning pairs and in a formal sense no different from notion 1 (C) 2009, SNU Biointelligence Lab, 23
The inductive inference approach - Discussion Remark 3. The proper role of language is to mediate a mapping between form and meaning or symbol and referent Possible formulation of the notion of a language (cont.) 3. L: Σ* [0, 1]: a language maps every expression to a real number between 0 and 1. –For any expression s ∈ Σ*, the number of L(s) characterizes the degree of well-formedness of that expression with L(s)=1 denoting perfect grammaticality and L(s)=0 denoting complete lack of it 4. L is a probability distribution μ on Σ* – this is the usual notion of a language in statistical language modeling 5. L is a probability distribution μ on Σ 1 * ×Σ 2 * Extended notions of a language makes the learning problem for the child harder than rather than easier (C) 2009, SNU Biointelligence Lab, 24
The inductive inference approach - Discussion Remark 4. If the space L = H of possible language is too large, then the family is no longer learnable Regular languages (DFA) is too large for learnability Remark 5. Is the class of infinite regular languages unlearnable? Consider two languages L 1 and L 2 s.t. L 1 ⊂ L 2 and L 2 \L 1 is an infinite set. Clearly one may find two such languages in L ={infinite regular languages}. Let σ 2 be a locking sequence for L 2. Then clearly, L 1 ∪ range(σ 2 ) is a language that contains σ 2, is a proper subset of L 2, and is contained in L. This language will not be learnable from a text whose prefix is σ 2 because the learner will lock on to L 2 on such a text (C) 2009, SNU Biointelligence Lab, 25
The inductive inference approach - Discussion Remark 6. The most compelling objection to the classical inductive inference paradigm comes from statistical quarters It seems unreasonable to expect the learner to exactly identify the target on every single text Probable Approximately Correct (PAC) learning framework tries to learn the target approximately with high probability. The quantity d(g t, h n ) is now a random variable that must converge to 0 (C) 2009, SNU Biointelligence Lab, 26
The inductive inference approach - Additional results Assumption The text is generated in i.i.d. fashion according to a probability measure μ on the sentences of the target language L. We’ll adopt this assumption in order to derive probabilistic bounds on the performance of the language learner Definition on the product spaces Measure μ 2 on the product space L×L All texts t from the language L that have generated according to i.i.d. draws from μ will be such that t 2 ∈ L×L Measure μ 3 on the product space L×L×L and so on. By the Kolmogorov Extension Theorem, a unique measure μ ∞ is guaranteed to exist on the set T=Π i=1 ∞ L i (where L i =L for each i). The set T consists of all texts that may be generated from L by i.i.d. draws according to μ The measure μ ∞ is defined on T and thus we have a measure on the set of all texts (C) 2009, SNU Biointelligence Lab, 27
The inductive inference approach - Additional results Theorem 2.5 Let A be an arbitrary learning algorithm and g be an arbitrary (not necessary to target) grammar. Then the set of texts on which the learning algorithm A converges to g is measurable. (C) 2009, SNU Biointelligence Lab, 28
The inductive inference approach - Additional results Definition 2.4 Let g be a target grammar and texts be presented to the learner in i.i.d. fashion according to a probability measure μ on L g. If there exists a learning algorithm A s,t, μ ∞ ({t | lim k→∞ d( A (t k ),g)=0}) = 1 then the target is said to be learnable with measure 1. The family G is said to be learnable with measure 1 if all grammars in G are learnable with measure 1 by some algorithm A Notes 1. If the measure μ is known in a certain sense, the entire family of r.e. sets becomes learnable with measure 1. 2. On the other hand, if μ is unknown, the Superfinite language (those having all the finite languages and at least one infinite language) are not learnable. Thus, the class of learnable languages is not enlarged for distribution-free learning in a stochastic setting. 3. Computational distributions make languages learnable. Thus any collection of computable distributions is identifiable in the limit. (C) 2009, SNU Biointelligence Lab, 29
The inductive inference approach - Additional results Consider the set of r.e. languages L 1, L 2, L 3,...: an enumeration of r.e. languages Measure μ i : associated with L i If L i is to be the target language, then examples are drawn in i.i.d. fashion according to measure μ i A natural measure μ i,∞ exists on the set of texts for L i Theorem 2.6 With strong prior knowledge about the natural of the μ i ’s, the family L of r.e. languages is measure 1 learnable. (C) 2009, SNU Biointelligence Lab, 30
The inductive inference approach - Additional results (C) 2009, SNU Biointelligence Lab, 31 Proof of Theorem 2.6 (Cont.)
The inductive inference approach - Additional results (C) 2009, SNU Biointelligence Lab, 32 Proof of Theorem 2.6 (Cont.)
The inductive inference approach - Additional results (C) 2009, SNU Biointelligence Lab, 33 Proof of Theorem 2.6 (Cont.)
The inductive inference approach - Additional results Definition 2.5 Consider a target grammar g and a text stochastically presented by i.i.d. draws from the target language L g according to a measure μ. If a learning algorithm exists that can learn the target grammar with measure 1 for all measures μ, then g is said to be learnable in a distribution-free sense. A family of grammars G is learnable in a distribution-free sense if there exists a learning algorithm that can learn every grammar in the family with measure 1 in a distribution-free sense. (C) 2009, SNU Biointelligence Lab, 34
The inductive inference approach - Additional results Note When one considers statistical learning, the distribution-free requirement is the natural one and all statistical estimation algorithms are required to converge in a distribution-free sense. When this restriction is imposed, the class of learnable families is not enlarged. Theorem 2.7 (Angluin 1988) If a family of grammars G is learnable with measure 1 (on almost all texts) in a distribution-free sense, then it is learnable in the limit in the Gold sense (on all texts). Theorem 2.8 (Pitt 1989) If a family of grammars G is learnable with measure p>1/2 then it is learnable in the limit in the Gold sense. (C) 2009, SNU Biointelligence Lab, 35
The inductive inference approach - Additional results A few positive results on the learning of grammatical families 1. Conditions the family M ( a collection of measures) becomes identifiable in the limit : uniformly computable distribution Let M ={μ 0, μ 1, μ 2,...} be a computable family of distributions. Define the distance between two distributions μ i and μ j as The family M is said to be uniformly computable if there exists a total recursive function f(i,x,ε) s.t. for every i, for every x ∈ Σ*, and for every ε, f(i,x,ε) outputs a rational number p s.t. |μ i (x)-p|<ε. The learner receives a text probabilistically drawn according to an unknown target measure from the family M. After k examples are received, the learning algorithm guesses A (t k ) ∈ M. Then, construct a learning algorithm that has the property (C) 2009, SNU Biointelligence Lab, 36
The inductive inference approach - Additional results A few positive results on the learning of grammatical families (cont.) 2. Consider active learning on the part of the learner The learning is allowed to make queries about the membership of arbitrary elements x ∈ Σ* (membership queries). This allows the regular languages to be learned in polynomial time though context-free grammars remain unlearnable It is certainly reasonable to consider the possibility that children explore the environment and this active exploration facilitates learning and circumvents some of the intractability inherent in inductive inference (C) 2009, SNU Biointelligence Lab, 37
The inductive inference approach - Additional results A few positive results on the learning of grammatical families (cont.) 3. Consider further restrictions on the set of texts on which the learner is required to succeed. (a) Recursive texts –A text t is said to be recursive if {t n | n ∈ N} is recursive. –If a computable map (from data sets to grammars) exists that can learn a family of languages L from recursive texts, then L is algorithmically learnable from all texts. –Restricting learnability to recursive texts does not enlarge the family of learnable languages. (C) 2009, SNU Biointelligence Lab, 38
The inductive inference approach - Additional results A few positive results on the learning of grammatical families (cont.) 3. Further restrictions on the set of texts (cont.) (b) Ascending texts –A text t is said to be ascending if for all n<m, the length of t(n) is less than or equal to the length of t(m), i. e. sentences are presented in increasing order of length. –There are language families L that are learnable from ascending texts but not learnable from all texts. –Superfinite families remain unlearnable in this setting (C) 2009, SNU Biointelligence Lab, 39
The inductive inference approach - Additional results A few positive results on the learning of grammatical families (cont.) 3. Further restrictions on the set of texts (cont.) (c) Informant texts –A text is said to be an informant if it consists of both positive and negative examples. –Every element of Σ* appears in the text with a label indicating whether it belongs to the target language or not. –All recursively enumerable sets are learnable from informant texts. –However, it seems unlikely that the learning child ever gets an opportunity to sample the space of negative examples with enough coverage to get an unbiased estimate of the target language. (C) 2009, SNU Biointelligence Lab, 40
The inductive inference approach - Additional results A few positive results on the learning of grammatical families (cont.) 4. Consider weaker convergence criteria e.g. Framework of anomalies where one is required to learn the target language up to (at most) k mistakes 5. Consider various ways to incorporate structure into the learning problem leading to learnability results. e.g. Learning context-free grammars from structured examples, Learning categorical grammars (C) 2009, SNU Biointelligence Lab, 41
Summary Provide the central developments and results of the theory of inductive inference that continues to provide the basic formal framework to reason about language acquisition The main implication of these results is that learning in the complete absence of any prior information is infeasible. (C) 2009, SNU Biointelligence Lab, 42