INFORMATION THEORY AND IMPLICATIONS Dušica Filipović Đurđević Laboratory for Experimental Psychology, Department of Psychology, Faculty of Philosophy, University of Novi Sad, Novi Sad, Serbia Laboratory for Experimental Psychology, Department of Psychology, Faculty of Philosophy, University of Belgrade, Belgrade, Serbia
OUR PRODUCTS TELL A STORY OF US
OUR PRODUCTS TELL A STORY OF US Feature of some artefact that is made to fit us Our feature
Feature of some product that is made to fit us UNDERSTANDING US Our feature Feature of some product that is made to fit us
LANGUAGE AS A NATURAL SYSTEM Language structure Mirrors human mind Window into human mind Wundt: Higher mental functions can be analysed only through understanding of it’s products
OUR GOAL RT Complexity of a given aspect of language
HOW TO DESCRIBE COMPLEXITY OF LANGUAGE? Linguistic descriptions Provide general framework Detailed systemization Probability Theory Frequencies of language events Probabilities of language events Information Theory Brings the two together
PROBABILITY Relative frequency of an outcome – event e in a series of n identical experiments: (Pascal, 1654) Relative frequency of an item in a corpus
PROBABILITY: EXAMPLE Imagine a corpus of 100 complex words with three suffixes 50 -ness 40 -ly 10 -less Probability of finding particular suffix: P({ness}) =50/100=0.5 P({ly}) =40/100=0.4 P({less}) =10/100=0.1
INFORMATION LOAD Minus logarithm of probability (the base of logarithm can vary) The less likely the event, the larger the amount of information it conveys.
INFORMATION LOAD: PROCESSING EFFECTS Inflectional morphology Serbian inflected forms Kostić, 1991; 1995; Kostić, Marković, & Baucal, 2003 Sentence processing Surprisal Frank, 2010; 2013; Hale, 2001; Levy, 2008
SERBIAN INFLECTIONAL MORPHOLOGY Nouns Masculine Feminine Neuter Singular Plural Nominative konj-ø konj-i vil-a vil-e sel-o sel-a Genitive konj-a Dative konj-u konj-ima vil-i vil-ama sel-u sel-ima Accusative konj-e vil-u Instrumental konj-em vil-om sel-om Locative
SERBIAN INFLECTIONAL MORPHOLOGY Nouns Masculine Feminine Neuter Singular Plural Nominative konj-ø konj-i vil-a vil-e sel-o sel-a Genitive konj-a Dative konj-u konj-ima vil-i vil-ama sel-u sel-ima Accusative konj-e vil-u Instrumental konj-em vil-om sel-om Locative
SERBIAN INFLECTIONAL MORPHOLOGY Feminine nouns suffix F(ei) p(i)=F(ei)/ F(e) Ii=-logp(i) -a 18715 0.26 1.94 -e 27803 0.39 1.36 -i 7072 0.10 3.32 -u 9918 0.14 2.84 -om 4265 0.06 4.06 -ama 4409 f(e)=72182
NOT JUST FREQUENCY Average probability per syntactic function/meaning
SERBIAN INFLECTIONAL MORPHOLOGY Feminine nouns suffix case F(ei) R(ei) F(ei)/R(ei) p(i)=[F(ei)/ F(e)]/Σ Ii=-logp(i) -a Nom. Sg Gen. Pl. 18715 54 346.57 0.31 1.47 -e Gen. Sg. Nom. Pl. Acc. Pl. 27803 112 248.24 0.22 2.25 -i Dat. Sg. Loc. Sg. 7072 43 164.47 0.15 2.74 -u Acc. Sg. 9918 58 171 -om Ins. Sg. 4265 32 133.28 0.12 3.32 -ama Dat. Pl. Loc. Pl. Ins. Pl 4409 75 58.79 0.05 5.06 Σ=1122.35
SERBIAN INFLECTIONAL MORPHOLOGY Kostić, 1991; 1995; Kostić, Marković, & Baucal, 2003
PROBABILITY: ADDITIVITY Probability of finding ness or ly or less equals 1 If the two events do not overlap, probability of finding either of them equals the sum of their probabilities; for example: if ø
JOINT PROBABILITIES Probability of joint occurrence of multiple events Our example, corpus of 100 words What is the probability of finding ness and ly? we already know that P({ness}) = 0.5 P({ly}) = 0.4
JOINT PROBABILITIES Intuitively: We find ness in 50% of cases, and we find ly in 40% of cases. Therefore, jointly, we find them in 50% of 40 % of cases, that is in 20% of cases. Formally: Generally, for independent events if e1 and e2 are independent events
CONDITIONAL PROBABILITY Often, events are dependent An example: Imagine a corpus in which ly in a word is always followed by ness. Probability of finding ness in a word given that we found ly in that word equals 1. Definition Probability of the event e2 under assumption that the event e1 has already occurred is called conditional probability of the event e1 and is marked as P(e2|e1)
CONDITIONAL PROBABILITY An example: P({ly}) =0.4 P({ness} |{ly}) =1 Their joint probability Generally
CONDITIONAL PROBABILITY Another example: P({ly}) = 0.4 P({ness} |{ly}) = 0 Their joint probability
SENTENCE PROCESSING Surprisal 𝐼 𝑤 𝑡+1 = - log p 𝑤 𝑡+1 𝑤 1 𝑡 Informatin load of the word given previously seen word(s) (Frank, 2010; 2013; Hale, 2001; Levy, 2008) Predicts reading latencies Less expected – more suprising words take more time to process 𝐼 𝑤 𝑡+1 = - log p 𝑤 𝑡+1 𝑤 1 𝑡
EVENT VS. SYSTEM (GROUP OF EVENTS) Amount of information – description of one event What about random variables? How to calculate the amount of information carried by random variable manifesting through several events? e.g. how to calculate uncertainty of feminine nouns inflection
ENTROPY Answer: by calculating entropy of probability distribution of potential values of the given random variable, that is by calculating the average uncertainty of the value of the given random variable (X)
AN EXAMPLE Variable in question: suffix Potential values: ness, ly, less The same example... P({ness}) =50/100=0.5 P({ly}) =40/100=0.4 P({less}) =10/100=0.1
ENTROPY Higher number of events higher uncertainty More balanced probabilities higher uncertainty
MAXIMUM ENTROPY Maximum uncertainty – entropy of the distribution with equally probable outcomes. An example: P({ness}) =0.33 P({ly}) =0.33 P({less}) =0.33
SHANNON EQUITABILITY Ratio of observed and maximum entropy General measure of “order” within a system, that is, the distance between the observed state of the system and complete unpredictability. Good for comparison of systems with different number of elements Our example
REDUNDANCY Complement of Shannon equitability, the sum of the two being 1 Tells the same story as relative entropy, but in the opposite direction.
ENTROPY: PROCESSING EFFECTS Morphology Inflectional and derivational entropy Moscoso del Prado Martin, Kostić, & Baayen, 2004; Baayen, Feldman, Schreuder, 2006 Lexical ambiguity Semantic entropy Filipović Đurđević, & Kostić, 2006 Auditory comprehension Cohort entropy Kemps, Wurm, Ernestus, Schreuder, Baayen, 2005
SERBIAN INFLECTIONAL MORPHOLOGY Feminine nouns suffix F(ei) p(i)=F(ei)/ F(e) -p(i)logp(i) -a 18715 0.26 0.51 -e 27803 0.39 0.53 -i 7072 0.10 0.33 -u 9918 0.14 0.40 -om 4265 0.06 0.24 -ama 4409 f(e)=72182 H=2.25
SEMANTIC ENTROPY Polysemy as sense uncertainty Balanced sense probabilities Unbalanced sense probabilities horn shell Filipović Đurđević, 2007
SEMANTIC ENTROPY Measure of uncertainty Higher entropy higher uncertainty shorter RT Sense probability Filipović Đurđević, 2007
COHORT ENTROPY cathedral cat captain caravan CA … Kemps, Wurm, Ernestus, Schreuder, Baayen, 2005
ENTROPY AND THE BRAIN Entropy: Hypocampus Surprisal: Sensory processing Expected novelty before it occurs Novelty per se Strange, Duggins, Penny, Dolan & Friston, 2005
JOINT ENTROPY – TOTAL ENTROPY OF THE SYSTEM Additivity of entropy If we imagine two unrelated systems, X and Y, total entropy will be: X Y
ENTROPY OF MORPHOLOGICAL FAMILY logN = Hmax Family size Entropy
JOINT ENTROPY BUT, notice: these events are mutually exclusive H(thinker | [think],[thinker]) = H([think] + H([thinker] | [think]) + H(thinker | [think], [thinker]) BUT, notice: these events are mutually exclusive Moscoso del Prado Martin, Kostić, & Baayen, 2004
JOINT ENTROPY – TOTAL ENTROPY OF THE SYSTEM If X and Y describe related events, we look at all possible outcomes pair-wise P(x,y) X Y
JOINT ENTROPY - CHARACTERISTICS It is never smaller than the entropy of the initial system. Adding new system can never reduce uncertainty! Two systems taken together can never have larger entropy than the sum of their individual entropies.
CONDITIONAL ENTROPY Reveals the level of uncertainty that remains in random variable X given that value of random variable Y is familiar. Equals zero if X is completely predictable based on Y - when X=f(Y) Maximum, that is equal to H(X) when X and Y are independent, that is when Y tells nothing on X.
CONDITIONAL ENTROPY: PROCESSING EFFECTS Inflectional morphology Paradigm cell filling problem Ackerman, Blevins, & Malouf, 2009; Ackerman, & Malouf, 2013 Sentence processing Syntactic entropy Frank, 2010; 2013; Hale, 2001; Levy, 2008
? CONDITIONAL ENTROPY prozoru prozore Paradigm cell filling problem NOM GEN DAT ACC VOC INS LOC učenik Ø a u e om i ima Slavko o / Pavle prozor selo polje em ime kube žena ama sudija o(a) stvar ju (i) Ackerman, Blevins, & Malouf, 2009; Ackerman, & Malouf, 2013
ENTROPY NOM GEN DAT ACC VOC INS LOC učenik Ø a u e om i ima Slavko o / Pavle prozor selo polje em ime kube žena ama sudija o(a) stvar ju (i) H(gen.sg) = - p(a)logp(a) – p(e)logp(e) – p(i)logp(i) = =-8/11*log(8/11) – 2/11*log(2/11) – 1/11*log(1/11) = = … Ackerman, Blevins, & Malouf, 2009; Ackerman, & Malouf, 2013
CONDITIONAL ENTROPY NOM GEN DAT ACC VOC INS LOC učenik Ø a u e om i ima Slavko o / Pavle prozor selo polje em ime kube žena ama sudija o(a) stvar ju (i) H(gen.sg | dat.sg = i) = – p(e)logp(e) – p(i)logp(i) = = – 2/3*log(2/3) – 1/3*log(1/3) = = … Ackerman, Blevins, & Malouf, 2009; Ackerman, & Malouf, 2013
CONDITIONAL ENTROPY: PROCESSING EFFECTS Uncertainty about the rest of the sentence Probability estimation based on positional frequencies of parts of speech in a finite set of words e.g. N=10 𝐻 𝑡 =− 𝑝( 𝑤 𝑡 )𝑙𝑜𝑔𝑝( 𝑤 𝑡 ) t: Mary 𝐻 𝑡+1 =− 𝑝 𝑤 𝑡+1 𝑤 𝑡 𝑙𝑜𝑔𝑝 𝑤 𝑡+1 𝑤 𝑡 t+1: Mary left ∆𝐻=𝐻 𝑡 −𝐻(𝑡+1) ∆𝐻>0 Frank, 2010; 2013; Hale, 2001; Levy, 2008
MUTUAL INFORMATION Amount of information shared by X and Y Equals zero when X and Y are independent (tell nothing of each other) Maximum, equal to H(X), and H(Y) when X and Y are identical. All the information is already contained in X, adding of Y tells nothing new (and vice versa).
RELATIONS AMONG MEASURES I(X,Y)=H(X)-H(X|Y)= =H(Y)-H(Y|X)= =H(X)+H(Y)-H(X,Y)
MEASURES OF DISTANCE BETWEEN DISTRIBUTIONS Kullback-Leibler divergence (Relative entropy) Measures the distance between two distributions, but is not a true distance measure, for reasons of asymmetry. distribution we are trying to predict starting distribution
MEASURES OF DISTANCE BETWEEN DISTRIBUTIONS Kullback-Leibler divergence (Relative entropy) Reveals the additional amount of information we need to predict q(x), if we already know p(x). Measure of inefficiency of assuming that the distribution is q when the true distribution is p Number of bits needed to chose an event from the set of possibilities when the coding scheme is based on q instead on true distribution – p
MEASURES OF DISTANCE BETWEEN DISTRIBUTIONS Jensen-Shannon divergence (true distance) Cross-entropy
MEASURES OF DISTANCE: PROCESSING EFFECTS Inflectional morphology Inflectional paradigms and classes Milin, Filipović Đurđević, & Moscoso del Prado Martin, 2009 Auditory comprehension Balling, & Baayen, 2012 Derivational morphology Derivational mini-paradigms and mini-classes Milin, Kuperman, Kostić, & Baayen, 2009
PARADIGM AND CLASS Frequency distribution Inflectional paradigm Inflectional class Feminine nouns Inflected form Inflected form probability Suffix Suffix probability saun-a saun-e saun-i saun-u saun-om saun-ama 0.31 0.09 0.34 0.16 0.05 -a -e -i -u -om -ama 0.26 0.39 0.10 0.14 0.06
RELATIVE ENTROPY p(i) – probability distribution of inflected forms of inflectional paradigm (e.g. of the word knjiga) q(i) - probability distribution of inflected forms of inflectional class (e.g. feminine nouns)
D(p||q) PREDICTS RT
RELATIVE ENTROPY IN AUDITORY COMPREHENSION ab abd abc 0.06 0.26 0.29 abde 0.13 0.14 abdef 0.52 0.57 0.03 Balling, & Baayen, 2012
WEIGHTED RELATIVE ENTROPY Baayen, et al, 2011 Masked priming Self-paced sentence reading VLDT
DERIVATIONAL MINI-CLASSES AND MINI-PARADIGMS Milin, Kuperman, Kostić, & Baayen, 2009 Derived words Sufixes and prefixes Word pairs PARADIGM CLASS KIND – UNKIND KIND – UNKIND TRUE – UNTRUE PLEASANT – UNPLEASANT … Cross entropy predicts RT
CONCLUSION Many ways to describe language in terms of Information Theory However, we learn nothing of implementation Information Theory helps us understand the constraints of the system Why something is optimal Important step towards understanding of how something is processed how it’s implemented in the brain
THANK YOU! Ovo istraživanje finansirano je od strane Ministarstva prosvete, nauke i tehnološkog razvoja Republike Srbije (projekat broj: 179033 i 179006).
READING MATERIAL Chapter 2 – "Mathematical foundations" in Manning, C. D. and Schuetze, H. (2000). Foundations of Statistical Natural Language Processing. Cambridge, MA: The MIT Press. Bod, R. (2003). Introduction to Elementary Probability Theory and Formal Stochastic Language Theory. In Bod, R., Hay, J., and Jannedy, S., (eds.), Probabilistic Linguistics. The MIT Press. MacKay, D. J. C. (2003). Information Theory, Inference, and Learning Algorithms. Cambridge, UK: Cambridge University Press. http://www.inference.phy.cam.ac.uk/mackay/itila/book.html Pluymaekers, M., Ernestus, M. and Baayen, R. H. (2005) Articulatory planning is continuous and sensitive to informational redundancy. Phonetica, 62, 146-159. Wurm, L.H., Ernestus, M., Schreuder, R., and Baayen, R. H. (2006) Dynamics of the Auditory Comprehension of Prefixed Words: Cohort Entropies and Conditional Root Uniqueness Points, The Mental Lexicon, 1, 125-146. Milin, P., Filipović Đurđević, D., Kostić, A. & Moscoso del Prado Martìn, F. (2009). The simultaneous effects of inflectional paradigms and classes on lexical recognition: Evidence from Serbian. Journal of Memory and Language, 60(1), 50-64. Milin, P., Kuperman, V., Kostic, A. and Baayen, R. H. (2009) Paradigms bit by bit: an information theoretic approach to the processing of paradigmatic structure in inflection and derivation. In Blevins, J.P. And Blevins, J. (Eds), Analogy in grammar: Form and acquisition, Oxford University Press, Oxford, 2009, 214-252. Moscoso del Prado Martin, F., Kostić, A. & Baayen, H. (2004). Putting the bits together: An information-theoretical perspective on morphological processing. Cognition, 94, 1-18. Kostić, A. and Mirković, J. (2002). Processing of inflected nouns and levels of cognitive sensitivity. Psihologija, 35 (3-4), 287-297. Kostić, A., Marković, T. i Baucal, A. (2003). Inflectional morphology and word meaning: orthogonal or co-implicative cognitive domains? U H. Baayen & R. Schreuder (Eds.): Morphological Structure in Language Processing. Mouton de Gruyter. Berlin, 1-45. Tabak, W., Schreuder, R. and Baayen, R. H. (2005). Lexical statistics and lexical processing: semantic density, information complexity, sex, and irregularity in Dutch. In M. Reis and S. Kepser (eds), Linguistic Evidence, Mouton, 529-555. Balling, L. and Baayen, R.H. (2012) Probability and surprisal in auditory comprehension of morphologically complex words. Cognition, 125, 80-106. Kemps, R., Wurm, L., Ernestus, M., Schreuder, R. and Baayen, R. H. (2005) Prosodic cues for morphological complexity: Comparatives and agent nouns in Dutch and English. Language and Cognitive Processes, 20, 43-73. Baayen, R. H., Milin, P., Filipovic Durdevic, D., Hendrix, P. and Marelli, M. (2011), An amorphous model for morphological processing in visual comprehension based on naive discriminative learning. Psychological Review 118, 438-482. Frank, S.L. (2013). Uncertainty reduction as a measure of cognitive load in sentence comprehension. Topics in Cognitive Science, 5, 475-494.