Presentation is loading. Please wait.

Presentation is loading. Please wait.

INFORMATION THEORY AND IMPLICATIONS

Similar presentations


Presentation on theme: "INFORMATION THEORY AND IMPLICATIONS"— Presentation transcript:

1 INFORMATION THEORY AND IMPLICATIONS
Dušica Filipović Đurđević Laboratory for Experimental Psychology, Department of Psychology, Faculty of Philosophy, University of Novi Sad, Novi Sad, Serbia Laboratory for Experimental Psychology, Department of Psychology, Faculty of Philosophy, University of Belgrade, Belgrade, Serbia

2

3

4

5

6 OUR PRODUCTS TELL A STORY OF US

7 OUR PRODUCTS TELL A STORY OF US
Feature of some artefact that is made to fit us Our feature

8 Feature of some product that is made to fit us
UNDERSTANDING US Our feature Feature of some product that is made to fit us

9 LANGUAGE AS A NATURAL SYSTEM
Language structure Mirrors human mind Window into human mind Wundt: Higher mental functions can be analysed only through understanding of it’s products

10 OUR GOAL RT Complexity of a given aspect of language

11 HOW TO DESCRIBE COMPLEXITY OF LANGUAGE?
Linguistic descriptions Provide general framework Detailed systemization Probability Theory Frequencies of language events Probabilities of language events Information Theory Brings the two together

12 PROBABILITY Relative frequency of an outcome – event e in a series of n identical experiments: (Pascal, 1654) Relative frequency of an item in a corpus

13 PROBABILITY: EXAMPLE Imagine a corpus of 100 complex words with three suffixes 50 -ness ly less Probability of finding particular suffix: P({ness}) =50/100=0.5 P({ly}) =40/100=0.4 P({less}) =10/100=0.1

14 INFORMATION LOAD Minus logarithm of probability (the base of logarithm can vary) The less likely the event, the larger the amount of information it conveys.

15 INFORMATION LOAD: PROCESSING EFFECTS
Inflectional morphology Serbian inflected forms Kostić, 1991; 1995; Kostić, Marković, & Baucal, 2003 Sentence processing Surprisal Frank, 2010; 2013; Hale, 2001; Levy, 2008

16 SERBIAN INFLECTIONAL MORPHOLOGY
Nouns Masculine Feminine Neuter Singular Plural Nominative konj-ø konj-i vil-a vil-e sel-o sel-a Genitive konj-a Dative konj-u konj-ima vil-i vil-ama sel-u sel-ima Accusative konj-e vil-u Instrumental konj-em vil-om sel-om Locative

17 SERBIAN INFLECTIONAL MORPHOLOGY
Nouns Masculine Feminine Neuter Singular Plural Nominative konj-ø konj-i vil-a vil-e sel-o sel-a Genitive konj-a Dative konj-u konj-ima vil-i vil-ama sel-u sel-ima Accusative konj-e vil-u Instrumental konj-em vil-om sel-om Locative

18 SERBIAN INFLECTIONAL MORPHOLOGY
Feminine nouns suffix F(ei) p(i)=F(ei)/ F(e) Ii=-logp(i) -a 18715 0.26 1.94 -e 27803 0.39 1.36 -i 7072 0.10 3.32 -u 9918 0.14 2.84 -om 4265 0.06 4.06 -ama 4409 f(e)=72182

19 NOT JUST FREQUENCY Average probability per syntactic function/meaning

20 SERBIAN INFLECTIONAL MORPHOLOGY
Feminine nouns suffix case F(ei) R(ei) F(ei)/R(ei) p(i)=[F(ei)/ F(e)]/Σ Ii=-logp(i) -a Nom. Sg Gen. Pl. 18715 54 346.57 0.31 1.47 -e Gen. Sg. Nom. Pl. Acc. Pl. 27803 112 248.24 0.22 2.25 -i Dat. Sg. Loc. Sg. 7072 43 164.47 0.15 2.74 -u Acc. Sg. 9918 58 171 -om Ins. Sg. 4265 32 133.28 0.12 3.32 -ama Dat. Pl. Loc. Pl. Ins. Pl 4409 75 58.79 0.05 5.06 Σ=

21 SERBIAN INFLECTIONAL MORPHOLOGY
Kostić, 1991; 1995; Kostić, Marković, & Baucal, 2003

22 PROBABILITY: ADDITIVITY
Probability of finding ness or ly or less equals 1 If the two events do not overlap, probability of finding either of them equals the sum of their probabilities; for example: if ø

23 JOINT PROBABILITIES Probability of joint occurrence of multiple events
Our example, corpus of 100 words What is the probability of finding ness and ly? we already know that P({ness}) = 0.5 P({ly}) = 0.4

24 JOINT PROBABILITIES Intuitively:
We find ness in 50% of cases, and we find ly in 40% of cases. Therefore, jointly, we find them in 50% of 40 % of cases, that is in 20% of cases. Formally: Generally, for independent events if e1 and e2 are independent events

25 CONDITIONAL PROBABILITY
Often, events are dependent An example: Imagine a corpus in which ly in a word is always followed by ness. Probability of finding ness in a word given that we found ly in that word equals 1. Definition Probability of the event e2 under assumption that the event e1 has already occurred is called conditional probability of the event e1 and is marked as P(e2|e1)

26 CONDITIONAL PROBABILITY
An example: P({ly}) =0.4 P({ness} |{ly}) =1 Their joint probability Generally

27 CONDITIONAL PROBABILITY
Another example: P({ly}) = 0.4 P({ness} |{ly}) = 0 Their joint probability

28 SENTENCE PROCESSING Surprisal 𝐼 𝑤 𝑡+1 = - log p 𝑤 𝑡+1 𝑤 1 𝑡
Informatin load of the word given previously seen word(s) (Frank, 2010; 2013; Hale, 2001; Levy, 2008) Predicts reading latencies Less expected – more suprising words take more time to process 𝐼 𝑤 𝑡+1 = - log p 𝑤 𝑡+1 𝑤 1 𝑡

29 EVENT VS. SYSTEM (GROUP OF EVENTS)
Amount of information – description of one event What about random variables? How to calculate the amount of information carried by random variable manifesting through several events? e.g. how to calculate uncertainty of feminine nouns inflection

30 ENTROPY Answer: by calculating entropy of probability distribution of potential values of the given random variable, that is by calculating the average uncertainty of the value of the given random variable (X)

31 AN EXAMPLE Variable in question: suffix
Potential values: ness, ly, less The same example... P({ness}) =50/100=0.5 P({ly}) =40/100=0.4 P({less}) =10/100=0.1

32 ENTROPY Higher number of events  higher uncertainty
More balanced probabilities  higher uncertainty

33 MAXIMUM ENTROPY Maximum uncertainty – entropy of the distribution with equally probable outcomes. An example: P({ness}) = P({ly}) = P({less}) =0.33

34 SHANNON EQUITABILITY Ratio of observed and maximum entropy
General measure of “order” within a system, that is, the distance between the observed state of the system and complete unpredictability. Good for comparison of systems with different number of elements Our example

35 REDUNDANCY Complement of Shannon equitability, the sum of the two being 1 Tells the same story as relative entropy, but in the opposite direction.

36 ENTROPY: PROCESSING EFFECTS
Morphology Inflectional and derivational entropy Moscoso del Prado Martin, Kostić, & Baayen, 2004; Baayen, Feldman, Schreuder, 2006 Lexical ambiguity Semantic entropy Filipović Đurđević, & Kostić, 2006 Auditory comprehension Cohort entropy Kemps, Wurm, Ernestus, Schreuder, Baayen, 2005

37 SERBIAN INFLECTIONAL MORPHOLOGY
Feminine nouns suffix F(ei) p(i)=F(ei)/ F(e) -p(i)logp(i) -a 18715 0.26 0.51 -e 27803 0.39 0.53 -i 7072 0.10 0.33 -u 9918 0.14 0.40 -om 4265 0.06 0.24 -ama 4409 f(e)=72182 H=2.25

38

39 SEMANTIC ENTROPY Polysemy as sense uncertainty
Balanced sense probabilities Unbalanced sense probabilities horn shell Filipović Đurđević, 2007

40 SEMANTIC ENTROPY Measure of uncertainty
Higher entropy  higher uncertainty  shorter RT Sense probability Filipović Đurđević, 2007

41 COHORT ENTROPY cathedral cat captain caravan CA …
Kemps, Wurm, Ernestus, Schreuder, Baayen, 2005

42 ENTROPY AND THE BRAIN Entropy: Hypocampus
Surprisal: Sensory processing Expected novelty before it occurs Novelty per se Strange, Duggins, Penny, Dolan & Friston, 2005

43 JOINT ENTROPY – TOTAL ENTROPY OF THE SYSTEM
Additivity of entropy If we imagine two unrelated systems, X and Y, total entropy will be: X Y

44 ENTROPY OF MORPHOLOGICAL FAMILY
logN = Hmax Family size Entropy

45 JOINT ENTROPY BUT, notice: these events are mutually exclusive
H(thinker | [think],[thinker]) = H([think] + H([thinker] | [think]) + H(thinker | [think], [thinker]) BUT, notice: these events are mutually exclusive Moscoso del Prado Martin, Kostić, & Baayen, 2004

46 JOINT ENTROPY – TOTAL ENTROPY OF THE SYSTEM
If X and Y describe related events, we look at all possible outcomes pair-wise P(x,y) X Y

47 JOINT ENTROPY - CHARACTERISTICS
It is never smaller than the entropy of the initial system. Adding new system can never reduce uncertainty! Two systems taken together can never have larger entropy than the sum of their individual entropies.

48 CONDITIONAL ENTROPY Reveals the level of uncertainty that remains in random variable X given that value of random variable Y is familiar. Equals zero if X is completely predictable based on Y - when X=f(Y) Maximum, that is equal to H(X) when X and Y are independent, that is when Y tells nothing on X.

49 CONDITIONAL ENTROPY: PROCESSING EFFECTS
Inflectional morphology Paradigm cell filling problem Ackerman, Blevins, & Malouf, 2009; Ackerman, & Malouf, 2013 Sentence processing Syntactic entropy Frank, 2010; 2013; Hale, 2001; Levy, 2008

50 ? CONDITIONAL ENTROPY prozoru prozore Paradigm cell filling problem
NOM GEN DAT ACC VOC INS LOC učenik Ø a u e om i ima Slavko o / Pavle prozor selo polje em ime kube žena ama sudija o(a) stvar ju (i) Ackerman, Blevins, & Malouf, 2009; Ackerman, & Malouf, 2013

51 ENTROPY NOM GEN DAT ACC VOC INS LOC učenik Ø a u e om i ima Slavko o / Pavle prozor selo polje em ime kube žena ama sudija o(a) stvar ju (i) H(gen.sg) = - p(a)logp(a) – p(e)logp(e) – p(i)logp(i) = =-8/11*log(8/11) – 2/11*log(2/11) – 1/11*log(1/11) = = … Ackerman, Blevins, & Malouf, 2009; Ackerman, & Malouf, 2013

52 CONDITIONAL ENTROPY NOM GEN DAT ACC VOC INS LOC učenik Ø a u e om i ima Slavko o / Pavle prozor selo polje em ime kube žena ama sudija o(a) stvar ju (i) H(gen.sg | dat.sg = i) = – p(e)logp(e) – p(i)logp(i) = = – 2/3*log(2/3) – 1/3*log(1/3) = = … Ackerman, Blevins, & Malouf, 2009; Ackerman, & Malouf, 2013

53 CONDITIONAL ENTROPY: PROCESSING EFFECTS
Uncertainty about the rest of the sentence Probability estimation based on positional frequencies of parts of speech in a finite set of words e.g. N=10 𝐻 𝑡 =− 𝑝( 𝑤 𝑡 )𝑙𝑜𝑔𝑝( 𝑤 𝑡 ) t: Mary 𝐻 𝑡+1 =− 𝑝 𝑤 𝑡+1 𝑤 𝑡 𝑙𝑜𝑔𝑝 𝑤 𝑡+1 𝑤 𝑡 t+1: Mary left ∆𝐻=𝐻 𝑡 −𝐻(𝑡+1) ∆𝐻>0 Frank, 2010; 2013; Hale, 2001; Levy, 2008

54 MUTUAL INFORMATION Amount of information shared by X and Y
Equals zero when X and Y are independent (tell nothing of each other) Maximum, equal to H(X), and H(Y) when X and Y are identical. All the information is already contained in X, adding of Y tells nothing new (and vice versa).

55 RELATIONS AMONG MEASURES
I(X,Y)=H(X)-H(X|Y)= =H(Y)-H(Y|X)= =H(X)+H(Y)-H(X,Y)

56 MEASURES OF DISTANCE BETWEEN DISTRIBUTIONS
Kullback-Leibler divergence (Relative entropy) Measures the distance between two distributions, but is not a true distance measure, for reasons of asymmetry. distribution we are trying to predict starting distribution

57 MEASURES OF DISTANCE BETWEEN DISTRIBUTIONS
Kullback-Leibler divergence (Relative entropy) Reveals the additional amount of information we need to predict q(x), if we already know p(x). Measure of inefficiency of assuming that the distribution is q when the true distribution is p Number of bits needed to chose an event from the set of possibilities when the coding scheme is based on q instead on true distribution – p

58 MEASURES OF DISTANCE BETWEEN DISTRIBUTIONS
Jensen-Shannon divergence (true distance) Cross-entropy

59 MEASURES OF DISTANCE: PROCESSING EFFECTS
Inflectional morphology Inflectional paradigms and classes Milin, Filipović Đurđević, & Moscoso del Prado Martin, 2009 Auditory comprehension Balling, & Baayen, 2012 Derivational morphology Derivational mini-paradigms and mini-classes Milin, Kuperman, Kostić, & Baayen, 2009

60 PARADIGM AND CLASS Frequency distribution Inflectional paradigm
Inflectional class Feminine nouns Inflected form Inflected form probability Suffix Suffix probability saun-a saun-e saun-i saun-u saun-om saun-ama 0.31 0.09 0.34 0.16 0.05 -a -e -i -u -om -ama 0.26 0.39 0.10 0.14 0.06

61 RELATIVE ENTROPY p(i) – probability distribution of inflected forms of inflectional paradigm (e.g. of the word knjiga) q(i) - probability distribution of inflected forms of inflectional class (e.g. feminine nouns)

62

63 D(p||q) PREDICTS RT

64 RELATIVE ENTROPY IN AUDITORY COMPREHENSION
ab abd abc 0.06 0.26 0.29 abde 0.13 0.14 abdef 0.52 0.57 0.03 Balling, & Baayen, 2012

65 WEIGHTED RELATIVE ENTROPY
Baayen, et al, 2011 Masked priming Self-paced sentence reading VLDT

66 DERIVATIONAL MINI-CLASSES AND MINI-PARADIGMS
Milin, Kuperman, Kostić, & Baayen, 2009 Derived words Sufixes and prefixes Word pairs PARADIGM CLASS KIND – UNKIND KIND – UNKIND TRUE – UNTRUE PLEASANT – UNPLEASANT Cross entropy predicts RT

67 CONCLUSION Many ways to describe language in terms of Information Theory However, we learn nothing of implementation Information Theory helps us understand the constraints of the system Why something is optimal Important step towards understanding of how something is processed how it’s implemented in the brain

68 THANK YOU! Ovo istraživanje finansirano je od strane Ministarstva prosvete, nauke i tehnološkog razvoja Republike Srbije (projekat broj: i ).

69 READING MATERIAL Chapter 2 – "Mathematical foundations" in Manning, C. D. and Schuetze, H. (2000). Foundations of Statistical Natural Language Processing. Cambridge, MA: The MIT Press. Bod, R. (2003). Introduction to Elementary Probability Theory and Formal Stochastic Language Theory. In Bod, R., Hay, J., and Jannedy, S., (eds.), Probabilistic Linguistics. The MIT Press. MacKay, D. J. C. (2003). Information Theory, Inference, and Learning Algorithms. Cambridge, UK: Cambridge University Press. Pluymaekers, M., Ernestus, M. and Baayen, R. H. (2005) Articulatory planning is continuous and sensitive to informational redundancy. Phonetica, 62, Wurm, L.H., Ernestus, M., Schreuder, R., and Baayen, R. H. (2006) Dynamics of the Auditory Comprehension of Prefixed Words: Cohort Entropies and Conditional Root Uniqueness Points, The Mental Lexicon, 1, Milin, P., Filipović Đurđević, D., Kostić, A. & Moscoso del Prado Martìn, F. (2009). The simultaneous effects of inflectional paradigms and classes on lexical recognition: Evidence from Serbian. Journal of Memory and Language, 60(1), Milin, P., Kuperman, V., Kostic, A. and Baayen, R. H. (2009) Paradigms bit by bit: an information theoretic approach to the processing of paradigmatic structure in inflection and derivation. In Blevins, J.P. And Blevins, J. (Eds), Analogy in grammar: Form and acquisition, Oxford University Press, Oxford, 2009, Moscoso del Prado Martin, F., Kostić, A. & Baayen, H. (2004). Putting the bits together: An information-theoretical perspective on morphological processing. Cognition, 94, 1-18. Kostić, A. and Mirković, J. (2002). Processing of inflected nouns and levels of cognitive sensitivity. Psihologija, 35 (3-4), Kostić, A., Marković, T. i Baucal, A. (2003). Inflectional morphology and word meaning: orthogonal or co-implicative cognitive domains? U H. Baayen & R. Schreuder (Eds.): Morphological Structure in Language Processing. Mouton de Gruyter. Berlin, 1-45. Tabak, W., Schreuder, R. and Baayen, R. H. (2005). Lexical statistics and lexical processing: semantic density, information complexity, sex, and irregularity in Dutch. In M. Reis and S. Kepser (eds), Linguistic Evidence, Mouton, Balling, L. and Baayen, R.H. (2012) Probability and surprisal in auditory comprehension of morphologically complex words. Cognition, 125, Kemps, R., Wurm, L., Ernestus, M., Schreuder, R. and Baayen, R. H. (2005) Prosodic cues for morphological complexity: Comparatives and agent nouns in Dutch and English. Language and Cognitive Processes, 20,   Baayen, R. H., Milin, P., Filipovic Durdevic, D., Hendrix, P. and Marelli, M. (2011), An amorphous model for morphological processing in visual comprehension based on naive discriminative learning. Psychological Review 118, Frank, S.L. (2013). Uncertainty reduction as a measure of cognitive load in sentence comprehension. Topics in Cognitive Science, 5,


Download ppt "INFORMATION THEORY AND IMPLICATIONS"

Similar presentations


Ads by Google