Download presentation
Presentation is loading. Please wait.
Published byNoah Wilcox Modified over 9 years ago
1
Bi-Weekly BTP presentation Bhargava Reddy 110050078 14-10-2014, Tuesday
2
Contents Defining Entropy of a Language Calculating entropy of a language Letter and word frequencies Entropy of various World Languages Entropy of Telugu Indus Script Entropy of Word Ordering
3
Motivation for Entropy in English Aoccdrnig to rseearchat at Elingsh uinervtisy, it deosn't mttaer in waht odrer the ltteers ina wrod are, the olny iprmoatnt tihng is taht the frist and lsat ltteer is at the rghit pclae. The rset canbe a toatl mses and youcansitll raed it wouthit a porbelm. Tihs is bcuseae we do not raed ervey lteter by it slef but the wrod as a wlohe.
4
The actual sentence According to research at English university, it doesn't matter in what order the letters in a word are, the only important thing is that the first and last letter is at the right place. The rest can be a total mess and you can still read it with out a problem. This is because we do not read every letter by it self but the word as a whole.
5
Letters removed in words Acc__ding t_ res_a_ch a_ En_l_sh _ni_ersity, it do__n’t ma_t_r in wh__ or_er t_e l_tt_rs in _ wo__ are, t_e onl_ im_or_an_ thi__ is tha_ th_ fi_st an_ l_st le__er is a_ the ri__t pl__e. T_e re_t can b_ _ tot_l mes_ a_d yo_ can st__l re_d i_ with_ut _ pro__em. Thi_ i_ beca__e we do_’t read ev_r_ let__r by i_sel_ but the w__d as _ who__ 25% of the letters have been removed from the actual sentence
6
The Formula for Entropy Based on Shannon's: A Mathematical Theory of Communication
7
Entropy of Language If the language is translated into binary digits (0 or 1) in the most efficient way, the entropy H is the average number or binary digits required per letter of the original language. The redundancy measures the amount of constraint imposed on a text in the language due to its statistical structure Ex: In english: The frequency of letter E, Strong tendency that H follows T or U to follow Q Based on Shannon's: Prediction and Entropy of Printed English
8
Entropy calculation from the statistics of English Based on Shannon's: Prediction and Entropy of Printed English
9
Entropy of English Based on Shannon's: Prediction and Entropy of Printed English
10
Interpretation of the equation Based on Shannon's: Prediction and Entropy of Printed English
11
Calculations of the F N Based on Shannon's: Prediction and Entropy of Printed English
12
Letter Frequencies in English Source: Wikipedia’s article on letter frequency in English
13
Calculation of higher F N Similar calculations for F 3 gives the value as 3.3 bits The tables of N-gram frequencies are not available for N>3 as a result F 4,F 5,F 6 cannot be calculated the same way Word frequencies are used to calculate to assist in such situations Let us look at the log-log paper of the probabilities of words against frequency rank Based on Shannon's: Prediction and Entropy of Printed English
14
Word Frequencies Based on Shannon's: Prediction and Entropy of Printed English
15
Inferences
16
Entropy for various world languages From the data we can infer that english languages has the least entropy and Finnish language has the highest entropy But all the languages have a comparable entropy when we take Shannon’s experiment into consideration Finnish (fi), German (de), Swedish (sv), Dutch (nl), English (en), Italian (it), French (fr), Spanish (es), Portuguese (pt) and Greek (el) Based on Word-length entropies and correlations of natural language written texts. 2014
17
Ziff like plots for various world languages Based on Word-length entropies and correlations of natural language written texts. 2014
18
Entropy of Telugu Language Indian languages are highly phonetic which makes the computation of the entropy to be a rather difficult task Thus entropy for Telugu has been calculated by converting it into english language and using Shannon’s experiment. The entropy is calculated in 2 ways: 1.Converting into English and then considering them as English letters 2.Converting into English and then considering them as Telugu letters Based on Entropy of Telugu. Venkata Ravinder Paruchuri. 2011
19
Telugu Language Entropy Based on Entropy of Telugu. Venkata Ravinder Paruchuri. 2011
20
Inferences The entropy of Telugu is higher than that of English, which means that Telugu is more succinct than English and each syllable in Telugu(as in other Indian languages) contains more information compared to English
21
Indus Script Very less has been known about the script from the ancient time But no inferences has been made about weather it is a linguistic script or not But from the diagram besides we can check that Indus script lies somewhere near most of the world languages We can thus infer that the Indus script is a one which can be noted as a linguistic script but we have no solid proof for it Based on Entropy, the Indus Script and Language: A Reply to R.Sproat
22
Entropy of Word Ordering We have initially seen that we were able to read the sentence perfectly without any confusion All we did was we kept the first and last word the same and randomized the remaining words This can be justified by the following data in the figure Blue – Randomized text Green – Original text Red – Relative difference
23
References Prediction and Entropy of Printed English. C.E.Shannon. The Bell System Technical Journal. January 1951 Word-length entropies and correlations of natural language written texts. Maria Kalimeri, Vassilios Constantoudis, Constantinos Papadimitrious. ArXiV conference 2014 Entropy of Telugu. Venkata Ravinder Paruchuri. 2011 Entropy, the Indus Script and Language: A Reply to R.Sproat. Rajesh PN Rao, Nisha Yadav, Mayank Vahia, Hrishikesh. Computational Linguistics 36(4). 2010 Universal Entropy of Word Ordering Across Linguistic Families. Marcelo A. Montemurro, Zanette DH. PLoS ONE 6(5): e19875. doi:10.1371/ journal.pone.0019875. 2011
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.