Home | Biodata | Biography | Photo Gallery | Publications | Tributes
[Back to Information Theory List]

Information Theory


A Note on Entropy of Telugu Prose
 Information and Control, Vol.13, No. 4, October, 1968, 281-285
P. Balasubrahmanyam and  Gift Siromoney 

An optimum code is constructed for the Telugu alphabet using the proportions of letters estimated from a large sample of Telugu prose. The unbiased estimates of one-gram entropies of the different forms of prose writings are obtained. It is shown that this entropy can neither be treated as a language characteristic nor as a style characteristic. Further the digram entropy and an approximation of the entropy of Telugu prose are obtained.

Telugu, spoken by about 50 million people in India, belongs to the family of Dravidian languages. The earlier works in Telugu are mainly in verse style and its prose writing is a fairly modern development. Using random sampling methods a large sample of  50,000 letters of modern Telugu prose was taken (Balasubrahmanyam, 1963). These prose works are classified into the four groups namely novels, short stories, plays and "others". The last group includes biographies, works on history and miscellaneous essays. In Telugu prose 16 "vowels" (achulu) and 36 "consonants" (hallulu) are used. In addition, the symbol called pu$rna$nuswa$ra is considered a separate letter when it follows any vowel except a. It is orthographically represented as a circle and the combination of a followed by pu$rna$nuswa$ra is usually reckoned as a letter pronounced as am and is the fifteenth letter of achulu ("vowels"). In classical Telugu poetry two more letters (artha$nuswa$ra and visarga) occur but these are omitted in our analysis since they are not used in modern Telugu prose.

Let p1 , p2, ..., pn be the proportions of the different letters of the alphabet. The one-gram entropy (Shannon, 1948) is given by



where 'Id' stands for the logarithm to the base 2. Following Basharin (1959) we write



where H1 is asymptotically normally distributed and N is the size of the sample. These values are obtained directly from the frequency of letters, using seven figure tables of common logarithms.

The relative frequencies of letters for modern prose and for the groups, short stories, novels, plays and "others" are given in Table I, along with Huffman's (1962) minimum redundancy code. This code was constructed from the relative frequencies of modern prose and the average number of binary digits per letter is 4.617 compared to the value 4.588 of H1. Therefore the efficiency (Reza, 1961) of encoding is 99.38 %. The unbiased estimates of the one-gram entropy H1 with the corresponding standard errors for the different groups are given in Table II.

We find that the value of H1 for plays is significantly different from those of novels, short stories and "others". That is what one would expect since Telugu plays are written in pedantic and artificial style of prose quite distinct from the other forms of prose. Herdan (Cherry, 1956) defined a language characteristic as one whose value is significantly different between languages but not between authors writing in the same language. Contrary to this we find here that within the same language H1 has significantly different values for different forms of prose. Further, it has been shown by Ramakrishna (1962) that H1 is practically the same for some of the major languages of India. Therefore H1 is not a language characteristic. If H1 is a style characteristic, then it must bring out the differences in the styles of writing but we find that short stories and "others" have practically the same values for H1 even though they are written in distinctly different styles. The values of H1 fail to bring out the difference whereas the chi-square test (Herdan, 1956) very clearly does so and hence H1 is not very useful as a style characteristic. The failure of H1 to bring out the differences in style has been established (Siromoney and Rajagopalan, 1964) in the field of Karnatic music where H1 is not different for authors of distinct styles but has been shown to be different for different works of the same author. When six works of Tamil poetry spread over a period of about 2000 years was analysed (Siromoney, 1963, 1964b) H1 again failed as a style characteristic.


TABLE I
RELATIVE FREQUENCIES OF LETTERS
Letters Novels Short stories Plays Others Prose Huffman code
a 1557 1630 1569 1748 1591 111
a: 582 535 534 482 552 0101
i 734 787 729 658 727 1100
i: 116 97 144 91 119 001011
u 863 807 975 974 902 000
u: 93 115 50 48 78 1000010
ru 15 10 18 32 18 100000000
ru: 0 0 0 0 0 100000001110001
e 159 167 120 135 146 011010
e: 207 182 228 157 204 110101
ai 41 50 33 55 42 10000111
o 83 85 77 54 78 1000001
o: 153 157 127 117 142 010011
au 4 2 8 14 6 10001110111
am 11 7 8 14 10 1000111010
m- 372 325 309 362 351 10011
aha 2 2 2 3 2 100000001011
k 363 365 342 317 352 10110
k- 17 0 9 20 14 001010001
g 192 210 167 198 188 110100
g- 4 5 5 6 5 10000000110
ng 0 0 3 2 1 10000000111001
ch 242 275 259 200 243 00110
ch- 2 0 3 3 2 100000001010
j 60 50 52 103 63 0100100
j- 0 0 0 0 0 100000001110000
nj 3 2 4 2 3 100000001111
t 150 207 147 102 148 011011
t- 4 0 3 6 4 10000000100
d 229 172 200 178 210 00100
d- 1 0 1 5 1 1000000011101
n- 27 12 42 35 31 01001010
th 304 337 286 283 299 01111
th- 8 5 18 26 13 001010000
dh 274 250 277 314 278 01100
dh- 36 37 32 49 37 10000001
n 714 737 743 654 715 1010
p 271 255 242 220 256 01000
p- 8 2 3 5 6 10001110110
b 87 62 93 100 88 1000110
b- 37 47 35 55 39 10000110
m 345 352 448 351 373 10111
y 255 207 234 311 253 00111
r 390 430 401 515 412 11011
r- 14 17 22 20 17 010010111
l 350 380 333 368 350 10010
l- 29 25 12 6 21 100011100
v 290 302 302 306 296 01110
s- 50 45 76 72 59 0010101
sh 28 22 41 26 30 00101001
s 166 160 166 134 161 100010
h 44 50 52 51 47 10001111
ksha 14 7 24 14 16 010010110



TABLE II
Estimates of H1 and their Standard Errors
Category Sample size U.B.E. of H1 Standard error
Novels 26488 4.5968 0.008473
Short stories 4001 4.5524 0.021078
Plays 13011 4.6662 0.016522
Others 6500 4.5620 0.018298
Total for prose 50000 4.5879 0.006305

Shannon (1951) defines entropy H as the limiting value of the n-gram entropy Hn which is given by the equation


where bi is a block of (n-1) letters, j an arbitrary letter following bi and p(bi, j ) the probability of the n-gram bi, j. From a sample of 10,000 digrams of Telugu prose, the value of the digram entropy H2 was estimated to be 3.09 bits per letter. An estimate of the entropy H was made using methods similar to those of Shannon (1951) and Brillouin (1956). From one book, strings of letters of length 75 were chosen at random and a subject was asked to guess the next letter. For English it has been established (Burton and Licklider, 1955) that a string of 32 letters is sufficient for such experiments and for Telugu preliminary trials indicated that the length of 75 letters was quite sufficient. The subject guessed the 76th letter correctly in 61 trials out of 100. Assuming that each letter guessed correctly gives 1 bit of information and each that could not be guessed correctly H1 bits of information, H is estimated to be 2.4 bits per letter which must be considered to be an upper limit. By varying the subjects or by varying the texts, one would expect to get different estimates for H. This variation between the subjects itself can be used, as it has been done for English (Siromoney, 1964a), to test the proficiency of the subject in Telugu prose.

REFERENCES

BALASUBRAHMANYAM, P. (1963), "An Application of Information Theory to Linguistics with Special Reference to Telugu,"  Master's thesis, University of Madras.
BASHARIN, G. P. (1959), On a statistical estimate for the entropy of a sequence of independent random variables, Teor. Veroyalnoste i Prim. 4, 361-364.
BRILLOUIN, L. (1956), "Science and Information Theory,"  p. 25, Academic Press, New York.
BURTON, N. G. AND LICKLIDER, J. C. R. (1955), Long-range constraints in the statistical structure of printed English, Am. J. Psychol. 68, 650-653.
CHERRY, COLIN (Ed.) (1956), "Information Theory, Third London Symposium," p. 168. Butterworth, London.
HERDAN, G. (1956), "Language as Choice and Chance," p. 88. Noordhoff, Groningen.
HUFFMAN, D. A. (1962), A method for the construction of minimum redundancy codes, Proc. Inst. Radio Engrs. 40, 1098-1101.
RAMAKRISHNA, B. S. ET AL. (1962), "Some Aspects of Relative Efficiencies of Indian Languages." Indian Institute of Science, Bangalore.
REZA, F. M. (1961), "An Introduction to Information Theory," p. 133. McGraw-Hill, New York.
SHANNON, C. E. (1948), A mathematical theory of communication, Bell System Tech. J. 27, 379-423.
SHANNON, C. E. (1951), Prediction and entropy of printed English, Bell System Tech. J. 30, 50-64.
SIROMONEY, GIFT (1963), Entropy of Tamil prose, Inform. Control 6, 297-300.
SIROMONEY, GIFT (1964a), An information-theoretical test for familiarity with a foreign language, J. Psychol. Researches 8,1-6.
SIROMONEY, GIFT (1964b), "Certain Applications of Information Theory." Doctoral thesis, University of Madras.
SIROMONEY, GIFT  AND  RAJAGOPALAN, K. R. (1964), Style as information in Karnatic music, J. Music Theory 8, 267-272.

Go to the top of the page

Home | Biodata | Biography | Photo Gallery | Publications | Tributes