Home | Biodata | Biography | Photo Gallery | Publications | Tributes Information Theory |
An optimum code is constructed for the Telugu alphabet using the proportions of letters estimated from a large sample of Telugu prose. The unbiased estimates of one-gram entropies of the different forms of prose writings are obtained. It is shown that this entropy can neither be treated as a language characteristic nor as a style characteristic. Further the digram entropy and an approximation of the entropy of Telugu prose are obtained.
Telugu, spoken by about 50 million people in India, belongs to the family of Dravidian languages. The earlier works in Telugu are mainly in verse style and its prose writing is a fairly modern development. Using random sampling methods a large sample of 50,000 letters of modern Telugu prose was taken (Balasubrahmanyam, 1963). These prose works are classified into the four groups namely novels, short stories, plays and "others". The last group includes biographies, works on history and miscellaneous essays. In Telugu prose 16 "vowels" (achulu) and 36 "consonants" (hallulu) are used. In addition, the symbol called pu$rna$nuswa$ra is considered a separate letter when it follows any vowel except a. It is orthographically represented as a circle and the combination of a followed by pu$rna$nuswa$ra is usually reckoned as a letter pronounced as am and is the fifteenth letter of achulu ("vowels"). In classical Telugu poetry two more letters (artha$nuswa$ra and visarga) occur but these are omitted in our analysis since they are not used in modern Telugu prose.
Let p1 , p2, ..., pn be the proportions of the different letters of the alphabet. The one-gram entropy (Shannon, 1948) is given by
The relative frequencies of letters for modern prose and for the groups, short stories, novels, plays and "others" are given in Table I, along with Huffman's (1962) minimum redundancy code. This code was constructed from the relative frequencies of modern prose and the average number of binary digits per letter is 4.617 compared to the value 4.588 of H1. Therefore the efficiency (Reza, 1961) of encoding is 99.38 %. The unbiased estimates of the one-gram entropy H1 with the corresponding standard errors for the different groups are given in Table II.
We find that the value of H1 for plays is significantly different from those of novels, short stories and "others". That is what one would expect since Telugu plays are written in pedantic and artificial style of prose quite distinct from the other forms of prose. Herdan (Cherry, 1956) defined a language characteristic as one whose value is significantly different between languages but not between authors writing in the same language. Contrary to this we find here that within the same language H1 has significantly different values for different forms of prose. Further, it has been shown by Ramakrishna (1962) that H1 is practically the same for some of the major languages of India. Therefore H1 is not a language characteristic. If H1 is a style characteristic, then it must bring out the differences in the styles of writing but we find that short stories and "others" have practically the same values for H1 even though they are written in distinctly different styles. The values of H1 fail to bring out the difference whereas the chi-square test (Herdan, 1956) very clearly does so and hence H1 is not very useful as a style characteristic. The failure of H1 to bring out the differences in style has been established (Siromoney and Rajagopalan, 1964) in the field of Karnatic music where H1 is not different for authors of distinct styles but has been shown to be different for different works of the same author. When six works of Tamil poetry spread over a period of about 2000 years was analysed (Siromoney, 1963, 1964b) H1 again failed as a style characteristic.
TABLE I
RELATIVE FREQUENCIES OF LETTERS
Letters | Novels | Short stories | Plays | Others | Prose | Huffman code |
a | 1557 | 1630 | 1569 | 1748 | 1591 | 111 |
a: | 582 | 535 | 534 | 482 | 552 | 0101 |
i | 734 | 787 | 729 | 658 | 727 | 1100 |
i: | 116 | 97 | 144 | 91 | 119 | 001011 |
u | 863 | 807 | 975 | 974 | 902 | 000 |
u: | 93 | 115 | 50 | 48 | 78 | 1000010 |
ru | 15 | 10 | 18 | 32 | 18 | 100000000 |
ru: | 0 | 0 | 0 | 0 | 0 | 100000001110001 |
e | 159 | 167 | 120 | 135 | 146 | 011010 |
e: | 207 | 182 | 228 | 157 | 204 | 110101 |
ai | 41 | 50 | 33 | 55 | 42 | 10000111 |
o | 83 | 85 | 77 | 54 | 78 | 1000001 |
o: | 153 | 157 | 127 | 117 | 142 | 010011 |
au | 4 | 2 | 8 | 14 | 6 | 10001110111 |
am | 11 | 7 | 8 | 14 | 10 | 1000111010 |
m- | 372 | 325 | 309 | 362 | 351 | 10011 |
aha | 2 | 2 | 2 | 3 | 2 | 100000001011 |
k | 363 | 365 | 342 | 317 | 352 | 10110 |
k- | 17 | 0 | 9 | 20 | 14 | 001010001 |
g | 192 | 210 | 167 | 198 | 188 | 110100 |
g- | 4 | 5 | 5 | 6 | 5 | 10000000110 |
ng | 0 | 0 | 3 | 2 | 1 | 10000000111001 |
ch | 242 | 275 | 259 | 200 | 243 | 00110 |
ch- | 2 | 0 | 3 | 3 | 2 | 100000001010 |
j | 60 | 50 | 52 | 103 | 63 | 0100100 |
j- | 0 | 0 | 0 | 0 | 0 | 100000001110000 |
nj | 3 | 2 | 4 | 2 | 3 | 100000001111 |
t | 150 | 207 | 147 | 102 | 148 | 011011 |
t- | 4 | 0 | 3 | 6 | 4 | 10000000100 |
d | 229 | 172 | 200 | 178 | 210 | 00100 |
d- | 1 | 0 | 1 | 5 | 1 | 1000000011101 |
n- | 27 | 12 | 42 | 35 | 31 | 01001010 |
th | 304 | 337 | 286 | 283 | 299 | 01111 |
th- | 8 | 5 | 18 | 26 | 13 | 001010000 |
dh | 274 | 250 | 277 | 314 | 278 | 01100 |
dh- | 36 | 37 | 32 | 49 | 37 | 10000001 |
n | 714 | 737 | 743 | 654 | 715 | 1010 |
p | 271 | 255 | 242 | 220 | 256 | 01000 |
p- | 8 | 2 | 3 | 5 | 6 | 10001110110 |
b | 87 | 62 | 93 | 100 | 88 | 1000110 |
b- | 37 | 47 | 35 | 55 | 39 | 10000110 |
m | 345 | 352 | 448 | 351 | 373 | 10111 |
y | 255 | 207 | 234 | 311 | 253 | 00111 |
r | 390 | 430 | 401 | 515 | 412 | 11011 |
r- | 14 | 17 | 22 | 20 | 17 | 010010111 |
l | 350 | 380 | 333 | 368 | 350 | 10010 |
l- | 29 | 25 | 12 | 6 | 21 | 100011100 |
v | 290 | 302 | 302 | 306 | 296 | 01110 |
s- | 50 | 45 | 76 | 72 | 59 | 0010101 |
sh | 28 | 22 | 41 | 26 | 30 | 00101001 |
s | 166 | 160 | 166 | 134 | 161 | 100010 |
h | 44 | 50 | 52 | 51 | 47 | 10001111 |
ksha | 14 | 7 | 24 | 14 | 16 | 010010110 |
TABLE II
Estimates of H1 and their Standard Errors
Category | Sample size | U.B.E. of H1 | Standard error |
Novels | 26488 | 4.5968 | 0.008473 |
Short stories | 4001 | 4.5524 | 0.021078 |
Plays | 13011 | 4.6662 | 0.016522 |
Others | 6500 | 4.5620 | 0.018298 |
Total for prose | 50000 | 4.5879 | 0.006305 |
REFERENCES
BALASUBRAHMANYAM, P. (1963), "An Application of Information Theory to Linguistics with Special Reference to
Telugu," Master's thesis, University of Madras.
BASHARIN, G. P. (1959), On a statistical estimate for the entropy of a sequence of independent random variables, Teor. Veroyalnoste i Prim. 4,
361-364.
BRILLOUIN, L. (1956), "Science and Information Theory," p. 25, Academic Press, New York.
BURTON, N. G. AND LICKLIDER, J. C. R. (1955), Long-range
constraints in the statistical structure of printed English, Am. J. Psychol. 68, 650-653.
CHERRY, COLIN (Ed.) (1956), "Information Theory, Third London Symposium," p. 168. Butterworth, London.
HERDAN, G. (1956), "Language as Choice and Chance," p. 88. Noordhoff,
Groningen.
HUFFMAN, D. A. (1962), A method
for the construction of minimum redundancy codes, Proc. Inst. Radio Engrs. 40, 1098-1101.
RAMAKRISHNA, B. S. ET AL. (1962), "Some Aspects of Relative Efficiencies of Indian Languages." Indian Institute of Science, Bangalore.
REZA, F. M. (1961), "An Introduction to Information Theory," p. 133. McGraw-Hill, New York.
SHANNON, C. E. (1948), A mathematical theory of communication, Bell System Tech. J.
27, 379-423.
SHANNON, C. E. (1951), Prediction and entropy of printed English, Bell System Tech. J.
30, 50-64.
SIROMONEY, GIFT (1963), Entropy of Tamil prose, Inform. Control 6, 297-300.
SIROMONEY, GIFT (1964a), An information-theoretical test for familiarity with a foreign
language, J. Psychol. Researches 8,1-6.
SIROMONEY, GIFT (1964b), "Certain Applications of Information Theory." Doctoral thesis, University of Madras.
SIROMONEY, GIFT AND RAJAGOPALAN, K. R. (1964), Style as information in Karnatic
music,
J. Music Theory 8, 267-272.