Information Theory

A Note on Entropy of Telugu Prose
Information and Control, Vol.13, No. 4, October, 1968, 281-285
P. Balasubrahmanyam and Gift Siromoney

An optimum code is constructed for the Telugu alphabet using the proportions of letters estimated from a large sample of Telugu prose. The unbiased estimates of one-gram entropies of the different forms of prose writings are obtained. It is shown that this entropy can neither be treated as a language characteristic nor as a style characteristic. Further the digram entropy and an approximation of the entropy of Telugu prose are obtained.

Telugu, spoken by about 50 million people in India, belongs to the family of Dravidian languages. The earlier works in Telugu are mainly in verse style and its prose writing is a fairly modern development. Using random sampling methods a large sample of 50,000 letters of modern Telugu prose was taken (Balasubrahmanyam, 1963). These prose works are classified into the four groups namely novels, short stories, plays and "others". The last group includes biographies, works on history and miscellaneous essays. In Telugu prose 16 "vowels" (achulu) and 36 "consonants" (hallulu) are used. In addition, the symbol called pu$rna$nuswa$ra is considered a separate letter when it follows any vowel except a. It is orthographically represented as a circle and the combination of a followed by pu$rna$nuswa$ra is usually reckoned as a letter pronounced as am and is the fifteenth letter of achulu ("vowels"). In classical Telugu poetry two more letters (artha$nuswa$ra and visarga) occur but these are omitted in our analysis since they are not used in modern Telugu prose.

Let p₁ , p_2, ..., p_n be the proportions of the different letters of the alphabet. The one-gram entropy (Shannon, 1948) is given by

where 'Id' stands for the logarithm to the base 2. Following Basharin (1959) we write

where H₁ is asymptotically normally distributed and N is the size of the sample. These values are obtained directly from the frequency of letters, using seven figure tables of common logarithms.

The relative frequencies of letters for modern prose and for the groups, short stories, novels, plays and "others" are given in Table I, along with Huffman's (1962) minimum redundancy code. This code was constructed from the relative frequencies of modern prose and the average number of binary digits per letter is 4.617 compared to the value 4.588 of H₁. Therefore the efficiency (Reza, 1961) of encoding is 99.38 %. The unbiased estimates of the one-gram entropy H₁ with the corresponding standard errors for the different groups are given in Table II.

We find that the value of H₁ for plays is significantly different from those of novels, short stories and "others". That is what one would expect since Telugu plays are written in pedantic and artificial style of prose quite distinct from the other forms of prose. Herdan (Cherry, 1956) defined a language characteristic as one whose value is significantly different between languages but not between authors writing in the same language. Contrary to this we find here that within the same language H₁ has significantly different values for different forms of prose. Further, it has been shown by Ramakrishna (1962) that H₁ is practically the same for some of the major languages of India. Therefore H₁ is not a language characteristic. If H₁ is a style characteristic, then it must bring out the differences in the styles of writing but we find that short stories and "others" have practically the same values for H₁ even though they are written in distinctly different styles. The values of H₁ fail to bring out the difference whereas the chi-square test (Herdan, 1956) very clearly does so and hence H₁ is not very useful as a style characteristic. The failure of H₁ to bring out the differences in style has been established (Siromoney and Rajagopalan, 1964) in the field of Karnatic music where H₁ is not different for authors of distinct styles but has been shown to be different for different works of the same author. When six works of Tamil poetry spread over a period of about 2000 years was analysed (Siromoney, 1963, 1964b) H₁ again failed as a style characteristic.

TABLE I
RELATIVE FREQUENCIES OF LETTERS

Letters Novels Short stories Plays Others Prose Huffman code

a 1557 1630 1569 1748 1591 111

a: 582 535 534 482 552 0101

i 734 787 729 658 727 1100

i: 116 97 144 91 119 001011

u 863 807 975 974 902 000

u: 93 115 50 48 78 1000010

ru 15 10 18 32 18 100000000

ru: 0 0 0 0 0 100000001110001

e 159 167 120 135 146 011010

e: 207 182 228 157 204 110101

ai 41 50 33 55 42 10000111

o 83 85 77 54 78 1000001

o: 153 157 127 117 142 010011

au 4 2 8 14 6 10001110111

am 11 7 8 14 10 1000111010

m- 372 325 309 362 351 10011

aha 2 2 2 3 2 100000001011

k 363 365 342 317 352 10110

k- 17 0 9 20 14 001010001

g 192 210 167 198 188 110100

g- 4 5 5 6 5 10000000110

ng 0 0 3 2 1 10000000111001

ch 242 275 259 200 243 00110

ch- 2 0 3 3 2 100000001010

j 60 50 52 103 63 0100100

j- 0 0 0 0 0 100000001110000

nj 3 2 4 2 3 100000001111

t 150 207 147 102 148 011011

t- 4 0 3 6 4 10000000100

d 229 172 200 178 210 00100

d- 1 0 1 5 1 1000000011101

n- 27 12 42 35 31 01001010

th 304 337 286 283 299 01111

th- 8 5 18 26 13 001010000

dh 274 250 277 314 278 01100

dh- 36 37 32 49 37 10000001

n 714 737 743 654 715 1010

p 271 255 242 220 256 01000

p- 8 2 3 5 6 10001110110

b 87 62 93 100 88 1000110

b- 37 47 35 55 39 10000110

m 345 352 448 351 373 10111

y 255 207 234 311 253 00111

r 390 430 401 515 412 11011

r- 14 17 22 20 17 010010111

l 350 380 333 368 350 10010

l- 29 25 12 6 21 100011100

v 290 302 302 306 296 01110

s- 50 45 76 72 59 0010101

sh 28 22 41 26 30 00101001

s 166 160 166 134 161 100010

h 44 50 52 51 47 10001111

ksha 14 7 24 14 16 010010110

TABLE II
Estimates of H₁ and their Standard Errors

Category Sample size U.B.E. of H₁ Standard error

Novels 26488 4.5968 0.008473

Short stories 4001 4.5524 0.021078

Plays 13011 4.6662 0.016522

Others 6500 4.5620 0.018298

Total for prose 50000 4.5879 0.006305

Shannon (1951) defines entropy H as the limiting value of the n-gram entropy H_n which is given by the equation

where b_i is a block of (n-1) letters, j an arbitrary letter following b_i and p(b_i, j) the probability of the n-gram b_i, j. From a sample of 10,000 digrams of Telugu prose, the value of the digram entropy H₂ was estimated to be 3.09 bits per letter. An estimate of the entropy H was made using methods similar to those of Shannon (1951) and Brillouin (1956). From one book, strings of letters of length 75 were chosen at random and a subject was asked to guess the next letter. For English it has been established (Burton and Licklider, 1955) that a string of 32 letters is sufficient for such experiments and for Telugu preliminary trials indicated that the length of 75 letters was quite sufficient. The subject guessed the 76th letter correctly in 61 trials out of 100. Assuming that each letter guessed correctly gives 1 bit of information and each that could not be guessed correctly H₁ bits of information, H is estimated to be 2.4 bits per letter which must be considered to be an upper limit. By varying the subjects or by varying the texts, one would expect to get different estimates for H. This variation between the subjects itself can be used, as it has been done for English (Siromoney, 1964a), to test the proficiency of the subject in Telugu prose.

REFERENCES

BALASUBRAHMANYAM, P. (1963), "An Application of Information Theory to Linguistics with Special Reference to Telugu," Master's thesis, University of Madras.
BASHARIN, G. P. (1959), On a statistical estimate for the entropy of a sequence of independent random variables, Teor. Veroyalnoste i Prim. 4, 361-364.
BRILLOUIN, L. (1956), "Science and Information Theory," p. 25, Academic Press, New York.
BURTON, N. G. AND LICKLIDER, J. C. R. (1955), Long-range constraints in the statistical structure of printed English, Am. J. Psychol. 68, 650-653.
CHERRY, COLIN (Ed.) (1956), "Information Theory, Third London Symposium," p. 168. Butterworth, London.
HERDAN, G. (1956), "Language as Choice and Chance," p. 88. Noordhoff, Groningen.
HUFFMAN, D. A. (1962), A method for the construction of minimum redundancy codes, Proc. Inst. Radio Engrs. 40, 1098-1101.
RAMAKRISHNA, B. S. ET AL. (1962), "Some Aspects of Relative Efficiencies of Indian Languages." Indian Institute of Science, Bangalore.
REZA, F. M. (1961), "An Introduction to Information Theory," p. 133. McGraw-Hill, New York.
SHANNON, C. E. (1948), A mathematical theory of communication, Bell System Tech. J. 27, 379-423.
SHANNON, C. E. (1951), Prediction and entropy of printed English, Bell System Tech. J. 30, 50-64.
SIROMONEY, GIFT (1963), Entropy of Tamil prose, Inform. Control 6, 297-300.
SIROMONEY, GIFT (1964a), An information-theoretical test for familiarity with a foreign language, J. Psychol. Researches 8,1-6.
SIROMONEY, GIFT (1964b), "Certain Applications of Information Theory." Doctoral thesis, University of Madras.
SIROMONEY, GIFT AND RAJAGOPALAN, K. R. (1964), Style as information in Karnatic music, J. Music Theory 8, 267-272.

Letters	Novels	Short stories	Plays	Others	Prose	Huffman code
a	1557	1630	1569	1748	1591	111
a:	582	535	534	482	552	0101
i	734	787	729	658	727	1100
i:	116	97	144	91	119	001011
u	863	807	975	974	902	000
u:	93	115	50	48	78	1000010
ru	15	10	18	32	18	100000000
ru:	0	0	0	0	0	100000001110001
e	159	167	120	135	146	011010
e:	207	182	228	157	204	110101
ai	41	50	33	55	42	10000111
o	83	85	77	54	78	1000001
o:	153	157	127	117	142	010011
au	4	2	8	14	6	10001110111
am	11	7	8	14	10	1000111010
m-	372	325	309	362	351	10011
aha	2	2	2	3	2	100000001011
k	363	365	342	317	352	10110
k-	17	0	9	20	14	001010001
g	192	210	167	198	188	110100
g-	4	5	5	6	5	10000000110
ng	0	0	3	2	1	10000000111001
ch	242	275	259	200	243	00110
ch-	2	0	3	3	2	100000001010
j	60	50	52	103	63	0100100
j-	0	0	0	0	0	100000001110000
nj	3	2	4	2	3	100000001111
t	150	207	147	102	148	011011
t-	4	0	3	6	4	10000000100
d	229	172	200	178	210	00100
d-	1	0	1	5	1	1000000011101
n-	27	12	42	35	31	01001010
th	304	337	286	283	299	01111
th-	8	5	18	26	13	001010000
dh	274	250	277	314	278	01100
dh-	36	37	32	49	37	10000001
n	714	737	743	654	715	1010
p	271	255	242	220	256	01000
p-	8	2	3	5	6	10001110110
b	87	62	93	100	88	1000110
b-	37	47	35	55	39	10000110
m	345	352	448	351	373	10111
y	255	207	234	311	253	00111
r	390	430	401	515	412	11011
r-	14	17	22	20	17	010010111
l	350	380	333	368	350	10010
l-	29	25	12	6	21	100011100
v	290	302	302	306	296	01110
s-	50	45	76	72	59	0010101
sh	28	22	41	26	30	00101001
s	166	160	166	134	161	100010
h	44	50	52	51	47	10001111
ksha	14	7	24	14	16	010010110

Category	Sample size	U.B.E. of H₁	Standard error
Novels	26488	4.5968	0.008473
Short stories	4001	4.5524	0.021078
Plays	13011	4.6662	0.016522
Others	6500	4.5620	0.018298
Total for prose	50000	4.5879	0.006305