Dr.Gift Siromoney's Home Page

Tamil Studies

STATISTICAL AIDS AND TAMIL STUDIES
Saiva Siddhanta, iii, 1968, pp. 44-48
Gift Siromoney and K. R. Rajagopalan

The need for the use of modern techniques in Tamilology was emphasized by several delegates from abroad who attended the Second World Tamil Conference held at Madras. However they seemed to be unaware of some of the work on Indian languages done in this country using the modern communication theory even though these studies were published in international scientific journals abroad. In this paper we give a brief survey of some important statistical techniques that have been found useful in literary studies and apply one of the methods to compare the styles of two of our well known authors.

Even though the early attempts at the use of statistical methods in literary studies can be traced back to the last century, the first significant contribution was made in 1939 by Udney Yule of Cambridge. Contributions by scholars such as C. B. Williams, W. C. Wake and G. Herdan followed. In the last few years electronic computers have been pressed into service and this has created wide interest in the subject.

Among the books of the Bible to be subjected to strict statistical analysis, Pauline Epistles were the first. In the Hibbert Journal of 1948, W. C. Wake studied the lengths of sentences in the Epistles and found distinct evidence for dual authorship. Romans, Galatians, I Corinthians and II Corinthians chapters 10-13 have sentence-lengths which are statistically indistinguishable, with an average of 11 words. On the other hand Thessalonians, Colossians and Philippians form a second group with an average of 17 words per sentence. Of these two groups, the first is believed to be the authentic writings of Paul when other factors are taken into account.

Another technique was used in 1963 by Morton to analyze the same Epistles using electronic computers. The Greek writers used the word "kai" very frequently. "Kai" means "and"' and the use of the word is independent of the subject matter. The proportion of "kais" in the vocabulary of a writer was found to be constant and this was used as a statistical characteristic. This study confirmed the earlier work of Wake that Romans, Galatians and the major portion of Corinthians were of the same pattern but claimed that the other Epistles formed three different groups. This study showed that some of the Epistles of Paul were too short to derive any reliable conclusions using statistical analysis. Other books of the New Testament have also been analysed using the computer.

In the study of disputed authorship several other techniques have been used. Yule, in his study of the vocabulary of different authors, confined himself to nouns and developed a measure of diversity of vocabulary called the "Characteristic". This technique was applied to the problem of the authorship of De Imitatione Christi and sufficient evidence was found to support the view that Thomas a Kempis, and not Gerson, was the author of the book. In the U. S. another technique based on Bayesian methods was used to study the Federalist papers. In Sweden the occurrence of rare words was used to identify authors and such statistical evidence was once admitted for the prosecution of a high official. The use of word-length as a criterion is not very common but at the beginning of this century, the works of Shakespeare and Bacon were analysed on this basis. It was found that Shakespeare's average was about 4 letters per word in contrast to Bacon's average of 3 letters.

The development of' information and communication theory gave rise to the study of the proportions of different letters used in a language. The proportions were used for calculating a measure called the "Entropy". Studies in several Indian languages were conducted by Ramakrishna of the Indian Institute of Science and appeared in the form of a monograph. Studies in Telugu were made by P. Balasubrahmanyam of the Madras Christian College and in Kannada by K. R. Rajagopalan. Rajagopalan, further compared the proportion of Sanskrit words in Kannada prose in short stories and novels on the one hand and biographies and literary studies on the other. The former had 18% of Sanskrit words and the latter 40%. Using sentence-length as a criterion, he found that the South Kanara writers showed a preference to longer sentences compared to the Old Mysore and North Kanara writers.

Work by Siromoney in Tamil using information theory was started in 1959 and the proportions of different letters were found. These results were applied to the study of the keyboard of the Tamil typewriter and for the invention of a keyboard for a teleprinter in Tamil. Five years have elapsed since the results appeared in the Tamil Culture and it is a great pity that not a single Tamil teleprinter has been constructed so far. The entropy of modern prose was found to be significantly different from those of the poetical works in Tamil. An attempt was made to distinguish the styles of Yutthakaandam and Utthirakaandam of Kambaraamaayanam in Information and Control (1963). The values of entropies were not significantly different but a more powerful method called the chi-square method established the difference between the two works.

Recently we have made a study of the styles of two of our authors on the basis of sentence-length. Abraham Pandither's Nanmarai kaattum nanneri (denoted by A₁) a Christian work written in the Saiva Siddhanta idiom and Karunaamirtha saagaram (A₂) his famous treatise on musicology were taken and compared with the works of T. P. Menakshisundaranaar. The works of the latter are Kaanal vari (M₁) and Saiaatnil thiruvembaavai thiruppaavai (M₂). The string of words between two periods was taken as a sentence and the string of letters between two spaces was taken as a word. Samples were taken from each book using random sampling techniques with the assistance of two undergraduate students, V. N. Govindan and S. Ranganathan. The results are presented in the form of a table. The average lengths of sentences are 13.99 (A₁) and 14.38 (A₂) for Pandither and 9.22 (M₁) and 9.39 (M₂) for T. P. M. The standard error for A₁ is 10.95 and for A₂ 10.53. In contrast to this, for T. P. M. , the standard errors are 5.76 ( M₁ ) and 5.02 ( M₂). 0n the average Pandither's sentences are about one and a half times as long as T. P. M.'s. Further about 10% of the sentences of Pandither have a length greater than 30 in contrast to T. P. M.'s sentences of which only about 1% are of a length greater than 30. However T. P. M. himself wrote very long sentences once in a way and we give the longest sentences of these two authors even though these sentences are not part of our samples. The shortest sentences are of unit length.

TABLE 1. Frequencies of sentences in terms of length

Length of sentences in words	Number of occurrences of sentences
	Pandither		Meenakshisundaranaar T.P.M.
	A1	A2	M1	M2
1- 5	35	23	56	33
6-10	42	50	87	103
10-15	30	38	33	29
16-20	18	15	14	14
21-25	13	19	5	4
26-30	7	6	2	2
31-35	7	3	1	1
36-40	3	1	1	1
41-45	2	2	1	1
46 and above	1	3	1	1
Total number of sentences	158	160	199	186
Average length of sentences	13.99	14.38	9.22	9.39

In Karunaamirtha saagaram (p.11), Pandither writes

Meenakshisundaranaar (Kaanal vari p. 180) replies,

Pandither's work was published in 1917 and T. P. M.'s in 1961 and the two sentences reflect the enormous differences in style not only between the authors but also between the two periods. The obvious difference is in the nature of the vocabulary -- the large number of words of Sanskrit origin which were in common use during Pandither's days are no longer used today. In general, shorter sentences are preferred to longer ones by present day writers.

Using the distribution of the difference between means we find that the average sentence-lengths are significantly different between the two authors. Taking Pandither, the average lengths of sentences are not significantly different between his two works even though the works are on two different subjects. However more work is needed to establish the use of sentence-length as a useful criterion to distinguish styles between authors in Tamil prose.

REFERENCES

P. Balasubrahmanyam, "An application of information theory to linguistics with special reference to Telugu", Master's thesis, University of Madras, 1963.
T. P. Meenakshisundaranaar, " Kaanal vari", Kalaikkathir, 1961, Coimbatore.
---------," Saiaamil Thiruvembaavai Thiruppaavai ", 1961, Thiruchi.
A. Q. Morton and James McLeman, Christianity and Computer, Hodder, 1964, London.
M. Abraham Pandither, " Karunaamirtha saagaram", 1917, Tanjore.
---------, " Nanmarai kaattum nanneri ", 1918, Tanjore.
K. R. Rajagopalan, " A note on entropy of Kannada prose", Information and Control, vol. 8, 1965, pp. 640-644.
--------, " Some statistical methods applied to language studies", Kannada Studies, July 1966, pp. 93-98.
B. S. Ramakrishna et al, l: "The Relative Efficiencies of Indian Languages'", Indian Institute of Science, 1962, Bangalore.
Gift Siromoney, " Entropy of Tamil prose ", Information and Control, vol. 6, 1963, pp, 297-300.
--------, "Efficient methods of telegraphy, typewriting and teleprinting in Tamil ", Tamil Culture, vol. 10, 1963, pp. 107-120.
W. C. Wake, "Authenticity of the Pauline Epistles", Hibbert Journal, vol. 47, 1947, p. 50.
C. B. Williams, " Statistics as an aid to literary studies", Science News, 24, 1952, pp. 99-106.