![]() |
Home | Biodata | Biography | Photo Gallery | Publications | Tributes Information Theory |
![]() |
A method of measuring the predictability of printed English was discussed by Shannon(10), in the light of modern information theory. In printed English, the 26 letters of the alphabet occur in definite proportions(9); the letter 'e', for instance, is the most frequent letter and 'z' the least. In certain cases, when a particular letter in a sequence of letters is given, the next letter can be easily guessed -- for instance, 'q' is invariably followed by 'u'. Such a statistical property of the language, helps the subject to predict the next letter correctly. A subject who is not familiar with the language does not use such statistical properties of the language, when he is guessing the next letter. When the two-letter combination 'ps' occurs, in most cases, it is easy to predict the next letter as 'y'. Such a knowledge of the trigram frequencies, increases one's capacity to predict and a subject who is familiar with a language, makes use of his knowledge of such N-gram frequencies, acquired in the process of learning the language(4). A person who knows a language, possesses, implicitly, an enormous knowledge of the statistics of the language. [However, the mere possession of some statistical tables, does not lead to adequate knowledge of the language in question(6) ]. Given a sequence of letters, taken out of context, a person who lacks an adequate knowledge of the language, will find it difficult to predict the next letter correctly.
Entropy(10) is a statistical measure of the average " information " (or uncertainty) per letter in a given text. If the text is translated into binary digits (0 or 1), in the most efficient way, the entropy F is the average number of binary digits (or bits) per letter.

Shannon(10) devised certain guessing experiments to measure the predictability (defined in terms of entropy) of ordinary literary English. Similar experiments(2) were conducted on a larger scale, to determine the extent to which the predictability of English is dependent on the number of preceding letters known to the subject and it was found that increasing the number of letters beyond 32, does not result in any noticeable increase. In this paper, we propose to study the variations in the predicting capacities of students learning English as a foreign language.
Procedure
134 students (boys) of the Madras Christian College, belonging to the Pre-University Class, were chosen. All of them belonged to the same group, taking among other subjects, mathematics, physical sciences and world history. After consultation with the members or the department of English, John Buchan's "Thirty-nine Steps" was chosen as a book suitable to the capacities of the students tested. For the book, the level of abstraction (R = 81.0) and the reading ease (R.E. = 81.2) were estimated, using Flesch's scales(3), from a sample of 1000 words. The percentage, on the average, of " definite words " is 37.6, the average number of syllables per 100 words is 132.0 and the average sentence-length is 13.8 words, which means that the style is " easy " and the level of abstraction, "fairly concrete" .
Two hundred 32-letter sequences were taken from the book, using random numbers, which denoted the page, the line and the position of the first letter of the sequence. It is customary in information theory to consider SPACE (which occurs between any two words) as an additional letter, forming a 27-letter alphabet for English. Each student was given sheets of paper containing the two hundred 32-letter sequences, taken out of context, and each was asked to guess the next letter. They were not told that all the sequences were taken from the same book. The required written answer had to be one of the 26 letters or SPACE. An answer is reckoned as right only if it coincides with the 33rd letter of the original sequence in the book. All of them finished the test within the prescribed 2-hour limit, the average time taken being about 70 minutes. The scores are tabulated (Table 1) and the corresponding histogram is given (Figure 1).
TABLE
Test Scores
| Score | Frequency |
| 5- | 2 |
| 15- | 1 |
| 25- | 1 |
| 35- | 4 |
| 45- | 7 |
| 55- | 12 |
| 65- | 18 |
| 75- | 21 |
| 85- | 31 |
| 95- | 20 |
| 105- | 11 |
| 115- | 5 |
| 125- | 1 |
| Total 134 | |
FIGURE 1
Histogram of Test Score.
The same experiment was tried on four university-trained British nationals and their scores were 128, 126, 126 and 121 out of 200, the time taken varying between 26 and 45 minutes.
Discussion of Reliability and Validity
To test the reliability of the test, a split-half technique(7, 5) was used. The coefficient of correlation between the scores of the first hundred sequences and those of the second hundred is 0.89, with a standard error of 0.02. To test the validity, the names of all those who scored above 100 on the one hand and below 66 on the other, were submitted to two tutors, who are well-acquainted with the subjects. All those 57 subjects were graded by the tutors (from their personal knowledge of the subjects) as "Good" or ''Bad", according to their familiarity with English. From the test scores, all those who got above 100 were considered "Good" and those below 66, "Bad". In 88% of the cases, the classification from the test and that made by the first tutor, agreed. In the case of the second tutor, there was agreement between his grades and the test grades in 81% of the cases. The grades awarded independently by the two tutors, agreed in 86% of the cases.
The grades awarded by the second tutor and those grades obtained from the test scores are represented in a 2 x 2 table (Table 2) and then tested for independence. The value of chi-square obtained is highly significant.
TABLE 2
Number of " Good " and " Bad " students as graded by II Tutor and the Test
| II Tutor's grades | ||||
| Good | Bad | Total | ||
| Test grades | Good | 23 | 6 | 29 |
| Bad | 5 | 23 | 28 | |
| Total | 28 | 29 | 57 | |
Conclusion
The test described in the paper may be used to classify the students into groups on the basis of their familiarity with the language, as the test is quite simple and can be administered quickly. The test can be used to determine the width (vocabulary, spelling, grammar, idioms), accuracy (correctness of response) and quickness of the responses (familiarity with the language diminishes time, other things being equal) of subjects to long sequences of letters, taken out of context. It can be adapted to experiments in abnormal psychology and psychiatry.
Another variation of the test will be to instruct each subject, to go on guessing the next letter, till he gets the correct answer. The total number of such guesses made by the subject can be used in constructing a score, representing his familiarity with the language. The subject who is familiar with the language will make fewer guesses and such scores will have greater reliability. In the absence of mechanical aids for carrying out such a test, each subject will have to be tested separately.
Experiments(1) have been conducted to estimate the upper bounds of the predictability of some
languages, for
example Swedish(7). The subjects chosen for such experiments were highly intelligent and proficient in the language. If "information" is taken in a quantitative sense (not in the semantic sense of meaning nor in the subjective sense of interest and values), it can be equated to statistical uncertainty. It is clear from our experiments that a text transmits more "information" to a person who is not familiar with
the language used. When a person who does not know English (except the alphabet), has to take down a message in English, he has to be equally careful with every letter he takes down. For
him (in a certain sense) F must be equated to
F0.
REFERENCES
1. BRILLOUIN, LEON (1956) : Science and Information Theory, 21-27,
Academic Press Inc., New York
2. BURTON, N. G. AND LICKLIDER, J. C. R. (1955) : Long-range constraints in the Statistical Structure of Printed
English, American Journal of Psychology, 68, 650-53.
3. FLESCH, RUDOLF (1950) : Measuring the Level of
Abstraction, Journal of Applied Psychology, 34, 384-90.
4. FRY, D. B. (1960) : Linguistic Theory and Experimental Research,
Transactions of Philological Society, 13-39.
5. GUILFORD, J. P. (1950) : Fundamental Statistics in Psychology and
Education, McGraw-Hill Book Co., New York 154-73.
6. GUILFORD, J. P. (1962) : An Informational View of Mind, Journal
of Psychological Research, 6, 25-34.
7. HANSSON, H. (1959):The Entropy of the Swedish Language, Transactions of the Second Prague Conference on
Information Theory, Czechoslovak Academy of Sciences, Prague,
215-17.
8. LADO, ROBERT (1961) : Language Testing, Longmans Green & Co. Ltd.,
London, 330-41.
9. SHANNON, C. E. (1948) : A Mathematical Theory of Communication,
Bell System Technical Journal, 27, 379-423.
10. SHANNON, C. E. (1951) : Prediction and Entropy of
Printed English, Bell System Technical Journal, 30, 50-64.