Dr.Gift Siromoney's Home Page

Tamil Studies

Computer methods of writer identification :
an application to Tamil handwriting
All India Interdisciplinary Symposium on Recent Trends of Research and Development in
Digital Techniques and Pattern Recognition,
Indian Statistical Institute, Calcutta, also, STAT-28/77, February1977.
Gift Siromoney, M. Chandrasekaran and R.Chandrasekaran.

ABSTRACT

Computer methods of writer identification developed elsewhere for English handwriting are extended to Tamil handwriting.
First, Tamil handwritings of n persons are encoded into numerical form on the basis of over a hundred features of Tamil writing. A given piece of handwriting belonging to any one of the n writers has to be matched with the correct writer. The given unknown handwriting is assigned to one of the n writers on the basis of a simple similarity measure defined for pairs of writings. In case there are ties they are resolved by using a weighted similarity measure.
Numerical methods of classifying Tamil handwriting into different main types are proposed.
Results of an experiment carried out by the authors with the aid of an IBM 370/155 computer are discussed.

Computer methods of identifying important features on handwritten bank checks were presented by Nagel and Rosenfeld (1973) at the All India Workshop and Symposium on Digital linage Processing held in Bangalore in 1973. Nagel (1973) has reviewed most of the earlier efforts to apply computer techniques to handwriting analysis. He has also used image enhancement techniques which make the handwriting more readable.

Image enhancement techniques have been applied to Sanskrit inscriptions of the seventh century at the University of Maryland (Siromoney, 1975). More recently computer methods have been used for dating medieval Tamil inscriptions by the authors (Siromoney et al, 1976).

In many countries document examiners are employed to study properties of handwriting in the languages of that country. Document examiners detect forgeries and also examine concealed handwriting. This paper describes the first attempt at using computer methods of writer identification as applied to a modern Indian script.

We started with a simple experiment to find out how successful an observer is in identifying the writer of a prewritten passage.

A passage in Tamil was copied at normal speed by forty different people in separate sheets of paper. A different passage was copied also at normal speed by six of the forty persons. The forty handwritings were numbered from one to forty and displayed. The six handwritings of the second passage were marked from A to F. Each subject was given one of the six handwritings of the second passage and was asked to match it with one of the forty handwritings displayed. They were allowed to take as much time as they liked. Out of the 108 subjects who took part in the experiment 54 identified the writer correctly. This works out on the average to a success rate of 50 percent. The success rate for the six passages from A to F varied from 6 percent to 84 percent. All those who took part in the experiment are students of the Madras Christian College.

It is generally believed that in each handwriting a number of individual features and characteristics are present and that a person's handwriting is an excellent measure of identity. However it may not be easy, as borne out by the experiment described earlier, to identify the writer characteristics in all the cases.

The modern Tamil script is derived from an early script called the Tamil-Brahmi script which goes back to the pre-Christian era. Tamil-Brahmi script closely resembles the Brahmi script of Asoka. There are twelve vowels and eighteen consonants and the relative proportions of these letters have been studied (Siromoney, 1963). Each consonant, in combination with a vowel gives rise to a consonant-vowel pair. The vowel is called uyir or life, the consonant is called mey or body and the consonant-vowel uyirmey, body with life. Compared to the English alphabet, there are more orthographic signs in Tamil and this leads to a larger number of features that can be used in handwriting analysis.

We have chosen 117 features for our study (Tables I and II). Many more features can be used but some of them will be features associated with rare letters. We have also avoided the use of features which are highly correlated. For instance "the top left-hand corner is smooth" is a feature we could have used for the letters, ka, sa, or tha. We use it as one feature and not as three separate features. A person who has this characteristic with respect to the letter ka will often have this characteristic also with respect to sa as well as tha. We have used only those features which are fairly stable within a given passage.

Features were extracted manually from each of the forty handwritings and encoded in a binary form. A feature is reckoned to be present if it occurs in more than 75 percent of the cases. Features were numbered from 1 to 117. If the rth feature was present in the handwriting, the rth place in the binary number was "1"; otherwise it was zero. Each binary number had 117 binary positions. However, in our method, the maximum number of one's in any binary number will be 34, since some features exclude the simultaneous occurrence of some others.

Our method is an adaptation of the method of Harmon and Sitar summarized by Nagel (1973).

The handwritings of forty writers were stored in the computer as binary numbers. This two dimensional vector will be called the library and denoted as LIB(I,J) where I varies from 1 to 40 and J from 1 to 117. A one dimensional row vector TEST(J) is formed for each test sample, and compared with the library. A similarity function TOT(I) is defined for every writer. It gives the number of matched pairs of one's, and ignores the zeros.

TOT(I) = (TEST(J) * LIB(I,J))

where I varies from 1 to M, the number of samples stored and N the number of features. Let TOT(I) be a maximum for I=K. Then the test sample is matched with the kth writer.

We present here the results of this experiment. All the six test samples were identified correctly. The three highest values of the similarity function for each test sample are given in Table III. We note that there is a clear difference between the first and the next ranks. This shows that there is lot of scope for further work in computerizing the work of document examiners who work with manuscripts in Indian languages.

We now describe a second method where the similarity function is defined in a different way. A weighted similarity measure will have greater discriminatory power. In the earlier method all features were given equal importance. Since rare features will be better discriminators than common features, this fact can be built into the weighted similarity function WTOT(I). In general WTOT(l) will be a better measure than TOT(I).

WTOT(I) = (TEST(J) * LIB(I,J) * WEIGT(J))

where I varies from 1 to M. M is the number of samples stored and N the number of features. In our experiment we have used 40 samples belonging to 40 writers. If a feature is present in all the samples we reject it. Otherwise we assign a suitable weight based on the diadic logarithm of the number of samples in which the feature occurs (Table IV).

Instead of giving weights ranging from 1 to 6 one may use other schemes to give weights to the features. If the number of samples exceeds 64 or 2⁶, the weights will vary from 1 to 7 and so on. As in TOT(I), WTOT(l) counts the number of matched pairs of ones giving suitable weights but ignoring the zeros.

Let WTOT(I) be a maximum for I=Q. Then the test sample is attributed to the qth writer.

In our experiment writers of every one of the test samples were correctly identified. This once again shows that our method is basically sound and that it can be used to assist the professional document examiners. The results are tabulated in (Table III).

We also used cluster analytic techniques in order to classify Tamil handwriting into different types but the kind of features used in our study were found unsuitable. A smaller number of features of a different kind might lead to satisfactory classification of Tamil handwriting into major types.

The work was done on an IBM 370/155 computer as well as on an IBM 1130 computer at Madras.

REFERENCES

Nagel, R.N., (1973), Computer screening of handwritten signature: A proposal, TR-220, Computer Science Center, University of Maryland, College Park.
Nagel, R.N. and Rosenfeld, A., (1973), Steps towards signature verification, All India Workshop and Symposium on Digital Image Processing, Bangalore.
Siromoney, G., (1963), Entropy of Tamil prose, Information and Control, 6, 297-300.
Siromoney, G., (1975), Computer techniques of image enhancement in the study of a Pallava Grantha inscription, Studies in Indian Epigraphy, 2, 55-58.
Siromoney, G., Chandrasekaran, M. and Chandrasekaran, R., (1976), Computer methods of dating medieval Tamil inscriptions, (Submitted for publication).

TABLE I

SOME FEATURES OF MODERN TAMIL SCRIPT USED IN THIS STUDY

No.	Letter	Feature
1	a	Equal vertical arms present at the T-junction.
2	a	Lower vertical arm very much long.
3	a	Upper vertical arm reduced to a point.
4	a	Loop instead of upper vertical arm.
5	aa	Loop present at bottom right corner.
6	aa	Bottom right corner ends in horizontal position.
7	i	From starting point moves down and right.
8	i	Loop present at the start.
9	ii	Both points are dots.
10	ii	Both points are closed circles.
11	u	Cusp present on the lower left side.
12	u	Smooth curve on the lower left corner.
13	u	Sharp corner present on the lower left position.
14	uu	Lengthening symbol within the horizontal arm.
15	uu	Lengthening symbol with long right limb moved right.
16	e	Starting point below top line.
17	e	Loop present on the left side.
18	o	Smooth C-curve at the first turning.
19	o	Cusp present at the second turning.
20	o	Cusp present at the first turning.
21	o	Smooth D-curve present at the second turning.

TABLE II

LIST OF FEATURES

LIST OF FEATURES(CONTD)

TABLE III

THE THREE HIGHEST VALUES OF SIMILARITY MEASURES FOR TEST SAMPLES

Test Sample Number	Library Sample Number	Similarity Measure TOT(I)	Weighted measure WTOT(I)	Identity of writer
Test Sample Number	Library Sample Number	Similarity Measure TOT(I)	Weighted measure WTOT(I)	True	Computed
1	13	30	63	13	13
1	11	22	38
1	31	22	43
2	3	29	68	3	3
2	29	23	52
2	1	22	46
3	11	31	58	11	11
3	6	25	47
3	9	24	44
4	10	31	68	10	10
4	14	23	45
4	29	23	50
5	14	29	68	14	14
5	10	24	49
5	5	20	42
6	9	30	64	9	9
6	11	24	43
6	10	23	44

TABLE IV

ASSIGNMENT OF WEIGHT TO DIFFERENT FEATURES

Number of samples in which a feature occurs	Weight assigned to a feature
1	6
2 - 3	5
4 - 7	4
8 -15	3
16 - 31	2
32 - 40	1

$JOB SR02, KP=26, PAGES=50, TIME=5
WATF IV 8441
SUBROUTINE TICHE(LTOT"ITIES,M"IP)
THIS SUBROUTINE CHECKS IF THERE ARE ANY TIES
SUBROUTINE TICHE(LTOT, ITIES,M,IP)
DIMENSION LTOT(400,ITIES(40)
LTOT = COMPARISON TOTAL
M=NUMBER OF PASSAGES
IP=OUTPUT VALUE. IT IS ONE IF THERE ARE ANY TIES,ZERO OTHERWISE
MAX =1
DO 10 I=1,M
IX=LTOT(I)
IF(IX-MAX)10,10,12
12 MAX=IX
10 CONTINUE
MN=0
DO 13 I=I,M
IF(MAX-LTOT(I))13,14,13
14 MN = MN + 1
ITIES(MN)=I
13 CONTINUE PRINT 950
PRINT 9205(I,LTOT(I) ,I= 1,M )
IF(MN-1) 15,16,15
16 T=ITIFS(1)
PRINT 930,IT
IP=0
GO TO 1000
15 PRINT 9409( ITIES( I) 9I =1?MN)
IP=1
950 FORMAT(///T40,'THE COMPARISON TOTALS ARE GIVEN BELOW')
920 FORMAT{/T50, 'PASSAGE NO.',13,2H =,14)
930 FORMAT(///T35,'THE GIVEN SAMPLE IS CLOSE TO THE PASSAGE N0.',I4)
940 FORMAT(///T37.'THERE ARE TIES BETWEEN THE FOLLOWING PASSAGES., '/{// 1/T53,'PASSAGE NO.'14))
1000 CONTINUE
RETURN
END

SUBROUTINE MASS(IK,IMAS)
IK=NUMBER OF OCCURENCES OF A CHARACTER IMAS=ASSIGNED WEIGHT
SUBROUTINE MASS(IK,IMAS)
IJ = -1
25 IJ=IJ+2
IN=2**IJ
IF(IK-IN)26,27,25
26 IMAS=7-IJ
GO TO 28
27 IMAS=6-IJ
CONTINUE
RETURN
END

MAIN PROGRAM TO DETERMINE THE PASSAGE WHICH IS CLOSE TO THE GIVEN SAMPLE PASSAGE.
INTEGER TEST(120),WEIGT(120},TIES(40),TOT(40),WTOT(4Q),TOTAL,TOTGW DIMENSION LIB(40,120)
M = NUMBER OF PASSAGES.
N = NUMBER OF DIFFERENT LETTERS.
L = NUMBER OF TEST SAMPLE PASSAGES
READ 100,M,N,L
DO 500 I=1,M
READ 110,{LIR(I,J),J=1,N)
500 CONTINUE
DO 10 J=19N IS = 0
DO II I=1,M
11 IS=IS+LIB( I, J)
10 WEIGT(J)=IS PRINT 125 FORMAT('1')
ASSIGN WEIGHT (BASE IS BASED ON DIAD1C LOGRITHMIC MEASURE)
IMAS=0
DO 23 II=1,N
IK=WEIGT(II)
CALL MASS(IK9IMAS)
WEIGH (II ) = IMAS
25 CONTINUE
DO 1000 IJK = 1,L
PRINT 112, IJK
READ 1109(TEST(II)9Il=l,N)
DO 30 1=1,M
TOTAL=0
TOTGW=0
DO 35 J = 19,N
TOTAL-TOTAL+LIB(I9J)*TEST(J)
TOTGW=TOTGW + L IB(I,J)*TFST(J)*WEIGT(J)
35 CONTINUE
TOT(I)=TOTAL
WTOT(I)=TOTGW
30 CONTINUE
IP = 0
PRINT 115
CALL TICHE (TOT,TIES,M,IP)
PRINT 120
CALL TICHE(WTOT,TIES,M,IP)
1000 CONTINUE
100 FORMAT(313)
110 FORMAT(8011)
112 FORMAT(///T45,'TEST SAMPLE PASSAGE NO.',14)
115 FORMAT(//T45,GENERAL COMPARISON OF TEST(J) WITH LIB(I,J).')
120 FORMAT(//T15,'COMPARISON OF CHARACTERS OF TEST(J) WITH LIB(I,J) GIVEN THE CHARACTERS.')
STOP
END
$ENTRY