MRC Psycholinguictic Database - User Manual Version 1.

SECTION I AN INTRODUCTION AND OVERVIEW

Aims of the Project

A database of information concerning various properties of a large number of English words has been established on computer. The database consists of 98,538 English words. For every one of these words, the database contains information about the word's spelling, syntactic category, and number of letters. For various large subsets of the total set of words, additional forms of information' are available: phonetic transcription, number of phonemes, number of syllables, stress pattern, frequency of occurrence, imageability, concreteness, familiarity, meaningfulness and age of acquisition. I.nformation about word associations is also available. Access to and manipulation of these various properties of the words in the database is achieved by submitting jobs written in a specially designed and highly simplified access language.

There are various uses to which this database may be put. Two general uses are:

(a) Selection of sets of words for use in psycholinguistic experimentation: it is common for such selection to operate using many criteria simultaneously (e.g. one might want a matched set of nouns and verbs, with matching on number of letters, number of syllables, and frequency of occurrence), and such selection is at best time- consuming and at worst impossible without computer assistance. The database access language has been written with this kind of matched selection very much in mind.

(b) Direct analysis of the linguistic information combined in the database: one can explore relationships between various linguistic properties, between spelling and pronunciation or between stress pattern,and syntactic category, for instance.

1.2 Origins

The database was developed by combining information from a variety of sources. A major source was the Associative Thesaurus, a computerised database of word associations developed by G. Kiss and his associates at the Medical Research Council's Speech and Communction Research ljnit (Kiss, Armstrong, Milroy and Piper, 1973). Another important source was a database consisting of a dictionary of phonetic

1.2 transcriptions transferred to magnetic tape from Daniel Jones' Pronouncing Dictionary of the English Language (12th edition): this task was carried out by Professor L. Guierre (see Guierre, 1966) who kindly provided a copy of his tape. Professor A.L. Paivio provided a tape containing sets of ratings of words on imageability, concreteness, familiarity and meaningfulness these ratings being extensions of the sets published by Paivio, Yaille and Madigan (1969). Dr. K.L. Gilhooly provided a tape containing the sets of ratings published by Gilhooly and Logio (1980). A tape containing the Colorado norms of imageability, concreteness, familiarity and meaningfulness (Toglia and Battig, 1978) was provided by Professor Battig's associates. Finally, the word-frequency counts of Kucera and Francis (1967) and Thorndike and Lorge (1942) and the Shorter Oxford English Dictionary database produced by Dolby, Resnikoff and MacMurray (1963) were incorporated; these three databases had been part of the Edinburgh Associative Thesaurus.

1-3 The Database The database will be described briefly here; it is described in detail in Section 2. It consists of three files: the DICT file, the S-R file, and the R-S file. The DICT file contains entries for 98,538 words, with information about spelling, syntactic category, number of letters, phonemes and syllables, phonetic transcription stress pattern, frequency of occurrence, imageability, concreteness, familiarity, meaningfulness, and age of acquisition of the words, although, as noted in 1.1 (and see Table 2A) not all of these items of information is available for every word. The S-R file consists of the word-association responses given by 100 subjects to 8,210 stimulus words. The P-S file contains the same information as the S-R file, organised in a different way: there are 22,086 words in the R-S file, and for each word there is a list of all the stimuli to which that word was given as a response when the word-association data were collected.

1.4 The Access Language This will be described briefly here; it is described in detail in Section 3. The language has been written especially for this project; it is based upon the ADABAS language ADACOM.

1.4 Jobs written in this language begin with a READ statement, which refers to one of the three database files (DICT, S-R or R-S), thus specifying which file the user wishes to access. A second function of the READ statement is that it can include specifications of which categories of word the user wishes to access (all nouns, for example, or all words with more than seven letters, or all words with a frequency of occurrence exceeding 50 in the Kucera-Francis norms).

Further selection of words, in terms of the presence or absence of specific letters, phonemes or syllables, can be accomplished by the ACCEPT IF and REJECT IF statements.

Some forms of calculation can be carried out using the COMPUTE statement.

The words selected can be ordered in terms of any of their properties (e.g. ordered in terms of their concreteness value or their number of syllables) by using the SORT statement.

Finally, once a set of words has been selected and (if desired) sorted, they are output using the DISPLAY statement.

To illustrate the way the language uorks, here is a sample job:

READ IN DICT WHERE WTYPE EQ 'N' AND NLET LE 7 REJECT IF LETTERS EQ 'A' SORT K-F-FREQ DISPLAY WORD IMAG K-F-FREQ

This job selects initially all the nouns in the DICT file which have fewer than seven letters. It then rejects from this set of words all the words containing the letter A. The remaining words are sorted in order of their frequency values in the Kucera-Francis norms, and printed in this order. With each word is printed its imageability value and its Kucera-Francis frequency.

1.5 Implementation

The system has been implemented on the IBM _370/165 computer at the Cambridge Computer Centre. There the database has been set up and will be maintained by the Adabas Software Package. User access to the database is by punched cards or via a telephone modem link using the Cambridge Computer Centre job control language Phoenix. Output can be listed at the user's own terminal, if he has one, or on lineprinter at Cambridge, in which case the lineprinter output is returned to the user by post.

Section 2

2.2.5 PHON, DPHON, and RDPHON

The 12th edition of Daniel Jones' Pronouncing Dictionary (Jones, 1963) was transferred to magnetic tape by Professor L. Guieere (Guierre, 1966). The phonetic symbols used on this tape, and their equivalents in one international Phonetic Alphabet, are listed in Table 2.D.

TABLE 2.D: The database _phonetic alphabet VOWELS CONSONANTS IPA DATABASE IPA DATABASE PHONETIC EXAMPLE PHONETIC PHONETIC EXA14PLE PHONETIC SYMBOL SYMBOL SYMBOL SYMBOL p bean 1. p .2ut btrn A. b but B born C. t ten T U. boon 'U. d 7en D burn 6. k San K pit I m man m F, pet E n not N pat A 1 Tike L patt 4 r run R pat C f full F pHt U v very v another 6 s some s -e, L bay El z 7,eal z blly Al h hat H bay ci w went w @0- ou 9 @La m e G now AU tf chain T.,@, La peer 16 dz Jane D7 lea pair 36 Tong 9 P2-or u6 a thin 8 1( then 5 @-h' 91 -1p 3 measure 7 les i I devoiced I L( 0 m devoiced m m( 0 n devoiced n N( 0 r devoiced r R( 0

The difference between PHON and DPHON is that PHON includes syl.lable boundary indicators whilst DPHON consists only of phonetic symbols (e.g. ABATE is 6/BEIT versus 6BEIT). RDPHON is DPHON in reverse order.

Section 3

3.2.7 This is unlikely to suit you, because the second (cont'd) READ will always start at the beginning of the DICT file, and will always then select the first verb in the file which matches in frequency the noun selected by the first READ statement. This introduces a nonrandomness which is likely to be unacceptable. To overcome it, you would want the second READ statement to start at random locations in the DICT file, thus:

READ IN DICT WHERE WTYPE EQ IN$ LIMIT I- RANDOM ISN2 READ IN DICT WITH ISN STARTING FROM ISN2 WHERE WTYPE EQ IV' AND K-F-FREQ EQ K-F-FREQ(1000)

Note the subscripting of ISN. A random starting point in the first READ statement is referred to as ISNI, and in the second READ statement as ISN2. There remains a slight problem: as mentioned before, when wraparound occurs with a READ statement which started at a random point (i.e. when the end of the DICT file is reached before the required number of words has been selected, so that the search proceeds on from the beginning of the file) it can happen that slightly fewer than the required number of words will be selected. In the example above, where only one word is required in the second READ statement, this could mean that when one search for a verb reaches the end of the file with no verb having yet been found, no verb, rather than one, will be selected. The solution, as mentioned above, i-s to use PRINT thus:

READ IN DICT WHERE WTYPE EQ ON' LIMIT 5 RANDOM ISN2 PRINT I READ IN DICT WITH ISN STARTING FROM ISN2 WHERE WTYPE EQ #VI AND K-F-FREQ EQ K-F-FREQ (1000)

Now, each time the first READ statement finds a noun, the second READ statement will start at a random point in the DICT file and select the first 5 verbs which it finds which have the same frequency as the noun - unless wraparound occurs, i.n which case slightly fewer than 5 verbs may be selected. From these 5 or slightly fewer verbs, the PRINT statement selects one. (If you had written PRINT 2, you would get 2 matching verbs for each noun). If no verb can be found which matches the noun on K-F-FREQ, control returns to the first READ statement, the next noun is selected, and the procedure continues.

3.2.7 Take this example to extremes now: suppose you (cont'd) wanted to match the nouns and verbs not only on frequency, but also on number of letters, number of syllables, and imageability. This could be done as follows:

READ IN DICT WHERE WTYPE EQ -Nt LIMIT 5 RANDOM ISN2 PRINT 1 READ IN DICT WITH ISN STARTING FROM ISN2 WHERE IMAG EQ IMAG(1000) AND K-F-PREQ EQ K-F-FREQ(1000) AND NLET EQ NLET(1000) AND NSYL EQ NSYL(1000) AND WTYPE EQ IV'

Note the ordering of crit(@ria in the second READ statement. Obviously there will be far fewer words in the DICT file which match the noun on IMAG than there are verbs in the DICT file, so less time will be consumed by the second READ statement if it tests for verbness only after finding a match on IMAG rather than the other way around. A second point to be noted is that nouns and verbs which have no IMAG ratings will be treated as matching on IMAG, and similarly with NSYL; and this is obviously unsatisfactory. It can be avoided by replacing the first READ statement with:

READ IN DICT WHERE WTYPE EQ IN' AND IMAG NE AND NSYL NE 0

Now every noun selected by the first READ statement will have a value for IMAG and NSYL. This example may look nice, but it has two problems. The first is that it may occupy a great deal of computer time. Since the matching criteria in the second READ statement are so stringent, each time the second READ statement is executed a large extent of the DICT file will need to be searched - sometimes all of it, when there is no matching verb for noun - and these repeated lengthy searches of the DICT file, one search for every noun selected by the first READ statement, may add up to so much time that you cannot run the job to its completion. A second, related, problem is that, if you decide that the matching criteria are too stringent, you may want to relax them, by requiring, say, that K-F-PREQ matches give or take 10. With single READ statements this can be done using RANGE (see Section 3.2.2). At present, unfortunately, we have been unable to find a way of using RANGE in a second READ statement, so at present only exact matching can be requested in the second READ statement. We hope to introduce inexact matching in the second READ statement at some future date.

Selecting genuinely random --@amples of words

Suppose you want 50 four-letter nouns, and you want these to be a genuinely random sample from all of the four-letter nouns in the DICT file. Random sampling with replacement can be achieved thus:

LIMIT 50 READ IN DICT LIMIT 5 RANDOM ISN2 PRINT I READ IN'DICT WITH ISN STARTING FROM ISN2 WHERE NLF,T EQ 4 and WTYPE EQ IN'

Note the dummy use of the first READ statement; all it does is make sure that the second READ statement is executed 50 times.

3.3 The COMPUTE statement

This statement performs mathematical calculations. Its form is as follows:

COMPUTE ANSWX (Ny) 'Value 1 value 2

Here x is a unique integer identifier you can do vari.ous COMP-UTF,s and their answers will be uniquely identified as ANSWI, ANSW2, and so on. (Ny) specifies the format of the computed answer e.g. (Ni@) means up to four places to the left of the decimal point and none after, (N6.3) means six places before and three after, and so on. Ilere the sum of the number to the left of the decimal point and the number to the right of the decimal point ttiiist not (@xceed 15, so (NI5), (NO.15) and (N8.7) are all right, but (NI6), (No.i6) or (N7.9) are not. Valuel, value 2, "' etc. are numerical constants or the names of numerical word I)ropertit-s. For example:

COMPUTE A@4SW1 (Nl-3) JILP,T / L@IIHON

gives you the number of letters per phoneme for each word in the set of words to which COMPUTE is applied. If one is carryinr out computations whose results are being summed, it is necessary of course to set to zero first the location into which the results to be summed are added. This i-s done by using the MOVE statement. For example, if the data are being summed in ANSW1, one can initially set ANSWI to zero by:

MOVE O 1-0 AN'@'Wl (N6)

where one a,-,numes the ultimate value of ANE;WL may be as large as 999,999 with no decimal fraction. Suppose you were interested in the l,roduct-moment correlation between imageability and concreteness for all the function words contained in the Kucerl-l-Francis norms. This could be obtained as follows:

MEANC and MEANP (Meaningfulness)

Two sets of ratings of meaningfulness were available: PAIV and COLO. Bowever, unlike rating of the other variables, the intercorrelation here was low (+ .529) and it is clear that this is because the two sets of ratings were collected with different kinds of instructions (see Section 2.2.16). Therefore, no attempt was made to pool the two sets of meaningfulness ratings, and hence there are two different sets of meaningfulness strings in the database: MFANC and MEANP.