ASA PRESSROOM

Acoustical Society of America 
134th Meeting Lay Language Papers


Using Unconscious Linguistic Knowledge to Perceive Acoustically Ambiguous Speech Sounds

Elliott Moreton elliott@linguist.umass.edu
Department of Linguistics
University of Massachusetts
Amherst, MA 01003

Popular version of paper 2aSC4
Presented Tuesday afternoon, December 2, 1997
134th ASA Meeting, San Diego, CA
Embargoed until December 2, 1997

In every language there are tight restrictions on how sounds can be combined to make words. English,
for example, has the sounds /p/ and /f/, but "pfropf" is not a possible word of English. This is not
because it is unpronounceable --German speakers can pronounce it (it means "stopper") -- but because
English native speakers reject *any* word which begins or ends with the consonant cluster /pf/.
Without having to think about it consciously, people seem to know which sequences their native
language allows and which ones it forbids -- the "phonotactics" of their language. The aim of this
experiment is to find out *how* they know.

There are two main views on this question. One, popular among linguists, is that knowing a language
involves unconsciously knowing a set of rules -- in this case, phonotactic rules that forbid particular
sound sequences like /pf/. Another, originating with the school of "connectionist" psychology, holds that
there are no such rules; English speakers find /pfropf/ objectionable simply because it is too different
from all of the thousands of words in their vocabulary, none of which contains /pf/. In this theory,
phonotactics emerges from the statistics of English vocabulary, organized as a neural network
(McClelland and Elman's TRACE model).

In order to test the competing theories, we exploit an effect discovered by Massaro and Cohen in 1983:
listeners seem to use phonotactics in deciding how to interpret an acoustically vague sound.

It is possible, for instance, to digitally synthesize sounds that are acoustically between /r/ and /l/, ranging
from very /r/-like, to middling, to very /l/-like, and to insert these ambiguous sounds into words. Massaro  and Cohen presented English speakers with syllables containing such sounds, and asked them to say which ones sounded more like /r/ and which more like /l/. They found that the boundary between "more like /r/" and "more like /l/" depended on what came before the ambiguous sound: After /t/, the boundary was very close to /l/, while after /s/, it was very close to /r/.

(1) The Massaro-Cohen experiment

After /t/, /l/ is forbidden.  Similarly, after /s/, /r/ is not permitted.  People's judgments of the ambiguous sounds were distributed like this:
 
/tl/-/tr/:  /l/ -- more like /l/ -- | -------------- more like /r/ -------------- /r/
/sl/-/sr/: /l/ -------------- more like /l/-------------- | --more like /r/---- /r/
 
(The listener heard a sound sequence consisting of /t/ or /s/ plus either a pure /l/, a pure /r/, or an acoustically ambiguous sound somewhere between /l/ and /r/.  The horizontal axis shows the gradation between /l/ and /r/ sounds.  The extreme left endpoint represents a pure /l/ and the extreme right represents a pure /r/.  In between are synthesized sounds that lie in between /l/ and /r/.  The ones closer to /l/ are more acoustically similar to /l/ and the ones closer to /r/ are more acoustically similar to /r/.  However, since /tl/ is a forbidden sequence in American English, listeners tended to hear /tr/ sounds, even when the ambiguous sound was acoustically closer to /l/.  The character "|" marks the boundary between when listeners heard /tl/ and when they heard /tr/.  Similarly, listeners heard /sl/ more frequently than /sr/, even when the synthesized sound was acoustically more similar to /l/.)

The English-speaking listeners heard more of the interval as /r/ after /t/ (/tl/ is illegal, while /tr/ is okay),
and more of the interval as /l/ after /s/ (/sl/ is legal, but /sr/ is not). The /l/-/r/ boundary is apparently
shifted by the phonotactics of English: listeners hear the illegal sound only when the acoustic evidence is  overwhelming. In other words, the mechanisms of speech perception seem to be taking into account the *plausibility* of an /l/ or /r/ in the given context, and demanding stronger acoustic evidence before
assenting to the less plausible hypothesis.

According to the rule-based theory, plausibility is determined by consulting a phonotactic rule.
According to the connectionist theory, plausibility is determined by looking at real words that are similar
to the context. The connectionist theory therefore predicts that the strength of the phonotactic effect
(i.e., the size of the boundary shift) should depend on the composition of the set of similar words: a
context that is similar to many words, or to a few very common words, should produce a large boundary  shift, while one that is similar to very few or very rare words should produce a small shift. On the other hand, the rule-based theory expects equal shifts in all contexts, since the rule is an impartial ban on *all* occurrences of the illegal sequence, irrespective of context.

The experiment worked like this: We synthesized English nonsense words ending in closed syllables like "grihdj" (rhymes with "fridge"), "greedj", "krihdj", and "kreedj", with stress on the final syllable, then
lopped off the final consonant to make nonsense words ending in the open syllables "grih", "gree", "krih", and "kree."

We took pairs of these words that differed only in the vowel, and synthesized words with acoustically
ambiguous vowels that were in between the endpoints, to produce four scales that gradually changed
from one word to the other in five steps. Example:

(2) Typical stimulus family

CLOSED/g pulgrihdj ---- 1 ---- 2 ---- 3 ---- 4 ---- 5 ---- pulgreedj

CLOSED/k pulkrihdj ---- 1 ---- 2 ---- 3 ---- 4 ---- 5 ---- pulkreedj

OPEN/g pulgrih        ---- 1 ---- 2 ---- 3 ---- 4 ---- 5 ---- pulgree

OPEN/k pulkrih        ---- 1 ---- 2 ---- 3 ---- 4 ---- 5 ---- pulkree

[CLOSED indicates that the vowel sound of interest occurs in the middle of a word, and OPEN indicates that it  occurs at the end of a word.  1, 2, 3, 4, 5 represent words with ambiguous vowel sounds lying acoustically in between the vowel sounds at the two endpoints.  For example, in the top line, we took the two nonsense words "pulgrihdj" (vowel sound "ih") and "pulgreedj" (vowel sound "ee") and constructed 5 words whose vowel sounds lie in between the "ih" and "ee."]

English-speaking listeners heard a series of trials that went like this: First they heard one endpoint, then
they heard an ambiguous word, then they heard the other endpoint. They were asked to say which
endpoint the ambiguous word was closer to. From their responses, we could determine the location of
the boundary between "ih" and "ee" in each of the four contexts.

Now, American English does not tolerate "ih" at the end of a word. There are no words ending in that
vowel, and you can't make up new ones. (For example: The letter i in "delicatessen" is pronounced "ih",
but when the word is shortened to "deli" it has to be pronounced "ee", since words cannot end in "ih".)
This means that "ih" is legal in the CLOSED contexts, but not in the OPEN ones. The Massaro-Cohen
effect should therefore move the "ih"-"ee" boundary towards "ih" in the OPEN contexts as compared to  the CLOSED ones. This is in fact what we found -- the boundary was about a half-step closer to "ih" in the OPEN contexts:

(3) Actual results from experiment

CLOSED/g pulgrihdj ---- 1 ---- 2 ---- 3|---- 4 ---- 5 ---- pulgreedj

CLOSED/k pulkrihdj ---- 1 ---- 2 ---- 3|---- 4 ---- 5 ---- pulkreedj

OPEN/g pulgrih        ---- 1 ---- 2 --|-  3---- 4 ---- 5 ---- pulgree

OPEN/k pulkrih        ---- 1 ---- 2 --|- 3 ---- 4 ---- 5 ---- pulkree

(Just as in (1), the "|" symbol divides the sounds that were heard as "more like 'ih'" from those heard as "more like 'ee'".  It marks the location of the sound that would be heard one way 50% of the time and the other way the other 50%.)

The location of the boundary did not depend on whether the word contained a g or a k. This is
important, because the syllable "gree" is extremely common at the end of a word, while the others are
all very rare. The set of words that is similar to "pulgree" is much larger and more common than the set
of words that is similar to "pulkree" (out of every million words you read or hear, about 450 end in
"gree", while only 10 end in "kree"). Therefore, the connectionist theory predicts a larger boundary shift
for the g words than for the k words -- something more like this:

(4) Predictions of the statistical theory

CLOSED/g pulgrihdj ---- 1 ---- 2 ---- 3|---- 4 ---- 5 ---- pulgreedj

CLOSED/k pulkrihdj ---- 1 ---- 2 ---- 3|---- 4 ---- 5 ---- pulkreedj

OPEN/g pulgrih        ---- 1 ---- 2|---- 3 ---- 4 ---- 5 ---- pulgree

OPEN/k pulkrih        ---- 1 ---- 2 --|- 3 ---- 4 ---- 5 ---- pulkree

Since this isn't what we got, the actual results match the predictions of the rule-based theory --
phonotactics applies uniformly and impartially in all contexts.

This experiment therefore supports the traditional linguistic view that knowing a language involves
knowing certain rules or constraints for manipulating symbols, and constitutes a problem for connectionist theories that would eliminate symbol manipulation from cognition.

(Work supported by Public Health Service Grant 5 T32 HD 07327 to the author, and by
NIH/NIDCD Grant 5 R29 DC 01708 to John Kingston).