This is se{w,r}ious: using acoustics, phonetic transcription, and naïve judgments to better understand how children learn (or fail to learn) the /r/ sound


Lay-language version of paper Growth in the Accuracy of Preschool Children’s /r/ Production: Evidence from a Longitudinal Study, poster 5aSC, presented in the session Speech Production, Friday, May 11, 8 am – 12 pm.


Mara Logerquist1

Alisha Martell 1

Hyuna Mia Kim2

Benjamin Munson1 (contact author,, +1 612 619 7724)

Jan Edwards2,3,4


1Department of Speech-Language-Hearing Sciences, University of Minnesota, Twin Cities, 2Department of Communication Sciences and Disorders, University of Wisconsin, Madison, 3Department of Hearing and Speech Sciences, University of Maryland, College Park,
4Language Science Center, University of Maryland, College Park


Few would dispute that language acquisition is a fascinating and remarkable feat.  Children progress from their first coos and cries to saying full sentences in a matter of just a few years.  Given all that is involved in spoken language, it seems almost unreal that children could accomplish this Herculean task in such a short time.  Even the seemingly simple task of learning to pronounce sounds is, on closer examination, rather tough.  Children have to listen to the adults around them to figure out what they should sound like.  Then, they have to approximate the adult productions that they hear with the very different vocal instrument that they have: children’s vocal tracts are about half the size of an adult’s.  Not surprisingly, specific difficulty in learning speech sounds is one of the most common communication disorders.

The English /r/ sound is a particularly interesting topic in speech sound acquisition.  It is one of the last sounds to be acquired.  For many children with developmental speech disorders, /r/ errors (which usually sound like the /w/ sound) are very persistent, even when other speech errors have been successfully corrected.  Perhaps because it is so common, /r/ errors are very socially salient.  We can easily find examples of portrayals of children’s speech in TV shows that have /r/ errors, such as Ming-Ming the duck’s catch phrase “This is serious” (with a production of /r/ that sounds like a /w/) on the show Wonder Pets (

The sound /r/ has a distinctive acoustic signature, as illustrated by productions of the words rock and walk (which rhyme in speech of many people in Minnesota).  These illustrations are spectrograms, which are a type of acoustic record of a sound.  Spectrograms allow us to measure fine-grained detail in speech.  The red dots on these spectrograms are estimates of which of the many frequencies that are present in the speech signal are the loudest.  In the production of rock, the third-lowest peak frequency (which we call the third formant [F3]) is low (about 1500 Hz).  In the production of walk, it is much higher (about 2500 Hz).


The last two authors of this study, along with a third collaborator, Dr. Mary E. Beckman, recently finished a longitudinal study of relationships among speech perception, speech production, and word learning in children.  As part of this study, we collected numerous productions of late-acquired sounds in word-initial position (like the /r/ sound in rocking).  The ultimate goal of that study is to understand how speech production and perception early in life set the stage for vocabulary growth throughout the preschool years, and how vocabulary growth helps children refine their knowledge of speech sounds.  The study collected a treasure trove of data that we can use to analyze other secondary questions. Our ASA poster does just that.  In it, we ask whether we can identify predictors of which children with inaccurate /r/ productions at our second time point (TP2, when children were between 39 and 52 months old, which followed the first time point, in which they were 28 to 39 months old) improve their /r/ production at our third time point (TP3, when the children were between 52 and 64 months old), and which did not improve.  Our candidate measures were taken from a battery of standardized and non-standardized tests of speech perception, vocabulary knowledge, and nonlinguistic cognitive skills.

Our first stab at answering this question involved looking at phonetic transcriptions of children’s productions of /r/ and /w/.  We picked /w/ as a comparison sound because most of children’s /r/ errors sound like /w/.  Phonetic transcription was completed by trained phonetic transcribers using a very strict protocol.  We calculated the accuracy of /r/ and /w/ at both TP2 and TP3.  As the plot below (in which each dot represents a single child’s performance) shows, children’s performance generally improved: most of the children are above the y=x line.

We examined predictors of how much growth in accuracy occurred from TP2 to TP3 for the subset of children whose accuracy of /r/ was below 50% at TP2.  Surprisingly, the results did not help us understand why some children improved more than others.  In general, we found that the children who had the most improvement were those with low speech perception and vocabulary scores at TP2.  A naïve interpretation might be that low vocabulary is associated with positive speech acquisition—an unlikely story!  Closer inspection showed that this relationship was because the children who had the lowest accuracy scores at TP2 (that is, the children with the most room to grow) were those who had the lowest vocabulary and speech perception scores.

We then went back to our data and asked whether we could get a finer-grained measure than simply we get from phonetic transcriptions.  We know from our own previous work, and from work of others (especially Tara McAllister and colleagues, who have worked on /r/ extensively) that speech sounds are acquired gradually.  A child learning /s/ (the “s” sound) over the course of development gradually learns how to produce /s/ differently from other similar sounds (like the “th” sound and the “sh” sound).  McAllister and colleagues showed this to be the case with /r/, too.  Measures like phonetic transcription don’t do a very good job of capturing this gradual acquisition, since a transcription either says that a sound is correct or incorrect.  It doesn’t track degrees of accuracy or inaccuracy.

To examine gradual learning of /r/, we first tried to look at acoustic measures, like the F3 measures that are useful in characterizing adults’ /r/ productions.  A quick look at two spectrograms of children’s productions reveal how hard this endeavor actually is.  Both of these are the first 175 ms of two kids’ productions of the word rocking.  Both of them were transcribed as /w/.  In both of them, the F3 is hard to find.  It’s not nearly as clear as it is in the adults’ productions that are shown above.  In both of these cases, the algorithm to track formant frequencies gives values that are rather suspicious.  In short, we would need to carefully code these by hand to get anything approximating an accurate F3, and some tokens wouldn’t be amenable to any kind of acoustic analysis.  Given our large number of productions in this study (nearly 3,000!), this would take many hundreds of hours of work.




Subject 679L




Subject 671L


To remedy this, we decided to abandon acoustics.  Instead, we presented brief clips of children’s speech (the first 175 ms of words starting with the /r/ and /w/ sounds) to naïve listeners, where “naïve” means “without any specialized training in speech-language pathology, phonetics, or acoustics.”  We asked them to rate the children’s productions on a continuous scale, by clicking on an arrow like this:


The “r” sound                                                                                                              The “w” sound


When we examine listener ratings, we find quite a bit of variation across kids in how their sounds are rated.  Consider the ratings for the productions above.  Each one of the “r” symbols represents a rating by an individual.  The higher the rating, the more /w/-like it was judged to sound.























Listen to these sounds yourself, and ask where you would click on the line.  Do your judgments match those of our listeners?


We find that the production by 679L (which was rated by 125 listeners) is perceived as much more /w/-like than was the production by 671L(which was rated by 20 listeners).  How do these data help us understand growth in /r/?  In our ongoing anlayses, we are examining growth by using pooled listener ratings instead of phonetic transcription.  Our hope is that these finer-grained measures will help us better understand individual differences in how children learn the /r/ sound.

Share This