(1995). "Acoustic characteristics of American English vowels

Transcription

(1995). "Acoustic characteristics of American English vowels
Acoustic characteristics of American English vowels
James Hillenbrand,LauraA. Getty,MichaelJ. Clark,and KimberleeWheeler
Department
ojeSpeechPathologyandAudiology,Western
MichiganUniversity,
Kalamazoo,
Michigan49008
(Received
10August1994;revised7 November
1994;accepted
17 January
1995)
The purposeof this studywas to replicateand extendthe classicstudyof vowel acousticsby
Peterson
andBarney(PB) [J.Acoust.Soc.Am. 24, 175-184 (1952)].Recordings
weremadeof 45
men,48 women,and46 childrenproducingthevowels/i,t,e,e,a:,a,•,o,u,u,n,3•/in
h-V-d syllables.
Formantcontoursfor F1-F4 were measuredfrom LPC spectrausinga custominteractiveediting
tool. For comparison
with the PB data,formantpatternswere sampledat a time thatwasjudgedby
visual inspectionto be maximallysteady.Analysisof the formantdatashowsnumerousdifferences
betweenthepresentdataandthoseof PB,bothin termsof averagefrequencies
of F1 andF2, and
thedegreeof overlapamongadjacentvowels.As with theoriginalstudy,listeningtestsshowedthat
thesignalswerenearlyalwaysidentifiedasthevowelintendedby thetalker.Discriminantanalysis
showedthatthevowelsweremorepoorlyseparated
thanthePB databasedon a staticsampleof the
formantpattern.However,thevowelscanbe separated
with a highdegreeof accuracyif duration
andspectralchangeinformationis included.
PACS numbers: 43.70.Fq,43.71.Es,43.72.Ar
INTRODUCTION
bodyof evidenceindicatingthatdynamicproperties
suchas
durationandspectralchangeplay an importantrole in vowel
The mostwidelycitedexperimenton the acoustics
and
perception(e.g.,Ainsworth,1972; Bennett,1968;Di Beneperceptionof vowelsis a surprisinglysimplestudycon- detto, 1989ab; Hillenbrandand Gayreft, 1993b; Jenkins
ductedat Bell TelephoneLaboratories
by PetersonandBaretal., 1983; Nearey, 1989; Nearey and Assmann,1986;
ney (1952) shortlyafterthe introduction
of the soundspec- Stevens,1959; Strange,1989; Strangeet al., 1983;Tiffany,
trograph.
Peterson
andBarney(PB) recorded
two repetitions 1953; Whalen, 1989). Other limitationsof the PB database
of ten vowels in /hVd/ contextspokenby 33 men, 28
include: (1) There is no indicationthat subjectswere
women, and 15 children. Acoustic measurements from
screened
for dialect,andvery little is knownaboutthe dianarrow-band
spectraconsisted
of formantfrequencies
(F llectof eitherthespeakers
or thelisteners;
(2) listening
results
F3), formantamplitudes,
andfundamental
frequency(F0).
werenotreportedseparately
for men,women,andchildtalkThe measurements
were takenat a singletime slicethatwas
ers;(3) no informationis givenaboutthe ageor genderof
judgedto be "steadystate."The /hVd/ signalswere also the childtalkers;(4) measures
weremadefrom a relatively
presentedto listenersfor identification.The resultsof the
smallgroupof children;(5) thereis nowayto determine
the
measurement
studyshoweda strongrelationship
betweenthe identifiability
of individualtokens;(6) measurement
reliabilintendedvowel andtheformantfrequencypattern.However,
ity wasnotreported;
and(7) sincetheoriginalsignalsareno
therewas considerable
formantfrequencyvariabilityfrom
longer available,the databasecannotbe used to evaluate
onespeakerto thenext,andtherewasa substantial
degreeof
signal representations
other than F0 and formantfrequenoverlapin the formantfrequencypatternsamongadjacent cies.
vowels. The listeningstudy showedthat the vowels were
The presentstudyrepresents
an attemptto addressthese
highly identifiable:The overall error rate was 5.6%, and limitations.Recordings
were madeof/hVd/utterancessponearlyall of theerrorsinvolvedconfusions
betweenadjacent
ken by a largegroupof men,women,andchildren.Measurevowels.
The PB measurements
haveplayeda centralrole in the
development
and testingof theoriesof vowel recognition.
Acousticmeasurements
for the signalsrecordedby PB have
beenwidelydistributed
to speech
research
laboratories
(e.g.,
Watrous,1991) and havebeenusedin numerousstudiesto
evaluate alternativemodels of vowel recognition(e.g.,
Nearey,1978; Neareyet aL, 1979; Syrdal,1985; Syrdaland
Gopal,1986; Nearey,1992; Lippmann,1989; Miller, 1989;
Hillenbrandand Gayreft, 1993a).Despitethe widespread
use of the PB measurements,
thereare severalwell recognizedlimitationsto thedatabase.
Perhapsthemostimportant
limitation is that the databaseconsi•t• exclusivelyof acoustic
mentswere made of vowel duration,F0 contours,and for-
mantfrequency
contours.
The signalswerealsopresented
to
a panel of listenersfor identification.Finally, discriminant
analysiswas usedto classflythe signalsusingvariouscombinations of the acoustic measurements.
I. ACOUSTIC
ANALYSIS
A. Methods
1. Talkers
Talkers consistedof 45 men, 48 women, and 46 ten- to
12-year-oldchildren(27 boys,19 girls).The majorityof the
measurements
takenat a singletime slice.Durationmeasure-
speakers
(87%) wereraisedin Michigan'slowerpeninsula,
ments were not made, and no information is available about
primarily the southeastern
and southwestern
parts of the
state.The remainderwere primarilyfrom otherareasof the
thepatternof spectralchangeovertime.Thereis now a solid
3099
J. Acoust.Soc. Am. 97 (5), Pt. 1, May 1995
0001-4966/95/97(5)/3099/13/$6.00
¸ 1995 AcousticalSocietyof America 3099
upper midwest,such as Illinois, Wisconsin,Minnesota,
northernOhio, andnorthernIndiana.An extensivescreening
procedure
wasusedto selectthese139 subjects
from a larger
group.The most importantpart of the screeningprocedure
was a carefuldialectassessment,
focusingespeciallyon subjects' productionof the/a/-/•/distinction. The/a/-/•/distinctionis not maintainedby many speakersof American
English,a fact whichwe believed(incorrecfiy,
as it turned
out) mightaccountfor the relativelyhighconfusability
reportedby PB for thispair of vowels.
The screening
procedure
beganwith a 5- to 7-min informal conversation
with one of the experimenters.
This conversationwas tape recordedfor later review by an experiencedphonetitian.Subjectsnext read a 128-wordpassage
tOO0
r•
3000
that contained several instances of words with/o/and/•/.
Subjectswere eliminatedif the phoneticiannotedany systematicdeparturefrom generalAmericanEnglish,or if the
speakerfailed to maintainthe /o/-/•/ distinctioneitherin
spontaneous
speechor in the 128-wordpassage.Subjects
were also requiredto passa brief task which testedtheir
ability to discriminate/n/-/•/minimal pairs. In additionto
the dialectassessment,
subjects
were eliminatedif they:(1)
were non-native
speakers
of English;(2) showedany evidenceof a speech,
language,
or voicedisorder;
(3) showed
anyevidence
of a currentrespiratory
infection;
or (4) faileda
20-dB pure-tonescreeningat 500, 1000, and 2000 Hz.
2. Recordings
Audio recordings
were madeof subjectsreadinglists
containing12 vowels:The ten vowels recordedby PB
(/ij,œ,•,o,•,u,u•%a•/)plus /e/ and /o/. Also recordedwere
four diphthongs
in/h-d/context, andbothvowelsanddiphthongsin isolation.Only resultsfrom the 12/hVd/utterances
will be describedin thisreport.Subjectsreadfrom oneof 12
different randomizationsof a list containingthe words
"heed,""hid," "hayed,""head,""had," "hod," "hawed,"
"hoed," "hood," "who'd," "hud," "heard," "hoyed,"
"hide," "hewed," and "how'd." Subjectswere given as
muchtimeas neededto practicethe taskanddemonstrate
an
understanding
of the pronunciations
that were expectedfor
eachkeyword.Recordings
weremadeof severalreadings
of
the list oncethe experimenter
was satisfiedthatthe subject
understoodthe task. Once the recordingsessionbegan, the
experimenterdid not auditioneachstimulusand requestadditionalreadingsbasedon the experimentefts
judgmentof
correct
pronunciation.:
An attempt
wasmadeto record
at
leastthreereadingsof the list. This wasoftennot possiblein
the caseof the children,who took longerto train thanadults
and sometimestired of the task after two readings.
The recordings
weremadewith a digitalaudiorecorder
(SonyPCM-F1)anda dynamicmicrophone
(Shure570-S).
One tokenof eachstimulusfrom eachtalkerwas low-pass
filtered at 7.2 kHz and digitizedat 16 kHz with 12 bits of
amplituderesolutionon a PDP 11/73computer.
Unlessthere
were problemswith recordingfidelity or backgroundnoise,
tokensweretakenfrom the subject'sfirstreadingof the list.
The gainon an inputamplifierwasadjusted
individuallyfor
eachtokensothatthepeakamplitudewasat least80% of the
_+10-Vdynamicrangeof theA/D, with no peakclipping.
3100 J. Acoust.Soc.Am.,Vol.97, No. 5, Pt. 1, May 1995
TIME
FIG. 1. Spectralpeakdisplayof the word "heard"spokenby a child.The
dashedverticallinesindicatethe beginningandend of the vowel nucleus.
The top panelshowsthe signalafterthe original14-poleLPC analysis,the
middlepanelshowsthe signalafterreanalysiswith 18 poles,andthe bottom
panelshowsthe signalafter handeditingwith a customeditingtool.
& Acoustic
measurements
a. Voweldurationand "steady-state"
times. The starting and endingtimesof vocalicnucleiwere measuredby
hand from high-resolution
gray-scaledigital spectrograms
usingstandard
measurement
criteria(Peterson
andLehiste,
1960).In anattemptto producea datasetcomparable
to PB,
two experimenters,
workingindependently,
madea judgment
of steady-state
timefor eachsignal.The measures
weremade
while viewinga spectralpeakdisplay(Fig. 1) anda grayscale spectrogram.PB provide a very brief descriptionof
how steady-state
timeswerelocated,indicatingonly thatthe
spectrum
wassampled,"... followingthe influenceof the/hi
andprecedingthe influenceof the/d/, duringwhicha prac-
ticallysteadystateis reached"
(Peterson
andBarney,1952,
p. 177).The two experimenters
workedfrom thisbrief description,andfromthetenexamples
shownin Fig. 2 of PB.
In additionto the hand measurements
of steady-state
times,we experimented
with severalmethodsof determining
steady-statetimes automaticallythroughan analysisof edited formant
contours. Of the several methods that were
tried, the techniquethat seemedto producethe bestresults
definedsteadystateas the centerof the sequenceof seven
analysisframes(56 ms)with theminimumslopein logF2logF1 space(Miller, 1989).
b. Formant contours. Formant-frequencyanalysisbegan with the calculation of 14-pole, 128-point linearpredictive
coding(LPC)spectra
every8 msover16 ms(256
point)hammingwindowedsegments.
The frequencies
of the
first sevenspectralpeakswere thenextractedfrom the LPC
spectrumfiles. The frequenciesof spectralpeakswere estimatedwith a three-pointparabolicinterpolation,
yieldinga
finer resolutionthan the 61.5-Hz frequencyquantization.
Files containingthe LPC peak data servedas the input to a
custominteractiveeditor.The editorallowstheexperimenter
Hillenbrand
et al.: Acoustic
characteristics
of vowels 3100
TABLEI. Percentage
of utt•ranc.½s
showing
a formantmergeranywhere
in
TABLE I1. Average absolute difference between formant frequencies
thevowelnucleus.
Shownin parentheses
are thepercentage
of utterances sampledat "steady-state"
timesdemrmined
by two judges.Figuresin pashowinga formantmergerat "steadystate."
rentheses
are differences
as a percentof averageformantfrequency.
Vowel
/i/
/i/
/e/
/e!
/aed
/a/
/g
Iol
/u/
/u/
/M
I•1
F1 -F2
0.0
0.0
0.0
o.o
0.0
2.0
4.0
2.7
o.o
1.3
0.7
0.0
F2-F3
(0.0)
(0.0)
(0.0)
(o.o)
(0.0)
(2.0)
(2.0)
(1.3)
(o.o)
(0.7)
(0.7)
(0.0)
10.0
0.0
9.3
o.o
3.3
0.7
0.7
0.0
o.o
0.7
0.0
15.3
Men
(s.o)
(0.0)
(4.7)
(o.o)
(3.3)
(0.0)
(0.7)
(0.0)
(o.o)
(0.7)
(0.0)
(11.3)
Women
Children
Overall
F1
F2
F3
7.5 (1.3%)
14.2 (0.8%)
18.6 (0.7)
9.2 (1.5%)
20.0 (1.1%)
21.2 (0.7%)
10.7 (1.8%)
18.5 (1.1%)
27.6 (1.0%)
9.2 (1.5%)
17.6 (1.0%)
22.5 (0.8%)
F4
20.6 (0.5%)
31.5 (0.8%)
36.4 (0.9%)
29.5 (0.7%)
For the presentstudy,formantswere edited only betweenthe startingandendingtimesof the vowel. Contours
for F1-F3 weremeasured
for all signals,exceptin casesof
unresolvable
formantmergers.The fourthformantwas measuredonly whena well-definedF4 contourwasclearlyvisible both on the LPC peak displayand the gray-scalespectrogram.The fourthformantwasjudgedto be unmeasurable
to reanalyzethe signalwith differentLPC analysisparam-
for 15.6% of the utterances.
eters and to hand edit the formant tracks.
c. Fundamental
frequencycontours.FO contourswere
extracted
with an autocorrelation
pitchtracker(Hillenbrand,
1988), followedby handeditingusingthe tool described
above.Grosstrackingerrorssuchas pitchhalvingandpitch
doublingwere correctedby reanalyzingthe signalwith an
optionthatimposesanupperor lowerlimit on thesearchfor
Editingandanalysisdecisions
werebasedon an examinationof the LPC peak displayoverlaidon a gray-scale
spectrogram
and,in somecases,on an examinationof individualLPC or Fourierspectralslices.Generalknowledgeof
acousticphonetics
alsoplayeda role in the editingprocess.
For example,editingdecisions
werefrequentlyinfluenced
by
the experimenter's
knowledgeof the closeproximityof F2
andF3 for vowelssuchas/i/and/s,/, thecloseproximityof
F1 and F2 for vowels suchas/a/and/u/, and so on (see
Ladefoged,
1967,for an excellentdiscussion
of theinherent
circularityin thismethodof estimatingvowelformants,and
for otherinsightfulcomments
on theformantanalysis).
Considerations
suchas theseoftenled the experimenter
to conclude that a formant mergeroccurred.In thesecases,the
LPC spectrawererecomputed
with a largernumberof poles
until the mergedformantsseparated.
Oncethe experimenter
was satisfiedwith the analysis,
editingcommandscould be usedto hand edit any formant
trackingerrorsthatremained.Figure1 showsan exampleof
theutterance
"heard"spoken
by a ten-year-old
boy:(a) after
the original14-poleanalysis,(b) after reanalysis
with 18
poles,and(c) afterhand-editing.
(For simplicity,the grayscalespectrogram
underlying
thepeakdisplayis notshown.)
The vertical lines indicatethe beginningand end of the
vowelnucleus.Two commands
areavailablefor handediting
theformantcontours.
Onecommandallowstheexperimenter
to use the mouseto deletea spuriouspeak, and a second
commandallowstheexperimenter
to usethe mouseto interpolatethrough"holes"in theformantcontour.For example,
in thecenterpanelof Fig. 1, thereis a gapin theF3 contour
towardthe end of the vowel. Clicking the mouseon either
sideof this gap causesthe programto linearlyinterpolate
formantfrequenciesthroughthis gap.
It was not uncommon
for utterances to show formant
mergersthroughoutall or part of the vocalic nucleusthat
could not be resolvedusingthesemethods.In thesecases,
zeros were written into the higher of the two formant slots
showingthe merger(e.g.,F3 was zeroedout in the caseof
anF2-F3 merger).TableI showsthefrequency
of occurreneeof formantmergersfor eachof the 12 vowels.
3101
d. Acoust. Soc. Am., VoL 97, No. 5, Pt. 1, May 1995
the autocorrelation
peak.Any errorsthatremainedwerecorrected using the editing commandsthat were described
above.
B. Results
1. Measuroment reliability
a. Vowel duration.
Vowel durations for 10% of the ut-
teranceswere remeasured
independently
by a secondexperimenter.
The
utterances
chosen for
remeasurement
were
drawnat randomfrom the total of 1668 signals,but with
approximately
equalnumbersof men,women,andchildren.
The averagedabsolutedifferencebetweenthe original and
remeasured durations was 6.9 ms. This result is in line with
reliabilitydatafor voweldurationreportedby Allen (1978)
and Smithet al. (1986).
b. Steady-statetimes. Steady-statetimes were measuredby two experimenters
for all 1668 utterances.
The average absolutedifferencebetweenthe two measurements
was 21.1 ms, or 7.7% of averagevowel duration.However,
moreimportantthanthe time differencebetweenthesetwo
measurements
is the differencein theformantfrequencypattern at these two samplepoints.These results,shown in
TablelI, indicatethatformantfrequencies
at the two sample
pointstypicallydifferedby roughly1% of averageformant
frequency.
c. Formantfrequencies.Two methodswereusedto estimatethereliabilityof theformantfrequencymeasurements.
The firstmethodinvolveda simplereanalysis
of 10% of the
utterances
usingthe LPC-basedsignalprocessing
andediting
techniquesdescribedpreviously.The secondmethod involved a reanalysisof 10% of the utterancesusingthe same
peakpickingandeditingtechniques
but with 128-pointcepstrallysmoothedspectrainsteadof LPC spectra.The primary
motivationfor thiscomparison
was Di Benedetto's
(1989a)
Hillenbrandet aL: Acousticcharacteristicsof vowels 3101
TABLE IIL Measurement-remeasutement
reliability for formantfrequenciesobtainedfrom a randomlyselected10% of the signals.Valuesaregivenas averageabsoluledifferences
andaveragesigneddifferences.
Men
Abs
Women
Signed
-1.8
Abs
Signed
Signed
8.1
2.8
26.4
2.8
27.4
2.4
25.2
F3
23. I
1.2
28.2
1.2
34.0
8.9
28.7
2.5
F4
56.2
69.1
15.4
59.0
4.0
reportthat LPC producedcomparable
estimates
of F2 and
F3 but estimatesof F1 thatwere low whencomparedwith
smoothed
widebandFourierspectra.The analysiscarriedout
in the presentstudyconsisted
of calculatingFourierspectra
over16 ms (256 point)hammingwindowedsegments
every
8 ms followed by cepstralsmoothing.Cepstralsmoothing
was implementedwith the "smoofi" algorithmfrom Press
et al. (1988). The size of the smoothingwindowwas adjustedindividuallyfor eachutteranceto minimize spurious
peaks or eliminate formant mergers.In this sense,the
degree-of-smoothing
parameterperformeda role in the cepstrumanalysiscomparable
to thenumberof polesin theLPC
analysis.The peakpickingandeditingprocedures
described
previouslywereusedto extractformantfrequencies
from the
cepstrallysmoothedspectra.
Results for the LPC remeasurement are shown in Table
III. The resultsarebasedon a frame-by-framecomparison
of
thesignals,excludingfrom consideration
any framein which
eithersignalshoweda mergerin theformantslotbeingcompared.Resultsare given as averageabsolutedifferencesand
as signed differences.Overall, the absolute differences
rangedfrom about12 to 60 Hz, or between1.0% and2.0%
of averageformantfrequency.
Table IV comparesformant measurements
obtained
fromLPC andcepstrallysmoothed
spectra.Positivenumbers
in the signed-differencecolumns indicate that the LPCderivedformantswere higher in frequencythan thosederivedfrom cepstrallysmoothed
spectra.In light of Di Bene-
deRo's (1989a) findings,the signed differencesare of
particularinterest.Consistent
with Di Benedetto'sresults,the
signeddifferencesare quite small for formantsaboveF1,
especiallyas a percentof formantfrequency.However,unlike Di Benedetto'sfindings,our resultsshowedslightly
higher first formantsfrom LPC spectra.This discrepancy
might be due to differencesbetweenthe cepstralsmoothing
methodusedin thepresentstudyandthe "pseudospectrum"
methodusedby Di Benedetto.However, it shouldbe noted
that Di Benedetto'sfindingswere basedon analysesof utter-
-2.5
Abs
20.7
-3.1
14.2
Signed
F2
50.7
-3.3
Abs
Overall
F1
-3.1
12.2
Children
11.7
-2.6
3.4
two experimenters
who madethesejudgments.
The averages
shownin the lable,andthe datadisplayedin the subsequent
figures,are basedon measurements
from individualtokens
that were well identifiedin the listeningstudy,to be describedin the next section.Specifically,for the purposes
of
thesecalculations,measurementswere not included from in-
dividualtokensthat producedan identificationerror rate of
15% or greater,where"error" simplymeansany instancein
which a signalwas identifiedas a vowel otherthan that intendedby the talker.Usingthiscriterion,theaverages
in this
tableare basedon measurements
from 88.5% of the signals.
This allows an analysisof measurements
for signalsfor
which the talkersand listenersare in goodagreementabout
the vowel that was spoken.In general,the removalof the
more ambiguoussignalshad very little effect on the averages,with the importantexceptionof/•/. As will be discussedin the next section, there were several instancesof
attemptsat/a/that werepoorlyidentifiedand,in somecases,
consistentlyidentifiedas/o/.
a. Vowelduration. The patternof durationaldifferences
amongthe vowels is very similar to that observedin connectedspeech.Our vowel durationsfrom/hVd/syllables are
two-thirdslongerthanthosemeasured
in connected
speech
by Black(1949), but correlatestrongly(r=0.91) with the
connected
speechdata.Thereweresignificantdifferences
in
vowelduration
across
thethreetalkergroups(F[2,33]=9.04,
p<0.001). Newman-Keuls
post-hoc
analyses
showed
significantlyshorterdurations
for themenwhencompared
to either
thewomenor thechildren.Longerdurationsfor thechildren
were expectedbasedon numerousdevelopmentalstudies
(e.g., Smith,1978;Kent and Forner,1980) but the differencesbetweenthe men and the womenwere not expected.
We do not have an explanationfor this findingand do not
know ff these male-female
duration differences would also
be seenin conversational
speechsamples.
b. Fundamental
frequency. Figure2 comparesour averagevaluesof fundamentalfrequencywith thoseof PB for
ances spoken by just two men and one woman.
d. Fundamentalfrequency. Remeasurementof fundamentalfrequencycontoursfor 10% of the utterances
showed
a frame-by-frameaverageabsolutedifferenceof 1.7 Hz and
an averagesigneddifferenceof 0.6 Hz.
2. Measurement
results
Acousticmeasurements
for the/hVd/signals are shown
in TableV. Fundamental
frequencyandformantvalueswere
sampledat the steady-statetimes determinedby one of the
3102
J. Acoust.Soc. Am., Vol. 97, No. 5, Pt. 1, May 1995
TABLE IV. Comparisonof formantfrequencies
derivedfrom LPC analysis
and cepstralsmoothing.Resultsare given as averageabsolutedifferences
andas averagesigneddifferences
(LPC-cepstrum).
Figuresin parentheses
are differencesas a percentof averageformantfrequency.
Absolute
Signed
F1
53.5 (8.9%)
41.5 (6.9%)
F2
F3
F4
60.4 (3.5%)
74.0 (2.6%)
91.1 (2.3%)
12.7 (0.7%)
10.9 (0.4%)
7.4 (0.2%)
Hillenbrar•det al.: Acousticcharacteristics
of vowels 3102
TABLE V. Averagedurations.fundamental
frequencies,
and formamfrequencies
of vowelsproducedby 45
men,48 women,and46 children.Averagesare basedon a subsetof the tokensthatwere well identifiedby
listeners(seetext for details).The durationmeasurements
are in ms; all othersare in Hz.
li/
Dur
F0
F1
F2
F3
F4
M
/el
/el
Ivel
Io/
/3/
Iol
lul
lul
//•
I•/
263
M
243
192
267
189
278
267
283
265
192
237
188
W
306
237
320
254
332
323
353
326
249
303
226
321
C
297
248
314
235
322
311
319
310
247
278
234
307
M
138
135
129
127
123
123
121
129
133
143
133
130
W
227
224
219
214
215
215
210
217
230
235
218
217
C
246
241
237
230
228
229
225
236
243
249
236
237
M
342
427
476
580
588
768
652
497
469
378
623
474
W
437
483
536
731
669
936
781
555
519
459
753
523
C
452
511
564
749
717
1002
803
597
568
494
749
586
M
2322
2034
2089
1799
1952
1333
997
910
1122
997
1200
1379
W
2761
2365
2530
2058
2349
1551
1136
1035
1225
1105
1426
1588
C
3081
2552
2656
2267
2501
1688
1210
1137
1490
1345
1546
1719
1710
M
3000
2684
2691
2605
2601
2522
2538
2459
2434
2343
2550
W
3372
3053
3047
2979
2972
2815
2824
2828
2827
2735
2933
1929
C
3702
3403
3323
3310
3289
2950
2982
2987
3072
2988
3145
2143
M
3657
3618
3649
3677
3624
3687
3486
3384
3400
3357
3557
3334
W
4352
4334
4319
4294
4290
4299
3923
3927
4052
4115
4092
3914
C
4572
4575
4422
4671
4409
4307
3919
4167
4328
4276
4320
3788
each of the ten vowels common to the two studies. Data
pointsthat lie on the solid line in the scatterplot indicate
identicalvalues,while data pointsabovethe line indicate
higherF0 valuesfor PB. AverageF0 valuesfor the menand
the womentypicallydifferedby only a few Hz whencomparedto the corresponding
vowelsrecordedby PB. F0 valuesfor our childrenaveraged28 Hz lower thanthe PB data.
c. Formantfrequencies. Figure 3 showsthe average
frequencies
for F1 andF2 for the threetalkergroups,along
with ellipsesfit to eachvowel category.Figure4 showsthe
individualdatapoints.To improvetheclarityof thedisplay,
data from/e/and/o/have
been omitted, and the databasehas
been thinnedof redundantdata points,resultingin the dis-
play of approximately
half of the individualpoints.While
thereare clearlygrosssimilaritiesto the PB data,thereare
numerousdifferencesas well. The degree of crowding
amongadjacentvowel categoriesappearsmuchgreaterthan
in the PB data, and many of the vowels are in different
locationsin F1-F2 spacethanin PB.
Figures5-7 showacousticvowel diagramsbasedon
averageformantfrequencies
fromthepresentstudyandfrom
PB for men,women,andchildren,respectively.
It is difficult
to arriveat a simplesummary
of thedifferences
thatareseen
in thesefigures.However,to the extentthat conventional
articulatory
interpretations
of formantdataare valid, a few
generalobservations
canbe made.The formantdataseemto
imply a generaltendencytowardlower tonguepositionsin
280
260
3OOO
240
220
2200
2OO
180
160
1400
ß Men
140
Women
Children
120
I
120
I
I
I
I
140
160
180
200
F0 VALUES
FROM
1000
I
I
I
I
220
240
260
280
PRESENT
STUDY
450
600
750
900
1050
1200
FIRST FORMANT (Hz)
FIG. 2. Scatterplotof F0 valuesfromthepresentstudyandfromPeterson
andBarney(1952)for eachof thetenvowelscommonto thetwo studies.
Data pointsabovethe solidline indicatehigherF0 valuesfor the Peterson
and Barneydata.
3103 d. Acoust.Soc.Am., Vol. 97, No. 5, Pt. 1, May 1995
FIG. 3. AveragevaluesofF1 andF2 for men,women,andchildtalkersfor
12 vowelswith ellipsesfit to the data ("ae"=/ae/, "a"=/a/, "c"=/3/,
"^" =/,q, "a"=l:v/).
Hillenbrandet al.: Acousticcharacteristics
of vowels 3103
i.
3400
i
i
i i
3000
iii.i.iii i
i
i
i i
i
2600
ae
ae
i
a•e
2200
a
1800
a
&
•aaaaa
a
1400
a
a
c
lOOO
u
600
300
450
600
750
900
1050
1200
FIRST FORMANT (Hz)
FIG.4. Values
of F1 andF2 for46 men,48 women,
and46 children
for 10vowelswithellipses
fit to thedata("ae"=/•e/, "a"=/o/, "c"=/3/, "n"=/M,
"a"=/aq).Measurements
for/e/and/o/havebeenomitted,
andthedatahavebeenthinned
of redundant
datapoints.
300
250
i,.
AVERAGE
FORMANT
VALUES
FOR
MEN
-- i,,
AVERAGE
FOPd•ANT
VALUES
FORWOMEN
400
U
,_, 350
t•
500
[.-, 450
•550
7O0
O
•
t•
650
900
• ?50
850
-.......
PRESENT STUDY
PETERSON & BARNEY
1000
--
PRESENT
........
PETERSON & BARNEY
2800
2400
2200
2000
1800
1600
1400
1200
1000
800
SECOND FORMANT (Hz)
2400
STUDY
2000
1600
1200
800
SECOND FORMANT (Hz)
FIG. 5. Acousticvoweldiagrams
showingaverageformantfrequencies
for
FIG. 6. Acoustic
voweldiagrams
showing
average
formantfrequencies
for
menfromthepresent
studyandfromPeterson
andBarney("ae"=/•e/,
womenfromthepresent
studyandfromPeterson
andBarney("ae"=/ae/,
"a" =/o/, "c" =/3/, "^"=/,q, "a" =/a,/).
3104 d.Acoust.
Soc.Am.,Vol.97, No.5, Pt.1, May1995
"a" =/a/, "c" =/3/, "n" =/•/, "a" =/a•/).
Hillenbrand
et al.:Acoustic
characteristics
ofvowels 3104
I
I
I
I
t
I
I
3750
300
i..
AVEPAGE
FORMANTVALUES
FOR
CHILDREI•
3450
3150
285O
2550
2250
1950
1100
1200
--
PRESENT STUDY
.......
PETERSON & BARNEY
I
I
I
32OO
2800
2400
2000
1650
[• Women /
[D Children
/
•
I
1600
•
,8(
1200
1650
8OO
1950
2250
F3 VALUES
SECONDFORMANT (Hz)
I
I
I
I
I
2550
2850
3150
3450
3750
FROM PRESENT
STUDY
FIG. 7. Acousticvoweldiagramsshowingaverageformantfrequencies
for
childrenfrom the presentstudyandfrom PetersonandBarney("ae" =/a•/,
FIG. 8. Scalterplot of F3 valuesfrom the presentsludyandfrom Peterson
"a" =/o/, "c" =/el, "n" =/,•/, "a" =/aV).
valuesfor the Petersonand Barneydata.
our dataand,for thebackvowels,moreanteriortonguepo-
associatedwith these utterances.Figure 9 is basedon the
formantpatternsampledat 20% and80% of vowel duration,
averagedacrossvowelsproducedby the threegroupsof talkers. F0 was sampledjust once at steadystate.The values
havebeenconvertedto a mel scaleusingthe technicalap-
sitions.It shouldbe noted,of course,that thesedifferencesin
and Barney(1952). Data pointsabovethe solidline indicatehigherF3
the formantpatternscouldbe explainedby differences
in lip
postureinsteadof---orperhapsin additionto--differencesin
tongueposition.The childrenin our studyseemto showa
proximation
from Fant (1973) and are represented
as F1generaltendencytowardcentralizationwhen comparedto
F0
vs
F3-F2.
(Note
that
the
F3-F2
axis
has
been
inverted
thePB data.The centralvowels/d and/3•/produced
by the
a convenadulttalkersarealonein occupying
nearlyidenticalpositions to producea displaythat morecloselyresembles
tional
FI
IF2
plot.)
This
representation
is
similar
to models
in the two sets of formant data.
proposed
by
Miller
(1989)
and
Syrdal
{1985),
except
thata
The vowelsoccupysimilarrelativepositionsin the two
mel
scale
is
used
in
place
of
Miller's
log
scale
and
Syrda!'s
setsof data,with the notableexceptionof/e/and/a:/. Our
dataindicatehigherF2 valuesfor/a•/as compared
with/e/,
and slightlylowerF1 valuesfor/•e/than/e/, althoughthe
differencein F1 is not consistentacrosstalker groups.
Analysisof data for individualtalkersshowedthat the F2
differencesbetweenthesetwo vowelswere highly consistent:91% of the talkersproducedan/a•/with a higherF2
than their/e/. The F1 differenceswere less consistent,with
68% of the talkersproducingan/a•/with a lowerF1 value
thantheir/e/. Our findingsfor thesetwo vowelscontrast
not
only with PB, but alsowith Di Benedetto's
(1989a) results
from threeadult talkers,and with a largeTexasInstruments
database
described
by Syrdal(1985).As canbe seenin Figs.
3 and4, thesetwo vowelsshowa very highdegreeof overlapin F1-F2 space.As will beseenbelow,thesevowelsare
well identifiedby listeners,
andcanbe separated
well based
on acousticmeasurements
only if spectralchangeis taken
into account.
of PB. Overall,theF3 valuesfrom the two studiesare quite
similar,with our measures
averaging113 Hz (4.7%) higher
for themen,47 Hz (1.7%)higherfor thewomen,and174Hz
(5.5%)lowerfor thechildren.
The slightlyhigherF3 values
differences de-
scribedpreviously.
d. Spectralchangepatterns. Althoughthe primarypurposeof this studywas to compareour staticformantmeasurements
with thoseof PB, a preliminaryanalysiswas conducted of the patterns of formant frequency change
3105
the locationof the secondsamplcof the formantpattern,and
a line connectsthis point to the first sample.As the figure
150
250
350
450
75O
85O
95O
Figure8 comparesour averagevaluesof F3 with those
for PB's children are consistent with the F0
Barkscale.
2Thesymbol
identifying
thevowelis plotted
at
J. Acoust.Soc. Am., Vol. 97, No. 5, Pt. 1, May 1995
I
I
l
I_
400
500
600
7110
FI - F0 (meh)
FIG. 9. Spectralchangepatternsassociated
with the 12 vowels ("ae"
=[•e/, "a"=/o/, "c"=/a/, "n"=/M, "3"=/•-/). The abscissais the difference between reel-transformedvalues of F 1 and F0, and the ordinate is the
difference between reel-transformed values of F2 and F3. The F3-F2
axis
has been invertedto producea displaythat more closely resemblesa conventionalFI[F2 plot. The symbolidenlifyingthe vowel is plottedat the
locationof thesecondsampleof theformantpattern,anda line connects
this
pointto the first sample.The largestsymbolsare usedfor the men and the
smallestsymbolsare usedfor the children.
Hillenbrandeta/.: Acousticcharacleristicsof vowels 3105
indicates,nearlyall of the vowelsshowa gooddealof formantfrequencychange.Further,theformantsare movingin
sucha way as to enhancethe contrastbetweenvowelswith
similarstaticpositionsin formantspace.For example,the
/•e/-/e/pair showsa high degreeof overlapwhenthe formantsaresampledat steadystate.As thefigureshows,these
two vowelsappearto exhibit distinctspectralchangepatterns.Likewise,/u/and/u/show a high degreeof overlapin
staticF1/F2 spacebut appearto showdistinctpatternsof
spectralchange.The influenceof spectralchangepatternson
the separabilityof vowel categorieswill be examinedin a
moresystematicway in the discriminantanalysisstudiesde-
ratesfrom the two studiesarequitesimilar,as are the rates
for mostof the individualvowels.Our resultsshowslightly
pooreridentificationof the/o/-/o/pair, but somewhatbetter
identificationof/l/ and of the /ze/-/e/ pair. The relatively
highidenfifiabilityof/•e/and/e/is interesting
in lightof the
poorseparation
of thesevowelsbasedon staticmeasures
of
F1
and F2.
Figure 10 showsa histogramof identificationratesfor
the individualsignals.The majorityof the signals(65%)
were identifiedunanimouslyby the listeners,and 89% were
scribed in Sec. III.
identified
at ratesof 90% or greater.
For33 signals(2%) the
majorityvote of the listenerswas a vowel other than that
intended
by thetalker.Mostof these(55%)wereintendedas
II. VOWEL
/3/but heardas/o/, but therewere also instancessuchas/a•/
heard as/e/,/œ/heard as/•e/,/o/heard as/3/, and/[/heard as
IDENTIFICATION
A. Methods
1. Listeners
Listenersconsistedof 20 undergraduate
and graduate
studentsin the SpeechandPathologyandAudiologyDepartment at WesternMichigan University,none of whom had
participatedas talkers.The choiceof phoneticallytrained
listenerswas motivatedby the findingsof Assmannet al.
(1982) indicatingthata relativelylargeproportion
of identificationerrorsproduced
by untrainedlistenersaredueprimarily to the listeners'uncertaintyabouthow to map perceived
vowelqualityontoorthographic
symbols.All of thelisteners
hadtakenan undergraduate
coursein phonetics,
althoughas
a grouptheywouldnot be considered
experienced
phoneticlans.The dialectscreeningprocedureand subjectselection
criteriadescribedfor the subjectswho servedas talkerswere
used for the listeners as well.
2. Procedures
/e/. It seemslogicalto interpretsignalsconsistently
heardas
a vowel otherthanthat intendedby the talker as production
errors.While the numberof tokensinvolved is a small proportionof the total,by this criterionnearly14% of the attemptsat/3/were production
errors.It is clear,then,thatthe
dialectscreeningprocedures
were ineffectivein somecases;
that is, there were some speakerswho passedthe /o/-/3/
dialectscreening
but, for reasons
thatare not clear,did not
producea convincing/3/whenthe/hVd/syllableswere recorded.
Overall identificationrateswere highestfor the women
talkers and lowest for the children.A repeated-measures
analysisof variancefor talker genderusingarcsinetransformed overall identification
rates for the 20 listeners was
significant[F(2,57)=14.7, p<0.001]. Post-hocanalysis
showedsignificantdifferencesamongall threetalkergroups.
It is importantto note, however,that the magnitudeof the
effect is quitesmall,with only 1.9% separating
the highest
and lowestidentificationrates.It is interestingthat thereis
no evidenceat all that signalswith higherfundamentalfrequencies
aremorepoorlyidentified,despitethefact thatformant peaksare more poorly definedin signalswith wide
harmonicspacing.
Listenerswere testedindividuallyin a quietroomin two
sessions
lastingapproximately1 h each.Signalswere lowpassfiltered at 7.2 kHz at the outputof a D/A converter
(TuckerandDavisDD]), amplified,
anddelivered
to a single
loudspeaker
(BostonAcoustics
A60) at an averageintensity
of 77 dBA at thelistener'shead(approximately
70 cm from
III. DISCRIMINANT
ANALYSIS
theloudspeaker).
Overthecourseof thetwosessions
listeners identifiedone presentationof each of the 1668 /hVd/
The purposeof the discriminantanalyseswas to detersignals.The signalswere presentedin fully random order
mine how well the vowelscouldbe separated
basedon vari(i.e., not blocked by talker), and the randomizationwas
ous combinations of the acoustic measurements and, where
changeddaily. This randomizationmethoddiffersfrom PB,
appropriate,
to comparetheseresultswith similaranalyses
of
who testedsubjectsin blocksof trials which presentedlisthePB database.
A quadraticdiscriminant
analysistechnique
teners
withrandomly
ordered
tokens
fromtentalkers.
3
(Johnsonand Winchern, 1982) was usedfor classification.In
Subjectsrespondedby pressingone of 12 keys on a
computerkeyboardthat hadbeenlabeledbothwith the phonetic symbols and the correspondingkey words (e.g.,
"heed," "hid," "head," etc.). Each listeningtest was precededby a brief practicesessionto ensurethat subjectsunderstoodhow the key labelswere to be interpreted.
all cases,the "jackknife"techniquewas usedin whichsta-
3. Results
TableVI presentsa summaryof identification
ratesfor
each vowel category,along with comparabledata from PB.
The full confusion
matrix
is shown in Table VII.
The results
tisticsfor an individual token are removedfrom the training
datapriorto theattemptto classifythetoken(seealsoSyrdal
andGopal,1986).Tokensshowinga mergerin a formantslot
that was includedin the parameterlist were not includedin
the analyses.
Table VIII showsclassificationresultsfor staticparameter sets,i.e., measurements
sampledonceat steadystate.To
facilitatecomparisons
with PB,/el and/o/were not included
in thesetests.As mighthavebeenexpectedbasedon inspection of Figs. 3 and 4, our vowelsdo not separateas well as
are generallyquitesimilarto PB. The overallidentification the PB data based on static measures of F1 and F2. The
3106 d. Acoust.Soc.Am., VoL97, No. 5, Pt. 1, May 1995
Hillenbrandet al.: Acousticcharacteristics
of vowels 3106
differences
in categoryseparability
becomesmalleras more
parametersare added,but in all casesthe PB vowels are
classified
with greateraccuracy
thanours.
Table IX demonstrates
the effectsof includingvowel
durationand spectralchangeinformationon classification
accuracy.Only data from the presentstudyare included,
againomitting/e/and/o/. The one-sample
resultsare based
on a singlesampleof the formantpatternat steadystate,the
two-sampleresultsare basedon samplestakenat 20% and
80% of vowel duration,and the three-sampleresultsare
basedon samplestaken at 20%, 50%, and 80% of vowel
duration.For the two- and three-sample
parametersetsincludingF0, a singlesampleof F0 at steadystatewas used.
Resultsare shown for all tokens in the database,and for a
subset of
the data
that
excluded
individual
tokens
that
showedidentification
errorratesof 15% or greater(11.5%of
the tokens).
It can be seenthat includingvowel durationin the parameterset resultsin a consistentimprovementin performance,especiallyfor the simplestparametersetssuchas
single-sample
F1-F2. However,the mostdramaticeffectis
seenwhencomparing
a singlesampleof the formantpattern
with two samples.The improvementin classification
accuracy averages11.2%,andis especiallylargefor the parametersetsinvolvingF1 andF2 alone.Addinga thirdsample
of the formantpatternproduceslittle or no improvementin
classification
accuracy.
This wouldseemto suggest
thatonly
a very coarserepresentation
of the spectralchangepatternis
neededfor classification.It can also be seenthat omitting
tokenswith relativelyhighidentification
errorratesproduces
a consistentimprovementin classificationaccuracy.This
findingindicatesthattheerrorsproducedby the patternclassifier tend to occurmore often for tokensthat are poorly
identifiedby listeners.
Althoughnot shownin thetable,thesameclassification
testswere conductedusingthe full set of 12 vowels.The
overallpatternof resultswas quite similarto that shownin
TableIX, exceptthatthe improvement
in classification
accuracywith two samplesof the formantpatternwassomewhat
larger.This is a logicalresultgiven that the two vowelsthat
were added,/e/and/o/, are nearly alwaysdiphthongized.
If we assume,then, that the acousticmeasurementswere
madewith roughlyequalprecisionin the two studies,thenit
must be the case that many of thesevowels were simply
producedin differentwaysby the two groupsof talkers.As
was indicatedpreviously,little is knownaboutthe dialectof
thePB talkersexcept:(1) Mostof thewomenwereraisedin
themid-Atlanticregion;(2) themenrepresented
"... a broad
regionalsampling
of theUnitedStates..." (Peterson
andBarney,1952,p. 177);(3) a "few" of thetalkerslearnedEnglish
as a secondlanguage;
and (4) "most"of the talkersspoke
GeneralAmerican. Perhapsmore importantthan potential
differencesin regionaldialect is the passageof some40
yearsin the timesat which the two setsof recordingswere
made. It is well known that significantchangesin speech
productioncan occurover a periodof severaldecades.For
example,Bauer(1985)showedclearevidence
of significant
aliachronic
vowel shiftswhen comparingrecordingsof British RP speakers
madein 1949with similarrecordings
made
in 1966. A secondcomparisonof comparablerecordings
made in 1982 showed a continuation
of the same vowel
shifts.There are almostcertainlysomedifferences
between
our data and those of PB that can be attributed, at least in
part, to aliachronic
change.For example,the "raising"of
/a•/, with the consequent
reductionin contrastin the static
formantpositions
for/•e/and/•/, hasbeenwell documented
in severaldialectsof AmericanEnglish(e.g., Labor et al.,
It should also be noted that the Texas Instruments data-
Theoriginalintentof thisstudywasto collecta database
of acousticmeasurements
for/hVd/utterancescomparable
to
PB, but with additionalmeasuresof durationand spectral
changethatcouldbe usedto studytherole of dynamicproperties in vowel recognition.The differencesthat were observedbetweenour staticmeasurements
of formantpatterns
andthoseof PB were not anticipated.One possibleexplanahas to do with our use of LPC as
opposed
to the moredirectspectrum
analysismethodused
by PB.Thispossibilitycannotbe eliminatedentirely,particularly sincewe attemptedno directcomparisons
of our LPC
measurements
with measuresobtainedusing PB's spectro-
graphictechnique.However,the differencesbetweenour
data and those of PB strike us as both too numerous and too
diverseto be explainedby differencesin spectrumanalysis
methods.Althoughnot entirely conclusive,the comparisons
3107
setsof formant
dataintoconvergence.
4
1972).
IV. DISCUSSION
tion of these differences
that were made between formants measured from LPC and
cepstrallysmoothedFourierspectraalso make it seemunlikely that the differencescan be attributedto spectrum
analysismethods.The closesimilarityin F3 valuesand the
nearlyidenticalformantvaluesfor the centralvowelsproducedby the adulttalkersfrom the two studieswould also
seemto argueagainstthisinterpretation.
It also seemsunlikely that the discrepancies
can be attributedto differencesin the timesat whichthe formantpatternswere sampled.Our datashowedthat steady-state
times
couldbe locatedwith a surprisinglyhigh degreeof reliability. In addition,in datanot reportedhere,we determinedfor
threevowels(/œ/,/a•/,and/u/)thateventhemostprocrustean
methodof locatingsteadystatetimescouldnotbringthetwo
d. Acoust.Soc. Am., Vol. 97, No. 5, Pt. 1, May 1995
basedescribed
by Syrdal(1985) showssomeratherlarge
differences from PB. The TI values for the front vowels are
quitesimilarto PB, but ratherlargedifferences
are seenfor
the back vowels. The differences
from PB are in the same
directionas in our data(i.e., implyinglowerandmoreanteriortonguepositions
compared
to PB), but thediscrepancies
are evenlarger.As with the PB study,little is knownabout
the dialectof the speakersin the TI study.
Thereis one final pointworthnotingaboutthe discrepanciesin formant frequenciesacrossthese three studies.
There has been a tendencyto view the PB databaseas a
benchmarkof sorts,establishingthe set of formant frequenciesfor AmericanEnglishvowels.For example,the PB mea-
surementsare frequently used as control parametersin
speechsynthesis
studies,andoftenserveto defineprototypes
for vowel categories.The PB measurements
have also been
Hillenbrandet aL: Acousticcharacteristicsof vowels 3107
heavilyusedto evaluatevowelnormalization
algorithms
and
are frequentlyusedin cross-language
comparisons
andcomparisonsbetweennormalanddisordered
speech.The present
results,alongwith thoseof Syrdal(1985)andBauer(1985),
serveasa reminderthata studyof thiskindcanonly hopeto
establisha set of formantfrequencies
that are typicalof a
specificdialectat a specifictimein thehistoryof thatdialect.
In contrast to the numerous differences in acoustic mea-
surements
betweenour studyandPB, the two listeningstudies producedvery similar results.The overall identification
ratesfrom the two studiesare quitesimilar,as are the rates
for individualvowels.Althougha detailedanalysisof the
relationships
betweenthe acousticandperceptualdatafrom
the presentstudywill haveto await furtherstudy,it seems
quiteclearthat the frequenciesof F1 andF2 at steadystate
are not good predictorsof the identificationresults.The
clearestexampleis the /a•/-/œ/ pair, which was identified
quitewell by listeners
despitevery poorseparation
in static
F1-F2 space.
Another indication that static measures of F 1 and F2
are poor predictorsof vowel identificationis the general
findingthatthesignificantly
increased
crowdingof vowelsin
staticF1 -F2 spacerelativeto PB wasnot accompanied
by
an increasein perceptualconfusionsamong vowels. Althoughit can only be guessedat, one possibilitythat might
be considered
is thatourtalkersproduced
moreheavilydiphthongizedvowelsthanPB's talkers.Accordingto thisview,
thegreaterdegreeof crowdingin thestaticformantspaceof
our talkersmightbe offsetby an increasein spectralchange,
resultingin a set of vowelsthat are as distinctas the PB
vowels.However,thisspeculation
cannotbe confirmedand,
in the absenceof the originalPB recordings,it is difficultto
go beyondthe generalsuggestion
of dialectdifferencesbetweenthe two groupsof talkers.
It is importantto note that listenerswere askedto identify utterances
by choosinga labelfrom a closedsetof broad
phonerotecategories.It shouldnot be concludedthat all utterancesthat were assignedthe same phonemiclabel are
phonetically
equivalent
(seeLadefoged,
1967,for a discus-
TABLEVL Overallpercent
correctidentification
by vowelcategory.
forthe
presentstudy(HGCW) andfor Peterson
andBarney(PB).
HGCW
PB
/i/
99.6
99.9
/I/
98.8
92.9
/e/
98.3
a
/t/
95.1
87.7
/a½/
94.1
96.5
/o/
/a/
92.3
82.0
87.0
92.8
/o/
99.2
a
/ul
97.5
96.5
/u/
97.2
99.2
I,d
/3-/
90.8
99.5
92.2
99.7
Total:
95.4
94.4
Men:
94.6
b
Women:
95.6
b
Children
93.7
b
aThesc
vowelswerenotrecorded
by Peterson
andBarney.
bPeterson
andBarney
didnotreport
results
separately
formen,women,
and
child talkers.
kensthatwerewell identifiedby listeners.It mightbe useful
to reanalyzea subsetof the utterancesthat are judged by
experienced
phoneticians
to be goodexamples
of thevarious
vowel qualities.It seemslikely that within-vowel-category
variabilityin the acousticmeasures
for a subsetof thiskind
would be substantially
reduced.
One otheraspectof the listeningtestdeservescomment.
The largenumberof speakersandfull randomizafion
of both
talkersandvowelsmakeit very unlikelythatlistenerscould
havemadeuseof a vocaltractnormalization
process
of the
kind described
by LadefogedandBroadbent(1957) andothers. While there is some evidence that listeners can make use
of speaker-specific
normalizinginformationin certainkinds
of tasks,the relativelyhigh identificationratesobtainedin
thepresentstudywouldseemto indicatethataccuratevowel
identificationdoes not require calibrationto individual
sion).Evena casuallistening
by anexperienced
phonetictan speakers.
showsclearly that there is a range of phoneticqualities
The discriminant
analysisresultsshowedthatour vowwithinthevowelcategories,
evenwhenconsidering
onlytoels couldnot be separated
well basedon a singlesampleof
TABLE VII. Confusionmatrixfor/hVd/utterancesproducedby 45 men,48 women,and46 children.
Vowel identifiedby listener
Ill
/i/
99.6
hi
/el
Vowel
It/
intended
/ae/
by
/o/
talker
/a/
1o/
/u/
/u/
0.6
/#
/e/
0.1
0.1
98.8
0.2
0.3
0.5
0.1
983
It/
/ae/
3108
/al
/o/
/ul
lul
It,/
In'/
0.1
0.9
0.2
0.I
0. I
95.1
0.3
3.7
0.2
0.1
0.1
5.6
94.1
0.2
0.1
0.3
92.3
3.5
13.8
82.0
0.1
0.1
0.3
3.7
0.1
J. Acoust.Soc. Am., Vol. 97, No. 5, Pt. 1, May 1995
0.2
1.8
0.2
0.2
0. I
3.3
0.1
3.8
1.3
97.2
1.0
0.4
0.1
0.5
97.5
1.9
0.3
3.2
0.2
90.8
99.2
0.1
IM
/a,/
/a/
0.2
0. I
0.2
0.2
0.I
0.2
99.5
Hillenbrandet aL: Acousticcharacteristicsof vowels 3108
o
TABLE VIII. Quadraticdiscrimination
resultsfor thepresentdata(HGCW)
andfor thePB datasetbasedon a singlesampleof theformantpattern.The
tableshowsoverallclassification
accuracyusingthe "jackknife"methodin
whichmeasurements
for individualtokensare removedfrom the training
statisticsprior to classification.
o
Parameter set
PERCENT CORRE. C• IDENTIFICATION
FOR INDIVIDUAL
TOKENS
F 1 ,F2
F1,F2,F3
FO,F1 ,F2
FO,FI,F2,F3
HGCW
PB
68.2
81.0
78.2
84.7
74.9
83.6
85.9
86.6
]0 IS 20 2.• ;0 3S 40 45 .•0 ,•_; 60 65 70 75 80 8.; 90 95 100
phoneticspace.For example,severalstudieshave shown
very high identificationratesfor "silentcenter"stimuliconsistingof onglidesand offglidesonly (e.g., Jenkinset al.,
1983;Nearey,1989;NeareyandAssmann,1986).CompleFIG. 10. Ristogmmof percentcorrectidentification
ratesfor individual
tokens,where "correct" means that the listener identified the vowel as the
mentingtheseresultsareHillenbrandandGayvert's(1993b)
one intendedby the talker.
findingsshowingthat steady-state
vowelssynthesized
from
thePB measurements
arenotwell identifiedby listeners(see
theformantpattern,especially
F1 andF2 alone.The sameis
alsoFairbanksand Grubb,1961).Takentogether,the silent
true of the PB data, but to a lesserdegree.Addingvowel
centerand steady-state
resynthesis
resultssuggestthat static
durationmeasuresresultedin consistentbut fairly modest spectraltargetsare neithernecessary
nor sufficientfor accuimprovements
in classification
accuracy,and includingtwo
rate vowel recognition.
samplesof theformantpatternproducedlargeimprovements
It is interestingin this regard that the importanceof
dynamicinformationin vowel identificationwas recognized
in categoryseparability.
Thesefindingsare consistent
with
Zahorianand Jagharghi(1993), who showedmuchbetter by PB, who commented,"It is the presentbelief that the
classification
ratesfor dynamicversusstaticrepresentations complexacoustical
patternsrepresented
by thewordsarenot
of bothformantsand overallspectralshape.Zahorianand adequatelyrepresented
by a single •ction, but requirea
lagharghiare alsoin agreement
in showingmuchlargerimmore complexportrayal"(Petersonand Barney,1952, p.
provementsin classificationaccuracywith the additionof
184).The precisenatureof this "morecomplexportrayal"
spectralchangeas comparedto vowelduration.The present remains unclear, however, since we still do not know how
results,alongwith thoseof ZahorianandJagharghi,
are conlistenersmapspectralchangepatternsontoperceivedvowel
sistentwith manyrecentfindingssuggesting
that the vowels quality.Our discriminantanalysisfindingsindicatedthat a
of AmericanEnglishare more appropriatelyviewed not as
fairly coarse,two-samplerepresentation
of the formantpatfor accuratevowel classification.
pointsin phoneticspacebut ratheras trajectoriesthrough tern is all that is necessary
PERCENT
CORRECT
IDENTI•-•C^TION
TABLE IX. Quadraticdiscrimination
resultsfor the presentdatashowingtheeffectof includingdurationand
spectralchangeinformationon classification
accuracy.
The tableshowsoverallclassification
accuracyusingthe
"jacldmife"methodin whichmeasurements
for individualtokensareremovedfromthetrainingstatistics
prior
to classification.
The one-sample
resultsarebasedon a singlesampleof theformantpatternat "steadystate;"
the two-sampleresultsare basedon samplestakenat 20% and80% of vowel duration;the three-sample
results
arebasedonsamples
takenat 20%,50%,and80%of vowelduration.
("NoDor"=voweldurationnotincluded;
"Dur"=vowel durationincluded.)Entriesunderthe headingof "All tokens"usedthe full database;
entries
underthe heading"well identifiedtokensonly" arebasedon a datasetthatdid not includetokenswith error
ratesof 15% or greater(11.5% of the tokens).
All tokens
One sample
Two samples
Three samples
Parameter set
NoDur
Dar
NoDur
Dur
NoDur
Dur
F1 ,F2
F1,F2,F3
F0,F1,F2
FO,FI ,F2,F3
68.2
81.0
78.2
84.7
76.1
84.6
82.0
87.8
87.9
91.6
90.3
02.7
92.5
94.1
87.7
91.8
90.4
93.1
91.0
92.8
92.6
94.8
One sample
91.6
93.6
Well identifiedtokensonly
Two samples
Three samples
Parameter set
NoDur
Dur
NoDur
Dur
NoDur
Dur
FI .F2
F1 ,F2,F3
FO,F1 ,F2
F0,F1 ,F2,F3
71.4
85.3
80.0
89.1
90.8
95.4
93.6
96.2
90.7
95.3
93.3
95.8
82.3
88.7
86.3
91.6
95.5
97.3
96.3
97.8
94.8
96.6
96.0
97.3
3109 d. Acoust.Soc.Am., Vol. 97, No. 5, Pt. 1, May 1995
Hillenbrandet al.: Acousticcharacteristics
of vowels 3109
However,thatdoesnot imply thatthe detailsof the formant
changepatternare unimportantto the listener.Additional
studiesusingsynthesismethodsare neededto learn more
aboutthe specificmappingrelationsthat are involvedin
vowel recognition.Also neededare studiesof spectral
changepatternsin more complex phoneticenvironments
than the/hVd/utterancesexaminedhere (e.g., Stevensand
House, 1963). While it seemscertain that the associations
which we observedbetweenvowel categoriesand spectral
changepatternswill be lessstraightforward
in morecomplex
phoneticenvironments,the extentof the oversimplification
in our data is as yet unknown.
Bennett,D.C. (1968}. "Spectralform anddurationas cuesin the recognition of Englishand Germanvowels,"Lang. Speech11, 65-85.
Black,J. W. (1949)."Naturalfrequency,
duration,
andintensity
of vowelsin
reading,"1. SpeechHear.Disor& 14, 216-221.
Di Benedetto,
M-G. (1989a)."Vowelrepresentation:
Someobservations
on
temporalandspectralproperties
of thefirstformantfrequency,"
J. Acoust.
Soc. Am. 86, 55-66.
Di Benedetto,
M-G. (1989b)."Frequency
andtimevariations
of the first
formant:Properties
relevantto theperception
of vowelheight,"J. Acoust.
Soc. Am. 86, 67-77.
Fairbanks,G., and Gmbb,P. (1961). "A psychophysical
investigation
of
vowelformants,"
J. SpeechHear.Res.4, 203-219.
Fant,G. (1973).SpeechSoundsandFeatures(MIT, Cambridge,
MA).
Hiilenbrand, J. (1988). "MPITCH:
An autocorrelation fundamental-
frequency
tracker,"[Computer
Program],
Western
MichiganUniversity,
Kalamazoo, MI.
ACKNOWLEDGMENTS
Hillenbrand,J., andGayvert,R. T. (1993a). "Vowel classification
basedon
fundamental
frequency
andformantfrequencies,"
J. SpeechHear.Res.36,
647-700.
We are very grateful to Michelle Malta, who donated
manyhoursto thisprojectandsetthe tonefor dedicationand
attentionto detail. The able assistance
of Emily Well and
Matt Phillipsis alsogratefullyacknowledged.
We wouldalso
like to thankTerry Nearey for helpful adviceprovidedat
variouspointsalongtheway, andJim Flegefor commentson
a previousdraft. This work was supportedby a research
grantfromtheNationalInstitutes
of Health(NIDCD 1-R01De01661) andby theAir ForceSystemsCommand,Rome
Air DevelopmentCenter,GriffissAir ForceBase,andtheAir
Force Office of Scientific Research(ContractNo. F30602-
Hillenbrand,J., and Gayreft, R. T. (1993b). "Identificationof steady-state
vowels synthesizedfrom the Petersonand Barney measurements,"
J.
Acoust. Soc. Am. 94, 668-674.
Jenkins,J. J., Strange,W., and Edman,T. R. (1983). "Identificationof
vowelsin 'vowelless'syllables,"Percept.Psychophys.
34, 441-450.
Johnson,
R. A., andWinchera,D. W. (1982).AppliedMultivariateStatistical Analysis(Prentice-Hall,
Englewood
Cliffs,NJ).
Kent, R. D., and Fomer,L. L. (1980). "Speechsegmentdurationsin sentencerecitations
by childrenandadults,"J. Phon.8, 157-168.
Labor, W., Yaeger,M., and Steiner,R. (1972}. "A quantitative
studyof
soundchangein progress,"Reporton NationalScienceFoundation
Contract NSF-FS-3287.
Ladefoged,
P. (1967).ThreeAreasof Experbnental
Phonetics
(OxfordU.P.,
London).
85-C-0008).
Ladefoged,P., and Broadbent,D. E. (1957). "Informationconveyedby
•Theissueof whether
to screen
talkers'
productions
based
ontheexperi-
Lippmann,R. P. (1989). "Reviewof neuralnetworksfor speechrecognition," NeuralComputation
1, 1-38.
Miller,J. D. (1989)."Auditory-perceptual
interpretation
of thevowel,"1.
vowels," J. Acoust. Soc. Am. 29, 98-104.
menter'sphoneticjudgmentsgenerateda great deal of discussionas the
projectwas beingplanned.We ultimatelysettledon the methoddescribed
above,in which we trainedthe subjectsas carefullyas possiblebut did not
interveneoncethe taperecorderwasstarted,exceptin casesof dysfluency
or obviousreadingerrors.The reasonthatwe chosethisapproachis that,
alongwith PB, we wantedto determinehow accuratelylistenersidentified
naturallyproducedvowels,where "accurate"is definedas agreementbetween the listenerand the intent of the speaker.If stimuli are screened
basedon theexperimenter's
judgmentsof correctproduction,
the question
changesfroma two-wayagreement
betweenthe talkerandthelistenerto a
Ihree-wayagreement
amongthetalker,thelistener,andtheexperimenter.
It
is notknownwhetherstimuliwerescreened
for correctpronunciation
when
the PB recordingswere made.
"Thereelscale
waschosen
overtheBarkandlogtmnsfurms
because
this
transformation
did a betterjob of groupingvowelsproducedby men,
women, and children.This representation
is similar to a normalization
Acoust. Soc. Am. 85, 2114-2134.
Nearey,T. M. (1978).PhoneticFeatureSystems
for Vowels(IndianaUniversityLinguisticsClub, Bloomington,IN).
Nearey,T. M. (1989). "Static,dynamic,andrelationalproperties
in vowel
perception,"
J. Acoust.Soc.Am. 8.5,2088-2113.
Nearey,T. M. (1992). "Applications
of generalized
linearmodelingto
voweldata,"in Proceedings
ICSLP 92, editedby J. Ohala,T. Nearey,B.
Derwing,M. Hodge,and G. Wiebe (Universityof Alberta,Edmonton,
AB), pp.583-586.
Nearey,T. M., andAssman,
P.(1986)."Modelingtheroleof vowelinherent
spectralchangein vowel identification,"J. Acoust.Soc.Am. 80, 12971308.
Nearey,T M., Hogan,J., andRozsypal,A. (1979). "Speechsignals,cues
and features,"in Perspectivesin ExperimentalLinguistics,editedby G.
Prideaux(Benjamin,
Amsterdam).
scheme
proposed
by Peterson
(1951).
-Youradditional
listeners
weretested
bothonthe12-vowel,
full randomiza- Peterson,G. E. (1951). "The phoneticvalue of vowels," Language27,
tion taskdescribedhereandon a ten-vowel,blocked-by-speaker
taskcomparableto PB. Resultsfor the ten vowelscommonto the two taskswere
very similar.
4Agraduate
student
wasgivena copyof thePBdataandasked
tochoose
a
steady-state
locationin sucha way that the valuesof F1 andF2 would be
closestto the averagesfrom PB, even if thai locationmade no sensein
relationto the objectiveof findingthe steadiestportionof the vowel. Differences between our data and those of Pl] remained even when this procedure was used.
Ainsworth,
W. A. (1972)."Durationasa cuein therecognition
of synthetic
vowels," J. Acoust. Soc. Am. 51, 648-651.
Allen, G. D. (1978). "Vowel durationmeasurement:
A reliabilitystudy,"J.
Acoust. Soc. Am. 63, 1176-1185.
Assmann,P., Nearey,T. E., and Hogan,J. (1982). "Vowel identification:
Orthographic,perceptual,and acousticfactors,"J. Acoust.Soc.Am. 71,
975 -989.
Bauer,L. {1985)."Tracingphonetic
changein thereceivedpronunciation
of
BritishEnglish,"J. Phon. 13, 61-81.
3110
J. Acoust. Soc. Am., VoL 97, No. 5, Pt. 1, May 1995
541-553.
Peterson,
G. E., andBarney,H. L. (1952). "Controlmethods
usedin a study
of the vowels," J. Acoust. Soc. Am. 24, 175-184.
Peterson,
G. E., andLehiste,L (1960)."Durationof syllablenucleiin English,"J. Acoust.Soc.Am. 32,693-703.
Press,W. H., Flannery,B. P., Teukolsky,S. A., and Vettefiing,M. W. T.
(1988). "NumericalRecipesin C," (CambridgeU.P.,Cambridge,MA).
Smith,B_ L. (1978). "Temporalaspectsof Englishspeechproduction:
A
developmentalperspective,"J. Phon. 6, 37-67.
Smith,B. L., Hillenbrand,J., andIngrisano,D. R. (1986). "Comparison
of
temporalmeasuresof speechusing spectrograms
and digital oscillograms,"J. SpeechHear.Res.29, 270-274.
Stevens,K. N. 0959). "The role of duration in vowel identification,"Q.
Progr.Rep. 52, ResearchLaboratoryof Electronics,MIT.
Stevens,K. N., and House,A. S. (1963). "Perturbationof vowel articulationsby consonantal
context:An acoustical
study,"J. SpeechHear.Res.6,
111-128.
Strange,
W. (1989)."Dynamicspecification
of coarticulated
vowelsspoken
in sentencecontext," J. Acoust. Soc. Am. 85, 2135-2153.
Strange,W., Jenkins,J. J., and Johnson,T. L. (1983}. "Dynamic specifica-
Hillenbrandet aL: Acousticcharacteristicsof vowels 3110
tion of coarticulatedvowels," I. Acoust. Soc.Am. 74, 695-705.
Syrdal,A. K. (1985)."Aspects
of a modelof theauditoryrepresentation
of
AmericanEnglishvowels,"SpeechCommun.4, 121-135.
Syrdai,A. K., and Gopal,H. S. (1986). "A perceptual
modelof vowel
recognitionbasedon the auditoryrepresentatioa
of AmericanEnglish
vowels," J. Acoust. Soc. Am. 79, 1086-1100.
TiffanyW. (1953)."Vowelrecognition
asa function
of duration,
frequency
modulation
andphoneticcoutext,"J. SpeechHear.Disord.18, 289-301.
3111
J. Acoust.Soc. Am., VoL 97, No. 5, Pt. 1, May 1995
Watrous,R. L. (1991). "Currentstatusof the Peterson-Barneyvowel formantdata," I. Acoust.Soc.Am. 89, 2459-2460.
Whalen,D. H. (1989). "Vowelandcoasonant
judgmeats
are aot independentwhencuedby thesameinformation,"
Percept.
Psychophys.
46, 284292.
Zahorian,S. A., andJagharghi,
A. J. (1993). "Spectralshapeversusformants as acousticcorrelatesfor vowels," J. Acoust. Soc. Am. 94, 1966
1982.
Hillenbrandet aL: Acousticcharacteristics
of vowels 3111