(1995). "Acoustic characteristics of American English vowels
Transcription
(1995). "Acoustic characteristics of American English vowels
Acoustic characteristics of American English vowels James Hillenbrand,LauraA. Getty,MichaelJ. Clark,and KimberleeWheeler Department ojeSpeechPathologyandAudiology,Western MichiganUniversity, Kalamazoo, Michigan49008 (Received 10August1994;revised7 November 1994;accepted 17 January 1995) The purposeof this studywas to replicateand extendthe classicstudyof vowel acousticsby Peterson andBarney(PB) [J.Acoust.Soc.Am. 24, 175-184 (1952)].Recordings weremadeof 45 men,48 women,and46 childrenproducingthevowels/i,t,e,e,a:,a,•,o,u,u,n,3•/in h-V-d syllables. Formantcontoursfor F1-F4 were measuredfrom LPC spectrausinga custominteractiveediting tool. For comparison with the PB data,formantpatternswere sampledat a time thatwasjudgedby visual inspectionto be maximallysteady.Analysisof the formantdatashowsnumerousdifferences betweenthepresentdataandthoseof PB,bothin termsof averagefrequencies of F1 andF2, and thedegreeof overlapamongadjacentvowels.As with theoriginalstudy,listeningtestsshowedthat thesignalswerenearlyalwaysidentifiedasthevowelintendedby thetalker.Discriminantanalysis showedthatthevowelsweremorepoorlyseparated thanthePB databasedon a staticsampleof the formantpattern.However,thevowelscanbe separated with a highdegreeof accuracyif duration andspectralchangeinformationis included. PACS numbers: 43.70.Fq,43.71.Es,43.72.Ar INTRODUCTION bodyof evidenceindicatingthatdynamicproperties suchas durationandspectralchangeplay an importantrole in vowel The mostwidelycitedexperimenton the acoustics and perception(e.g.,Ainsworth,1972; Bennett,1968;Di Beneperceptionof vowelsis a surprisinglysimplestudycon- detto, 1989ab; Hillenbrandand Gayreft, 1993b; Jenkins ductedat Bell TelephoneLaboratories by PetersonandBaretal., 1983; Nearey, 1989; Nearey and Assmann,1986; ney (1952) shortlyafterthe introduction of the soundspec- Stevens,1959; Strange,1989; Strangeet al., 1983;Tiffany, trograph. Peterson andBarney(PB) recorded two repetitions 1953; Whalen, 1989). Other limitationsof the PB database of ten vowels in /hVd/ contextspokenby 33 men, 28 include: (1) There is no indicationthat subjectswere women, and 15 children. Acoustic measurements from screened for dialect,andvery little is knownaboutthe dianarrow-band spectraconsisted of formantfrequencies (F llectof eitherthespeakers or thelisteners; (2) listening results F3), formantamplitudes, andfundamental frequency(F0). werenotreportedseparately for men,women,andchildtalkThe measurements were takenat a singletime slicethatwas ers;(3) no informationis givenaboutthe ageor genderof judgedto be "steadystate."The /hVd/ signalswere also the childtalkers;(4) measures weremadefrom a relatively presentedto listenersfor identification.The resultsof the smallgroupof children;(5) thereis nowayto determine the measurement studyshoweda strongrelationship betweenthe identifiability of individualtokens;(6) measurement reliabilintendedvowel andtheformantfrequencypattern.However, ity wasnotreported; and(7) sincetheoriginalsignalsareno therewas considerable formantfrequencyvariabilityfrom longer available,the databasecannotbe used to evaluate onespeakerto thenext,andtherewasa substantial degreeof signal representations other than F0 and formantfrequenoverlapin the formantfrequencypatternsamongadjacent cies. vowels. The listeningstudy showedthat the vowels were The presentstudyrepresents an attemptto addressthese highly identifiable:The overall error rate was 5.6%, and limitations.Recordings were madeof/hVd/utterancessponearlyall of theerrorsinvolvedconfusions betweenadjacent ken by a largegroupof men,women,andchildren.Measurevowels. The PB measurements haveplayeda centralrole in the development and testingof theoriesof vowel recognition. Acousticmeasurements for the signalsrecordedby PB have beenwidelydistributed to speech research laboratories (e.g., Watrous,1991) and havebeenusedin numerousstudiesto evaluate alternativemodels of vowel recognition(e.g., Nearey,1978; Neareyet aL, 1979; Syrdal,1985; Syrdaland Gopal,1986; Nearey,1992; Lippmann,1989; Miller, 1989; Hillenbrandand Gayreft, 1993a).Despitethe widespread use of the PB measurements, thereare severalwell recognizedlimitationsto thedatabase. Perhapsthemostimportant limitation is that the databaseconsi•t• exclusivelyof acoustic mentswere made of vowel duration,F0 contours,and for- mantfrequency contours. The signalswerealsopresented to a panel of listenersfor identification.Finally, discriminant analysiswas usedto classflythe signalsusingvariouscombinations of the acoustic measurements. I. ACOUSTIC ANALYSIS A. Methods 1. Talkers Talkers consistedof 45 men, 48 women, and 46 ten- to 12-year-oldchildren(27 boys,19 girls).The majorityof the measurements takenat a singletime slice.Durationmeasure- speakers (87%) wereraisedin Michigan'slowerpeninsula, ments were not made, and no information is available about primarily the southeastern and southwestern parts of the state.The remainderwere primarilyfrom otherareasof the thepatternof spectralchangeovertime.Thereis now a solid 3099 J. Acoust.Soc. Am. 97 (5), Pt. 1, May 1995 0001-4966/95/97(5)/3099/13/$6.00 ¸ 1995 AcousticalSocietyof America 3099 upper midwest,such as Illinois, Wisconsin,Minnesota, northernOhio, andnorthernIndiana.An extensivescreening procedure wasusedto selectthese139 subjects from a larger group.The most importantpart of the screeningprocedure was a carefuldialectassessment, focusingespeciallyon subjects' productionof the/a/-/•/distinction. The/a/-/•/distinctionis not maintainedby many speakersof American English,a fact whichwe believed(incorrecfiy, as it turned out) mightaccountfor the relativelyhighconfusability reportedby PB for thispair of vowels. The screening procedure beganwith a 5- to 7-min informal conversation with one of the experimenters. This conversationwas tape recordedfor later review by an experiencedphonetitian.Subjectsnext read a 128-wordpassage tOO0 r• 3000 that contained several instances of words with/o/and/•/. Subjectswere eliminatedif the phoneticiannotedany systematicdeparturefrom generalAmericanEnglish,or if the speakerfailed to maintainthe /o/-/•/ distinctioneitherin spontaneous speechor in the 128-wordpassage.Subjects were also requiredto passa brief task which testedtheir ability to discriminate/n/-/•/minimal pairs. In additionto the dialectassessment, subjects were eliminatedif they:(1) were non-native speakers of English;(2) showedany evidenceof a speech, language, or voicedisorder; (3) showed anyevidence of a currentrespiratory infection; or (4) faileda 20-dB pure-tonescreeningat 500, 1000, and 2000 Hz. 2. Recordings Audio recordings were madeof subjectsreadinglists containing12 vowels:The ten vowels recordedby PB (/ij,œ,•,o,•,u,u•%a•/)plus /e/ and /o/. Also recordedwere four diphthongs in/h-d/context, andbothvowelsanddiphthongsin isolation.Only resultsfrom the 12/hVd/utterances will be describedin thisreport.Subjectsreadfrom oneof 12 different randomizationsof a list containingthe words "heed,""hid," "hayed,""head,""had," "hod," "hawed," "hoed," "hood," "who'd," "hud," "heard," "hoyed," "hide," "hewed," and "how'd." Subjectswere given as muchtimeas neededto practicethe taskanddemonstrate an understanding of the pronunciations that were expectedfor eachkeyword.Recordings weremadeof severalreadings of the list oncethe experimenter was satisfiedthatthe subject understoodthe task. Once the recordingsessionbegan, the experimenterdid not auditioneachstimulusand requestadditionalreadingsbasedon the experimentefts judgmentof correct pronunciation.: An attempt wasmadeto record at leastthreereadingsof the list. This wasoftennot possiblein the caseof the children,who took longerto train thanadults and sometimestired of the task after two readings. The recordings weremadewith a digitalaudiorecorder (SonyPCM-F1)anda dynamicmicrophone (Shure570-S). One tokenof eachstimulusfrom eachtalkerwas low-pass filtered at 7.2 kHz and digitizedat 16 kHz with 12 bits of amplituderesolutionon a PDP 11/73computer. Unlessthere were problemswith recordingfidelity or backgroundnoise, tokensweretakenfrom the subject'sfirstreadingof the list. The gainon an inputamplifierwasadjusted individuallyfor eachtokensothatthepeakamplitudewasat least80% of the _+10-Vdynamicrangeof theA/D, with no peakclipping. 3100 J. Acoust.Soc.Am.,Vol.97, No. 5, Pt. 1, May 1995 TIME FIG. 1. Spectralpeakdisplayof the word "heard"spokenby a child.The dashedverticallinesindicatethe beginningandend of the vowel nucleus. The top panelshowsthe signalafterthe original14-poleLPC analysis,the middlepanelshowsthe signalafterreanalysiswith 18 poles,andthe bottom panelshowsthe signalafter handeditingwith a customeditingtool. & Acoustic measurements a. Voweldurationand "steady-state" times. The starting and endingtimesof vocalicnucleiwere measuredby hand from high-resolution gray-scaledigital spectrograms usingstandard measurement criteria(Peterson andLehiste, 1960).In anattemptto producea datasetcomparable to PB, two experimenters, workingindependently, madea judgment of steady-state timefor eachsignal.The measures weremade while viewinga spectralpeakdisplay(Fig. 1) anda grayscale spectrogram.PB provide a very brief descriptionof how steady-state timeswerelocated,indicatingonly thatthe spectrum wassampled,"... followingthe influenceof the/hi andprecedingthe influenceof the/d/, duringwhicha prac- ticallysteadystateis reached" (Peterson andBarney,1952, p. 177).The two experimenters workedfrom thisbrief description,andfromthetenexamples shownin Fig. 2 of PB. In additionto the hand measurements of steady-state times,we experimented with severalmethodsof determining steady-statetimes automaticallythroughan analysisof edited formant contours. Of the several methods that were tried, the techniquethat seemedto producethe bestresults definedsteadystateas the centerof the sequenceof seven analysisframes(56 ms)with theminimumslopein logF2logF1 space(Miller, 1989). b. Formant contours. Formant-frequencyanalysisbegan with the calculation of 14-pole, 128-point linearpredictive coding(LPC)spectra every8 msover16 ms(256 point)hammingwindowedsegments. The frequencies of the first sevenspectralpeakswere thenextractedfrom the LPC spectrumfiles. The frequenciesof spectralpeakswere estimatedwith a three-pointparabolicinterpolation, yieldinga finer resolutionthan the 61.5-Hz frequencyquantization. Files containingthe LPC peak data servedas the input to a custominteractiveeditor.The editorallowstheexperimenter Hillenbrand et al.: Acoustic characteristics of vowels 3100 TABLEI. Percentage of utt•ranc.½s showing a formantmergeranywhere in TABLE I1. Average absolute difference between formant frequencies thevowelnucleus. Shownin parentheses are thepercentage of utterances sampledat "steady-state" timesdemrmined by two judges.Figuresin pashowinga formantmergerat "steadystate." rentheses are differences as a percentof averageformantfrequency. Vowel /i/ /i/ /e/ /e! /aed /a/ /g Iol /u/ /u/ /M I•1 F1 -F2 0.0 0.0 0.0 o.o 0.0 2.0 4.0 2.7 o.o 1.3 0.7 0.0 F2-F3 (0.0) (0.0) (0.0) (o.o) (0.0) (2.0) (2.0) (1.3) (o.o) (0.7) (0.7) (0.0) 10.0 0.0 9.3 o.o 3.3 0.7 0.7 0.0 o.o 0.7 0.0 15.3 Men (s.o) (0.0) (4.7) (o.o) (3.3) (0.0) (0.7) (0.0) (o.o) (0.7) (0.0) (11.3) Women Children Overall F1 F2 F3 7.5 (1.3%) 14.2 (0.8%) 18.6 (0.7) 9.2 (1.5%) 20.0 (1.1%) 21.2 (0.7%) 10.7 (1.8%) 18.5 (1.1%) 27.6 (1.0%) 9.2 (1.5%) 17.6 (1.0%) 22.5 (0.8%) F4 20.6 (0.5%) 31.5 (0.8%) 36.4 (0.9%) 29.5 (0.7%) For the presentstudy,formantswere edited only betweenthe startingandendingtimesof the vowel. Contours for F1-F3 weremeasured for all signals,exceptin casesof unresolvable formantmergers.The fourthformantwas measuredonly whena well-definedF4 contourwasclearlyvisible both on the LPC peak displayand the gray-scalespectrogram.The fourthformantwasjudgedto be unmeasurable to reanalyzethe signalwith differentLPC analysisparam- for 15.6% of the utterances. eters and to hand edit the formant tracks. c. Fundamental frequencycontours.FO contourswere extracted with an autocorrelation pitchtracker(Hillenbrand, 1988), followedby handeditingusingthe tool described above.Grosstrackingerrorssuchas pitchhalvingandpitch doublingwere correctedby reanalyzingthe signalwith an optionthatimposesanupperor lowerlimit on thesearchfor Editingandanalysisdecisions werebasedon an examinationof the LPC peak displayoverlaidon a gray-scale spectrogram and,in somecases,on an examinationof individualLPC or Fourierspectralslices.Generalknowledgeof acousticphonetics alsoplayeda role in the editingprocess. For example,editingdecisions werefrequentlyinfluenced by the experimenter's knowledgeof the closeproximityof F2 andF3 for vowelssuchas/i/and/s,/, thecloseproximityof F1 and F2 for vowels suchas/a/and/u/, and so on (see Ladefoged, 1967,for an excellentdiscussion of theinherent circularityin thismethodof estimatingvowelformants,and for otherinsightfulcomments on theformantanalysis). Considerations suchas theseoftenled the experimenter to conclude that a formant mergeroccurred.In thesecases,the LPC spectrawererecomputed with a largernumberof poles until the mergedformantsseparated. Oncethe experimenter was satisfiedwith the analysis, editingcommandscould be usedto hand edit any formant trackingerrorsthatremained.Figure1 showsan exampleof theutterance "heard"spoken by a ten-year-old boy:(a) after the original14-poleanalysis,(b) after reanalysis with 18 poles,and(c) afterhand-editing. (For simplicity,the grayscalespectrogram underlying thepeakdisplayis notshown.) The vertical lines indicatethe beginningand end of the vowelnucleus.Two commands areavailablefor handediting theformantcontours. Onecommandallowstheexperimenter to use the mouseto deletea spuriouspeak, and a second commandallowstheexperimenter to usethe mouseto interpolatethrough"holes"in theformantcontour.For example, in thecenterpanelof Fig. 1, thereis a gapin theF3 contour towardthe end of the vowel. Clicking the mouseon either sideof this gap causesthe programto linearlyinterpolate formantfrequenciesthroughthis gap. It was not uncommon for utterances to show formant mergersthroughoutall or part of the vocalic nucleusthat could not be resolvedusingthesemethods.In thesecases, zeros were written into the higher of the two formant slots showingthe merger(e.g.,F3 was zeroedout in the caseof anF2-F3 merger).TableI showsthefrequency of occurreneeof formantmergersfor eachof the 12 vowels. 3101 d. Acoust. Soc. Am., VoL 97, No. 5, Pt. 1, May 1995 the autocorrelation peak.Any errorsthatremainedwerecorrected using the editing commandsthat were described above. B. Results 1. Measuroment reliability a. Vowel duration. Vowel durations for 10% of the ut- teranceswere remeasured independently by a secondexperimenter. The utterances chosen for remeasurement were drawnat randomfrom the total of 1668 signals,but with approximately equalnumbersof men,women,andchildren. The averagedabsolutedifferencebetweenthe original and remeasured durations was 6.9 ms. This result is in line with reliabilitydatafor voweldurationreportedby Allen (1978) and Smithet al. (1986). b. Steady-statetimes. Steady-statetimes were measuredby two experimenters for all 1668 utterances. The average absolutedifferencebetweenthe two measurements was 21.1 ms, or 7.7% of averagevowel duration.However, moreimportantthanthe time differencebetweenthesetwo measurements is the differencein theformantfrequencypattern at these two samplepoints.These results,shown in TablelI, indicatethatformantfrequencies at the two sample pointstypicallydifferedby roughly1% of averageformant frequency. c. Formantfrequencies.Two methodswereusedto estimatethereliabilityof theformantfrequencymeasurements. The firstmethodinvolveda simplereanalysis of 10% of the utterances usingthe LPC-basedsignalprocessing andediting techniquesdescribedpreviously.The secondmethod involved a reanalysisof 10% of the utterancesusingthe same peakpickingandeditingtechniques but with 128-pointcepstrallysmoothedspectrainsteadof LPC spectra.The primary motivationfor thiscomparison was Di Benedetto's (1989a) Hillenbrandet aL: Acousticcharacteristicsof vowels 3101 TABLE IIL Measurement-remeasutement reliability for formantfrequenciesobtainedfrom a randomlyselected10% of the signals.Valuesaregivenas averageabsoluledifferences andaveragesigneddifferences. Men Abs Women Signed -1.8 Abs Signed Signed 8.1 2.8 26.4 2.8 27.4 2.4 25.2 F3 23. I 1.2 28.2 1.2 34.0 8.9 28.7 2.5 F4 56.2 69.1 15.4 59.0 4.0 reportthat LPC producedcomparable estimates of F2 and F3 but estimatesof F1 thatwere low whencomparedwith smoothed widebandFourierspectra.The analysiscarriedout in the presentstudyconsisted of calculatingFourierspectra over16 ms (256 point)hammingwindowedsegments every 8 ms followed by cepstralsmoothing.Cepstralsmoothing was implementedwith the "smoofi" algorithmfrom Press et al. (1988). The size of the smoothingwindowwas adjustedindividuallyfor eachutteranceto minimize spurious peaks or eliminate formant mergers.In this sense,the degree-of-smoothing parameterperformeda role in the cepstrumanalysiscomparable to thenumberof polesin theLPC analysis.The peakpickingandeditingprocedures described previouslywereusedto extractformantfrequencies from the cepstrallysmoothedspectra. Results for the LPC remeasurement are shown in Table III. The resultsarebasedon a frame-by-framecomparison of thesignals,excludingfrom consideration any framein which eithersignalshoweda mergerin theformantslotbeingcompared.Resultsare given as averageabsolutedifferencesand as signed differences.Overall, the absolute differences rangedfrom about12 to 60 Hz, or between1.0% and2.0% of averageformantfrequency. Table IV comparesformant measurements obtained fromLPC andcepstrallysmoothed spectra.Positivenumbers in the signed-differencecolumns indicate that the LPCderivedformantswere higher in frequencythan thosederivedfrom cepstrallysmoothed spectra.In light of Di Bene- deRo's (1989a) findings,the signed differencesare of particularinterest.Consistent with Di Benedetto'sresults,the signeddifferencesare quite small for formantsaboveF1, especiallyas a percentof formantfrequency.However,unlike Di Benedetto'sfindings,our resultsshowedslightly higher first formantsfrom LPC spectra.This discrepancy might be due to differencesbetweenthe cepstralsmoothing methodusedin thepresentstudyandthe "pseudospectrum" methodusedby Di Benedetto.However, it shouldbe noted that Di Benedetto'sfindingswere basedon analysesof utter- -2.5 Abs 20.7 -3.1 14.2 Signed F2 50.7 -3.3 Abs Overall F1 -3.1 12.2 Children 11.7 -2.6 3.4 two experimenters who madethesejudgments. The averages shownin the lable,andthe datadisplayedin the subsequent figures,are basedon measurements from individualtokens that were well identifiedin the listeningstudy,to be describedin the next section.Specifically,for the purposes of thesecalculations,measurementswere not included from in- dividualtokensthat producedan identificationerror rate of 15% or greater,where"error" simplymeansany instancein which a signalwas identifiedas a vowel otherthan that intendedby the talker.Usingthiscriterion,theaverages in this tableare basedon measurements from 88.5% of the signals. This allows an analysisof measurements for signalsfor which the talkersand listenersare in goodagreementabout the vowel that was spoken.In general,the removalof the more ambiguoussignalshad very little effect on the averages,with the importantexceptionof/•/. As will be discussedin the next section, there were several instancesof attemptsat/a/that werepoorlyidentifiedand,in somecases, consistentlyidentifiedas/o/. a. Vowelduration. The patternof durationaldifferences amongthe vowels is very similar to that observedin connectedspeech.Our vowel durationsfrom/hVd/syllables are two-thirdslongerthanthosemeasured in connected speech by Black(1949), but correlatestrongly(r=0.91) with the connected speechdata.Thereweresignificantdifferences in vowelduration across thethreetalkergroups(F[2,33]=9.04, p<0.001). Newman-Keuls post-hoc analyses showed significantlyshorterdurations for themenwhencompared to either thewomenor thechildren.Longerdurationsfor thechildren were expectedbasedon numerousdevelopmentalstudies (e.g., Smith,1978;Kent and Forner,1980) but the differencesbetweenthe men and the womenwere not expected. We do not have an explanationfor this findingand do not know ff these male-female duration differences would also be seenin conversational speechsamples. b. Fundamental frequency. Figure2 comparesour averagevaluesof fundamentalfrequencywith thoseof PB for ances spoken by just two men and one woman. d. Fundamentalfrequency. Remeasurementof fundamentalfrequencycontoursfor 10% of the utterances showed a frame-by-frameaverageabsolutedifferenceof 1.7 Hz and an averagesigneddifferenceof 0.6 Hz. 2. Measurement results Acousticmeasurements for the/hVd/signals are shown in TableV. Fundamental frequencyandformantvalueswere sampledat the steady-statetimes determinedby one of the 3102 J. Acoust.Soc. Am., Vol. 97, No. 5, Pt. 1, May 1995 TABLE IV. Comparisonof formantfrequencies derivedfrom LPC analysis and cepstralsmoothing.Resultsare given as averageabsolutedifferences andas averagesigneddifferences (LPC-cepstrum). Figuresin parentheses are differencesas a percentof averageformantfrequency. Absolute Signed F1 53.5 (8.9%) 41.5 (6.9%) F2 F3 F4 60.4 (3.5%) 74.0 (2.6%) 91.1 (2.3%) 12.7 (0.7%) 10.9 (0.4%) 7.4 (0.2%) Hillenbrar•det al.: Acousticcharacteristics of vowels 3102 TABLE V. Averagedurations.fundamental frequencies, and formamfrequencies of vowelsproducedby 45 men,48 women,and46 children.Averagesare basedon a subsetof the tokensthatwere well identifiedby listeners(seetext for details).The durationmeasurements are in ms; all othersare in Hz. li/ Dur F0 F1 F2 F3 F4 M /el /el Ivel Io/ /3/ Iol lul lul //• I•/ 263 M 243 192 267 189 278 267 283 265 192 237 188 W 306 237 320 254 332 323 353 326 249 303 226 321 C 297 248 314 235 322 311 319 310 247 278 234 307 M 138 135 129 127 123 123 121 129 133 143 133 130 W 227 224 219 214 215 215 210 217 230 235 218 217 C 246 241 237 230 228 229 225 236 243 249 236 237 M 342 427 476 580 588 768 652 497 469 378 623 474 W 437 483 536 731 669 936 781 555 519 459 753 523 C 452 511 564 749 717 1002 803 597 568 494 749 586 M 2322 2034 2089 1799 1952 1333 997 910 1122 997 1200 1379 W 2761 2365 2530 2058 2349 1551 1136 1035 1225 1105 1426 1588 C 3081 2552 2656 2267 2501 1688 1210 1137 1490 1345 1546 1719 1710 M 3000 2684 2691 2605 2601 2522 2538 2459 2434 2343 2550 W 3372 3053 3047 2979 2972 2815 2824 2828 2827 2735 2933 1929 C 3702 3403 3323 3310 3289 2950 2982 2987 3072 2988 3145 2143 M 3657 3618 3649 3677 3624 3687 3486 3384 3400 3357 3557 3334 W 4352 4334 4319 4294 4290 4299 3923 3927 4052 4115 4092 3914 C 4572 4575 4422 4671 4409 4307 3919 4167 4328 4276 4320 3788 each of the ten vowels common to the two studies. Data pointsthat lie on the solid line in the scatterplot indicate identicalvalues,while data pointsabovethe line indicate higherF0 valuesfor PB. AverageF0 valuesfor the menand the womentypicallydifferedby only a few Hz whencomparedto the corresponding vowelsrecordedby PB. F0 valuesfor our childrenaveraged28 Hz lower thanthe PB data. c. Formantfrequencies. Figure 3 showsthe average frequencies for F1 andF2 for the threetalkergroups,along with ellipsesfit to eachvowel category.Figure4 showsthe individualdatapoints.To improvetheclarityof thedisplay, data from/e/and/o/have been omitted, and the databasehas been thinnedof redundantdata points,resultingin the dis- play of approximately half of the individualpoints.While thereare clearlygrosssimilaritiesto the PB data,thereare numerousdifferencesas well. The degree of crowding amongadjacentvowel categoriesappearsmuchgreaterthan in the PB data, and many of the vowels are in different locationsin F1-F2 spacethanin PB. Figures5-7 showacousticvowel diagramsbasedon averageformantfrequencies fromthepresentstudyandfrom PB for men,women,andchildren,respectively. It is difficult to arriveat a simplesummary of thedifferences thatareseen in thesefigures.However,to the extentthat conventional articulatory interpretations of formantdataare valid, a few generalobservations canbe made.The formantdataseemto imply a generaltendencytowardlower tonguepositionsin 280 260 3OOO 240 220 2200 2OO 180 160 1400 ß Men 140 Women Children 120 I 120 I I I I 140 160 180 200 F0 VALUES FROM 1000 I I I I 220 240 260 280 PRESENT STUDY 450 600 750 900 1050 1200 FIRST FORMANT (Hz) FIG. 2. Scatterplotof F0 valuesfromthepresentstudyandfromPeterson andBarney(1952)for eachof thetenvowelscommonto thetwo studies. Data pointsabovethe solidline indicatehigherF0 valuesfor the Peterson and Barneydata. 3103 d. Acoust.Soc.Am., Vol. 97, No. 5, Pt. 1, May 1995 FIG. 3. AveragevaluesofF1 andF2 for men,women,andchildtalkersfor 12 vowelswith ellipsesfit to the data ("ae"=/ae/, "a"=/a/, "c"=/3/, "^" =/,q, "a"=l:v/). Hillenbrandet al.: Acousticcharacteristics of vowels 3103 i. 3400 i i i i 3000 iii.i.iii i i i i i i 2600 ae ae i a•e 2200 a 1800 a & •aaaaa a 1400 a a c lOOO u 600 300 450 600 750 900 1050 1200 FIRST FORMANT (Hz) FIG.4. Values of F1 andF2 for46 men,48 women, and46 children for 10vowelswithellipses fit to thedata("ae"=/•e/, "a"=/o/, "c"=/3/, "n"=/M, "a"=/aq).Measurements for/e/and/o/havebeenomitted, andthedatahavebeenthinned of redundant datapoints. 300 250 i,. AVERAGE FORMANT VALUES FOR MEN -- i,, AVERAGE FOPd•ANT VALUES FORWOMEN 400 U ,_, 350 t• 500 [.-, 450 •550 7O0 O • t• 650 900 • ?50 850 -....... PRESENT STUDY PETERSON & BARNEY 1000 -- PRESENT ........ PETERSON & BARNEY 2800 2400 2200 2000 1800 1600 1400 1200 1000 800 SECOND FORMANT (Hz) 2400 STUDY 2000 1600 1200 800 SECOND FORMANT (Hz) FIG. 5. Acousticvoweldiagrams showingaverageformantfrequencies for FIG. 6. Acoustic voweldiagrams showing average formantfrequencies for menfromthepresent studyandfromPeterson andBarney("ae"=/•e/, womenfromthepresent studyandfromPeterson andBarney("ae"=/ae/, "a" =/o/, "c" =/3/, "^"=/,q, "a" =/a,/). 3104 d.Acoust. Soc.Am.,Vol.97, No.5, Pt.1, May1995 "a" =/a/, "c" =/3/, "n" =/•/, "a" =/a•/). Hillenbrand et al.:Acoustic characteristics ofvowels 3104 I I I I t I I 3750 300 i.. AVEPAGE FORMANTVALUES FOR CHILDREI• 3450 3150 285O 2550 2250 1950 1100 1200 -- PRESENT STUDY ....... PETERSON & BARNEY I I I 32OO 2800 2400 2000 1650 [• Women / [D Children / • I 1600 • ,8( 1200 1650 8OO 1950 2250 F3 VALUES SECONDFORMANT (Hz) I I I I I 2550 2850 3150 3450 3750 FROM PRESENT STUDY FIG. 7. Acousticvoweldiagramsshowingaverageformantfrequencies for childrenfrom the presentstudyandfrom PetersonandBarney("ae" =/a•/, FIG. 8. Scalterplot of F3 valuesfrom the presentsludyandfrom Peterson "a" =/o/, "c" =/el, "n" =/,•/, "a" =/aV). valuesfor the Petersonand Barneydata. our dataand,for thebackvowels,moreanteriortonguepo- associatedwith these utterances.Figure 9 is basedon the formantpatternsampledat 20% and80% of vowel duration, averagedacrossvowelsproducedby the threegroupsof talkers. F0 was sampledjust once at steadystate.The values havebeenconvertedto a mel scaleusingthe technicalap- sitions.It shouldbe noted,of course,that thesedifferencesin and Barney(1952). Data pointsabovethe solidline indicatehigherF3 the formantpatternscouldbe explainedby differences in lip postureinsteadof---orperhapsin additionto--differencesin tongueposition.The childrenin our studyseemto showa proximation from Fant (1973) and are represented as F1generaltendencytowardcentralizationwhen comparedto F0 vs F3-F2. (Note that the F3-F2 axis has been inverted thePB data.The centralvowels/d and/3•/produced by the a convenadulttalkersarealonein occupying nearlyidenticalpositions to producea displaythat morecloselyresembles tional FI IF2 plot.) This representation is similar to models in the two sets of formant data. proposed by Miller (1989) and Syrdal {1985), except thata The vowelsoccupysimilarrelativepositionsin the two mel scale is used in place of Miller's log scale and Syrda!'s setsof data,with the notableexceptionof/e/and/a:/. Our dataindicatehigherF2 valuesfor/a•/as compared with/e/, and slightlylowerF1 valuesfor/•e/than/e/, althoughthe differencein F1 is not consistentacrosstalker groups. Analysisof data for individualtalkersshowedthat the F2 differencesbetweenthesetwo vowelswere highly consistent:91% of the talkersproducedan/a•/with a higherF2 than their/e/. The F1 differenceswere less consistent,with 68% of the talkersproducingan/a•/with a lowerF1 value thantheir/e/. Our findingsfor thesetwo vowelscontrast not only with PB, but alsowith Di Benedetto's (1989a) results from threeadult talkers,and with a largeTexasInstruments database described by Syrdal(1985).As canbe seenin Figs. 3 and4, thesetwo vowelsshowa very highdegreeof overlapin F1-F2 space.As will beseenbelow,thesevowelsare well identifiedby listeners, andcanbe separated well based on acousticmeasurements only if spectralchangeis taken into account. of PB. Overall,theF3 valuesfrom the two studiesare quite similar,with our measures averaging113 Hz (4.7%) higher for themen,47 Hz (1.7%)higherfor thewomen,and174Hz (5.5%)lowerfor thechildren. The slightlyhigherF3 values differences de- scribedpreviously. d. Spectralchangepatterns. Althoughthe primarypurposeof this studywas to compareour staticformantmeasurements with thoseof PB, a preliminaryanalysiswas conducted of the patterns of formant frequency change 3105 the locationof the secondsamplcof the formantpattern,and a line connectsthis point to the first sample.As the figure 150 250 350 450 75O 85O 95O Figure8 comparesour averagevaluesof F3 with those for PB's children are consistent with the F0 Barkscale. 2Thesymbol identifying thevowelis plotted at J. Acoust.Soc. Am., Vol. 97, No. 5, Pt. 1, May 1995 I I l I_ 400 500 600 7110 FI - F0 (meh) FIG. 9. Spectralchangepatternsassociated with the 12 vowels ("ae" =[•e/, "a"=/o/, "c"=/a/, "n"=/M, "3"=/•-/). The abscissais the difference between reel-transformedvalues of F 1 and F0, and the ordinate is the difference between reel-transformed values of F2 and F3. The F3-F2 axis has been invertedto producea displaythat more closely resemblesa conventionalFI[F2 plot. The symbolidenlifyingthe vowel is plottedat the locationof thesecondsampleof theformantpattern,anda line connects this pointto the first sample.The largestsymbolsare usedfor the men and the smallestsymbolsare usedfor the children. Hillenbrandeta/.: Acousticcharacleristicsof vowels 3105 indicates,nearlyall of the vowelsshowa gooddealof formantfrequencychange.Further,theformantsare movingin sucha way as to enhancethe contrastbetweenvowelswith similarstaticpositionsin formantspace.For example,the /•e/-/e/pair showsa high degreeof overlapwhenthe formantsaresampledat steadystate.As thefigureshows,these two vowelsappearto exhibit distinctspectralchangepatterns.Likewise,/u/and/u/show a high degreeof overlapin staticF1/F2 spacebut appearto showdistinctpatternsof spectralchange.The influenceof spectralchangepatternson the separabilityof vowel categorieswill be examinedin a moresystematicway in the discriminantanalysisstudiesde- ratesfrom the two studiesarequitesimilar,as are the rates for mostof the individualvowels.Our resultsshowslightly pooreridentificationof the/o/-/o/pair, but somewhatbetter identificationof/l/ and of the /ze/-/e/ pair. The relatively highidenfifiabilityof/•e/and/e/is interesting in lightof the poorseparation of thesevowelsbasedon staticmeasures of F1 and F2. Figure 10 showsa histogramof identificationratesfor the individualsignals.The majorityof the signals(65%) were identifiedunanimouslyby the listeners,and 89% were scribed in Sec. III. identified at ratesof 90% or greater. For33 signals(2%) the majorityvote of the listenerswas a vowel other than that intended by thetalker.Mostof these(55%)wereintendedas II. VOWEL /3/but heardas/o/, but therewere also instancessuchas/a•/ heard as/e/,/œ/heard as/•e/,/o/heard as/3/, and/[/heard as IDENTIFICATION A. Methods 1. Listeners Listenersconsistedof 20 undergraduate and graduate studentsin the SpeechandPathologyandAudiologyDepartment at WesternMichigan University,none of whom had participatedas talkers.The choiceof phoneticallytrained listenerswas motivatedby the findingsof Assmannet al. (1982) indicatingthata relativelylargeproportion of identificationerrorsproduced by untrainedlistenersaredueprimarily to the listeners'uncertaintyabouthow to map perceived vowelqualityontoorthographic symbols.All of thelisteners hadtakenan undergraduate coursein phonetics, althoughas a grouptheywouldnot be considered experienced phoneticlans.The dialectscreeningprocedureand subjectselection criteriadescribedfor the subjectswho servedas talkerswere used for the listeners as well. 2. Procedures /e/. It seemslogicalto interpretsignalsconsistently heardas a vowel otherthanthat intendedby the talker as production errors.While the numberof tokensinvolved is a small proportionof the total,by this criterionnearly14% of the attemptsat/3/were production errors.It is clear,then,thatthe dialectscreeningprocedures were ineffectivein somecases; that is, there were some speakerswho passedthe /o/-/3/ dialectscreening but, for reasons thatare not clear,did not producea convincing/3/whenthe/hVd/syllableswere recorded. Overall identificationrateswere highestfor the women talkers and lowest for the children.A repeated-measures analysisof variancefor talker genderusingarcsinetransformed overall identification rates for the 20 listeners was significant[F(2,57)=14.7, p<0.001]. Post-hocanalysis showedsignificantdifferencesamongall threetalkergroups. It is importantto note, however,that the magnitudeof the effect is quitesmall,with only 1.9% separating the highest and lowestidentificationrates.It is interestingthat thereis no evidenceat all that signalswith higherfundamentalfrequencies aremorepoorlyidentified,despitethefact thatformant peaksare more poorly definedin signalswith wide harmonicspacing. Listenerswere testedindividuallyin a quietroomin two sessions lastingapproximately1 h each.Signalswere lowpassfiltered at 7.2 kHz at the outputof a D/A converter (TuckerandDavisDD]), amplified, anddelivered to a single loudspeaker (BostonAcoustics A60) at an averageintensity of 77 dBA at thelistener'shead(approximately 70 cm from III. DISCRIMINANT ANALYSIS theloudspeaker). Overthecourseof thetwosessions listeners identifiedone presentationof each of the 1668 /hVd/ The purposeof the discriminantanalyseswas to detersignals.The signalswere presentedin fully random order mine how well the vowelscouldbe separated basedon vari(i.e., not blocked by talker), and the randomizationwas ous combinations of the acoustic measurements and, where changeddaily. This randomizationmethoddiffersfrom PB, appropriate, to comparetheseresultswith similaranalyses of who testedsubjectsin blocksof trials which presentedlisthePB database. A quadraticdiscriminant analysistechnique teners withrandomly ordered tokens fromtentalkers. 3 (Johnsonand Winchern, 1982) was usedfor classification.In Subjectsrespondedby pressingone of 12 keys on a computerkeyboardthat hadbeenlabeledbothwith the phonetic symbols and the correspondingkey words (e.g., "heed," "hid," "head," etc.). Each listeningtest was precededby a brief practicesessionto ensurethat subjectsunderstoodhow the key labelswere to be interpreted. all cases,the "jackknife"techniquewas usedin whichsta- 3. Results TableVI presentsa summaryof identification ratesfor each vowel category,along with comparabledata from PB. The full confusion matrix is shown in Table VII. The results tisticsfor an individual token are removedfrom the training datapriorto theattemptto classifythetoken(seealsoSyrdal andGopal,1986).Tokensshowinga mergerin a formantslot that was includedin the parameterlist were not includedin the analyses. Table VIII showsclassificationresultsfor staticparameter sets,i.e., measurements sampledonceat steadystate.To facilitatecomparisons with PB,/el and/o/were not included in thesetests.As mighthavebeenexpectedbasedon inspection of Figs. 3 and 4, our vowelsdo not separateas well as are generallyquitesimilarto PB. The overallidentification the PB data based on static measures of F1 and F2. The 3106 d. Acoust.Soc.Am., VoL97, No. 5, Pt. 1, May 1995 Hillenbrandet al.: Acousticcharacteristics of vowels 3106 differences in categoryseparability becomesmalleras more parametersare added,but in all casesthe PB vowels are classified with greateraccuracy thanours. Table IX demonstrates the effectsof includingvowel durationand spectralchangeinformationon classification accuracy.Only data from the presentstudyare included, againomitting/e/and/o/. The one-sample resultsare based on a singlesampleof the formantpatternat steadystate,the two-sampleresultsare basedon samplestakenat 20% and 80% of vowel duration,and the three-sampleresultsare basedon samplestaken at 20%, 50%, and 80% of vowel duration.For the two- and three-sample parametersetsincludingF0, a singlesampleof F0 at steadystatewas used. Resultsare shown for all tokens in the database,and for a subset of the data that excluded individual tokens that showedidentification errorratesof 15% or greater(11.5%of the tokens). It can be seenthat includingvowel durationin the parameterset resultsin a consistentimprovementin performance,especiallyfor the simplestparametersetssuchas single-sample F1-F2. However,the mostdramaticeffectis seenwhencomparing a singlesampleof the formantpattern with two samples.The improvementin classification accuracy averages11.2%,andis especiallylargefor the parametersetsinvolvingF1 andF2 alone.Addinga thirdsample of the formantpatternproduceslittle or no improvementin classification accuracy. This wouldseemto suggest thatonly a very coarserepresentation of the spectralchangepatternis neededfor classification.It can also be seenthat omitting tokenswith relativelyhighidentification errorratesproduces a consistentimprovementin classificationaccuracy.This findingindicatesthattheerrorsproducedby the patternclassifier tend to occurmore often for tokensthat are poorly identifiedby listeners. Althoughnot shownin thetable,thesameclassification testswere conductedusingthe full set of 12 vowels.The overallpatternof resultswas quite similarto that shownin TableIX, exceptthatthe improvement in classification accuracywith two samplesof the formantpatternwassomewhat larger.This is a logicalresultgiven that the two vowelsthat were added,/e/and/o/, are nearly alwaysdiphthongized. If we assume,then, that the acousticmeasurementswere madewith roughlyequalprecisionin the two studies,thenit must be the case that many of thesevowels were simply producedin differentwaysby the two groupsof talkers.As was indicatedpreviously,little is knownaboutthe dialectof thePB talkersexcept:(1) Mostof thewomenwereraisedin themid-Atlanticregion;(2) themenrepresented "... a broad regionalsampling of theUnitedStates..." (Peterson andBarney,1952,p. 177);(3) a "few" of thetalkerslearnedEnglish as a secondlanguage; and (4) "most"of the talkersspoke GeneralAmerican. Perhapsmore importantthan potential differencesin regionaldialect is the passageof some40 yearsin the timesat which the two setsof recordingswere made. It is well known that significantchangesin speech productioncan occurover a periodof severaldecades.For example,Bauer(1985)showedclearevidence of significant aliachronic vowel shiftswhen comparingrecordingsof British RP speakers madein 1949with similarrecordings made in 1966. A secondcomparisonof comparablerecordings made in 1982 showed a continuation of the same vowel shifts.There are almostcertainlysomedifferences between our data and those of PB that can be attributed, at least in part, to aliachronic change.For example,the "raising"of /a•/, with the consequent reductionin contrastin the static formantpositions for/•e/and/•/, hasbeenwell documented in severaldialectsof AmericanEnglish(e.g., Labor et al., It should also be noted that the Texas Instruments data- Theoriginalintentof thisstudywasto collecta database of acousticmeasurements for/hVd/utterancescomparable to PB, but with additionalmeasuresof durationand spectral changethatcouldbe usedto studytherole of dynamicproperties in vowel recognition.The differencesthat were observedbetweenour staticmeasurements of formantpatterns andthoseof PB were not anticipated.One possibleexplanahas to do with our use of LPC as opposed to the moredirectspectrum analysismethodused by PB.Thispossibilitycannotbe eliminatedentirely,particularly sincewe attemptedno directcomparisons of our LPC measurements with measuresobtainedusing PB's spectro- graphictechnique.However,the differencesbetweenour data and those of PB strike us as both too numerous and too diverseto be explainedby differencesin spectrumanalysis methods.Althoughnot entirely conclusive,the comparisons 3107 setsof formant dataintoconvergence. 4 1972). IV. DISCUSSION tion of these differences that were made between formants measured from LPC and cepstrallysmoothedFourierspectraalso make it seemunlikely that the differencescan be attributedto spectrum analysismethods.The closesimilarityin F3 valuesand the nearlyidenticalformantvaluesfor the centralvowelsproducedby the adulttalkersfrom the two studieswould also seemto argueagainstthisinterpretation. It also seemsunlikely that the discrepancies can be attributedto differencesin the timesat whichthe formantpatternswere sampled.Our datashowedthat steady-state times couldbe locatedwith a surprisinglyhigh degreeof reliability. In addition,in datanot reportedhere,we determinedfor threevowels(/œ/,/a•/,and/u/)thateventhemostprocrustean methodof locatingsteadystatetimescouldnotbringthetwo d. Acoust.Soc. Am., Vol. 97, No. 5, Pt. 1, May 1995 basedescribed by Syrdal(1985) showssomeratherlarge differences from PB. The TI values for the front vowels are quitesimilarto PB, but ratherlargedifferences are seenfor the back vowels. The differences from PB are in the same directionas in our data(i.e., implyinglowerandmoreanteriortonguepositions compared to PB), but thediscrepancies are evenlarger.As with the PB study,little is knownabout the dialectof the speakersin the TI study. Thereis one final pointworthnotingaboutthe discrepanciesin formant frequenciesacrossthese three studies. There has been a tendencyto view the PB databaseas a benchmarkof sorts,establishingthe set of formant frequenciesfor AmericanEnglishvowels.For example,the PB mea- surementsare frequently used as control parametersin speechsynthesis studies,andoftenserveto defineprototypes for vowel categories.The PB measurements have also been Hillenbrandet aL: Acousticcharacteristicsof vowels 3107 heavilyusedto evaluatevowelnormalization algorithms and are frequentlyusedin cross-language comparisons andcomparisonsbetweennormalanddisordered speech.The present results,alongwith thoseof Syrdal(1985)andBauer(1985), serveasa reminderthata studyof thiskindcanonly hopeto establisha set of formantfrequencies that are typicalof a specificdialectat a specifictimein thehistoryof thatdialect. In contrast to the numerous differences in acoustic mea- surements betweenour studyandPB, the two listeningstudies producedvery similar results.The overall identification ratesfrom the two studiesare quitesimilar,as are the rates for individualvowels.Althougha detailedanalysisof the relationships betweenthe acousticandperceptualdatafrom the presentstudywill haveto await furtherstudy,it seems quiteclearthat the frequenciesof F1 andF2 at steadystate are not good predictorsof the identificationresults.The clearestexampleis the /a•/-/œ/ pair, which was identified quitewell by listeners despitevery poorseparation in static F1-F2 space. Another indication that static measures of F 1 and F2 are poor predictorsof vowel identificationis the general findingthatthesignificantly increased crowdingof vowelsin staticF1 -F2 spacerelativeto PB wasnot accompanied by an increasein perceptualconfusionsamong vowels. Althoughit can only be guessedat, one possibilitythat might be considered is thatourtalkersproduced moreheavilydiphthongizedvowelsthanPB's talkers.Accordingto thisview, thegreaterdegreeof crowdingin thestaticformantspaceof our talkersmightbe offsetby an increasein spectralchange, resultingin a set of vowelsthat are as distinctas the PB vowels.However,thisspeculation cannotbe confirmedand, in the absenceof the originalPB recordings,it is difficultto go beyondthe generalsuggestion of dialectdifferencesbetweenthe two groupsof talkers. It is importantto note that listenerswere askedto identify utterances by choosinga labelfrom a closedsetof broad phonerotecategories.It shouldnot be concludedthat all utterancesthat were assignedthe same phonemiclabel are phonetically equivalent (seeLadefoged, 1967,for a discus- TABLEVL Overallpercent correctidentification by vowelcategory. forthe presentstudy(HGCW) andfor Peterson andBarney(PB). HGCW PB /i/ 99.6 99.9 /I/ 98.8 92.9 /e/ 98.3 a /t/ 95.1 87.7 /a½/ 94.1 96.5 /o/ /a/ 92.3 82.0 87.0 92.8 /o/ 99.2 a /ul 97.5 96.5 /u/ 97.2 99.2 I,d /3-/ 90.8 99.5 92.2 99.7 Total: 95.4 94.4 Men: 94.6 b Women: 95.6 b Children 93.7 b aThesc vowelswerenotrecorded by Peterson andBarney. bPeterson andBarney didnotreport results separately formen,women, and child talkers. kensthatwerewell identifiedby listeners.It mightbe useful to reanalyzea subsetof the utterancesthat are judged by experienced phoneticians to be goodexamples of thevarious vowel qualities.It seemslikely that within-vowel-category variabilityin the acousticmeasures for a subsetof thiskind would be substantially reduced. One otheraspectof the listeningtestdeservescomment. The largenumberof speakersandfull randomizafion of both talkersandvowelsmakeit very unlikelythatlistenerscould havemadeuseof a vocaltractnormalization process of the kind described by LadefogedandBroadbent(1957) andothers. While there is some evidence that listeners can make use of speaker-specific normalizinginformationin certainkinds of tasks,the relativelyhigh identificationratesobtainedin thepresentstudywouldseemto indicatethataccuratevowel identificationdoes not require calibrationto individual sion).Evena casuallistening by anexperienced phonetictan speakers. showsclearly that there is a range of phoneticqualities The discriminant analysisresultsshowedthatour vowwithinthevowelcategories, evenwhenconsidering onlytoels couldnot be separated well basedon a singlesampleof TABLE VII. Confusionmatrixfor/hVd/utterancesproducedby 45 men,48 women,and46 children. Vowel identifiedby listener Ill /i/ 99.6 hi /el Vowel It/ intended /ae/ by /o/ talker /a/ 1o/ /u/ /u/ 0.6 /# /e/ 0.1 0.1 98.8 0.2 0.3 0.5 0.1 983 It/ /ae/ 3108 /al /o/ /ul lul It,/ In'/ 0.1 0.9 0.2 0.I 0. I 95.1 0.3 3.7 0.2 0.1 0.1 5.6 94.1 0.2 0.1 0.3 92.3 3.5 13.8 82.0 0.1 0.1 0.3 3.7 0.1 J. Acoust.Soc. Am., Vol. 97, No. 5, Pt. 1, May 1995 0.2 1.8 0.2 0.2 0. I 3.3 0.1 3.8 1.3 97.2 1.0 0.4 0.1 0.5 97.5 1.9 0.3 3.2 0.2 90.8 99.2 0.1 IM /a,/ /a/ 0.2 0. I 0.2 0.2 0.I 0.2 99.5 Hillenbrandet aL: Acousticcharacteristicsof vowels 3108 o TABLE VIII. Quadraticdiscrimination resultsfor thepresentdata(HGCW) andfor thePB datasetbasedon a singlesampleof theformantpattern.The tableshowsoverallclassification accuracyusingthe "jackknife"methodin whichmeasurements for individualtokensare removedfrom the training statisticsprior to classification. o Parameter set PERCENT CORRE. C• IDENTIFICATION FOR INDIVIDUAL TOKENS F 1 ,F2 F1,F2,F3 FO,F1 ,F2 FO,FI,F2,F3 HGCW PB 68.2 81.0 78.2 84.7 74.9 83.6 85.9 86.6 ]0 IS 20 2.• ;0 3S 40 45 .•0 ,•_; 60 65 70 75 80 8.; 90 95 100 phoneticspace.For example,severalstudieshave shown very high identificationratesfor "silentcenter"stimuliconsistingof onglidesand offglidesonly (e.g., Jenkinset al., 1983;Nearey,1989;NeareyandAssmann,1986).CompleFIG. 10. Ristogmmof percentcorrectidentification ratesfor individual tokens,where "correct" means that the listener identified the vowel as the mentingtheseresultsareHillenbrandandGayvert's(1993b) one intendedby the talker. findingsshowingthat steady-state vowelssynthesized from thePB measurements arenotwell identifiedby listeners(see theformantpattern,especially F1 andF2 alone.The sameis alsoFairbanksand Grubb,1961).Takentogether,the silent true of the PB data, but to a lesserdegree.Addingvowel centerand steady-state resynthesis resultssuggestthat static durationmeasuresresultedin consistentbut fairly modest spectraltargetsare neithernecessary nor sufficientfor accuimprovements in classification accuracy,and includingtwo rate vowel recognition. samplesof theformantpatternproducedlargeimprovements It is interestingin this regard that the importanceof dynamicinformationin vowel identificationwas recognized in categoryseparability. Thesefindingsare consistent with Zahorianand Jagharghi(1993), who showedmuchbetter by PB, who commented,"It is the presentbelief that the classification ratesfor dynamicversusstaticrepresentations complexacoustical patternsrepresented by thewordsarenot of bothformantsand overallspectralshape.Zahorianand adequatelyrepresented by a single •ction, but requirea lagharghiare alsoin agreement in showingmuchlargerimmore complexportrayal"(Petersonand Barney,1952, p. provementsin classificationaccuracywith the additionof 184).The precisenatureof this "morecomplexportrayal" spectralchangeas comparedto vowelduration.The present remains unclear, however, since we still do not know how results,alongwith thoseof ZahorianandJagharghi, are conlistenersmapspectralchangepatternsontoperceivedvowel sistentwith manyrecentfindingssuggesting that the vowels quality.Our discriminantanalysisfindingsindicatedthat a of AmericanEnglishare more appropriatelyviewed not as fairly coarse,two-samplerepresentation of the formantpatfor accuratevowel classification. pointsin phoneticspacebut ratheras trajectoriesthrough tern is all that is necessary PERCENT CORRECT IDENTI•-•C^TION TABLE IX. Quadraticdiscrimination resultsfor the presentdatashowingtheeffectof includingdurationand spectralchangeinformationon classification accuracy. The tableshowsoverallclassification accuracyusingthe "jacldmife"methodin whichmeasurements for individualtokensareremovedfromthetrainingstatistics prior to classification. The one-sample resultsarebasedon a singlesampleof theformantpatternat "steadystate;" the two-sampleresultsare basedon samplestakenat 20% and80% of vowel duration;the three-sample results arebasedonsamples takenat 20%,50%,and80%of vowelduration. ("NoDor"=voweldurationnotincluded; "Dur"=vowel durationincluded.)Entriesunderthe headingof "All tokens"usedthe full database; entries underthe heading"well identifiedtokensonly" arebasedon a datasetthatdid not includetokenswith error ratesof 15% or greater(11.5% of the tokens). All tokens One sample Two samples Three samples Parameter set NoDur Dar NoDur Dur NoDur Dur F1 ,F2 F1,F2,F3 F0,F1,F2 FO,FI ,F2,F3 68.2 81.0 78.2 84.7 76.1 84.6 82.0 87.8 87.9 91.6 90.3 02.7 92.5 94.1 87.7 91.8 90.4 93.1 91.0 92.8 92.6 94.8 One sample 91.6 93.6 Well identifiedtokensonly Two samples Three samples Parameter set NoDur Dur NoDur Dur NoDur Dur FI .F2 F1 ,F2,F3 FO,F1 ,F2 F0,F1 ,F2,F3 71.4 85.3 80.0 89.1 90.8 95.4 93.6 96.2 90.7 95.3 93.3 95.8 82.3 88.7 86.3 91.6 95.5 97.3 96.3 97.8 94.8 96.6 96.0 97.3 3109 d. Acoust.Soc.Am., Vol. 97, No. 5, Pt. 1, May 1995 Hillenbrandet al.: Acousticcharacteristics of vowels 3109 However,thatdoesnot imply thatthe detailsof the formant changepatternare unimportantto the listener.Additional studiesusingsynthesismethodsare neededto learn more aboutthe specificmappingrelationsthat are involvedin vowel recognition.Also neededare studiesof spectral changepatternsin more complex phoneticenvironments than the/hVd/utterancesexaminedhere (e.g., Stevensand House, 1963). While it seemscertain that the associations which we observedbetweenvowel categoriesand spectral changepatternswill be lessstraightforward in morecomplex phoneticenvironments,the extentof the oversimplification in our data is as yet unknown. Bennett,D.C. (1968}. "Spectralform anddurationas cuesin the recognition of Englishand Germanvowels,"Lang. Speech11, 65-85. Black,J. W. (1949)."Naturalfrequency, duration, andintensity of vowelsin reading,"1. SpeechHear.Disor& 14, 216-221. Di Benedetto, M-G. (1989a)."Vowelrepresentation: Someobservations on temporalandspectralproperties of thefirstformantfrequency," J. Acoust. Soc. Am. 86, 55-66. Di Benedetto, M-G. (1989b)."Frequency andtimevariations of the first formant:Properties relevantto theperception of vowelheight,"J. Acoust. Soc. Am. 86, 67-77. Fairbanks,G., and Gmbb,P. (1961). "A psychophysical investigation of vowelformants," J. SpeechHear.Res.4, 203-219. Fant,G. (1973).SpeechSoundsandFeatures(MIT, Cambridge, MA). Hiilenbrand, J. (1988). "MPITCH: An autocorrelation fundamental- frequency tracker,"[Computer Program], Western MichiganUniversity, Kalamazoo, MI. ACKNOWLEDGMENTS Hillenbrand,J., andGayvert,R. T. (1993a). "Vowel classification basedon fundamental frequency andformantfrequencies," J. SpeechHear.Res.36, 647-700. We are very grateful to Michelle Malta, who donated manyhoursto thisprojectandsetthe tonefor dedicationand attentionto detail. The able assistance of Emily Well and Matt Phillipsis alsogratefullyacknowledged. We wouldalso like to thankTerry Nearey for helpful adviceprovidedat variouspointsalongtheway, andJim Flegefor commentson a previousdraft. This work was supportedby a research grantfromtheNationalInstitutes of Health(NIDCD 1-R01De01661) andby theAir ForceSystemsCommand,Rome Air DevelopmentCenter,GriffissAir ForceBase,andtheAir Force Office of Scientific Research(ContractNo. F30602- Hillenbrand,J., and Gayreft, R. T. (1993b). "Identificationof steady-state vowels synthesizedfrom the Petersonand Barney measurements," J. Acoust. Soc. Am. 94, 668-674. Jenkins,J. J., Strange,W., and Edman,T. R. (1983). "Identificationof vowelsin 'vowelless'syllables,"Percept.Psychophys. 34, 441-450. Johnson, R. A., andWinchera,D. W. (1982).AppliedMultivariateStatistical Analysis(Prentice-Hall, Englewood Cliffs,NJ). Kent, R. D., and Fomer,L. L. (1980). "Speechsegmentdurationsin sentencerecitations by childrenandadults,"J. Phon.8, 157-168. Labor, W., Yaeger,M., and Steiner,R. (1972}. "A quantitative studyof soundchangein progress,"Reporton NationalScienceFoundation Contract NSF-FS-3287. Ladefoged, P. (1967).ThreeAreasof Experbnental Phonetics (OxfordU.P., London). 85-C-0008). Ladefoged,P., and Broadbent,D. E. (1957). "Informationconveyedby •Theissueof whether to screen talkers' productions based ontheexperi- Lippmann,R. P. (1989). "Reviewof neuralnetworksfor speechrecognition," NeuralComputation 1, 1-38. Miller,J. D. (1989)."Auditory-perceptual interpretation of thevowel,"1. vowels," J. Acoust. Soc. Am. 29, 98-104. menter'sphoneticjudgmentsgenerateda great deal of discussionas the projectwas beingplanned.We ultimatelysettledon the methoddescribed above,in which we trainedthe subjectsas carefullyas possiblebut did not interveneoncethe taperecorderwasstarted,exceptin casesof dysfluency or obviousreadingerrors.The reasonthatwe chosethisapproachis that, alongwith PB, we wantedto determinehow accuratelylistenersidentified naturallyproducedvowels,where "accurate"is definedas agreementbetween the listenerand the intent of the speaker.If stimuli are screened basedon theexperimenter's judgmentsof correctproduction, the question changesfroma two-wayagreement betweenthe talkerandthelistenerto a Ihree-wayagreement amongthetalker,thelistener,andtheexperimenter. It is notknownwhetherstimuliwerescreened for correctpronunciation when the PB recordingswere made. "Thereelscale waschosen overtheBarkandlogtmnsfurms because this transformation did a betterjob of groupingvowelsproducedby men, women, and children.This representation is similar to a normalization Acoust. Soc. Am. 85, 2114-2134. Nearey,T. M. (1978).PhoneticFeatureSystems for Vowels(IndianaUniversityLinguisticsClub, Bloomington,IN). Nearey,T. M. (1989). "Static,dynamic,andrelationalproperties in vowel perception," J. Acoust.Soc.Am. 8.5,2088-2113. Nearey,T. M. (1992). "Applications of generalized linearmodelingto voweldata,"in Proceedings ICSLP 92, editedby J. Ohala,T. Nearey,B. Derwing,M. Hodge,and G. Wiebe (Universityof Alberta,Edmonton, AB), pp.583-586. Nearey,T. M., andAssman, P.(1986)."Modelingtheroleof vowelinherent spectralchangein vowel identification,"J. Acoust.Soc.Am. 80, 12971308. Nearey,T M., Hogan,J., andRozsypal,A. (1979). "Speechsignals,cues and features,"in Perspectivesin ExperimentalLinguistics,editedby G. Prideaux(Benjamin, Amsterdam). scheme proposed by Peterson (1951). -Youradditional listeners weretested bothonthe12-vowel, full randomiza- Peterson,G. E. (1951). "The phoneticvalue of vowels," Language27, tion taskdescribedhereandon a ten-vowel,blocked-by-speaker taskcomparableto PB. Resultsfor the ten vowelscommonto the two taskswere very similar. 4Agraduate student wasgivena copyof thePBdataandasked tochoose a steady-state locationin sucha way that the valuesof F1 andF2 would be closestto the averagesfrom PB, even if thai locationmade no sensein relationto the objectiveof findingthe steadiestportionof the vowel. Differences between our data and those of Pl] remained even when this procedure was used. Ainsworth, W. A. (1972)."Durationasa cuein therecognition of synthetic vowels," J. Acoust. Soc. Am. 51, 648-651. Allen, G. D. (1978). "Vowel durationmeasurement: A reliabilitystudy,"J. Acoust. Soc. Am. 63, 1176-1185. Assmann,P., Nearey,T. E., and Hogan,J. (1982). "Vowel identification: Orthographic,perceptual,and acousticfactors,"J. Acoust.Soc.Am. 71, 975 -989. Bauer,L. {1985)."Tracingphonetic changein thereceivedpronunciation of BritishEnglish,"J. Phon. 13, 61-81. 3110 J. Acoust. Soc. Am., VoL 97, No. 5, Pt. 1, May 1995 541-553. Peterson, G. E., andBarney,H. L. (1952). "Controlmethods usedin a study of the vowels," J. Acoust. Soc. Am. 24, 175-184. Peterson, G. E., andLehiste,L (1960)."Durationof syllablenucleiin English,"J. Acoust.Soc.Am. 32,693-703. Press,W. H., Flannery,B. P., Teukolsky,S. A., and Vettefiing,M. W. T. (1988). "NumericalRecipesin C," (CambridgeU.P.,Cambridge,MA). Smith,B_ L. (1978). "Temporalaspectsof Englishspeechproduction: A developmentalperspective,"J. Phon. 6, 37-67. Smith,B. L., Hillenbrand,J., andIngrisano,D. R. (1986). "Comparison of temporalmeasuresof speechusing spectrograms and digital oscillograms,"J. SpeechHear.Res.29, 270-274. Stevens,K. N. 0959). "The role of duration in vowel identification,"Q. Progr.Rep. 52, ResearchLaboratoryof Electronics,MIT. Stevens,K. N., and House,A. S. (1963). "Perturbationof vowel articulationsby consonantal context:An acoustical study,"J. SpeechHear.Res.6, 111-128. Strange, W. (1989)."Dynamicspecification of coarticulated vowelsspoken in sentencecontext," J. Acoust. Soc. Am. 85, 2135-2153. Strange,W., Jenkins,J. J., and Johnson,T. L. (1983}. "Dynamic specifica- Hillenbrandet aL: Acousticcharacteristicsof vowels 3110 tion of coarticulatedvowels," I. Acoust. Soc.Am. 74, 695-705. Syrdal,A. K. (1985)."Aspects of a modelof theauditoryrepresentation of AmericanEnglishvowels,"SpeechCommun.4, 121-135. Syrdai,A. K., and Gopal,H. S. (1986). "A perceptual modelof vowel recognitionbasedon the auditoryrepresentatioa of AmericanEnglish vowels," J. Acoust. Soc. Am. 79, 1086-1100. TiffanyW. (1953)."Vowelrecognition asa function of duration, frequency modulation andphoneticcoutext,"J. SpeechHear.Disord.18, 289-301. 3111 J. Acoust.Soc. Am., VoL 97, No. 5, Pt. 1, May 1995 Watrous,R. L. (1991). "Currentstatusof the Peterson-Barneyvowel formantdata," I. Acoust.Soc.Am. 89, 2459-2460. Whalen,D. H. (1989). "Vowelandcoasonant judgmeats are aot independentwhencuedby thesameinformation," Percept. Psychophys. 46, 284292. Zahorian,S. A., andJagharghi, A. J. (1993). "Spectralshapeversusformants as acousticcorrelatesfor vowels," J. Acoust. Soc. Am. 94, 1966 1982. Hillenbrandet aL: Acousticcharacteristics of vowels 3111