Word prominence is a crucial component of how speakers of English communicate the information structure of an utterance, but little research has addressed how native listeners interpret non-natives’ production of prominence. This study uses Rapid Prosody Transcription [RPT] (Cole et al. 2010) to test whether native American English-speaking (L1) listeners respond to the same cues for prominence when listening to non-native (L2) speakers of English as when they listen to other native speakers.
Listeners followed a printed, orthographic transcript as they listened to recordings of the Rainbow Passage (Fairbanks 1960). They were asked to underline all words that they perceived as prominent or “highlighted.” 12 L1 listeners responded to recordings of 12 different L1 speakers, and another 12 listeners heard recordings of 12 L2 speakers whose native language is Latin American Spanish. Cole et al. (2017) found that prominence marking in RPT is reliable with 11 or more listeners. Errors were transcribed accurately in the transcripts, so each transcript was slightly different. A word’s pScore is the number of listeners marking a word as prominent. Listener reliability was reported in Smith & Edmunds (2013); the kappa values indicate substantial listener agreement, although surprisingly, it was better for L2 speakers than for L1 speakers. Here we present results for 6 L1 and 6 L2 speakers. Each word was analyzed individually; function words were excluded because they are known to pattern differently.
Hypothesis. Adopting Cole et al.’s (2010) contrast between “expectation-based” and acoustic factors, the hypothesis here is that expectation-based factors will have a stronger effect on the perception of prominence when listening to non-native speakers than when listening to native speakers, because non-native speakers are likely to produce acoustic cues differently than English speakers, making them less informative for listeners.
Analysis. TextGrids for the recordings were prepared in Praat (Boersma & Weenink 2017) in order to extract duration and F0 values. The Rainbow Passage was synthesized using a text-to-speech system to obtain a version with standardized prosody; the duration of each word was measured in this recording. The total duration of speech (excluding pauses over 150 ms) was calculated for each recording and for the synthesized version. For each speaker, the ratio was calculated between the total duration of speech in their recording and the duration of speech in the synthesis, to estimate the divergence of their speaking rate compared to the synthesis. Then, the duration of each word in the synthesis was multiplied by the ratio for each speaker to give the “predicted” duration of that word for that speaker. This value was subtracted from the actual duration of each word in each recording. Another acoustic factor was the maximum f0 in each word, normalized with respect to the speaker’s mean and standard deviation over the entire recording. Both maximum f0 and duration have been found to be significant predictors of prominence in RPT studies (Cole et al. under review). Each word token was assigned a ‘repetition number’ identifying the number of times it had occurred so far in the recording. Additional factors were the log frequency of the word from the Corpus of Contemporary American English (Davies 2008), its part of speech, and the proportion of listeners (from a different RPT study) who had identified the word as being followed by a boundary (bScore). The dependent variable was the pScore for each word. A zero-inflated poisson regression (zeroinfl in the R package pscl) was used to assess the significance of six predictor factors.
Results. The regression showed effects of part-of-speech and word frequency at p<.001 and of duration and repetition number at p<.05, for the responses to the L1 speakers. F0 and bScore were not significant. For predicting pScores of L2 speakers, part-of-speech and word frequency were again significant at p<.001, and bScore at p<.01. Neither acoustic measure nor repetition number were significant. These results partially support the hypothesis in that duration was a significant predictor for L1 speakers only; Figure 1 shows that this effect was modulated by part of speech. Further analysis is ongoing. It appears that the listeners could use part-of-speech and frequency information regardless of who was speaking, but are only able to use acoustic information from other L1 speakers whose productions are consistent with their expectations.
References
Boersma, P. & Weenink, D. (2017). PRAAT: Doing phonetics by computer [Computer program]. Version 6.0.27.
Cole, J., Mo, Y., Hasegawa-Johnson, M. (2010). Signal-based and expectation-based factors in the perception of prosodic prominence. Laboratory Phonology, 1(2): 425-452.
Cole, J., Mahrt, T. & Roy, J. (2017). Crowd-sourcing prosodic annotation. Computer Speech & Language, 45, 300-325.
Cole, J., Hualde, J.I., Smith, C.L., Eager, C., Mahrt, T. & Napoleão de Souza, R. (under review). Acoustic, informational and structural bases of prominence perception.
Davies, M. (2008-2018) The Corpus of Contemporary American English (COCA): 560 million words, 1990-present. https://corpus.byu.edu/coca/.
Fairbanks, G. (1960). Voice and Articulation Drillbook. Harper and Row.
Smith, C. & Edmunds, P. (2013). Native English listeners’ perceptions of prosody in L1 and L2 reading. Proceedings of Interspeech 2013, # IS130507, pp. 235-238.
Text-to-speech synthesis used Natural Reader’s Sharon voice https://www.naturalreaders.com/