Main content

Home

Menu

Loading wiki pages...

View
Wiki Version:
# Audio edition of the Spoken British National Corpus # [AudioBNC][1] is a corpus of spoken British English, resulting from the force-alignment of most of the spoken portions of the British National Corpus (BNC), providing a substantial freely-accessible corpus of soundfiles and Praat TextGrid files. The version of the AudioBNC shared by SPADE is a subset, designed to utilise the parts of the corpus where alignment was most reliable. In order to define this subset of the AudioBNC which maximizes the accuracy of the alignment, utterances were kept if they met a number of criteria: the utterance length was greater than one second, the utterance contained at least two words, the mean harmonics-to-noise ratio of the recording was at least 5.6, and the mean difference in segmental boundaries between the alignment and a re-alignment with the Montreal Forced Aligner was at most 30 milliseconds. **Number of Speakers**: The original AudioBNC has recordings from 124 speakers who carried recorders, resulting in recordings with over 1000 speakers in total. There are also speakers in the 755 ‘context-governed’ recordings, for which we don’t have an exact count. Each region-specific dataset from the AudioBNC contains the data from speakers of that region (based on the [AudioBNC coding scheme][2]), as well as speakers with _no_ defined regional coding. *Speakers with no dialectal information should be excluded prior to analysis*. **For the purposes of the SPADE project, we approximate that the derived datasets contain information from around 1800 speakers, where around 860 speakers are assigned a dialect code.** \ **Hours of Speech**: about 700, plus much more in the 755 ‘context-governed’ recordings. **We approximate the number of hours of speech analysed within SPADE to be around 700 hours**, though this is likely to be an underestimate. \ **Year Recorded**: corpus collected between 1991-4; most recordings are from 1992-1993.\ **Data Guardian**: public, copyright the University of Oxford.\ **Speaker Dimensions**: for ‘demographic’ recordings: age group, social class, sex; for ‘context-governed’ recordings: region. The corpus has been split for our purposes by speaker dialect. Derived Measures Datasets exist for each region and can be found in the subcomponents. The regions are: Ireland, Scotland, Wales, Northern England, and Southern England. As many AudioBNC recordings were recorded with low-pass filtering, sibilant measures were not taken for these corpora. ### Corpus Reference ### John Coleman, Ladan Baghai-Ravary, John Pybus, and Sergio Grau (2012) Audio BNC: the audio edition of the Spoken British National Corpus. Phonetics Laboratory, University of Oxford. http://www.phon.ox.ac.uk/AudioBNC [1]: http://www.phon.ox.ac.uk/AudioBNC [2]: http://www.natcorp.ox.ac.uk/docs/URG.xml
OSF does not support the use of Internet Explorer. For optimal performance, please switch to another browser.
Accept
This website relies on cookies to help provide a better user experience. By clicking Accept or continuing to use the site, you agree. For more information, see our Privacy Policy and information on cookie use.
Accept
×

Start managing your projects on the OSF today.

Free and easy to use, the Open Science Framework supports the entire research lifecycle: planning, execution, reporting, archiving, and discovery.