# SPeech Across Dialects of English (SPADE) #
SPADE project website: https://spade.glasgow.ac.uk/
The SPADE project aims to develop and apply user-friendly software for large-scale speech analysis of existing public and private English speech datasets, in order to understand more about English speech over space and time. To date, we have worked with 42 shared corpora comprising dialects from across the British Isles (England, Wales, Scotland, Ireland) and North America (US, Canada), with an effective time span of over 100 years. Most of these corpora were shared by a Data Guardian, with the exception of five corpora licensed to McGill (Buckeye, ICE-Canada, Santa Barbara, Switchboard, and TIMIT). For an outline of the corpus collation, ethical issues, data management and processing, see [Sonderegger et al (fc 2021)][1].
This OSF Project comprises the dissemination component of SPADE. In this first release, we make available acoustic measures for sibilants and durations and static formants for vowels, for 39 corpora (~2200 hours of speech analysed from ~8600 speakers), anonymised where required, with information about dataset generation.
We worked with the SPADE corpora thanks to the generosity of the Data Guardians who together comprise [The SPADE Consortium][2]. Each SPADE corpus has its own wiki page, with a brief summary of its dimensions, provenance, Data Guardian, reference and conditions of use. Some larger corpora, such as the AudioBNC, or the English Dialects App, were split into subcorpora for the purposes of SPADE analysis. The 'speaker dimensions’ information lists the metadata available for speakers in this corpus, such as age, social class, ethnicity, location, and so on. Typically each dimension corresponds to a column in the measures datasets for that corpus.
## Measures Datasets ##
Here we deposit measures datasets for each of the shared SPADE corpora, plus Praat TextGrids created by SPADE for some corpora. Reading passage text is provided when the measures are based on reading only. Each measure type has a wikipage briefly describing the measures and the contents of each dataset. All scripts used to generate the measures are posted in the [Dataset Generation][3] Files section.
#### [Vowel Durations][4] ####
#### [Static Formants][5] ####
#### [Sibilant Measures][6] ####
#### [TextGrids][7] ####
Note that the datasets are in their raw form and will require cleaning (e.g. outlier removal) before analysis. For example, data processing done for one case study, the analysis of sibilant measures for six corpora reported in [Stuart-Smith et al. 2019][8], is described in an [OSF project here][9], which shows some issues to be dealt with for ‘real’ analyses.
Ethically non-invasive speech research is a key component of the SPADE project. We used whitelisting to anonymise measures datasets generated from non-public, restricted corpora. Information about how whitelisting was carried out is available [here][10].
## Conditions of Use ##
### License ###
The materials in this open-access OSF project are distributed under the terms of the [Creative Commons Attribution-NonCommercial 4.0 International License][11], which permits you to share — copy and redistribute the material in any medium or format, and/or adapt — remix, transform, and build upon the material, provided that you do not use the material for commercial purposes, and that you give appropriate credit, provide a link to the license, and indicate if changes were made.
### Citation ###
When using any of the datasets deposited here for whatever purpose (e.g. teaching/course materials, projects/dissertations, presentations, publications, etc), cite the DOI for this OSF project **and** the corpus-specific reference for each corpus whose data has been used. Corpus references can be found on the individual corpora wiki.
**IMPORTANT NOTE:** The individual corpora wiki also state any corpus-specific conditions of use which need to be observed.
## Project Team ##
SPADE is a research collaboration across five British and North American institutions (University of Glasgow, McGill University, North Carolina State University, University of Edinburgh, University of Oregon). Details of the Project Team can be found [here][12]. Jane, Morgan and Jeff particularly thank Vanna Willerton, Rachel Macdonald and James Tanner, whose work in collating and constructing this OSF project has been substantial, and Michael McAuliffe, who led development of the software used in SPADE.
## Project Funding ##
SPADE was funded by the Transatlantic Platform (T-AP) Digging into Data Challenge via contributions from the ESRC, UK: ES/R003963/1, SSHRC/CRSH, Canada: RGPDD 501771-16, NSERC/CRSNG, Canada: 869-2016-0006, and the NSF, USA: SMA-1730479.
## Copyright ##
© 2020 J. Stuart-Smith, M. Sonderegger, and J. Mielke.
## Disclaimer ##
The Universities of Glasgow, McGill and North Carolina make no warranty whatsoever in relation to these materials including as to accuracy, quality or fitness for any particular purpose, and accept no liability in relation to the use of the materials or anything associated with such use.
## References ##
SPeech Across Dialects of English (SPADE): Large-scale digital analysis of a spoken language across space and time (2017-2020). ESRC Grant ES/R003963/1, NSERC/CRSNG Grant RGPDD 501771-16, SSHRC/CRSH Grant 869-2016-0006, NSF Grant SMA-1730479. (Digging into Data/Trans-Atlantic Platform).
Sonderegger, M., Stuart-Smith, J., McAuliffe, M., Macdonald, R. and Kendall, K. (fc 2021). Managing data for integrated speech corpus analysis in *SPeech Across Dialects of English* (SPADE). In Andrea Berez-Kroeker, Bradley McDonnell, Eve Kroller, and Lauren Collister (eds), *Open Handbook of Linguistic Data Management*. MIT Press Open.
Stuart-Smith, J. , Sonderegger, M., Macdonald, R., Mielke, J., McAuliffe, M. and Thomas, E. (2019) Large-scale Acoustic Analysis of Dialectal and Social Factors in English /s/-retraction. In: International Congress of Phonetic Sciences (ICPhS 2019), Melbourne, Australia, 5-9 Aug 2019, pp. 1273-1277.
[1]: https://spade.glasgow.ac.uk/wp-content/uploads/2019/11/sondereggerEtAl_handbook_2020.pdf
[2]: https://spade.glasgow.ac.uk/the-spade-consortium/
[3]: https://osf.io/ja94t/
[4]: https://osf.io/4jfrm/wiki/Duration%20Datasets/
[5]: https://osf.io/4jfrm/wiki/Static%20Formant%20Datasets/
[6]: https://osf.io/4jfrm/wiki/Sibilant%20Measures%20Datasets/
[7]: https://osf.io/4jfrm/wiki/TextGrids/
[8]: http://eprints.gla.ac.uk/183726/
[9]: https://osf.io/bknrg/
[10]: https://osf.io/4jfrm/wiki/Whitelisting/
[11]: http://creativecommons.org/licenses/by-nc/4.0/
[12]: https://spade.glasgow.ac.uk/project-team/