Dataset Generation

doi:None

Title	Authors

Home

## Dataset Generation The measures datasets were generated from the SPADE corpora using [software][1] built as part of the SPADE project: the Integrated Speech Corpus ANalysis (ISCAN) package, building on the earlier PolyglotDB package. These packages enable integrated speech corpus analysis using python scripts or a GUI. This software was developed from 2016-2019 by a team at McGill, led by [Michael McAuliffe][2]. [ISCAN][3] software is available on Github, with [full documentation][4]. A paper by [McAuliffe et al. (2019)][5] gives a high-level description. This subproject (Dataset Generation) contains the python scripts used to generate the measures datasets, corresponding to each acoustic measure below, using ISCAN, as well as a Praat script and some auxiliary R scripts. Brief descriptions of the measures datasets, together with a summary of all column labels for each acoustic measures are given here: - [vowel durations][6] - [static formants][7] - [sibilant measures][8] The ISCAN, praat and R scripts used to take the measures are posted in this sub-project (`sibilants.py`, etc.). Specific information about the code used in the ISCAN scripts can be found in the [documentation here][9]. This subproject also contains the "whitelist" of words (`whitelist_spade.csv`) used to anonymise derived measures datasets generated from restricted corpora, before posting, as detailed [here][10]. For example, the sibilants dataset for the Canadian Prairies corpus, `spade-Canadian-Prairies_sibilants_whitelisted.csv`, was generated by applying the script `sibilants.py` to the [Canadian Prairies Corpus][11], then removing all rows corresponds to words not in `whitelist_spade.csv`. IMPORTANT NOTE: These scripts use *automatic* acoustic analysis procedures, which take as input the word and phone-level segmentation for each corpus. For all corpora bar Buckeye and TIMIT, this segmentation is itself automatic, from forced alignment. The resulting derived measures datasets posted here give these raw measures, and users may decide on their own procedures for data cleaning (e.g. outlier removal) before analysis. All analyses published so far by the SPADE team use some data cleaning; a detailed example is [here][12]. ## Reference McAuliffe, M., Coles, A., Goodale, M., Mihuc, S., Wagner, M., Stuart-Smith, J., & Sonderegger, M. (2019). ISCAN: A system for integrated phonetic analyses across speech corpora. Proceedings of the 19th Congress of Phonetic Sciences, 1322–1326. [1]: http://spade.glasgow.ac.uk/software/ [2]: https://memcauliffe.com/ [3]: https://github.com/MontrealCorpusTools/ISCAN [4]: https://iscan.readthedocs.io/en/latest/# [5]: https://spade.glasgow.ac.uk/wp-content/uploads/2019/04/iscan-icphs2019-revised.pdf [6]: https://osf.io/4jfrm/wiki/Duration%20Datasets/ [7]: https://osf.io/4jfrm/wiki/Static%20Formant%20Datasets/ [8]: https://osf.io/4jfrm/wiki/Sibilant%20Measures%20Datasets/ [9]: https://polyglotdb.readthedocs.io/en/latest/ [10]: https://osf.io/4jfrm/wiki/Whitelisting/ [11]: https://osf.io/ud3e4/ [12]: https://osf.io/bknrg/

OSF does not support the use of Internet Explorer. For optimal performance, please switch to another browser.

This website relies on cookies to help provide a better user experience. By clicking Accept or continuing to use the site, you agree. For more information, see our Privacy Policy and information on cookie use.

Start managing your projects on the OSF today.

Free and easy to use, the Open Science Framework supports the entire research lifecycle: planning, execution, reporting, archiving, and discovery.

Create an Account Learn More Hide this message

Main content

Home

Menu

Start managing your projects on the OSF today.

Main content

Links to this project

Home

Menu

Add new wiki page

Page permissions have changed

Wiki page deleted

Connected to the collaborative wiki

Connecting to the collaborative wiki

Collaborative wiki is unavailable

Browser unsupported

Start managing your projects on the OSF today.