Main content

Home

Menu

Loading wiki pages...

View
Wiki Version:
## Dataset Generation The measures datasets were generated from the SPADE corpora using [software][1] built as part of the SPADE project: the Integrated Speech Corpus ANalysis (ISCAN) package, building on the earlier PolyglotDB package. These packages enable integrated speech corpus analysis using python scripts or a GUI. This software was developed from 2016-2019 by a team at McGill, led by [Michael McAuliffe][2]. [ISCAN][3] software is available on Github, with [full documentation][4]. A paper by [McAuliffe et al. (2019)][5] gives a high-level description. This subproject (Dataset Generation) contains the python scripts used to generate the measures datasets, corresponding to each acoustic measure below, using ISCAN, as well as a Praat script and some auxiliary R scripts. Brief descriptions of the measures datasets, together with a summary of all column labels for each acoustic measures are given here: - [vowel durations][6] - [static formants][7] - [sibilant measures][8] The ISCAN, praat and R scripts used to take the measures are posted in this sub-project (`sibilants.py`, etc.). Specific information about the code used in the ISCAN scripts can be found in the [documentation here][9]. This subproject also contains the "whitelist" of words (`whitelist_spade.csv`) used to anonymise derived measures datasets generated from restricted corpora, before posting, as detailed [here][10]. For example, the sibilants dataset for the Canadian Prairies corpus, `spade-Canadian-Prairies_sibilants_whitelisted.csv`, was generated by applying the script `sibilants.py` to the [Canadian Prairies Corpus][11], then removing all rows corresponds to words not in `whitelist_spade.csv`. IMPORTANT NOTE: These scripts use *automatic* acoustic analysis procedures, which take as input the word and phone-level segmentation for each corpus. For all corpora bar Buckeye and TIMIT, this segmentation is itself automatic, from forced alignment. The resulting derived measures datasets posted here give these raw measures, and users may decide on their own procedures for data cleaning (e.g. outlier removal) before analysis. All analyses published so far by the SPADE team use some data cleaning; a detailed example is [here][12]. ## Reference McAuliffe, M., Coles, A., Goodale, M., Mihuc, S., Wagner, M., Stuart-Smith, J., & Sonderegger, M. (2019). ISCAN: A system for integrated phonetic analyses across speech corpora. Proceedings of the 19th Congress of Phonetic Sciences, 1322–1326. [1]: http://spade.glasgow.ac.uk/software/ [2]: https://memcauliffe.com/ [3]: https://github.com/MontrealCorpusTools/ISCAN [4]: https://iscan.readthedocs.io/en/latest/# [5]: https://spade.glasgow.ac.uk/wp-content/uploads/2019/04/iscan-icphs2019-revised.pdf [6]: https://osf.io/4jfrm/wiki/Duration%20Datasets/ [7]: https://osf.io/4jfrm/wiki/Static%20Formant%20Datasets/ [8]: https://osf.io/4jfrm/wiki/Sibilant%20Measures%20Datasets/ [9]: https://polyglotdb.readthedocs.io/en/latest/ [10]: https://osf.io/4jfrm/wiki/Whitelisting/ [11]: https://osf.io/ud3e4/ [12]: https://osf.io/bknrg/
OSF does not support the use of Internet Explorer. For optimal performance, please switch to another browser.
Accept
This website relies on cookies to help provide a better user experience. By clicking Accept or continuing to use the site, you agree. For more information, see our Privacy Policy and information on cookie use.
Accept
×

Start managing your projects on the OSF today.

Free and easy to use, the Open Science Framework supports the entire research lifecycle: planning, execution, reporting, archiving, and discovery.