Main content
Home
Menu
Loading wiki pages...
Dataset Generation
The measures datasets were generated from the SPADE corpora using software built as part of the SPADE project: the Integrated Speech Corpus ANalysis (ISCAN) package, building on the earlier PolyglotDB package. These packages enable integrated speech corpus analysis using python scripts or a GUI. This software was developed from 2016-2019 by a team at McGill, led by Michael McAuliffe. ISCAN software is available on Github, with full documentation. A paper by McAuliffe et al. (2019) gives a high-level description.
This subproject (Dataset Generation) contains the python scripts used to generate the measures datasets, corresponding to each acoustic measure below, using ISCAN, as well as a Praat script and some auxiliary R scripts.
Brief descriptions of the measures datasets, together with a summary of all column labels for each acoustic measures are given here:
The ISCAN, praat and R scripts used to take the measures are posted in this sub-project (sibilants.py
, etc.). Specific information about the code used in the ISCAN scripts can be found in the documentation here.
This subproject also contains the "whitelist" of words (whitelist_spade.csv
) used to anonymise derived measures datasets generated from restricted corpora, before posting, as detailed here.
For example, the sibilants dataset for the Canadian Prairies corpus, spade-Canadian-Prairies_sibilants_whitelisted.csv
, was generated by applying the script sibilants.py
to the Canadian Prairies Corpus, then removing all rows corresponds to words not in whitelist_spade.csv
.
IMPORTANT NOTE: These scripts use automatic acoustic analysis procedures, which take as input the word and phone-level segmentation for each corpus. For all corpora bar Buckeye and TIMIT, this segmentation is itself automatic, from forced alignment. The resulting derived measures datasets posted here give these raw measures, and users may decide on their own procedures for data cleaning (e.g. outlier removal) before analysis. All analyses published so far by the SPADE team use some data cleaning; a detailed example is here.
Reference
McAuliffe, M., Coles, A., Goodale, M., Mihuc, S., Wagner, M., Stuart-Smith, J., & Sonderegger, M. (2019). ISCAN: A system for integrated phonetic analyses across speech corpora. Proceedings of the 19th Congress of Phonetic Sciences, 1322–1326.
Page permissions have changed
Your browser should refresh shortly…
Renaming wiki...
Wiki page deleted
Press Confirm to return to the project wiki home page.
Connected to the collaborative wiki
This page is currently connected to the collaborative wiki. All edits made will be visible to contributors with write permission in real time. Changes will be stored but not published until you click the "Save" button.
Connecting to the collaborative wiki
This page is currently attempting to connect to the collaborative wiki. You may continue to make edits. Changes will not be saved until you press the "Save" button.
Collaborative wiki is unavailable
The collaborative wiki is currently unavailable. You may continue to make edits. Changes will not be saved until you press the "Save" button.
Browser unsupported
Your browser does not support collaborative editing. You may continue to make edits. Changes will not be saved until you press the "Save" button.

Start managing your projects on the OSF today.
Free and easy to use, the Open Science Framework supports the entire research lifecycle: planning, execution, reporting, archiving, and discovery.
Copyright © 2011-2025
Center for Open Science
|
Terms of Use
|
Privacy Policy
|
Status
|
API
TOP Guidelines
|
Reproducibility Project: Psychology
|
Reproducibility Project: Cancer Biology