Main content

Home

Menu

Loading wiki pages...

View
Wiki Version:
The VoxCommunis Corpus contains acoustic models, lexicons, and force-aligned TextGrids with phone- and word-level segmentations derived from the [Mozilla Common Voice Corpus][1]. The Mozilla Common Voice Corpus contains audio data with transcriptions from over 70 languages. The Mozilla Common Voice Corpus and derivative VoxCommunis Corpus here are free to download and use under a CC0 license. As of writing, most files are based on Common Voice Version 7.0 unless otherwise indicated by the suffix "_cv10" which would indicate Common Voice Version 10.0. The lexicons are developed using [Epitran][2] and the [XPF Corpus][3] which are both rule-based G2P systems. Some manual correction has been applied, and we hope to continue improving these. Any updates from the community are welcome. The acoustic models have been trained using the [Montreal Forced Aligner (version 2.0)][4], and the force-aligned TextGrids are obtained directly from those alignments. These acoustic models can be downloaded and re-used with the Montreal Forced Aligner for new data. The spkr\_files contain the mapping from the original client\_id to the simplified spkr\_id in the formants data. The speaker IDs in the formant data are based on the client\_id order in the validated set of Common Voice Version 7.0 and are generated by running remap\_spkrs.py on validated.tsv (included in the Common Voice language-specific download). For use of this derivative data, please cite the original corpus ([Mozilla Common Voice Corpus][1]), as well as: Ahn, Emily, and Chodroff, Eleanor. (2022). [VoxCommunis: A corpus for cross-linguistic phonetic analysis][5]. Proceedings of the 13th Conference on Language Resources and Evaluation Conference (LREC 2022). Some errors have been flagged for the Armenian Corpus in that it merges Western and Eastern Armenian and misses several schwas in the transcription. Please be forewarned! If you identify other major errors, let us know, and we'll certainly add the limitations, and hopefully update the resource if and when possible. We will likely migrate this repository over to GitHub soon to enable pull requests and facilitate updates. Some additional code relevant to the classification of speakers as having "high" or "low" formant settings can be found here: https://github.com/emilyahn/outliers/blob/main/src/assign_formant_range.py [1]: https://commonvoice.mozilla.org/en/datasets [2]: https://github.com/dmort27/epitran [3]: https://cohenpr-xpf.github.io/XPF/Convert-to-IPA.html [4]: https://montreal-forced-aligner.readthedocs.io/en/latest/index.html [5]: http://www.lrec-conf.org/proceedings/lrec2022/pdf/2022.lrec-1.566.pdf
OSF does not support the use of Internet Explorer. For optimal performance, please switch to another browser.
Accept
This website relies on cookies to help provide a better user experience. By clicking Accept or continuing to use the site, you agree. For more information, see our Privacy Policy and information on cookie use.
Accept
×

Start managing your projects on the OSF today.

Free and easy to use, the Open Science Framework supports the entire research lifecycle: planning, execution, reporting, archiving, and discovery.