This dataset includes:
1) Transcriptions of programmers speaking single or a few lines of Java code along with the associated actual programming statements.
2) Single line comments extracted from the CodexGlue dataset [CodeXGLue][2]. The CodexGlue dataset is derived from the curated [CodeSearchnet][1] dataset which was used for a code summarization task.
3) A 4-gram word level mixture language model created by mixing a 4-gram [LibriSpeech][3] model, a 4-gram model trained with single line comment dataset from [CodeXGLue][4], and a 4-gram model trained on the SpokenJava transcripts. The language model has a 203K word vocabulary.
You read more about the research leading to this dataset in these papers:
Nowrin, S. and Vertanen K. [Leveraging Large Pretrained Models for Line-by-Line Spoken Program Recognition][5]. Proceedings of the IEEE Conference on Acoustics, Speech, and Signal Processing (2024).
Nowrin, S. and Vertanen K. [Programming by Voice: Exploring User Preferences and Speaking Styles][6]. Proceedings of the 5th Conference on Conversational User Interfaces (2023).
Nowrin, S., Ordóñez, P. and Vertanen K. [Exploring Motor-impaired Programmers' Use of Speech Recognition][7].
Proceedings of the ACM SIGACCESS Conference on Computers and Accessibility (2022).
[1]: https://github.com/github/CodeSearchNet
[2]: https://github.com/microsoft/CodeXGLUE
[3]: https://www.openslr.org/12
[4]: https://github.com/microsoft/CodeXGLUE
[5]: https://keithv.com/pub/linebyline/
[6]: https://keithv.com/pub/progvoice/
[7]: https://keithv.com/pub/progspeech/