Main content
Synthetic Data in Communication Sciences and Disorders: Promoting an Open, Reproducible, and Cumulative Science
Date created: 2024-07-31 11:36 AM | Last Updated: 2024-10-20 12:51 PM
Category: Project
Description: Reproducibility is a core principle of science and access to a study’s data is essential to reproduce its findings. However, data sharing is uncommon in the field of Communication Sciences and Disorders (CSD), often due to concerns related to privacy and disclosure risks. Synthetic data offers a potential solution to this barrier by generating artificial datasets that do not represent real individuals yet retain statistical properties and relationships from the original data. This study evaluates the performance of synthetic data generation using open data from previously published studies across the American Speech-Language-Hearing Association (ASHA) ‘Big Nine’ domains. Findings suggest that synthetic data can effectively maintain statistical properties and relationships across a wide range of data commonly seen in the field of CSD. While some studies with fewer observations than recommended (i.e., n<130) showed lower agreement and greater variability in p-values and effect size estimates, this was not consistently appreciated. Therefore, researchers who use synthetic data should assess its stability in preserving their results. This study concludes with a general framework on sharing open data to facilitate computational reproducibility and foster a cumulative science in the field of CSD.
Data and analysis scripts to reproduce the analyses reported in this study are provided below.
To reproduce the manuscript, click "Github: AustinRThompson/..." then click "Download as zip". From there, open the Rproj. Within the manuscript file, synthetic_manuscript.qmd reproduces the manuscript in its entirety, whereas the file tutorial_script.R reproduces only the tutorial code.