Main content

Home

Menu

Loading wiki pages...

View
Wiki Version:
# OHBM 2022 Abstract ## Introduction About one-third of all data sets (N = 640)[1] being shared on OpenNeuro[2] in the Brain Imaging Data Structure (BIDS)[3] are missing data about participants. Among the two-thirds with a participants.tsv, only roughly half of them have a participants.json despite existing tools to create BIDS-compliant sidecar files such as PyNIDM[4]. Only a startling 2.3% of datasets have phenotypic or assessment data. Participant data (e.g. age), phenotype data (e.g. blood tests), and assessment data (e.g. responses to a survey), collectively referred to here as phenotypic data, are regularly collected, but data wranglers are not always equipped to put these data into a shareable format. While they put these data into dataframe objects for analysis purposes, they rarely write reproducible code to export them as tabular data. Another aspect influencing data structure and, indirectly, sharing is big longitudinal study design (e.g. ABCD, HCP, and UK Biobank), which is increasingly common. Assuming sharing data is allowed, the main self-reported barriers to FAIR-principled data sharing include being too time-consuming, lack of funding, and lack of knowledge[5]. This project aims to expand on the current BIDS standards for phenotypic data with guidelines, and public tools. ## Methods We came up with strategies to work with phenotypic data that would comply with the current BIDS standard including use cases for end users. Given the flexibility offered by the definition of a session in the BIDS standard (i.e. "a logical grouping of neuroimaging and behavioral data consistent across subjects”), we analyzed the current BIDS standard to determine optimal approaches, including when one has to organize longitudinal or asynchronous phenotypic data with scans or multiple assessments within a session. ## Results When phenotypic data are present, the BIDS standard[6] currently requires phenotypic and assessment data in a phenotype folder at the root level of the dataset. We narrowed down the options for appropriate treatment of phenotype data to three possibilities for now. 1. **Inheritance**: With the BIDS inheritance principle in mind, this option is to include a phenotype subfolder in every subject (or session) folder with responses for just that subject. You may use this approach when subjects are tested longitudinally, and phenotypic data are so numerous that storing them on a subject or session basis is less cumbersome, making data easier to manipulate. 1. **One per measurement tool**: As the BIDS standard currently describes, this option is to put one pair of data dictionary files and data table file for each measurement tool acquired. The user collects all subjects, sessions, or runs of data as one entry per row (whichever is the smallest unit). Use-case for this is when data are relatively simple and short, limiting the number of rows to a few thousand and columns to a hundred. 1. **External descriptor(s)**: Similar to the current Genetic Descriptor in BIDS, this option requires less, or no transformation and just points to an external place to pull the data for one’s wrangling. Use this when data use agreements prevent sharing of certain data or to simplify sharing given the complexity of one’s phenotypic data. The work to integrate these community consensus driven guidelines with BIDS maintainers has begun as well as creating examples and open-source tools for helping wrangle and generate phenotypic files. ## Conclusions An extract, transform, and load (ETL) pipeline proves necessary to curate large data sets. While an ETL pipeline can be a series of “manual” steps to copy data from one or more sources into a destination system that represents the data differently, such as BIDS, it can be made reproducible by documenting each step [8]. We want to share our solution and spark conversations with users and other complementary initiatives such as NIDM, Common Data Elements, Reproschema, and Psych-DS to lower the barriers to FAIR-principled[10] data and get broader community adoption. ## References 1. Counts of https://github.com/OpenNeuroDatasets repositories as of 2021-12-14. 1. Markiewicz, C. J., Gorgolewski, K. J., Feingold, F., Blair, R., Halchenko, … & Poldrack, R. (2021). The OpenNeuro resource for sharing of neuroscience data. ELife, 10, e71774. https://doi.org/10.7554/eLife.71774 1. Gorgolewski, K. J., Auer, T., Calhoun, V. D., Craddock, R. C., Das, S., … & Poldrack, R. A. (2016). The brain imaging data structure, a format for organizing and describing outputs of neuroimaging experiments. Scientific Data, 3(1), 160044. https://doi.org/10.1038/sdata.2016.44 1. PyNIDM: Neuroimaging Data Model in Python (3.9.5). (2021). [Python]. https://pynidm.readthedocs.io/en/latest/ 1. Paret, C., Unverhau, N., Feingold, F., Poldrack, R. A., Stirner, M., Schmahl, C., & Sicorello, M. (2021). Survey on Open Science Practices in Functional Neuroimaging [Preprint]. Scientific Communication and Education. https://doi.org/10.1101/2021.11.26.470115 1. BIDS-Contributors. (2021). The Brain Imaging Data Structure (BIDS) Specification. https://doi.org/10.5281/ZENODO.3686061 1. Denney, M. J., Long, D. M., Armistead, M. G., Anderson, J. L., & Conway, B. N. (2016). Validating the extract, transform, load process used to populate a large clinical research database. International Journal of Medical Informatics, 94, 271–274. PubMed. https://doi.org/10.1016/j.ijmedinf.2016.07.009 1. Wilkinson, M. D., Dumontier, M., Aalbersberg, Ij. J., Appleton, G., Axton, M., … & Mons, B. (2016). The FAIR Guiding Principles for scientific data management and stewardship. Scientific Data, 3(1), 160018. https://doi.org/10.1038/sdata.2016.18
OSF does not support the use of Internet Explorer. For optimal performance, please switch to another browser.
Accept
This website relies on cookies to help provide a better user experience. By clicking Accept or continuing to use the site, you agree. For more information, see our Privacy Policy and information on cookie use.
Accept
×

Start managing your projects on the OSF today.

Free and easy to use, the Open Science Framework supports the entire research lifecycle: planning, execution, reporting, archiving, and discovery.