Presentation abstract:
There is a pressing need to establish best practices for data curation professionals in response to the increasing prevalence and application of machine learning (ML) across disciplines. Broad sharing of ML outputs - which are resource intensive to create, requiring large amounts of training and test data, processing power, and specialized programming knowledge - can make future research more efficient and reusable. However, formal community-accepted guidelines and recommended practices for documenting and sharing ML objects are sparse within library-centric professions and across data repositories. In this talk, we will discuss an ongoing project to better understand current practices for sharing and reuse of ML components (data, code, workflows, etc.). A core part of this project is an in-depth analysis of ML objects from a selection of repositories that specialize in ML research workflows and outputs, as well as several generalist repositories including Figshare and Zenodo. By analyzing the metadata of ML objects extracted via API and web scraping, we aim to address a variety of questions relevant to reusability, such as: What is the most commonly used license for ML components? How often are training datasets, necessary for measuring ML model performance, included with an ML object? Is clear documentation of the software environment used provided? Answers to these questions are merely the first step in better understanding the landscape of ML objects, in the context of reusability. In addition to assessing how ML objects are being shared, we will also leverage the FAIR Principles to identify and classify the minimum viable metadata that makes an ML object and/or project reusable for a standard practitioner. We look forward to feedback from and discussion with the RDAP community.