Ubiquitous Research Preservation: Transforming Knowledge Preservation in Computational Science

Research preservation is crucial for supporting researchers’ sensemaking and knowledge sharing. However, human compliance to capturing strategies is a barrier for creating complete scientific repositories. In this paper, we introduce Ubiquitous Research Preservation, which we envision to automate preservation in computational science. We contribute a characterization of preservation processes, illustrate the spectrum of technology interventions and describe research challenges and opportunities for Ubiquitous Research Preservation in computation-based scientific domains.


INTRODUCTION
Preservation of scientific knowledge enables researchers to reflect on past choices and to share resources and findings with the scientific community. Yet, preserving and sharing research requires substantial efforts [1]. Studies have shown that documentation and preservation technology needs to ease scientists' efforts and make use of automated recording and processing mechanisms [3,6,8]. This  This paper focuses on research preservation in computational and data-driven science. Although barriers for capturing and sharing resources in computation-based science are rather low, availability and sharing of digital resources remains a major concern [2,7]. In fact, shortcomings in personal repositories often require creative solutions 1 .

Motivation and Background
Oleksik et al. [6] reported on their observational study on electronic lab notebooks (ELN) in a research organization. They found that the flexibility of digital media can lead to much less precision during experiment recording and that "freezing" parts of the record might be necessary. The authors stressed that "ELN environments need to incorporate automatic or semi-automatic features that are supported by sophisticated technologies [...]. " Studying the use of a hybrid laboratory notebook, Tabard et al. [8] found that "users clearly do not want to focus on the process of capturing information." Yet, they also noted that automated mechanisms can be intrusive and that users need to be in control of the recording and sharing. They illustrated the importance of reflection in the scientific process and highlighted how access to preserved, redundant information supports reflection, as "scientists understand how their thoughts have evolved over time. " In our ongoing research, we study practices around research preservation in High Energy Physics (HEP) [3,4]. In an interview study with HEP data analysts [3], we found that lack of preservation and sharing highly impacts the ability to reuse and reproduce work in this data-intensive, computational environment. We also found that HEP data analysis work is based on common building blocks that foster implementation of automated recording strategies.

Box 1: Characterizing Researcher Interaction
Based on our research in experimental physics, we introduce and define two dimensions to characterize preservation practices from a researcher point of view: Initiative and Resource Awareness.
Initiative: Nominates the entity responsible for initiating a preservation process.
User-Initiated: The researcher is responsible for process initiation and control. User decides on suitable occasions.
Machine-Initiated: The machine initiates and controls processes. Decisions might be based on: workflow knowledge; pre-configured domain rules; and / or user-configured rules.
Resource Awareness: Describes how aware researchers are about the selection of resources in the preservation process.
Conscious: Only resources are preserved which are selected by the user.
Unaware: The user has no direct control over the resources that are preserved. However, he / she might have previously set rules for this process.
Kery et al. [5] asked scientists to think about "a magical perfect record" in their study of literate programming tools. Participants created queries referring to "many kinds of contextual details, including libraries used, output, plots, [...]. " Participants described their inability to find prior analyses and illustrated consequences. The authors found that in literate programming tools, "version control is currently poor enough that records of prior iterations often do not exist. "

TECHNOLOGY INTERVENTIONS FOR RESEARCH PRESERVATION
To describe the spectrum of technology intervention in the preservation of machine-processed research, we characterize preservation efforts from a researcher point of view, taking into account our research in experimental physics [3,4]. Researchers commonly document, preserve and possibly share information and resources in lab notebooks, cloud services or dedicated research preservation services (e.g. Figshare and Zenodo). Or, they decide to commit assets to repositories (e.g. GitHub). In either case, those actions are mostly user-initiated. Scientists who -for any reason -decide to preserve or share their research make a conscious selection of their study's data and materials. We assigned those characteristics to the dimensions Initiative and Resource Awareness, as described in Box 1.

Towards Ubiquitous Research Preservation
We characterize automated preservation strategies based on Initiative and Resource Awareness. In contrast to current user-initiated preservation efforts, machine-supported recording of workflows would be Machine-Initiated. Here, researchers could be Unaware of continuous background preservation efforts. This envisioned transformation is based on the demonstrated need to support researchers through automated preservation processes. Described dimensions and characteristics enable a wide spectrum of technology interventions, as depicted in Figure 1. For example, technology could implement a completely machine-initiated/unaware preservation of computational processes. Such an approach could guarantee (near-) continuous workflow recording, possibly taking inspiration from extreme forms of documentation like lifelogging.
Related work showed that control is an important factor in research preservation. Technology supporting userinitiated/unaware interactions might make an important contribution towards acceptance. For example, a researcher who considers a process to be relevant in the future, could start an application or execute a command that initiates recording of computational states and changes (see Figure  2). The researcher should be able to stop this process at any time.
Machine-initiated/conscious interaction could also provide researchers with control. Here, the machine might actively propose users to preserve certain processes. This decision would need to be based on pre-defined triggers or in-depth workflow knowledge. A researcher might receive a notification detailing the proposed initiation of a preservation process or activity (see Figure 3).
We refer to the spectrum of technology interventions for machine-supported recording of computation-based research as Ubiqitous Research Preservation (URP). We define URP and URP technology in Box 2.

Box 2: Definitions
Ubiquitous Research Preservation (URP) refers to the machine-supported scientific knowledge recording and preservation process of computational workflows.
URP technology initiates and/or controls partial or complete preservation.

RESEARCH CHALLENGES AND FUTURE WORK
Our research and related studies illustrated various challenges resulting from automated recording strategies. Here, we expand on challenges and opportunities for research on URP technology: Usefulness. To create complete "magical records", preserved data need to be annotated, searchable and suitable for desired use cases. It will be important to manage the signalto-noise ratio, as well as to find suitable ways for information discovery and presentation.
Generalizability. As URP technology profits from knowledge about research practices for recording and presenting information, development of assistive technology across heterogeneous environments needs to be further researched. Research questions include: How can technology assess researchers' practices, needs and integrate into their workflows? Can we create accessible templates based on learned and confirmed structures? How does technology adapt to scientific novelty and creativity?
Control. Acceptance of URP technology will depend on researchers' perceived control over the preservation process. Figure 4 shows our <Recorder> that continuously captures the screen and title of applications that the user selected for recording. Though we need to further evaluate the <Recorder>, it is clear that researchers want to control capturing and sharing. This conflict between exercising control over the preservation process and desired automated preservation requires further study.
Integration. The landscape of connected devices that measure, generate or process scientific data is large and diverse. Devices range from desktop computers to microscopes and sensors. Integrating all those data sources into the preservation process poses further challenges regarding user control and system architectures. As depicted in Figure  5, some devices will implement URP strategies. And even though our examples and developments are mostly limited to computer applications, a wide variety of connected devices can offer URP by directly communicating with repository servers. Other devices can be connected to URP technology, which acts as a proxy in the preservation process.

DISCUSSION AND CONCLUSION
We described our past and current efforts aiming to spark discussions and further research on machine-automated preservation in computation-based science. We illustrated a broad spectrum of technology interventions that we refer to as Ubiquitous Research Preservation (URP). We expect URP to make a positive impact on researchers' ability to reflect on past processes, to provide training material and to improve the reproducibility of their work. Yet, we do not intent to oversimplify complex use cases. Preservation is a first step towards supporting those, but it is not the only requirement. In particular, the decision to share resources does not only depend on the effort to preserve data, but on various other factors, including competition, fear of judgement and privacy policies.
We described four major research challenges, crucial for the design and acceptance of URP technology. Usefulness and control will be crucial for the acceptance and use of URP systems. Generalizability needs to be considered, to provide fast and wide access to URP tools and to include even branches of science and organizations that find it challenging to spend considerable resources on the development and adaptation of URP systems. Finally, the diverse landscape of connected, data-producing or data-processing devices needs to be integrated into URP systems. Developments and URP architectures must not be limited to computer applications.
Our research focuses on computational science, as automated, machine-supported knowledge preservation promises to best map experimental processes and resources. Yet, as all science became to varying degrees connected to computation, we expect URP to profit scientific domains beyond computational science. Similarly, URP is likely to impact technology users well beyond science.