Main content

Home

Menu

Loading wiki pages...

View
Wiki Version:
# The BIDS Best Practices in Data Science Series This series is a set of reflections and write-ups from meetings we regularly hold at the [Berkeley Institute for Data Science](bids.berkeley.edu), where we bring a wide range of people from across the UC-Berkeley campus and beyond together to discuss how to do something in data science well -- or at least better. ## Ten Simple Rules on Writing Clean and Reliable Open-Source Scientific Software <a href="https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1009481">Download PDF here.</a> Functional, usable, and maintainable open-source software is increasingly essential to scientific research, but there is a large variation in formal training for software development and maintainability. Here, we propose 10 “rules” centered on 2 best practice components: clean code and testing. These 2 areas are relatively straightforward and provide substantial utility relative to the learning investment. Adopting clean code practices helps to standardize and organize software code in order to enhance readability and reduce cognitive load for both the initial developer and subsequent contributors; this allows developers to concentrate on core functionality and reduce errors. Clean coding styles make software code more amenable to testing, including unit tests that work best with modular and consistent software code. Unit tests interrogate specific and isolated coding behavior to reduce coding errors and ensure intended functionality, especially as code increases in complexity; unit tests also implicitly provide example usages of code. Other forms of testing are geared to discover erroneous behavior arising from unexpected inputs or emerging from the interaction of complex codebases. Although conforming to coding styles and designing tests can add time to the software development project in the short term, these foundational tools can help to improve the correctness, quality, usability, and maintainability of open-source scientific software code. They also advance the principal point of scientific research: producing accurate results in a reproducible way. In addition to suggesting several tips for getting started with clean code and testing practices, we recommend numerous tools for the popular open-source scientific software languages Python, R, and Julia. ## Principles of Data Analysis Workflows <a href="https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1008770">Download PDF here.</a> Abstract: Abstract A systematic and reproducible “workflow”—the process that moves a scientific investigation from raw data to coherent research question to insightful contribution—should be a fundamental part of academic data-intensive research practice. In this paper, we elaborate basic principles of a reproducible data analysis workflow by defining 3 phases: the Explore, Refine, and Produce Phases. Each phase is roughly centered around the audience to whom research decisions, methodologies, and results are being immediately communicated. Importantly, each phase can also give rise to a number of research products beyond traditional academic publications. Where relevant, we draw analogies between design principles and established practice in software development. The guidance provided here is not intended to be a strict rulebook; rather, the suggestions for practices and tools to advance reproducible, sound data-intensive analysis may furnish support for both students new to research and current researchers who are new to data-intensive work. ## Challenges of Doing Data-Intensive Research in Teams, Labs, and Groups <a href="https://osf.io/preprints/socarxiv/a7b3m/download">Download PDF here.</a> Abstract: What are the challenges and best practices for doing data-intensive research in teams, labs, and other groups? This paper reports from a discussion in which researchers from many different disciplines and departments shared their experiences on doing data science in their domains. The issues we discuss range from the technical to the social, including issues with getting on the same computational stack, workflow and pipeline management, handoffs, composing a well-balanced team, dealing with fluid membership, fostering coordination and communication, and not abandoning best practices when deadlines loom. We conclude by reflecting about the extent to which there are universal best practices for all teams, as well as how these kinds of informal discussions around the challenges of doing research can help combat impostor syndrome. ## Best Practices for Fostering Diversity and Inclusion in Data Science <a href="https://osf.io/preprints/socarxiv/8gsjz/download">Download PDF here.</a> Abstract: What actions can we take to foster diverse and inclusive workplaces in the broad fields around data science? This paper reports from a discussion in which researchers from many different disciplines and departments raised questions and shared their experiences with various aspects around diversity, inclusion, and equity. The issues we discuss include fostering inclusive interpersonal and small group dynamics, rules and codes of conduct, increasing diversity in less-representative groups and disciplines, organizing events for diversity and inclusion, and long-term efforts to champion change. ## Best Practices for Managing Turnover in Data Science Groups, Teams, and Labs <a href="https://osf.io/preprints/socarxiv/wsxru/download">Download PDF here.</a> Abstract: Turnover is a fact of life for any project, and academic research teams can face particularly high levels of people who come and go through the duration of a project. In this article, we discuss the challenges of turnover and some potential practices for helping manage it, particularly for computational- and data-intensive research teams and projects. The topics we discuss include establishing and implementing data management plans, file and format standardization, workflow and process documentation, clear team roles, and check-in and check-out procedures. ## Resistance to Adoption of Best Practices <a href="https://osf.io/preprints/socarxiv/qr8cz/download">Download PDF here.</a> Abstract: There are many recommendations of "best practices" for those doing data science, data-intensive research, and research in general. These documents usually present a particular vision of how people should work with data and computing, recommending specific tools, activities, mechanisms, and sensibilities. However, implementation of best (or better) practices in any setting is often met with resistance from individuals and groups, who perceive some drawbacks to the proposed changes to everyday practice. We offer some definitions of resistance, identify the sources of researchers' hesitancy to adopt new ways of working, and describe some of the ways resistance is manifested in data science teams. We then offer strategies for overcoming resistance based on our group members' experiences working alongside resistors or resisting change themselves. Our discussion concluded with many remaining questions left to tackle, some of which are listed at the end of this piece.