Introduction
Health data researchers are increasingly required to develop complex analytic code in order to implement sophisticated analyses on large health datasets. While writing analysis scripts (box 1) for academic projects is distinct from general purpose software development, they share many of the same features. A researcher’s script usually consists of a sequence of commands executed by a computer to extract, reshape, clean, describe and analyse data. If the quality of this analytic code cannot be reasonably assured, then results cannot be trusted: programming errors have resulted in high profile retractions.1–3 Similarly, if lengthy scripts for data management cannot be re-used, then work is needlessly duplicated.
Glossary
Analytic Script: A series of commands written in a programming or statistical language such as R, Stata or Python, that are executed by a computer. These commands are used to do the analysis and may involve data extraction, cleaning, processing and analysis.
Commit: An individual change or revision to a file or set of files9
Docstring: This is a non-executable text that is attached to units of code such as functions, and documents what the code is doing. For example, this may include inputs, outputs, and specific errors.
Functions: These are pieces of code that can be run (or invoked) and executes the code specified.
Library: This is a collection of code that does a particular task or set of tasks, and can be imported and used in other projects.
Open source: Code or software projects where the source code is freely available and may be changed, and shared by others.
Pull: This is the term that describes when you fetch files from GitHub or similar. You can “pull” the most up to date file onto your computer, or “pull” changes that your colleague may have made.9
Pull Request: There are proposed changes to a repository by a user and are accepted or rejected, or commented on by the other project collaborators.9
Push: This is the term that describes when you send your committed changes back to GitHub (or a similar platform). Once pushed, others will be able to see your suggested changes to any files.9
Repository: This is a project space within GitHub or GitLab that holds a project. The easiest way of conceptualising this is as a folder that contains all your project files, and stores each files’ revision history.9
Requirements/Dependencies: These are software libraries that are required to run a particular project or piece of code. They normally have a version number, for example, version 0.0.1, 0.0.2 etc
The software engineering community has developed a range of techniques to improve the quality, re-usability, efficiency and readability of code. Organisations such as the Software Sustainability Institute4 support this approach to code development and provide more detailed guidance and education which are well worth reviewing. In this brief guide we explain how researchers can borrow best practices and freely available tools from this community to improve their work. We specifically cover the following three topics: Writing High Quality Code, Working Collaboratively and Sharing your work. Throughout the piece we often refer to examples from Python or R, two popular open source programming languages used by academics, but our advice is universal and there will be analogues to these examples in any commonly used statistical or general purpose programming language.