Reproducible research
Reproducible research refers to the idea that the results of a research study can be duplicated or reproduced by other researchers. In the context of scientific and academic research, it means that the data, methods, and code used in a study are made available to others, allowing them to replicate the study’s findings. The goal is to ensure transparency, verifiability, and the ability to build upon existing work. To me, the what best guides reproducibility is to always keep in mind that:
You should not save your results, you should save how you made them.
If you completely document how your results are produced, they can always be generated a new - by you or anywone else.
Still, even with the best of intentions, simply recording the commands you used and writing down what you did rarely suffice in practise: You may forget to document something, things you feel are implicit may not be to the reader, you may put something down in the wrong order, you may forget your last minute install of a library in an updated version, and so on … Any such such thing will leave anyone trying to reproduce your work stuck, not knowing what to do. Even if they figure out a way to proceed, it may not be the way you origiginally did, defeating the purpuse of reprudcing. This page takes you through the approach to reproducibility we use in our group and teaches you how to follow the following seven rules:
- Define version and location of data (if any) that serve as raw input to your analysis.
- Use
conda
to define which external tools and libraries your analysis depend on. - Use
git
for version control and backup and host your repository at the group’s home on GitHub - Use
gwf
to pipeline your computationally demanding steps in an executable workflow. - Use executable
jupyter
notebooks for your experiments, plots, analyses, statistical tests, and interpretation. - Use
quarto
to assemble reports or manuscripts that link directly to results in your notebooks. - Make your GitHub repository public and render your project as web pages on the group web page
Specify raw input data
Report the location and version of the raw data that serve as input to your analysis. Make sure this data is “read only” so that onone accidentally modifies or deletes it. You can do that like this:
Terminal
chmod a=r important_input_file.txt
or recursively for an entire folder:
Terminal
chmod -R a=r input_file_folder
Define dependencies
Employing tools like Conda define the dependencies of your of your project, ensuring that others can run the same code with the same dependencies.
When you create your conda environment, and whenever you install new packages, you should update the environment.yml
file in the binder
directory. That way the specification of your environment is always up to date and allows anyone to create the conda environment required to run your analysis You do it by running this command:
Terminal
conda env export --from-history -f ./binder/environment.yml
Once your project is well underway and you have settled on the set of packages needed for your project, you can make your environment more stable (and thus reproducible) by creating the environment again, installing all the packages in one go. To do that, first create a test
environment to make sure that your exported environment can actually be created (you do not want to end up with a new environmnet that does not work like the old one):
Terminal
conda env create -n test -f binder/environment.yml
If, and only if, the test
environment is successfully created and works the way it should you delete the test
environment and the birc-project
enrivonemnt:
Terminal
conda env remove -n test
conda env remove -n birc-project
and create a new version of the birc-project
enrivonemnt:
Terminal
conda env create -f binder/environment.yml
Then, to make life easier for someone reproducing your project on the GenomeDK, you must run this command as well (it records all the packages used including their linux versions used here):
Terminal
conda activate birc-project
conda env export -f ./binder/environment-genomedk.yml
Use version control
Using version control systems like Git to track changes to code and documentation over time. This helps in understanding the evolution of the research and facilitates collaboration. It also allows you to tag the the state of your repository upon publication, so you can keep working on it later without disrupting its state at publication.
In the world of data projects, there are three kinds of data files.
Type 1 files: Files representing the input to your project: sequencing reads, genomes, genotypes, etc. These are usually difficult or expensive to reproduce and once made they never change. They are often too large to fit on GitHub, so they are saved (and backed up) indefinitely on the cluster, and can be distributed upon request to other researchers that want to replicate your analysis. Often the data is only local copy of data also available in a public database. You can put such files in the data
folder of your repository on the cluster, but unless they are very small, do not git add
these files to Git control.
Type 2 files: Files representing your work and the results from the project: documentation, scripts, code, gwf workflows, notebooks, plots, tables, etc. These files are usually as small as they are precious and they are the ones you add to your Git repository and puth to GitHub every time you change them. These files are workflow.py
, and files you put in the binder
, scripts
, notebooks
, results
, figures
, and reports
folders.
Type 3 files: Files representing intermediary steps to get from files of type 1 to files of type 2. These can be large and many and often most of the harddisk space consumed by your project. However, they are easily regenerated if your project is reproducible. So type 3 files need not be saved indefinitely and they should not be added to your Git repository. In fact, type 3 files should be deleted as soon as the project is finished. All such files must be put in the steps
folder or in folders inside the steps
folder. This way, removing all type 3 files is safely done by deleting the content of the steps
folder.
Make computatioal steps an executable workflow
Pipeline the steps of the computationally demanding part of your analysis in executable workflows using cluster workflow tools like GWF. This ensures that interdependent steps are run in the right order and rerun if necessary.
Make your analysis executable
Use Jupyter notebooks to assemble the code, ducumentation, and figures relevant to each part of your subsequent analysis. That way your analysis is what is produced by running your notebooks in order, no more, no less. placeholder notebooks in the notebooks folder
name them so they sort lexicographically: 01_first_analysis.ipynb
, 02_second_analysis.ipynb
.
Make your reporting executable
Compile your manuscript and any supplementary ducumemnts by linking to figures and results in your notebooks. This is easily done using tools like Quarto
Markdown / Qmarkdown / Notebooks
the _quarto.yml file
rendering
at the project end set output-dir to docs, render the whole thing and add docs to repository
Make your entire project publicly available
Making your repository publically available on GitHub. In addition to the scientific reguirement, reproducibility allows anyone to benefit from and build upon your work, greatly increasing its value.
Reproducing your work
If you follow the steps above, reproducing your results would only entail the following steps:
- Retrieve the raw data in the specified version.
- Clone your Git repository.
- Create the conda environment from the
environment.yml
file. - Run GWF workflow on a compute cluster for the computationally demanding parts of your analysis.
- Run the Jupyter notebooks in the specified order.
- Run Quarto on the resulting state of the repository to build the manuscript and supplementary information.
Below I will go over how to organize your project so that these steps also contribute to makeing your own life easier.