A few thoughts on organizing computational (biology) projects

e · l · n
Jun 23, 2015

I read this excellent article with practical recommendations on how to organize a computational project, in terms of directory structure.

Directory structure matters

The importance of a good directory structure seems to often be overlooked in teaching about computational biology, but can be the difference between a successful project, and one where every change or re-run of some part of a workflow, will require days of manual fiddling to get hand on the right data, in the right format, in the right place, with the right version of the workflow, with the right parameters, and then succeed to run it without errors.

Identify static and moving parts of your code and config

Among the many good ideas in the article I especially liked the ones about:

This lead me to realize something else that has bugged me for a long time, based on troubles we had with our project setup, but which I wasn't clear about before:

In our own project for example, some things that we assumed would be static, such as workflow definitions (we thought that just the parameters, config and run scripts would change), will actually change. Sometimes a little, and sometimes a lot.

Thus, I'm quite sure that our next step, after a re-organization we're doing now, will include a full copy of the workflow definition in each experiment folder.

I'm sure this is different for each project though, depending on where you are on the "production pipeline vs explorative experimentation workflow" - scale. In our case, we assumed we were closer to a "production pipeline" than we actually were, and that caused us a lot of trouble.

A tentative project directory structure

Apart from the realization above, I've also been sketching on a new directory structure, heavily inspired from the paper, with some additions from our own experience.

So, this is what I think I would like a project directory structure to look like right now:

├── bin  <-------------------# Binary files and executables (jar files & proj-wide scripts etc)
├── cfg <--------------------# Project-wide configuraiotn
├── doc  <-------------------# Any documents, such as manuscripts being written
├── exp  <-------------------#  The main experiments folder
│   ├── 2000-01-01-example <-# An example Experiment
│   │   ├── audit  <---------# Audit logs from workflow runs (higher level than normal logs)
│   │   ├── bin   <----------# Experiment-specific executables and scripts
│   │   ├── cfg  <-----------# Experiment-specific config
│   │   ├── dat  <-----------# Any data generated by workflows
│   │   ├── doc   <----------# Experiment-specific documents
│   │   ├── log   <----------# Log files from workflow runs (lower level than audit logs)
│   │   ├── raw   <----------# Raw-data to be used in the experiment (not to be changed)
│   │   ├── res  <-----------# Results from workflow runs
│   │   ├── run   <----------# All files rel. to running experiment: Workflows, run confs/scripts...
│   │   └── tmp   <----------# Any temporary files not supposed to be saved
├── raw  <-------------------# Project-wide raw data
├── res  <-------------------# Project-wide results
└── src  <-------------------# Project-wide source code (that needs to be compiled)

We'll see how many changes this will get when putting it into practice :) But in the meanwhile, feedback and ideas are of course welcome as early as possible.

EDIT 21 March 2019: More consistent use of three-letter folder names, when applicable.