Troubleshooting Nextflow pipelines

e · l · n

Nov 1, 2023

Background - The number one skill

In my new position as a bioinformatician in clinical microbiology at the Karolinska University Hospital, I'm no longer only developing my own pipeline tools :) but am also a user of Nextflow, which is the leading pipeline tool at least in clinical genomics, and at least in Sweden.

We have been evaluating Nextflow before in my work at pharmb.io, but that was before DSL2 and the support for re-usable modules (which was one reason we needed to develop our own tools to support our challenges, as explained in the paper). Thus, there's definitely some stuff to get into.

Based on my years in bioinformatics and data science, I've seen that the number one skill that you need to develop is to be able to effectively troubleshoot things, because things will invariably fail in all kinds of ways. And in the process, you will probably learn a lot about the technology stack you are using.

You need to be pretty intentional about developing this skill though, as it is a skill that is seldomly taught properly in undergraduate programs, if ever. This means it is far too often treated as a trial-and-error practice, perhaps combined with some informed guessing. While this can work, there are far more efficient ways.

Anyways, this is why I'm writing this post. I wanted to document some key troubleshooting techniques I have just picked up for working with Nextflow. Some is available in the Documentation, but I didn't find a coherent summary of them all, and I also learned some further tricks through experimentation that can perhaps be contributed somewhere else later. But to start with, they are published below. Feel free to give feedback, and also let me know about your own favorite tips in the comments!

Existing troubleshooting resources

As said, Nextflow has some information on troubleshooting tools and techniques, and the idea here is not to re-iterate them, so I'll start with pointing out some of the most important ones here.

Perhaps the most similar page to this one, is the troubleshooting guide on the Nextflow Training website.
There is also some info on avialable debugging tools in the overview page in the main Nextflow docs.
The doc page on Tracing & visualisation provides info on some very useful tools to understand what your pipeline does.
The workflow introspection page also contains tips on how you can inspect especially the objects in the DSL/Groovy part of the pipeline.
A search for "troubleshooting" in the Nextflow docs gives a few pages specific to various execution platforms, as well as on caching and resuming, which should be generally useful.
Mahesh Binzer-Panchal started an issue to collect various common errors in Nextflow.
This seems to be now collected into this website with gotchas and common errors.
Apart from the Slack, there is now also a community forum, where you can search for previous issues and ask about your own.
Last but definitely not least, the nf-core community contains a super-vibrant community, with an even more active Slack server, and lots of other community resources.

Some further troubleshooting tips

1. The execution log

Primarily, Nextflow provides some pretty good tips itself, when an error occurs.

The first thing to note, is if you get your terminal filled with error output, and don't manage to read it all before it flushes by, you can always find the latest log in a file named .nextflow.log inside your execution folder.

To scroll through this file in a searchable way, without wrapping long lines, I recommend to use the less -S command:

less -S ~/.nextflow.log

Using less, you can search by typing / followed by the search phrase, step through the search results with n for next find, and N for previous find, scroll using arrows or PageDown/PageUp, or vim commands for the same, as well as quit using q.

Sometimes it is easier to just "grep" through the file, for example searching case-insensitively for "error", and perhaps piping that to less -S:

grep -i error .nextflow.log | less -S

2. The work folder

At least for jobs executed locally, in the execution log mentioned above, Nextflow typically points out the path to the "work folder". Depending on your configuration, it typically involves the word "work" and ends with a long random string of letters and numbers called a "hash", for example /tmp/nf/work/a3/0e2ea68c421ced4797e00de9e73155. It is a good idea to cd into this folder and exploring it, when you have a hard to track down-bug, so in this example:

cd /tmp/nf/work/a3/0e2ea68c421ced4797e00de9e73155
ls -1a

The -1a flags, or in particular -1tra are very useful, and have the following function:

-1 (the number) lists the files vertically instead of horizontally, making it easier to read. If you also want more details like timestamps and permissions, use -l (the letter) instead.
-t sorts the list by time
-r reverses the the time-based sorting from -t, so that the last files are located last (if the list is long, these will be the only ones immediately visible on the screen)
-a lists "all" files, meaning also hidden files, that is, those starting with a . in the name.

Since the -t and -r flags are not strictly needed, we will skip them below though for simplicity.

As you will see, the folder contains a number of hidden files, viewable only with the -a flag to ls, that contain some key information on how the job is being executed:

$ ls -1a
.
..
.command.begin
.command.err
.command.log
.command.out
.command.run
.command.sh
.exitcode
<some more files not relevant here>

These files are very useful to acquaint yourself more with. Their function is, in summary:

.command.begin
.command.err - Everything written to stderr from the command. Will typically be errors, but some commands are ill-behaved and write out other stuff here as well, that might be potentially useful for debugging.
.command.log - Log output from the command
.command.out - Everything written to stdout
.command.run - This is a scaffold script that contains various bash functions to create temporary folder, stage files, execute the .command.sh script, and clean up. We will look closer at it below.
.command.sh - This script contains the main command run by the task in question. It is thus very useful to check that it looks as expected.
.exitcode - This file will contain the exit code, or "returncode" of the command. It is an integer value with various meanings that can be looked up on the internet, but the most important thing to know is that a 0 means the command has completed successfully, and anything else (Typically 1 or above) means it has failed in some way.

3. Debugging a command in the work folder

If you have a tricky failing task that you don't really understand why it fails, it might be a good idea to manually execute task scripts.

You will typically need to do that by executing the .command.run script, which in turn executes .command.sh, since the former will create the temporary directories, stage files etc that is needed for the task to run properly.

So, you would do:

bash .command.run

... and watch for any detailed output that might give you hints.

Make all commands visible with set -x

The problem with the above is that the .command.run script does a lot of "magic" and setup grunt work that you don't see.

Thus, to make everything it does more clear, what you can do is to add the commands set -x in the top of both the .command.run and .command.sh scripts, using a text editor (the nano terminal based text is available on most systems, and is more user-friendly than vim for new users: nano .command.run ... save with Ctrl+W, and exit with Ctrl+X).

Then you can execute it again:

bash .command.run

But even better is, if you pipe all the output to a file, so that you can later read this at your own pace:

bash .command.run &> out.log

(The & in &> will make sure that both stdout and stderr are redirected to the file)

Even better, is to BOTH redirect to a file, but also pipe it to something like less -S, so you can see and scroll the output immediately:

bash .command.run |& tee out.log | less -S

Here, the |& will pipe both stderr and stdout to the next command, which less tee, which will both take a filename where it writes its output, and also forward the output to the next command, which is here less -S.

4. Turning off the cleanup parts, to explore temporary folders

One caveat when running the .command.run script is that it will always clean up temporary folders after it is finished. This means certain subtle errors might be harder to detect, since you can't explore these temporary folders manually.

One way to get around this though is to comment out those parts in the .command.run file before running it (In bash, you can comment it out by adding a # - character at the beginning of the line).

In particular, check the on_exit() function, which might look like so:

on_exit() {
    exit_status=${nxf_main_ret:=$?}
    printf $exit_status > /tmp/nf/work/a3/0e2ea68c421ced4797e00de9e73155/.exitcode
    set +u
    [[ "$tee1" ]] && kill $tee1 2>/dev/null
    [[ "$tee2" ]] && kill $tee2 2>/dev/null
    [[ "$ctmp" ]] && rm -rf $ctmp || true
    rm -rf $NXF_SCRATCH || true
    sync || true
    exit $exit_status
}

Here, you could comment our for example the rm -rf $NXF_SCRATCH || true line, if you want the temporary folder to remain existing (typically put in /tmp and named something like /tmp/nxf.XXXXXXXXX. If you add set -x to the beginning of the script as explained above, you should be able to see the exact path of this one, when executing the command).

You can also have a look around the nxf_main() , and the nxf_launch() functions. nxf_main() is located in the bottom of the script and is the over-arching function, calling the other sub-functions (the on_exit() is not explicitly called though, but is set up to be called whenever the nxf_main() function is returning or interrupted), while nxf_launch() is the one executing the .command.sh script, together with some environment variables, setup for containers etc, which is a somewhat common source of some errors.

Its content should be visible in the script output when you add set -x to the script and run it, but it might also be good to examine it manually!

If you want to be able to quickly enable these two adjustments when in a work folder, you can add the following bash function to your ~/.bash_aliases file:

function debugnf() {
    sed -i '2s/^/set -x\n/' .command.{run,sh};
    sed -i 's/rm /#rm /g' .command.run;
}

Then, when in a work folder, you can just run debugnf before running any of the .command.run or .command.sh files manually.

Summary

Hope you were able to learn something from the tips in this post! And, perhaps you know some further great tips for debugging? Feel free to share them, or at least a link to them, in the comments below!

Changelog

2023-11-01 13:37 CET: Added section "Turning off the cleanup parts, to explore temporary folders"
2023-11-01 14:20 CET: Added pointer about nxf_launch(), on tip from Maxime Garcia.
2023-11-01 21:30 CET: Added link to troubleshooting guide at the Nextflow Training website.