One thing became even more clear to me today after brewing in my mind for some time: Dynamic scheduling in scientific workflow tools is important.
What I mean is that new tasks should be scheduleable during the execution of a workflow not just during its scheduling phase.
What strikes me is that far from all worklfow tools allow this. Many tools completely separates the execution in a workflow into two stages:
The case where we really needed this was for running machine learning algorithms on data sets of various data set sizes. To gain optimal models, we are first optimizing the cost parameter to our training step by running a grid-search over a number of cost values.
The performance of training with different cost values is then evaluated and an optimal cost is chosen. And now comes the interesting part: We want to schedule a defined workflow with this newly selected cost value.
This is not easily possible in Luigi even with our SciLuigi extension though, since Luigi separates scheduling and execution. But also since parameters such as a cost value are initiated upon scheduling time. Thus we can not use a value resulting from a calculations to start the next task, in a SciLuigi workflow.
Of course we found a work-around for this: We just created a task that takes the chosen cost value and executes a shell command to start a separate python process with that other part of the workflow. It works. But things are not closely integrated, we get extra overhead and the separate workflow instance will create separate logging, audit files etc.
Thus, this is something I would like to see in the next workflow system I use: Ability to schedule new tasks continuously from during execution of a workflow.
Interestingly this is a feature that comes for free in tools that adhere to the dataflow paradigm. In most dataflow tools you have independently running processes that receive messages with input data that continuously schedule new tasks as they receive messages until the system hands them a message telling them to shut down. "Dynamic scheduling" is really how dataflow systems work in other words, which I find interesting. I think the dataflow system Nextflow (Correct me if I'm wrong, Paolo! :) ) works like this. And so does my little experiment in a pure Go workflow library, which I started hacking on out of frustration with some other tools a long time ago, although that one still lacks most other popular features ;)
I just had not realized how important this feature could be, for very common use cases.
Bionics IT currently serving as a research and development blog for Samuel Lampa, a PhD student (Pharmaceutical Bioinformatics at Uppsala University).
Find me elsewhere on the web:
Do you want to get faster on using the commandline? ... and help me pay my linode and and domain fees at the same time? :)
Then feel free to check out my
commandline productivity course!