Why didn't Go get a breakthrough in bioinformatics (yet)?

e · l · n
May 13, 2024
A gopher doing bioinformatics

As we are - according to some expert opinions - living in the Century of Biology, I found it interesting to reflect on Go's usage within the field.

Go has some great features that make it really well suited for biology, such as:

Go has in fact garnered some use for bioinformatics tools over the years, with some indications that its use is increasing. Examples of popular tools and toolkits are SeqKit (a veritable swiss army knife for bioinformatics), the BioGo toolkit, lately the Gonetics package, and not the least the Poly package for synthetic biology. And this is besides heavy use in infrastructure-oriented projects like the Benthos stream processing tool, the Reflow pipeline tool and Pachyderm orchestration suite.

Still, Go has far from had anything like a breakthrough in bioinformatics, which is surprising given its incredible growth in popularity outside the field. CS-oriented bioinformatics folks seem to prefer the much more complicated Rust language for implementing new tools (See e.g. this Nature articlee and Heng Li's post).

This is to me quite surprising, as I know that for a large part of the bioinformatics community, learning a language like Rust will pretty much remain elusive because of the excessively steep learning curve, leading to an even deeper division in the bio community between tool makers and tool consumers. To me, Go seems much more like a language that could reasonably lessen this divide, and cater both to many tool developers, as well as people who less frequently write new tools but perhaps sometimes want to port some homegrown scripts into compiled code for speed.

This got me to start to reflect on the status of Go as a routine language for bioinformatics. To put my thoughts into perspective, I need to start from the beginning though, with a small personal lookback.

A personal lookback

When I just started out my career as a bioinformatician, I pretty soon found myself looking for a good compiled language to learn as a completent to Python, for when the speed of a scripting language would not be enough.

I scoured the web for languages, and even organized some crowd-sourcing of languages to watch, as well as later some benchmarking of a few of these languages.

What I was looking for was something that would be close to the feel of Python where it is really easy to just open a file and read it line by line, and write the output to another file, but that would be compiled and fast, to be able to process the increasingly huge amounts of sequencing data being produced.

I was initially interested especially in the D language, mainly because the syntax for reading files felt so fluent and natural:

import std.stdio;

void main() {
  foreach(line; stdin.byLine()) {
    writeln("Got input line: ", line);
  }
}

But later, in part because of recommendation by some peers to bet on Go because of the Google backing, I ended up heavily invested in Go instead. I eventually found a fascination with the concurrency primitives of Go which I played around with a lot, which resulted in libraries like SciPipe and FlowBase, and also some comparisons with other languages with similar features like with Crystal.

And the concurrency primitives is also why I eventually stuck with Go as one of my go-to languages. This is because it seems that a lot of bioinformatics problems are naturally modelled as pipelines of operations to happen on a stream of data, and the concurrency primitives in Go (channels and goroutines) makes it exceedingly easy to build such pipelines in a way that allows to run each process on a separate CPU core.

Something that has continuously bothered me with Go though is how clunky it is for working with files, which is the bread and butter of bioinformatics work for various reasons. Compare the above D code example with the following similar code written in Go:

package main

import (
    "bufio"
    "fmt"
    "os"
)

func main() {
    scanner := bufio.NewScanner(os.Stdin)
    for scanner.Scan() {
        line := scanner.Text()
        fmt.Println("Got input line:", line)
    }
    if err := scanner.Err(); err != nil {
        fmt.Fprintln(os.Stderr, "error reading standard input:", err)
    }
}

So, just to open a file for writing, you need to instantiate a buffered scanner, and remember all the unnatural APIs for scanning and retreiving the text from the scan, not to mention the whole error handling thing. From 4 lines of code in D, to 12 lines in Go (excluding closing braces).

This, I think, is one large explanation to why Go has never really had a breakthrough in the larger bioinformatics community. For a seasoned programmer this is not a big deal as they are used to looking things up in documentation as they go, and also recognize common patterns of things that need to happen under the hood to read a file anyway. But for the average bioinformatician, this level of complexity and nitty-gritty details is simply a no-starter. Through enormous efforts the bioinformatics community has trained hordes of biologists to be somewhat familiar with basic Python scripts and perhaps a few Bash commands. But getting them to be comfortable with this level of detail will simply not happen in the foreseeable future.

But, again, I think this is a shame. Because with the concurrency primitives and generally good performance of Go, it actually suites a lot of bioinformatics workfloads excellently.

I wonder if and what could be done to make the simple things, like reading and writing files, easy in a future version of Go?

Addendum I: Why not Crystal?

I realize I should comment a bit about why I haven't gone with Crystal. Crystal is a super curious language in that it has the kind of fluent syntax I'm looking for (heavily inspired by Ruby), has performed pretty well in some comparisons, and also sports Go-like concurrency primitives, as demonstrated by my previous comparisons, but is plagued by seemingly inherently exponentially long compile times hindering effective development of larger projects, and limitations in its cross-platform compatibility. If these issues could be addressed, I'd be very interested in re-evaluating it!

Addendum II: What are some other contenders?

Apart from Rust and Crystal, and as you can see from my list of compiled languages, there are a lot of potential alternatives for a go-to compiled, fast language for bioinformatics. One of the most interesting ones I'm aware of right now are Julia and Zig. While Julia has been picking up usage in biology quite a bit, it isn't a properly compiled language though, but rather a scripting language providing speedups via just-in-time compilation. It doesn't to my knowledge (yet) have a great story for ahead-of-time compilation of statically linked binaries. Zig remains an interesting langauge because its very close integration with the C programming language (C programs are also valid Zig programs, which means you can even just use Zig as a more modern compiler and toolchain for compiling C code). I think we have yet to see any major uptake of Zig in the bioinformatics community too. And, then there is Mojo. But I think Mojo is way to young to say anything with confidence about how it will develop or be able to gain a foothold in bioinformatics.

Samuel Lampa (@smllmp)

Note: Some discussion is happening around the post on Twitter and Reddit


Edit history:
Edit 2024-05-15, 20:07: Added mention of Benthos
Edit 2024-05-16, 19:08: Correction: processes -> goroutines (Thanks Mihai Todor!)
Edit 2024-05-16, 19:31: Added addendum about Crystal (Thanks Alexander Adam for raising the question!)
Edit 2024-05-16, 19:45: Added addendum about Julia and Zig too (Thanks again to Mihai Todor for bringing up Zig!)
Edit 2024-05-16, 19:50~: Added comment about Mojo.
Edit 2024-05-16, 20:02: Improved intro.
Edit 2024-05-17, 11:59: Improved intro again with more reasoning on why Go is a good suggestion for bio in the first place.