NGS Bioinformatics Course Day 3: New Luigi helper tool, "real-world" NGS pipelines

e · l · n
Mar 3, 2015

It turned out I didn't have the time and strength to blog every day at the NGS Bioinformatics Intro course, so here comes a wrap up with some random notes and tidbits from the last days, including any concluding remarks!

These days we started working on a more realistic NGS pipeline, on analysing re-sequencing samples (slides, tutorial).

First some outcome from this tutorial

What do I mean with "outcome"? Well, as I tried to manually copy and paste the bag of hairy nasty long bash commandline strings in the tutorial pages, that all depended upon each other, I got so frustrated that I decided to try to encode them in a workflow language / tool.

I quickly tried out SnakeMake, and was in fact impressed by the quality and robustness of this tool, but I figured it would take too long to properly grok the make-inspired way of thinking in this tool, for me to finish the labs in time. Instead I set out to use spotify's luigi tool, which I'm intimately familiar with already.

Only problem now, was that the relatively extreme amounts of boilerplate that needs to be written for each luigi task (really it's not that bad, but compared to just running a shell command, it is).

This resulted in that I realised some ideas I've been pondering on for a long time, on how a little bit of Flow-based programming inspired ideas could be implemented in Luigi, and so I created a little (50 LOC) helper library for doing just this! The lib, called "Luigi's monkey wrench" is available on github.

And, of course I had to encode the tutorial in it too! You can find the luigi's-monkey-wrench code for the tutorial in this gist!

Let's just include a small snippet, to give you a taste for how the code looks with this helper:

import luigi
from luigis_monkey_wrench import *
 
REF='human_17_v37.fasta'
INDIVIDUALS=['NA06984','NA07000']
SAMPLES=['1','2']
BASENAME='.ILLUMINA.low_coverage.17q_'
PICARDDIR='/sw/apps/bioinfo/picard/1.69/kalkyl/'
KNOWNSITES=('/proj/g2014207/labs/gatk/'
            'ALL.chr17.phase1_integrated_calls.20101123.snps_indels_svs.genotypes.vcf')
 
class GATKWorkflow(WorkflowTask):
 
    def requires(self):
        for i in INDIVIDUALS:
            # Workflow definition
            # ---------------------------------------------------------------------
            # files() will return a pseudo task that just outputs an existing file,
            #         while not running anything.
            # shell() will create a new task with a command that can take inputs
            #         and outputs.
            fq1 = file('fastq:{i}/{i}{b}1.fq'.format(i=i,b=BASENAME))
            fq2 = file('fastq:{i}/{i}{b}2.fq'.format(i=i,b=BASENAME))
 
            # Step 2 in [1]--------------------------------------------------------
            aln1 = shell('bwa aln {ref} <i:fastq> > <o:sai:<i:fastq>.sai>'.format(
                ref=REF))
            aln1.inports['fastq'] = fq1.outport('fastq')
 
            aln2 = shell('bwa aln {ref} <i:fastq> > <o:sai:<i:fastq>.sai>'.format(
                ref=REF))
            aln2.inports['fastq'] = fq2.outport('fastq')
 
            # ... and so the script continues ...

For a closer look, have a look at:

Overall structure of tutorial

The overall structure of the task was something like:

  1. Setup data
  2. Map reads to a reference (generating .bam file)
  3. Re-calibrate (new .bam file)
  4. Identify ("call") variants (resulting in .vcf file, "variant calling file")

The mapping step

Regarding the mapping step, it was mentioned about some popular mapping tools:

... then, interestingly, it was mentioned a bit of what kind of computation tricks they use to do the mapping in an efficient way:

The re-calibration step

For the re-calibration step, the more detailed steps was:

  1. Local re-aliagnment
    1. (Loosely defined as: "Using all the reads in a section of the genome, to get a better hint of which variant is probably the 'correct' one")
  2. Removal of duplicates
  3. Re-calibration (adjust for biases)
    1. Reported vs. empirical quality score
    2. (Biaes from which cycle in the machine etc)

Typical software in this section was:

The variant calling step

Some unsorted notes:

Discovery of structural variants

We had some discussion on discovery of structural variants too. Find some random notes below:

Some methods used to do this:

See abbreviations

Common commands

bwa aln -a bwtsw # Align with bwtsw algorithm
samtools faidx
bwa sampe        # Generate alignments in SAM format given paired-end reads

Some common file types

See also