NGS Bioinformatics Course Day 3: New Luigi helper tool, "real-world" NGS pipelines

e · l · n

Mar 3, 2015

It turned out I didn't have the time and strength to blog every day at the NGS Bioinformatics Intro course, so here comes a wrap up with some random notes and tidbits from the last days, including any concluding remarks!

These days we started working on a more realistic NGS pipeline, on analysing re-sequencing samples (slides, tutorial).

First some outcome from this tutorial

What do I mean with "outcome"? Well, as I tried to manually copy and paste the bag of hairy nasty long bash commandline strings in the tutorial pages, that all depended upon each other, I got so frustrated that I decided to try to encode them in a workflow language / tool.

I quickly tried out SnakeMake, and was in fact impressed by the quality and robustness of this tool, but I figured it would take too long to properly grok the make-inspired way of thinking in this tool, for me to finish the labs in time. Instead I set out to use spotify's luigi tool, which I'm intimately familiar with already.

Only problem now, was that the relatively extreme amounts of boilerplate that needs to be written for each luigi task (really it's not that bad, but compared to just running a shell command, it is).

This resulted in that I realised some ideas I've been pondering on for a long time, on how a little bit of Flow-based programming inspired ideas could be implemented in Luigi, and so I created a little (50 LOC) helper library for doing just this! The lib, called "Luigi's monkey wrench" is available on github.

And, of course I had to encode the tutorial in it too! You can find the luigi's-monkey-wrench code for the tutorial in this gist!

Let's just include a small snippet, to give you a taste for how the code looks with this helper:

import luigi
from luigis_monkey_wrench import *
 
REF='human_17_v37.fasta'
INDIVIDUALS=['NA06984','NA07000']
SAMPLES=['1','2']
BASENAME='.ILLUMINA.low_coverage.17q_'
PICARDDIR='/sw/apps/bioinfo/picard/1.69/kalkyl/'
KNOWNSITES=('/proj/g2014207/labs/gatk/'
            'ALL.chr17.phase1_integrated_calls.20101123.snps_indels_svs.genotypes.vcf')
 
class GATKWorkflow(WorkflowTask):
 
    def requires(self):
        for i in INDIVIDUALS:
            # Workflow definition
            # ---------------------------------------------------------------------
            # files() will return a pseudo task that just outputs an existing file,
            #         while not running anything.
            # shell() will create a new task with a command that can take inputs
            #         and outputs.
            fq1 = file('fastq:{i}/{i}{b}1.fq'.format(i=i,b=BASENAME))
            fq2 = file('fastq:{i}/{i}{b}2.fq'.format(i=i,b=BASENAME))
 
            # Step 2 in [1]--------------------------------------------------------
            aln1 = shell('bwa aln {ref} <i:fastq> > <o:sai:<i:fastq>.sai>'.format(
                ref=REF))
            aln1.inports['fastq'] = fq1.outport('fastq')
 
            aln2 = shell('bwa aln {ref} <i:fastq> > <o:sai:<i:fastq>.sai>'.format(
                ref=REF))
            aln2.inports['fastq'] = fq2.outport('fastq')
 
            # ... and so the script continues ...

For a closer look, have a look at:

Overall structure of tutorial

The overall structure of the task was something like:

Setup data
Map reads to a reference (generating .bam file)
Re-calibrate (new .bam file)
Identify ("call") variants (resulting in .vcf file, "variant calling file")

The mapping step

Regarding the mapping step, it was mentioned about some popular mapping tools:

... then, interestingly, it was mentioned a bit of what kind of computation tricks they use to do the mapping in an efficient way:

Hashtable (Maq, Eland, SOAP, Shrimp, Zoom)
Suffix trees: (BWA, ...)
Burrows-Wheeler transofrm, for file compression (Many tools?)

The re-calibration step

For the re-calibration step, the more detailed steps was:

Local re-aliagnment
1. (Loosely defined as: "Using all the reads in a section of the genome, to get a better hint of which variant is probably the 'correct' one")
Removal of duplicates
Re-calibration (adjust for biases)
1. Reported vs. empirical quality score
2. (Biaes from which cycle in the machine etc)

Typical software in this section was:

Samtools - Tools for working with the Short-read Alignments (SAM) format (and the binary BAM counterpart)
Picard - Java based tool for working with SAM/BAM formats
GATK - Genome Analysis Toolkit
IGV - Integrated Genomics Viewer

The variant calling step

Some unsorted notes:

FreeBayes - Bayesian SNP calling
Bayesian methods instead of a simple pileup.
Population-based calling.

Discovery of structural variants

We had some discussion on discovery of structural variants too. Find some random notes below:

Harder (than SNP calling)
No perfect way to do this
Different methods depending on data and type of variant that you want to find.

Some methods used to do this:

Read depth analysis (Differences in coverage depth (how many times a certain sequence appear to have been sequenced) might indicate duplicattions etc)
Mate pair sequencing (To know the approximate distance between "pair" of known sequences, on longer distances than the typical read length)
Long-reads (Goes without saying, that longer reads will identify large-scale differences easier ... the problem is just to get long enough reads)
De Novo assembly (Really hard, but when it works, it gives a really clear picture)

See abbreviations

PGM - Personal Genome Machine
WGS - Whole Genome Study (as opposed to Exome study, which skips the Introns)

Common commands

bwa aln -a bwtsw # Align with bwtsw algorithm
samtools faidx
bwa sampe        # Generate alignments in SAM format given paired-end reads

Some common file types

.fastq - FASTA format with quality scores.
.sam - Short-read alignment file format
.bam - Binary SAM format (for compression)
.sai - SAM file index
.gtf / .gff / .bed - Annotation file formats