NGS Bioinformatics Intro Course Day 1

e · l · n

Feb 9, 2015

Just finished day 1 of the introductory course on Bioinformatics for Next generation sequencing data at Scilifelab Uppsala. Attaching a photo from one of the hands-on tutorial sessions, with the tutorial leaders, standing to the right.

Today's content was mostly introductions to the linux commandline in general, and the UPPMAX HPC environment in particular, an area I'm already very familiar with, after two years as a sysadmin at UPPMAX. Thus, today I mostly got to help out the other students a bit.

By the way, this strong focus on getting everybody into using the linux commandline, rather than just throwing an easy-to-use to use GUI to them, seems to be something that is not super common out there internationally (and something that we touch on in our lessons learned article, from implementing the UPPNEX resource). It is something that is widely acclaimed though, here at SciLifeLab among most if not everybody who has already made their way into the world of commandline use. GUIs are typically - in popular opinion here - way too constraining for the juggling around of terabytes of data between different storage systems etc, that is so typical to many NGS analysis use cases.

Anyways, in the time between student questions, I also browsed around on a very interesting site with introductory bioinformatics materials that someone sent me:

Course in "Applied Bioinformatics 2014" at Penn State University

This course is created by the creator of widely used questions-and-answers site biostars.org, and seems to contain tons of very good and very hands-on and practical material. I will probably continue digging through this material after the current course is finished (which is also when I should have a better head start into the field).

In spite of mostly focusing on linux introduction, I anyways picked up the first "NGS bioinformatics" nugget for today:

The "rule-of-thumbish" kind of fact that a typical steps in an exome analysis consists of:

Filtering out low quality reads
Aligning of the reads to a reference genome
Finding all the SNPs (Single Nucleotide Polymophisms) in the data.

Otherwise, in the afternoon, after most students had came over the initial linux commandline hurdle, I found some time to test out the python based, and "make-inspired" workflow tool snakemake. I have heard good things about snakemake from a large number of folks I know, so it has been on my shortlist to check out (Otherwise, I'm mostly familiar with tools like Spotify's Luigi, Galaxy, Yabi, BPipe and Nextflow).

Snakemake left me so far with a largely positive impression, for it's terse and efficient syntax, great logging and visualization facilites and even a web based GUI to contorl jobs. Still, there were also a number of things that left me not totally sold, such as the python 3 support (which will cause incompatibility with numpy & co), the somewhat backwardish make syntax (you specify what file patter you want to get in the end, rather than feeding your pipeline with any indata you have). Will have to continue testing it out in the near future to see how bad/not-so-bad those things are.

Anyways, this should be well enough for today. Now looking forward to getting in to the "real" bioinformatics content tomorrow tuesday (starting with a walk-through of common file formats used in NGS analysis).