Bioinformatics Seminars

Bioinformatics Seminar

Time:
Venue: Na

3 October 2017

Na

BioShake: a Haskell EDSL for bioinformatics pipelines

Justin Bedo
WEHI Bioinformatics

Recently there have been great advances in the field of biology arising
from rapid technological progress ; notably in certain areas such as
genomics. These technologies have drastically increased the use of
computing as part of data processing and analysis. It is now common to
use many hours of compute to process biological data in what is known as
a bioinformatics pipeline.

These bioinformatics pipelines are typically composed of numerous
programs and stages coupled together loosely using intermediate files.
These pipelines tend to be quite complex and require substantial
computational time ; hence a good pipeline must be able to manage
intermediate files ; guarantee re-entrability - the ability to re-enter
and continue a partially run pipeline - and also provide clear syntax to
easily describe and understand pipelines.

I will present BioShake: a Haskell EDSL for bioinformatics pipelines
that has several advantages over existing frameworks:

1. Type level programming is strongly leveraged to prevent errors in the
pipeline construction during compilation. Errors such as mismatching
file types or metadata result in a compile error rather than a runtime
error. This catches errors significantly earlier ; reducing debugging
time. As bioinformatics pipelines tend to have long runtimes ; this is
especially advantageous.

2. Naming of outputs at various stages of a pipeline are abstracted by
BioShake. Output at a stage can be explicitly named if they are
desired outputs. Thus ; the burden of constructing names for temporary
files is alleviated.

3. BioShake builds on top of Shake ; an industrial strength build tool
also implemented as an EDSL in Haskell. BioShake thus inherits the
reporting features ; robust dependency tracking ; and resumption
capabilities offered by the underlying Shake architecture.

4. Unlike the underlying Shake that expects dependencies to be specified ;
BioShake allows forward specification of pipelines. As bioinformatics
pipelines tend to be quite long and mostly linear ; this eases the
cognitive burden during pipeline design and also improves readability.


Search past seminars