Build a full workflow

The task

We want to perform read mapping and variant genotyping against a genome graph, a data structure which contains known genetic variation observed in a population or a species.

We will use gramtools to do this, which I develop.

The data

We have 5 read sets from Mycobacterium tuberculosis, a bacterial pathogen. We also have its reference genome and a file describing known variants, in the VCF format. The data files are in P3/data.

We will build a genome graph, map the reads to it, and find out which samples have which variants.

Desired workflow

../_images/P3_dag.svg

The steps of the workflow to implement. Only two samples are shown.

Steps explained

Step

Purpose

Command

build

Build a genome graph from genome reference + variants

gramtools build {options}

quasimap

Map reads from a sample to a genome graph

gramtools quasimap {options}

infer

Infer which variants the mapped sample has

gramtools infer {options}

compress

compress a VCF file and index it for fast access

bgzip {file} && bcftools {file}.gz

mergeVcfs

combine compressed VCFs into one multi-sample VCF

bcftools merge -O z {inputs} > {output}

plots

make plots of the variation found (Brice’s script)

python3 scripts/analyse_variants.py {in}.vcf.gz {outdir}

Get information the command options by running them on the command-line.

build

This command takes a kmer-size option, reasonable values are < 12.

quasimap

This command takes a run-directory option, which stores all results for a given sample. It used again in the infer step.

plots

The script plots, given a multi-sample VCF file, i) variant density and ii) the site frequency spectrum.

This script uses the excellent python package scikit-allel for analysing genetic variation, if you’re interested!

Tips

Writing

Here is the Snakefile we start from:

import pandas as pd

configfile: "config/config.yaml"

samples = pd.read_csv("config/samples.tsv", sep="\t")

rule all:
	input:
		"{data}/gramtools/build"

		#expand("{results}/nuc_diversity.pdf",results=config["results"]) # The final expected result

rule build:
	input:
		ref="{data}/TB_ref.fa",
		vcf="{data}/known_variants.vcf.gz"
	output:
		directory("{data}/gramtools/build")
	params:
		#k = ???
	shell:
		#???

Keep updating the ``all`` rule with each step you add, so that you always have a functional workflow.

Data

Start by using only two samples, as in the provided config/samples.tsv.

Then when your workflow works with two samples, run on all of them using:

mv config/all_samples.tsv config/samples.tsv