Build a full workflow¶

The task¶

We want to perform read mapping and variant genotyping against a genome graph, a data structure which contains known genetic variation observed in a population or a species.

We will use gramtools to do this, which I develop.

The data¶

We have 5 read sets from Mycobacterium tuberculosis, a bacterial pathogen. We also have its reference genome and a file describing known variants, in the VCF format. The data files are in P3/data.

We will build a genome graph, map the reads to it, and find out which samples have which variants.

Desired workflow¶

The steps of the workflow to implement. Only two samples are shown.¶

Steps explained¶

Step	Purpose	Command
build	Build a genome graph from genome reference + variants	`gramtools build {options}`
quasimap	Map reads from a sample to a genome graph	`gramtools quasimap {options}`
infer	Infer which variants the mapped sample has	`gramtools infer {options}`
compress	compress a VCF file and index it for fast access	`bgzip {file} && bcftools {file}.gz`
mergeVcfs	combine compressed VCFs into one multi-sample VCF	`bcftools merge -O z {inputs} > {output}`
plots	make plots of the variation found (Brice’s script)	`python3 scripts/analyse_variants.py {in}.vcf.gz {outdir}`

Get information the command options by running them on the command-line.

build¶

This command takes a kmer-size option, reasonable values are < 12.

quasimap¶

This command takes a run-directory option, which stores all results for a given sample. It used again in the infer step.

plots¶

The script plots, given a multi-sample VCF file, i) variant density and ii) the site frequency spectrum.

This script uses the excellent python package scikit-allel for analysing genetic variation, if you’re interested!

Tips¶

Writing¶

Here is the Snakefile we start from:

import pandas as pd

configfile: "config/config.yaml"

samples = pd.read_csv("config/samples.tsv", sep="\t")

rule all:
	input:
		"{data}/gramtools/build"

		#expand("{results}/nuc_diversity.pdf",results=config["results"]) # The final expected result

rule build:
	input:
		ref="{data}/TB_ref.fa",
		vcf="{data}/known_variants.vcf.gz"
	output:
		directory("{data}/gramtools/build")
	params:
		#k = ???
	shell:
		#???

Keep updating the ``all`` rule with each step you add, so that you always have a functional workflow.

Data¶

Start by using only two samples, as in the provided config/samples.tsv.

Then when your workflow works with two samples, run on all of them using:

mv config/all_samples.tsv config/samples.tsv