Build a full workflow¶
The task¶
We want to perform read mapping and variant genotyping against a genome graph, a data structure which contains known genetic variation observed in a population or a species.
We will use gramtools to do this, which I develop.
The data¶
We have 5 read sets from Mycobacterium tuberculosis, a bacterial pathogen. We also have its reference genome and a file describing known variants, in the VCF format. The data files are in P3/data.
We will build a genome graph, map the reads to it, and find out which samples have which variants.
Steps explained¶
Step |
Purpose |
Command |
---|---|---|
Build a genome graph from genome reference + variants |
|
|
Map reads from a sample to a genome graph |
|
|
infer |
Infer which variants the mapped sample has |
|
compress |
compress a VCF file and index it for fast access |
|
mergeVcfs |
combine compressed VCFs into one multi-sample VCF |
|
make plots of the variation found (Brice’s script) |
|
Get information the command options by running them on the command-line.
build¶
This command takes a kmer-size option, reasonable values are < 12.
quasimap¶
This command takes a run-directory option, which stores all results for a given sample. It used again in the infer
step.
plots¶
The script plots, given a multi-sample VCF file, i) variant density and ii) the site frequency spectrum.
This script uses the excellent python package scikit-allel for analysing genetic variation, if you’re interested!
Tips¶
Writing¶
Here is the Snakefile we start from:
import pandas as pd
configfile: "config/config.yaml"
samples = pd.read_csv("config/samples.tsv", sep="\t")
rule all:
input:
"{data}/gramtools/build"
#expand("{results}/nuc_diversity.pdf",results=config["results"]) # The final expected result
rule build:
input:
ref="{data}/TB_ref.fa",
vcf="{data}/known_variants.vcf.gz"
output:
directory("{data}/gramtools/build")
params:
#k = ???
shell:
#???
Keep updating the ``all`` rule with each step you add, so that you always have a functional workflow.
Data¶
Start by using only two samples, as in the provided config/samples.tsv.
Then when your workflow works with two samples, run on all of them using:
mv config/all_samples.tsv config/samples.tsv