EMUMADZ pipeline case study 2¶
BAM to SNP for F2-F2 comparisons¶
Copyright © 2025 Tyrone ChenCode in this repository is provided under a MIT license. This documentation is provided under a CC-BY-4.0 license.
Visit our lab website here. Contact Benjamin Hogan at ben.hogan@petermac.org.
Usage¶
Note
All steps are run in directory source.
Data setup¶
Software requirements:
bcftools
gatk
samtools
snpEff
Note
Here we assume that bcftools,gatk,samtools,snpEff are all installed and the user has completed the steps in the install instructions.
Data requirements:
Reference genome fasta file
Samplesheet with sample information
Chromosome name mappings
Here is the directory structure:
project_root/
├── source/ # Scripts are run from here
├── data/ # Input data (BAM files, reference genome, samplesheet)
└── results.F2-F2/ # All output files organized by processing step
├── 01_variant_called
├── 02_chromosome_filtered
├── 03_stringent_filtered
├── 04_normalised
├── 05_samples_merged
├── 06_snps_filtered
├── 07_mutant_candidates
├── 08_annot_all
├── 08_annot_eff
├── 08_annot_vep
├── 09_finalised
└── 10_visualisations
Create a samplesheet with sample identifiers, paths to bam file and the sample type:
sample_identity alignment_file sample_type
some_sample_1 /path/to/1.bam reference
some_sample_2 /path/to/2.bam reference
some_sample_3 /path/to/3.bam reference
some_sample_4 /path/to/4.bam reference
some_sample_5 /path/to/5.bam mutant
... ... ...
In our case, this is how our table looks like:
Here we setup the file structure and some other metadata.
# until step 3, this doesnt matter because all files are treated independently
ids="122-3-I 15-1 15-3 161-2 163-4L 175-1 18-2-4 22-1-I 245-2-A 24-6 255-B 259-1 261-C-C 262-1 262-5-8 263-1 263-III-1 268-3-2 278-1 282-2-1 287-1-D 299-3 322-2 335-A 373-I 38-A 49-2-A 49-2-C 53-2 74-3"
SAMPLESHEETS=$(for i in $ids; do echo samplesheet.F2-F2.$i.tsv; done)
# convenience function since it will be used in most downstream operations
setup() {
# navigate to source directory
if [[ ! $(basename $PWD) -eq source ]]; then
echo "All scripts are set to run from directory source"
echo "See directory structure above for more information"
exit 1
fi
# various configuration variables
THREADS=16
# GATK
GATK_MEM="64g"
GATK_JAVA_OPTS="-Xmx${GATK_MEM} -XX:+UseParallelGC"
# ENSEMBL VEP
VEP_ASSEMBLY="Zv9"
VEP_VERSION="79"
VEP_CACHE="${HOME}/.vep"
VEP_BUFFER="8192"
# snpEff
SNPEFF_CONFIG="/home/tchen/.local/share/mamba/envs/snps/share/snpeff-5.2-1/snpEff.config"
SNPEFF_GENOME="Zv9.75"
# paths relative to source directory
DATA_DIR="../data/"
RESULTS_DIR="../results.F2-F2/"
SAMPLESHEET="${DATA_DIR}/samplesheet.F2-F2.tsv"
REFERENCE_FA="${DATA_DIR}/Reference/GCA_000002035.2_Zv9_genomic.fna"
REF_FIXED_FA="${DATA_DIR}/Reference/GCA_000002035.2_Zv9_genomic.ChrFixed.fna"
CHROM_MAP="${DATA_DIR}/chrom_table.tsv"
# parse samplesheet and load samples into array
MUT_SAMPLES=($(awk '$3=="mutant" {print $1}' ${SAMPLESHEET}))
REF_SAMPLES=($(awk '$3=="reference" {print $1}' ${SAMPLESHEET}))
echo "${#MUT_SAMPLES[@]} mutants: ${MUT_SAMPLES[@]}"
echo "${#REF_SAMPLES[@]} references: ${REF_SAMPLES[@]}"
# create results directory structure
mkdir -p ${RESULTS_DIR}/{01_variant_called,02_chromosome_filtered,03_stringent_filtered,04_normalised,05_samples_merged,06_snps_filtered,07_mutant_candidates,08_annot_eff,08_annot_vep,08_annot_all,09_finalised,10_visualisations}
# set downstream directories and suffixes
# if step 3 is actioned, edit these as needed
PROCESSED_VCF_DIR="${RESULTS_DIR}/02_chromosome_filtered"
}
Create some metadata files. The chromosome file maps the chromosome IDs onto the chr{1..25} more human-readable format. Have the fasta genome reference file on hand.
setup_metadata() {
# verify reference genome preparation
if [[ ! -f "${REFERENCE_FA}.fai" ]]; then
echo "Creating FASTA index..."
samtools faidx ${REFERENCE_FA}
fi
if [[ ! -f "${REFERENCE_FA%.*}.dict" ]]; then
echo "Creating sequence dictionary..."
gatk CreateSequenceDictionary -R ${REFERENCE_FA}
fi
# same but for the fixed chromosomes
if [[ ! -f "${REF_FIXED_FA}.fai" ]]; then
echo "Creating FASTA index..."
samtools faidx ${REFERENCE_FA}
fi
# create chromosome list (adjust range as needed)
grep -P "^chr[0-9]+\t" ${REF_FIXED_FA}.fai > \
${DATA_DIR}/chromosomes.txt
}
setup
setup_metadata
Nextflow¶
Note
This step is equivalent to running the rest of the pipeline manually for the given sample(s).
Example of using nextflow:
samplesheet="../data/samplesheet.F2-F2.122-3-I.tsv"
reference_fa="../data/Reference/GCA_000002035.2_Zv9_genomic.fna"
ref_fixed_fa="../data/Reference/GCA_000002035.2_Zv9_genomic.ChrFixed.fna"
chrom="../data/chrom_table.tsv"
results="../results.F2-F2/122-3-I/"
nextflow run emumadz_pipeline.nf \
--samplesheet ${samplesheet} \
--reference_fa ${reference_fa} \
--ref_fixed_fa ${ref_fixed_fa} \
--chrom_map ${chrom} \
--outdir ${results} \
-resume
Whole zebrafish genome assembly¶
The ``seqliner`` assembly step is not covered in this documentation.
Variant calling and read filtering¶
We filter the bam files for low quality reads and perform variant calling simultaneously.
Note
We apply intentionally minimal filtering at the early stage and apply custom filters later. Here, we are mainly interested in filtering poorly mapped reads.
# this helper function retrieves bam file path given sample id
get_bam_path() {
local sample_id=$1
awk -v id="$sample_id" '$1==id {print $2}' ${SAMPLESHEET}
}
call_variants() {
local sample=$1
local sample_type=$2
local bam_path=$(get_bam_path $sample)
echo "Processing ${sample_type}: $sample ($bam_path)"
# Verify BAM file and index exist
if [[ ! -f "${bam_path}" ]]; then
echo "No BAM: ${bam_path}"
return 1
fi
if [[ ! -f "${bam_path/bam/bai}" ]]; then
echo "No index: ${bam_path/bam/bai}"
return 1
fi
# gatk HaplotypeCaller with quality filters and threading
gatk --java-options "${GATK_JAVA_OPTS}" HaplotypeCaller \
-R ${REFERENCE_FA} \
-I ${bam_path} \
-O "${RESULTS_DIR}/01_variant_calling/${sample}.vcf.gz" \
--minimum-mapping-quality 20 \
--min-base-quality-score 20 \
--stand-call-conf 30 \
--max-reads-per-alignment-start 50 \
--max-mnp-distance 1 \
--native-pair-hmm-threads ${THREADS} \
--verbosity INFO
echo "Completed variant calling for ${sample_type}: $sample"
}
setup
export REFERENCE_FA RESULTS_DIR GATK_JAVA_OPTS THREADS SAMPLESHEET
for ref in "${REF_SAMPLES[@]}"; do call_variants ${ref} reference; done
for mut in "${MUT_SAMPLES[@]}"; do call_variants ${mut} mutant; done
echo "Variant calling completed for all samples"
Caution
The above was tested given an allocation of 40 cpus and 80 GB of memory. You may want to adjust the number of parallel jobs and memory corresponding to your available resources.
Caution
gatk settings are specific to version 4.5.0.0-gcc-13.2.0 and may not work for other versions.
Optional hard filtering.
Warning
These filters may be too stringent for this use case. Only use if you are seeing an excess of false positives. Note that it is possible to later filter out false positives.
# Function to apply GATK hard filters
apply_stringent_filters() {
local sample=$1
local infile_dir="${RESULTS_DIR}/02_chromosome_filtered"
local outfile_dir="${RESULTS_DIR}/03_stringent_filtered"
echo "Applying hard filters to: $sample"
# Apply GATK hard filters
gatk --java-options "${GATK_JAVA_OPTS}" VariantFiltration \
-R ${REFERENCE_FA} \
-V "${infile_dir}/${sample}.vcf" \
-O "${outfile_dir}/${sample}_filtered.vcf" \
--filter-expression "QD < 2.0" --filter-name "QD2" \
--filter-expression "QUAL < 30.0" --filter-name "QUAL30" \
--filter-expression "SOR > 3.0" --filter-name "SOR3" \
--filter-expression "FS > 60.0" --filter-name "FS60" \
--filter-expression "MQ < 40.0" --filter-name "MQ40" \
--filter-expression "MQRankSum < -12.5" --filter-name "MQRankSum-12.5" \
--filter-expression "ReadPosRankSum < -8.0" --filter-name "ReadPosRankSum-8"
# select only PASS variants
gatk --java-options "${GATK_JAVA_OPTS}" SelectVariants \
-R ${REFERENCE_FA} \
-V "${outfile_dir}/${sample}_filtered.vcf" \
-O "${outfile_dir}/${sample}.vcf" \
--exclude-filtered
# clean up intermediate files
rm -f "${outfile_dir}/${sample}_filtered.vcf"
echo "Completed stringent filtering for: $sample"
}
# store hard-filtered variants
setup
mkdir -p "${RESULTS_DIR}/03_hard_filtering"
# Process all samples
for sample in "${ALL_SAMPLES[@]}"; do
apply_hard_filters $sample
done
Rename chromosomes in vcf with map file, and discard non-chromosomes¶
data/chrom_table.tsv contains the chromosome name mappings. Use this to remap the chromosomes.
At the same time, we are only interested in the chromosomes, which start with chr and go from chr1 to chr25. So we discard all other contigs (including from the header).
filter_and_rename_chromosomes() {
local sample=$1
local infile_dir="${RESULTS_DIR}/01_variant_called"
local outfile_dir="${RESULTS_DIR}/02_chromosome_filtered"
echo "Filtering chromosomes for: $sample"
# 33550336 use extreme range to cover all
for i in {1..25}; do
echo -e "chr${i}\t1\t1111111111111000000000000"
done > ../data/chromosomes_only.txt
# we are going to paste text, so leave it decompressed
# rename all contigs with mapping
bcftools annotate --threads ${THREADS} \
--rename-chr ${CHROM_MAP} \
${infile_dir}/${sample}.vcf.gz | \
grep -v '##contig=<ID=FR' > "${outfile_dir}/${sample}_reannot.vcf"
# filter to main chromosomes only and remove non-chromosomal variants from header
bcftools view -t $(cut -f1 ${DATA_DIR}/chromosomes.txt | paste -sd,) \
"${outfile_dir}/${sample}_reannot.vcf" \
--threads ${THREADS} \
-o "${outfile_dir}/${sample}_chr_temp.vcf"
# clean up header to remove non-chromosomal contigs
bcftools reheader -f ${DATA_DIR}/chromosomes.txt \
"${outfile_dir}/${sample}_chr_temp.vcf" | \
bgzip -@ ${THREADS} -c > "${outfile_dir}/${sample}.vcf.gz"
bcftools index --threads ${THREADS} \
"${outfile_dir}/${sample}.vcf.gz"
# clean up intermediate files
rm -f "${outfile_dir}/${sample}_reannot.vcf" \
"${outfile_dir}/${sample}_chr_temp.vcf"
echo "Completed chromosome filtering for: $sample"
}
# process all samples sequentially
setup
ALL_SAMPLES=("${MUT_SAMPLES[@]}" "${REF_SAMPLES[@]}")
for sample in "${ALL_SAMPLES[@]}"; do
filter_and_rename_chromosomes $sample
done
Decompose multiallelic variants into individual instances¶
The aim of the normalisation is to preserve differential events that may occur across alleles. For example, if a reference control is heterozygous for A/T point mutation, and the mutant sample is homozygous for a G point mutation, the mutant is considered to have a SNP event.
normalise_vcf() {
local sample=$1
local infile_dir="${PROCESSED_VCF_DIR}"
local outfile_dir="${RESULTS_DIR}/04_normalised/"
local input_vcf="${infile_dir}/${sample}.vcf.gz"
local output_vcf="${outfile_dir}/${sample}.vcf.gz"
echo "Normalising VCF for: $sample"
# normalize variants: decompose complex variants and left-align indels
bcftools norm -m-both -f "${REF_FIXED_FA}" "${input_vcf}" \
--threads ${THREADS} --write-index -Ob -o "${output_vcf}"
echo "Completed normalisation for: $sample"
return 0
}
echo "Normalising VCFs before merging..."
ALL_SAMPLES=("${MUT_SAMPLES[@]}" "${REF_SAMPLES[@]}")
for sample in "${ALL_SAMPLES[@]}"; do
normalise_vcf $sample
done
Merge files with refs¶
Samples are then merged for side-by-side comparison between mutant samples and reference controls.
For each sample, merge the four references: - TL2209397-ENUref-Female.vcf.gz - TL2209398-ENUref-Male.vcf.gz - TL2209399-TUref-Female.vcf.gz - TL2209400-TUref-Male.vcf.gz
Each sample occupies the first slot in the vcf sample columns, followed by the references.
Caution
The order of samples and references is important.
merge_vcf() {
local sample=$1
local infile_dir="${RESULTS_DIR}/04_normalised/"
local outfile_dir="${RESULTS_DIR}/05_samples_merged/"
# create VCF list for this mutant + all references (using normalised files)
VCF_LIST="${infile_dir}/${sample}.vcf.gz"
for ref in "${REF_SAMPLES[@]}"; do
VCF_LIST="${VCF_LIST} ${infile_dir}/${ref}.vcf.gz"
done
# merge this mutant with all references
echo "Merging ${sample} with references..."
bcftools merge ${VCF_LIST} --threads ${THREADS} --write-index \
-Ob -o "${outfile_dir}/${sample}.vcf.gz"
}
for sample in "${MUT_SAMPLES[@]}"; do
merge_vcf $sample
done
Filter for SNP only events¶
Other mutations are not of interest since N-ethyl-N-nitrosourea (ENU) used in the forward genetic screen mainly induces point mutations.
snps_vcf() {
local sample=$1
local infile_dir="${RESULTS_DIR}/05_samples_merged/"
local outfile_dir="${RESULTS_DIR}/06_snps_filtered/"
# filter for SNPs only
echo "Filtering SNPs for ${sample}..."
bcftools view -i 'TYPE="snp"' --threads ${THREADS} \
"${infile_dir}/${sample}.vcf.gz" \
--write-index -Ob -o "${outfile_dir}/${sample}.vcf.gz"
}
for sample in "${MUT_SAMPLES[@]}"; do
snps_vcf $sample
done
Identify candidate SNPs¶
Obtain SNPs which occur in the sample but not the references. There are several criteria in our filter, summarised below: 1. mutant sample must be covered to a depth of >= 1 read. 2. At least one reference control must be covered to a depth of >= 1 read. 3. If the reference control is multiallelic for a SNP, it should not match the mutant sample.
Beyond this, a light mapping quality filter was implemented during the variant calling stage.
The reference control may be for example a grandparent or a wild-type sibling. The pipeline is agnostic to input and will run regardless. An arbitrary number of reference controls can be used.
We explain the criteria above with some examples.
Showing the concept of coverage in (1) and (2):
MUT SAMPLE PASS FAIL PASS
MAPPED -------------- --------------
GENOME ============== ============== ==============
REF CONTROL 1
MAPPED -------------- --------------
GENOME ============== ============== ==============
REF CONTROL 2
MAPPED --------------
GENOME ============== ============== ==============
Showing the concept of multiallelic SNP events in (3): - ORI is the true original - REF is the reference control - MUT is the mutant sample
Let us assume, the true canonical nucleotide is C at a given position.:
==================
PLATONIC IDEAL REF
==================
ORI ------C-------
------C-------
=====EXAMPLE1=====
REF homozygous
MUT homozygous
SNP in ORI identical to REF
SNP in REF different to MUT
This is a SNP event
ORI ------C-------
------C-------
REF ------C-------
------C-------
MUT ------G-------
------G-------
=====EXAMPLE2=====
REF homozygous
MUT homozygous
SNP in ORI different to REF
SNP in REF identical to MUT
This is not a SNP event
ORI ------C-------
------C-------
REF ------G-------
------G-------
MUT ------G-------
------G-------
=====EXAMPLE3=====
REF homozygous
MUT heterozygous
SNP in ORI different to REF
SNP in REF different to MUT
This is a SNP event
ORI ------C-------
------C-------
REF ------A-------
------T-------
MUT ------G-------
------G-------
=====EXAMPLE4=====
REF heterozygous
MUT homozygous
SNP in ORI different to REF
SNP in REF partially different to MUT
This is not a SNP event
ORI ------C-------
------C-------
REF ------A-------
------G-------
MUT ------G-------
------G-------
=====EXAMPLE5=====
REF homozygous
MUT heterozygous
SNP in ORI different to REF
SNP in REF partially different to MUT
This is not a SNP event
ORI ------C-------
------C-------
REF ------G-------
------G-------
MUT ------A-------
------G-------
The three criteria are implemented together.
mutant sample must be covered to a depth of >= 1 read.
At least one reference control must be covered to a depth of >= 1 read.
If the reference control is multiallelic for a SNP, it should not match the mutant sample.
find_candidates_vcf() {
local sample=$1
local infile_dir="${RESULTS_DIR}/06_snps_filtered/"
local outfile_dir="${RESULTS_DIR}/07_mutant_candidates/"
NUM_REFS=${#REF_SAMPLES[@]}
# mutant AND at least one ref has >=1 read depth
DP_FILTER="FORMAT/DP[0]>=1"
for ((i=1; i<=NUM_REFS; i++)); do
if [ $i -eq 1 ]; then
DP_FILTER="${DP_FILTER} && (FORMAT/DP[${i}]>=1"
else
DP_FILTER="${DP_FILTER} || FORMAT/DP[${i}]>=1"
fi
done
# the trailing bracket isnt a typo btw
DP_FILTER="${DP_FILTER})"
# instead of a strict genotyping, we relax thresholds
# this more effectively approximates genotype instead
# mutant needs read depth of >= 0.85 and refs <= 0.15
AF_FILTER='FORMAT/AD[0:1]/(FORMAT/AD[0:0]+FORMAT/AD[0:1]) >= 0.85'
for ((i=1; i<=NUM_REFS; i++)); do
# Missing data in references = assumed reference (good)
# Only filter if there IS alt allele data and it's too high
AF_FILTER="${AF_FILTER} && (FORMAT/AD[${i}:1]=\".\" || FORMAT/AD[${i}:0]=\".\" || FORMAT/AD[${i}:1]/(FORMAT/AD[${i}:0]+FORMAT/AD[${i}:1]) <= 0.15)"
done
# combine filters
STRICT_FILTER="(${DP_FILTER}) && (${AF_FILTER})"
bcftools filter -i "${STRICT_FILTER}" --threads ${THREADS} --write_index \
"${infile_dir}/${sample}.vcf.gz" -Ob -o "${outfile_dir}/${sample}.vcf.gz"
}
for sample in "${MUT_SAMPLES[@]}"; do
snps_vcf $sample
done
Screen impact of SNP¶
Either VEP or snpEff will produce similarly formatted results. We run both separately and combine their annotations at the end.
VEP¶
We run ENSEMBL’s variant effect predictor VEP on the data. Install instructions are provided separately on our install page.
Important
Since we are using an ancient zebrafish genome, a specific combination of settings must be used. A prerequisite is installing an older cached version of the genome, which is covered in the install section. If this was not done, it is unlikely that VEP will work as intended.
annot_vep() {
local sample=$1
local infile_dir="${RESULTS_DIR}/07_mutant_candidates/"
local outfile_dir="${RESULTS_DIR}/08_annot_vep/"
echo "Predicting variant impact for ${sample}..."
vep \
--cache \
--dir_cache ${VEP_CACHE} \
--species danio_rerio \
--assembly ${VEP_ASSEMBLY} \
--cache_version ${VEP_VERSION} \
--offline \
--regulatory \
--vcf \
--variant_class \
--fork ${THREADS} \
--buffer_size ${VEP_BUFFER} \
--stats_html \
--stats_text \
--force_overwrite \
--compress_output bgzip \
--input_file "${infile_dir}/${sample}.vcf.gz" \
--output_file "${outfile_dir}/${sample}.vcf.gz"
bcftools index --threads ${THREADS} \
"${outfile_dir}/${sample}.vcf.gz"
}
for sample in "${MUT_SAMPLES[@]}"; do
annot_vep $sample
done
snpEff¶
We run snpEff to screen for SNPs which are predicted to be impactful. Annotations are added to the data in the process.
Warning
snpEff runs on java, and as a result may run into memory and other issues. You may need to set java memory flags accordingly.
# adjust max values accordingly
export JAVA_OPTIONS="-Xms512m -Xmx16g"
# SnpEff version SnpEff 5.2 (build 2023-09-29 06:17)
snpEff download Zv9.75
annot_eff() {
local sample=$1
local infile_dir="${RESULTS_DIR}/07_mutant_candidates/"
local outfile_dir="${RESULTS_DIR}/08_annot_eff/"
export JAVA_OPTIONS="-Xms512m -Xmx16g"
csv_stat=${outfile_dir}/${sample}.csv
fas_prot=${outfile_dir}/${sample}.fa
web_stat=${outfile_dir}/${sample}.html
out_path=${outfile_dir}/${sample}.vcf
snpEff \
-c "${SNPEFF_CONFIG}" \
-csvStats "${csv_stat}" \
-i vcf \
-o vcf \
-fastaProt "${fas_prot}" \
-s "${web_stat}" \
"${SNPEFF_GENOME}" \
"${infile_dir}/${sample}.vcf.gz" > "${out_path}"
bgzip -@ ${THREADS} ${out_path}
bcftools index --threads ${THREADS} ${out_path}.gz
}
for sample in "${MUT_SAMPLES[@]}"; do
annot_eff $sample
done
Combine SNP annotations¶
Both VEP and snpEff annotations are now available. These are recombined so that the aggregated information is available in one file. Note that we split the VEP and snpEff steps for efficiency, but running either method sequentially will obtain the same result.
annot_all() {
local sample=$1
local annot_eff="${RESULTS_DIR}/08_annot_eff/"
local annot_vep="${RESULTS_DIR}/08_annot_vep/"
local outfile_dir="${RESULTS_DIR}/08_annot_all/"
bcftools merge --threads ${THREADS} \
--force-samples \
${annot_vep}/${sample}.vcf.gz \
${annot_eff}/${sample}.vcf.gz \
--write-index -Ob -o ${outfile_dir}/${sample}.vcf.gz
}
for sample in "${MUT_SAMPLES[@]}"; do
annot_all $sample
done
Visualising data¶
Preparing data for visualisation¶
Ensure that contigs and scaffolds are removed.
THREADS=16
for i in $(find . -type f -name "*.bam"); do
j="${i%.bam}_remap1.bam"
k="${i%.bam}_remap2.bam"
# remap chromosome names in header first
samtools view -@ ${THREADS} -H "${i}" | \
awk 'NR==FNR{map[$1]=$2; next} {for(old in map) gsub(old, map[old])} 1' \
${CHROM_MAP} - > "${i}_head"
samtools reheader "${i}_head" "${i}" > "${j}"
samtools index -@ ${THREADS} "${j}"
# now filter to chr1-25
CHRS=$(seq 1 25 | sed 's/^/chr/' | tr '\n' ' ')
samtools view -@ ${THREADS} -b -o "${k}" "${j}" ${CHRS}
# rework headers
samtools view -@ ${THREADS} -H "${k}" | \
awk '/^@SQ/{if($0 ~ /SN:chr([1-9]|1[0-9]|2[0-5])\t/) print; next} {print}' \
> "${k}_head"
samtools reheader "${k}_head" "${k}" > "${i}"
samtools index -@ ${THREADS} "${i}"
rm "${j}" "${k}" "${i}_head" "${k}_head"
done
Visualise individual SNPs (single base level resolution)¶
The aim is to visualise all reads at each SNP position. This allows a user to investigate individual SNPs at single base level resolution. Complements and acts as a layer of validation for genome-level visualisation.
The tables generated here contain all the necessary information which can be visualised in the user’s method of choice (e.g. R or pandas dataframes.) The bam files (subsetted or full) can be viewed in the user’s genomic browser of choice, e.g. IGV, JBrowse, gosling.
Visualising the entire genome is ideal but the size of the alignment data makes this step impractical. Therefore, we subset the bam alignment files for reads containing the final subset of SNPs only for a portable but still informative bam file. The full bam file can also be loaded and mounted on a separate volume if needed. Here we parse the final vcf files into a format suitable for visualisation.
Hint
Skipping bam file subsetting and inspecting the full file is also valid. However, portability is an on-going issue due to bam file size (in this case ~10-20GB each).
prep_bam() {
local sample=$1
local infile_dir="${RESULTS_DIR}/08_annot_all/"
local outfile_dir="${RESULTS_DIR}/10_visualisations/"
python parse_vcf.py \
"${infile_dir}/${sample}.vcf.gz" \
"${SAMPLESHEET}" \
"${outfile_dir}/${sample}.csv" \
--subset_path "${outfile_dir}/${sample}" \
--subset_workers ${THREADS} \
--coverage_report "${outfile_dir}/${sample}_coverage.csv" \
--chrom_mapping ${CHROM_MAP} \
--force_overwrite
}
for sample in "${MUT_SAMPLES[@]}"; do
prep_bam $sample
done
The tables generated here contain all the necessary information which can be visualised in the user’s method of choice (e.g. R or pandas dataframes.) The bam files (subsetted or full) can be viewed in the user’s genomic browser of choice, e.g. IGV, JBrowse, gosling.
Visualise mutation hotspots (genome-level visualisation)¶
The aim is to visualise homozygosity hotspots across the entire genome. This allows a user to “zoom in” on hotspots for further investigation. Complements and acts as a layer of validation for base-level visualisation.
prep_tdf() {
local sample=$1
local infile_dir="${RESULTS_DIR}/06_snps_filtered/"
local outfile_dir="${RESULTS_DIR}/10_visualisations/"
python plot_homozygosity.py \
"${infile_dir}/${sample}.vcf.gz" \
${sample} \
-i ${REF_FIXED_FA} \
-t 0.85 \
-n 16
}
for sample in "${MUT_SAMPLES[@]}"; do
prep_tdf $sample
done
We use the scoring algorithm described in Henke et al., 2013.
Note
In theory, this step can be performed directly after vcf files are merged and SNPs are filtered out. This is ok if you want to generate independent tdf files for your own use. However, in the context of this pipeline, data is appended to json files created in the previous step.
Tip
After this step, no further data is generated.
Visualisation module¶
Caution
This part is more involved and is intended for developers who want to host a web page to view the data.
Tip
Can also be set up on localhost to view.
Install dependencies¶
Install npm following the instructions for your own operating system. A few linux examples are provided.
# Ubuntu/Debian
sudo apt install nodejs npm
# CentOS/RHEL
sudo yum install nodejs npm
Hint
If you get an error saying the host name cannot be resolved, try the following (at your own risk).
sudo sed -i 's/mirrorlist/#mirrorlist/g' /etc/yum.repos.d/CentOS-*
sudo sed -i 's|#baseurl=http://mirror.centos.org|baseurl=http://vault.centos.org|g' /etc/yum.repos.d/CentOS-*
Then install igv-dist.
Note
IGV code is also redistributed in this repository under the MIT license.
npm install express
mkdir igv-dist
curl -o igv-dist/igv.min.js https://cdn.jsdelivr.net/npm/igv@2.15.11/dist/igv.min.js
curl -o igv-dist/igv.css https://cdn.jsdelivr.net/npm/igv@2.15.11/dist/igv.css
Our html file calls the local script.
<script src="/api/igv-proxy/igv.min.js"></script>
<link rel="stylesheet" href="/api/igv-proxy/igv.css">
Secure hosting¶
To set up a https secure host:
Obtain a free domain name at
duckdnsApply
certbotto generate a SSL certificateSetup a
nginxconfiguration.
Note
This is highly user-specific, and will change depending on your level of access to the host server, file location and open port(s). For this reason, no specific instructions are provided. However, following the instructions for each of the above three services should work in most scenarios.
Custom genome¶
Current version of IGV does not support genome assembly danRer7 (Zv9) by default.
Tip
If you want to add your own custom genome, you can follow the steps here.
genome_dir="igv-genomes/danRer7/"
fasta_path=${genomes_dir}/danRer7.fa
index_path=${genomes_dir}/danRer7.fa.fai
annot_path=${genomes_dir}/danRer7.gff
mapfile=${genomes_dir}/mapfile.txt
mkdir -p ${genomes_dir}
cp ${REF_FIXED_FA} ${fasta_path}
cp ${REF_FIXED_FA}.fai ${index_path}
wget 'https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/002/035/GCF_000002035.4_Zv9/GCF_000002035.4_Zv9_genomic.gff.gz' -o ${annot_path}
gzip -d ${annot_path}.gz
paste <(grep -v '#' ${annot_path} | cut -f1 | uniq | grep NC_007) \
<(seq 1 25 | sed 's/^/chr/') > ${mapfile}
awk 'BEGIN{FS=OFS="\t"} FNR==NR{map[$1]=$2; next} {if($1 in map) $1=map[$1]; print}' ${mapfile} ${annot_path} > tmp.gff
mv tmp.gff ${annot_path}
Note
Use the fixed fasta file and indices with the chromosome names corresponding to the bam files.
Directory structure¶
At the end, your directory structure should look like this:
vis/
├── variant_viewer.html
├── file_server.py
├── data/
│ ├── ${SAMPLE_NAME}/
│ │ ├── ${SAMPLE_NAME}_mutant.bam
│ │ ├── ${SAMPLE_NAME}_mutant.bam.bai
│ │ ├── ${REFERENCE_NAME}_reference.bam
│ │ ├── ${REFERENCE_NAME}_reference.bam.bai
│ │ ...
│ ├── ${SAMPLE_NAME}.csv
│ ├── ${SAMPLE_NAME}.json
│ └── ${SAMPLE_NAME}_coverage.csv
├── igv-dist/
│ ├── igv.css
│ └── igv.min.js
└── igv-genomes/
├── danRer7/
│ ├── danRer7.fa
│ ├── danRer7.fa.fai
│ ├── danRer7.gff
│ └── mapfile.txt
└── genomes.json
We need the mapfile.txt in this case to match chromosome names to the updated ones. Configuration files are provided as part of the repository, but larger genome and data files are not. Instructions on how to obtain or regenerate these are covered in previous steps.
Hint
For data, symlinking results/10_visualisations to vis/data should work normally after completing the previous steps in the pipeline. For visualisation, the gff file is optional but provides useful annotations for gene regions.
Core server files:
variant_viewer.html
file_server.py
Data files, where the contents of data are the same as 10_visualisations:
data/
├── ${SAMPLE_NAME}/
│ ├── ${SAMPLE_NAME}_mutant.bam
│ ├── ${SAMPLE_NAME}_mutant.bam.bai
│ ├── ${REFERENCE_NAME}_reference.bam
│ ├── ${REFERENCE_NAME}_reference.bam.bai
│ ...
├── ${SAMPLE_NAME}.csv
├── ${SAMPLE_NAME}.json
└── ${SAMPLE_NAME}_coverage.csv
IGV genome browser source files:
igv-dist/
├── igv.css
└── igv.min.js
Custom reference genomes (in this case danRer7):
igv-genomes/
├── danRer7/
│ ├── danRer7.fa
│ ├── danRer7.fa.fai
│ └── danRer7.gff (optional)
└── genomes.json
Caution
The chromosome names in the gff files must match the fasta and alignment files.
You can modify genomes.json as needed for your use case. Then modify the html directly to add the corresponding value matching the id to the drop-down list.
{
"danRer7": {
"id": "danRer7",
"name": "Zebrafish Zv9 (danRer7)",
"fastaURL": "/genomes/danRer7/danRer7.fa",
"indexURL": "/genomes/danRer7/danRer7.fa.fai",
"chromosomeOrder": "chr1,chr2,chr3,chr4,chr5,chr6,chr7,chr8,chr9,chr10,chr11,chr12,chr13,chr14,chr15,chr16,chr17,chr18,chr19,chr20,chr21,chr22,chr23,chr24,chr25,chrM"
}
}