EMUMADZ pipeline case study 3

MIT License CC BY 4.0 GitHub Docker Build Status Docker Pulls Website Zenodo

FASTQ to SNP for F2-F2 comparisons

Copyright © 2025 Tyrone Chen ORCID logo, Richard Lupat ORCID logo, Michelle Meier ORCID logo, Maia Zethoven ORCID logo, Gregory Baillie ORCID logo, Scott Paterson ORCID logo, Oguzhan Baltaci ORCID logo, Cas Simons ORCID logo, Jason Li ORCID logo, Andrew Cox ORCID logo, Kelly Smith ORCID logo, Benjamin Hogan ORCID logo

Code in this repository is provided under a MIT license. This documentation is provided under a CC-BY-4.0 license.

Visit our lab website here. Contact Benjamin Hogan at ben.hogan@petermac.org.

Usage

Note

All steps are run in directory source.

Alignment

We align paired end reads to the genome with bwa-mem.

Tip

Approximately 12 hours of runtime per sample with 8 threads and 32GB memory.

INPUT_FILE="../data/F2-F2.20260123/fastqsheet.tsv"
REF_GENOME="../data/Reference/GCA_000002035.2_Zv9_genomic.ChrFixed.fna"
OUTPUT_DIR="../data/Alignment_File/F2-F2.20260123/"
THREADS=8

mkdir -p ${OUTPUT_DIR}

# if not already indexed
bwa index ${REF_GENOME}

while read R1 R2; do
    SAMPLE=$(basename ${R1} | sed 's/_R1.*//' | sed 's/.R1.*//')
    bwa mem -R "@RG\tID:${SAMPLE}\tSM:${SAMPLE}\tPL:ILLUMINA\tLB:${SAMPLE}" \
        -t ${THREADS} ${REF_GENOME} ${R1} ${R2} | \
        samtools sort -@ ${THREADS} -o ${OUTPUT_DIR}/${SAMPLE}.bam -
    samtools index ${OUTPUT_DIR}/${SAMPLE}.bam
done < ${INPUT_FILE}

Note

If bwa mem was run without the -R, then the following can be used to add the header back in.

for i in *bam; do
    SAMPLE=$(echo ${i} | sed "s|.bam||")
    samtools addreplacerg -@ 15 -r \
        "@RG\tID:${SAMPLE}\tSM:${SAMPLE}\tPL:ILLUMINA\tLB:${SAMPLE}" \
        --write-index -o ../${i} ${i}
    samtools index -b -@ 15 ../${i}
done

Data setup

Software requirements:

  • bcftools

  • gatk

  • samtools

  • snpEff

Note

Here we assume that bcftools,gatk,samtools,snpEff are all installed and the user has completed the steps in the install instructions.

Data requirements:

  • Reference genome fasta file

  • Samplesheet with sample information

  • Chromosome name mappings

Here is the directory structure:

project_root/
├── source/         # Scripts are run from here
├── data/           # Input data (BAM files, reference genome, samplesheet)
└── results.F2-F2.20260123/        # All output files organized by processing step
    ├── 01_variant_called
    ├── 02_chromosome_filtered
    ├── 03_stringent_filtered
    ├── 04_normalised
    ├── 05_samples_merged
    ├── 06_snps_filtered
    ├── 07_mutant_candidates
    ├── 08_annot_all
    ├── 08_annot_eff
    ├── 08_annot_vep
    ├── 09_finalised
    └── 10_visualisations

Create a samplesheet with sample identifiers, paths to bam file and the sample type:

sample_identity alignment_file  sample_type
some_sample_1   /path/to/1.bam  reference
some_sample_2   /path/to/2.bam  reference
some_sample_3   /path/to/3.bam  reference
some_sample_4   /path/to/4.bam  reference
some_sample_5   /path/to/5.bam  mutant
...             ...             ...

In our case, this is how our table looks like:

Here we setup the file structure and some other metadata.

# until step 3, this doesnt matter because all files are treated independently
ids="12-2-5-F 245-2-1-F 261-C-B-B 262-5-8-1 263-2-A-C 282-2-1-1 32-1-B 350-A-4-C 95-2-A Dut"
for i in ${ids}; do
    out="samplesheet.F2-F2.${i}.tsv"
    printf "sample_identity\talignment_file\tsample_type\n" > ${out}
    grep ${i} samplesheet.F2-F2.20260123.tsv | \
        sed -E "s/-Mut_S[0-9]+|-Sib_S[0-9]+//" >> ${out}
done
SAMPLESHEETS=$(for i in $ids; do echo samplesheet.F2-F2.$i.tsv; done)

# convenience function since it will be used in most downstream operations
setup() {
    # navigate to source directory
    if [[ ! $(basename $PWD) -eq source ]]; then
        echo "All scripts are set to run from directory source"
        echo "See directory structure above for more information"
        exit 1
    fi

    # various configuration variables
    THREADS=16

    # GATK
    GATK_MEM="64g"
    GATK_JAVA_OPTS="-Xmx${GATK_MEM} -XX:+UseParallelGC"

    # ENSEMBL VEP
    VEP_ASSEMBLY="Zv9"
    VEP_VERSION="79"
    VEP_CACHE="${HOME}/.vep"
    VEP_BUFFER="8192"

    # snpEff
    SNPEFF_CONFIG="/home/tchen/.local/share/mamba/envs/snps/share/snpeff-5.2-1/snpEff.config"
    SNPEFF_GENOME="Zv9.75"

    # paths relative to source directory
    DATA_DIR="../data/"
    RESULTS_DIR="../results.F2-F2.20260123/"
    SAMPLESHEET="${DATA_DIR}/samplesheet.F2-F2.20260123.tsv"
    REFERENCE_FA="${DATA_DIR}/Reference/GCA_000002035.2_Zv9_genomic.fna"
    REF_FIXED_FA="${DATA_DIR}/Reference/GCA_000002035.2_Zv9_genomic.ChrFixed.fna"
    CHROM_MAP="${DATA_DIR}/chrom_table.tsv"

    # parse samplesheet and load samples into array
    MUT_SAMPLES=($(awk '$3=="mutant" {print $1}' ${SAMPLESHEET}))
    REF_SAMPLES=($(awk '$3=="reference" {print $1}' ${SAMPLESHEET}))
    echo "${#MUT_SAMPLES[@]} mutants: ${MUT_SAMPLES[@]}"
    echo "${#REF_SAMPLES[@]} references: ${REF_SAMPLES[@]}"

    # create results directory structure
    mkdir -p ${RESULTS_DIR}/{01_variant_called,02_chromosome_filtered,03_stringent_filtered,04_normalised,05_samples_merged,06_snps_filtered,07_mutant_candidates,08_annot_eff,08_annot_vep,08_annot_all,09_finalised,10_visualisations}

    # set downstream directories and suffixes
    # if step 3 is actioned, edit these as needed
    PROCESSED_VCF_DIR="${RESULTS_DIR}/02_chromosome_filtered"
}

Create some metadata files. The chromosome file maps the chromosome IDs onto the chr{1..25} more human-readable format. Have the fasta genome reference file on hand.

setup_metadata() {
    # verify reference genome preparation
    if [[ ! -f "${REFERENCE_FA}.fai" ]]; then
        echo "Creating FASTA index..."
        samtools faidx ${REFERENCE_FA}
    fi
    if [[ ! -f "${REFERENCE_FA%.*}.dict" ]]; then
        echo "Creating sequence dictionary..."
        gatk CreateSequenceDictionary -R ${REFERENCE_FA}
    fi

    # same but for the fixed chromosomes
    if [[ ! -f "${REF_FIXED_FA}.fai" ]]; then
        echo "Creating FASTA index..."
        samtools faidx ${REFERENCE_FA}
    fi

    # create chromosome list (adjust range as needed)
    grep -P "^chr[0-9]+\t" ${REF_FIXED_FA}.fai > \
        ${DATA_DIR}/chromosomes.txt
}

setup
setup_metadata

Nextflow

Note

This step is equivalent to running the rest of the pipeline manually for the given sample(s).

Example of using nextflow:

samplesheet="../data/samplesheet.F2-F2.20260123.122-3-I.tsv"
reference_fa="../data/Reference/GCA_000002035.2_Zv9_genomic.fna"
ref_fixed_fa="../data/Reference/GCA_000002035.2_Zv9_genomic.ChrFixed.fna"
chrom="../data/chrom_table.tsv"
results="../results.F2-F2.20260123/122-3-I/"

nextflow run emumadz_pipeline.nf \
    --samplesheet ${samplesheet} \
    --reference_fa ${reference_fa} \
    --ref_fixed_fa ${ref_fixed_fa} \
    --chrom_map ${chrom} \
    --outdir ${results} \
    -resume

Whole zebrafish genome assembly

The ``seqliner`` assembly step is not covered in this documentation.

Variant calling and read filtering

We filter the bam files for low quality reads and perform variant calling simultaneously.

Note

We apply intentionally minimal filtering at the early stage and apply custom filters later. Here, we are mainly interested in filtering poorly mapped reads.

# this helper function retrieves bam file path given sample id
get_bam_path() {
    local sample_id=$1
    awk -v id="$sample_id" '$1==id {print $2}' ${SAMPLESHEET}
}

call_variants() {
    local sample=$1
    local sample_type=$2
    local bam_path=$(get_bam_path $sample)

    echo "Processing ${sample_type}: $sample ($bam_path)"

    # Verify BAM file and index exist
    if [[ ! -f "${bam_path}" ]]; then
        echo "No BAM: ${bam_path}"
        return 1
    fi
    if [[ ! -f "${bam_path/bam/bai}" ]]; then
        echo "No index: ${bam_path/bam/bai}"
        return 1
    fi

    # gatk HaplotypeCaller with quality filters and threading
    gatk --java-options "${GATK_JAVA_OPTS}" HaplotypeCaller \
        -R ${REFERENCE_FA} \
        -I ${bam_path} \
        -O "${RESULTS_DIR}/01_variant_calling/${sample}.vcf.gz" \
        --minimum-mapping-quality 20 \
        --min-base-quality-score 20 \
        --stand-call-conf 30 \
        --max-reads-per-alignment-start 50 \
        --max-mnp-distance 1 \
        --native-pair-hmm-threads ${THREADS} \
        --verbosity INFO

    echo "Completed variant calling for ${sample_type}: $sample"
}

setup
export REFERENCE_FA RESULTS_DIR GATK_JAVA_OPTS THREADS SAMPLESHEET

for ref in "${REF_SAMPLES[@]}"; do call_variants ${ref} reference; done
for mut in "${MUT_SAMPLES[@]}"; do call_variants ${mut} mutant; done

echo "Variant calling completed for all samples"

Caution

The above was tested given an allocation of 40 cpus and 80 GB of memory. You may want to adjust the number of parallel jobs and memory corresponding to your available resources.

Caution

gatk settings are specific to version 4.5.0.0-gcc-13.2.0 and may not work for other versions.

Optional hard filtering.

Warning

These filters may be too stringent for this use case. Only use if you are seeing an excess of false positives. Note that it is possible to later filter out false positives.

# Function to apply GATK hard filters
apply_stringent_filters() {
    local sample=$1
    local infile_dir="${RESULTS_DIR}/02_chromosome_filtered"
    local outfile_dir="${RESULTS_DIR}/03_stringent_filtered"

    echo "Applying hard filters to: $sample"

    # Apply GATK hard filters
    gatk --java-options "${GATK_JAVA_OPTS}" VariantFiltration \
        -R ${REFERENCE_FA} \
        -V "${infile_dir}/${sample}.vcf" \
        -O "${outfile_dir}/${sample}_filtered.vcf" \
        --filter-expression "QD < 2.0" --filter-name "QD2" \
        --filter-expression "QUAL < 30.0" --filter-name "QUAL30" \
        --filter-expression "SOR > 3.0" --filter-name "SOR3" \
        --filter-expression "FS > 60.0" --filter-name "FS60" \
        --filter-expression "MQ < 40.0" --filter-name "MQ40" \
        --filter-expression "MQRankSum < -12.5" --filter-name "MQRankSum-12.5" \
        --filter-expression "ReadPosRankSum < -8.0" --filter-name "ReadPosRankSum-8"

    # select only PASS variants
    gatk --java-options "${GATK_JAVA_OPTS}" SelectVariants \
        -R ${REFERENCE_FA} \
        -V "${outfile_dir}/${sample}_filtered.vcf" \
        -O "${outfile_dir}/${sample}.vcf" \
        --exclude-filtered

    # clean up intermediate files
    rm -f "${outfile_dir}/${sample}_filtered.vcf"

    echo "Completed stringent filtering for: $sample"
}

# store hard-filtered variants
setup
mkdir -p "${RESULTS_DIR}/03_hard_filtering"

# Process all samples
for sample in "${ALL_SAMPLES[@]}"; do
    apply_hard_filters $sample
done

Rename chromosomes in vcf with map file, and discard non-chromosomes

data/chrom_table.tsv contains the chromosome name mappings. Use this to remap the chromosomes.

At the same time, we are only interested in the chromosomes, which start with chr and go from chr1 to chr25. So we discard all other contigs (including from the header).

filter_and_rename_chromosomes() {
    local sample=$1
    local infile_dir="${RESULTS_DIR}/01_variant_called"
    local outfile_dir="${RESULTS_DIR}/02_chromosome_filtered"

    echo "Filtering chromosomes for: $sample"
    # 33550336 use extreme range to cover all
    for i in {1..25}; do
        echo -e "chr${i}\t1\t1111111111111000000000000"
    done > ../data/chromosomes_only.txt

    # we are going to paste text, so leave it decompressed
    # rename all contigs with mapping
    bcftools annotate --threads ${THREADS} \
        --rename-chr ${CHROM_MAP} \
        ${infile_dir}/${sample}.vcf.gz | \
    grep -v '##contig=<ID=FR' > "${outfile_dir}/${sample}_reannot.vcf"

    # filter to main chromosomes only and remove non-chromosomal variants from header
    bcftools view -t $(cut -f1 ${DATA_DIR}/chromosomes.txt | paste -sd,) \
        "${outfile_dir}/${sample}_reannot.vcf" \
        --threads ${THREADS} \
        -o "${outfile_dir}/${sample}_chr_temp.vcf"

    # clean up header to remove non-chromosomal contigs
    bcftools reheader -f ${DATA_DIR}/chromosomes.txt \
        "${outfile_dir}/${sample}_chr_temp.vcf" | \
    bgzip -@ ${THREADS} -c > "${outfile_dir}/${sample}.vcf.gz"

    bcftools index --threads ${THREADS} \
        "${outfile_dir}/${sample}.vcf.gz"

    # clean up intermediate files
    rm -f "${outfile_dir}/${sample}_reannot.vcf" \
        "${outfile_dir}/${sample}_chr_temp.vcf"

    echo "Completed chromosome filtering for: $sample"
}

# process all samples sequentially
setup

ALL_SAMPLES=("${MUT_SAMPLES[@]}" "${REF_SAMPLES[@]}")
for sample in "${ALL_SAMPLES[@]}"; do
    filter_and_rename_chromosomes $sample
done

Decompose multiallelic variants into individual instances

The aim of the normalisation is to preserve differential events that may occur across alleles. For example, if a reference control is heterozygous for A/T point mutation, and the mutant sample is homozygous for a G point mutation, the mutant is considered to have a SNP event.

normalise_vcf() {
    local sample=$1
    local infile_dir="${PROCESSED_VCF_DIR}"
    local outfile_dir="${RESULTS_DIR}/04_normalised/"
    local input_vcf="${infile_dir}/${sample}.vcf.gz"
    local output_vcf="${outfile_dir}/${sample}.vcf.gz"

    echo "Normalising VCF for: $sample"

    # normalize variants: decompose complex variants and left-align indels
    bcftools norm -m-both -f "${REF_FIXED_FA}" "${input_vcf}" \
        --threads ${THREADS} --write-index -Ob -o "${output_vcf}"

    echo "Completed normalisation for: $sample"
    return 0
}

echo "Normalising VCFs before merging..."

ALL_SAMPLES=("${MUT_SAMPLES[@]}" "${REF_SAMPLES[@]}")
for sample in "${ALL_SAMPLES[@]}"; do
    normalise_vcf $sample
done

Merge files with refs

Samples are then merged for side-by-side comparison between mutant samples and reference controls.

For each sample, merge the four references: - TL2209397-ENUref-Female.vcf.gz - TL2209398-ENUref-Male.vcf.gz - TL2209399-TUref-Female.vcf.gz - TL2209400-TUref-Male.vcf.gz

Each sample occupies the first slot in the vcf sample columns, followed by the references.

Caution

The order of samples and references is important.

merge_vcf() {
    local sample=$1
    local infile_dir="${RESULTS_DIR}/04_normalised/"
    local outfile_dir="${RESULTS_DIR}/05_samples_merged/"

    # create VCF list for this mutant + all references (using normalised files)
    VCF_LIST="${infile_dir}/${sample}.vcf.gz"
    for ref in "${REF_SAMPLES[@]}"; do
        VCF_LIST="${VCF_LIST} ${infile_dir}/${ref}.vcf.gz"
    done

    # merge this mutant with all references
    echo "Merging ${sample} with references..."
    bcftools merge ${VCF_LIST} --threads ${THREADS} --write-index \
        -Ob -o "${outfile_dir}/${sample}.vcf.gz"
}

for sample in "${MUT_SAMPLES[@]}"; do
    merge_vcf $sample
done

Filter for SNP only events

Other mutations are not of interest since N-ethyl-N-nitrosourea (ENU) used in the forward genetic screen mainly induces point mutations.

snps_vcf() {
    local sample=$1
    local infile_dir="${RESULTS_DIR}/05_samples_merged/"
    local outfile_dir="${RESULTS_DIR}/06_snps_filtered/"

    # filter for SNPs only
    echo "Filtering SNPs for ${sample}..."
    bcftools view -i 'TYPE="snp"' --threads ${THREADS} \
        "${infile_dir}/${sample}.vcf.gz" \
        --write-index -Ob -o "${outfile_dir}/${sample}.vcf.gz"
}

for sample in "${MUT_SAMPLES[@]}"; do
    snps_vcf $sample
done

Identify candidate SNPs

Obtain SNPs which occur in the sample but not the references. There are several criteria in our filter, summarised below: 1. mutant sample must be covered to a depth of >= 1 read. 2. At least one reference control must be covered to a depth of >= 1 read. 3. If the reference control is multiallelic for a SNP, it should not match the mutant sample.

Beyond this, a light mapping quality filter was implemented during the variant calling stage.

The reference control may be for example a grandparent or a wild-type sibling. The pipeline is agnostic to input and will run regardless. An arbitrary number of reference controls can be used.

We explain the criteria above with some examples.

Showing the concept of coverage in (1) and (2):

MUT SAMPLE           PASS                FAIL                PASS

    MAPPED      --------------                          --------------
    GENOME      ==============      ==============      ==============

REF CONTROL 1

    MAPPED      --------------                          --------------
    GENOME      ==============      ==============      ==============

REF CONTROL 2

    MAPPED      --------------
    GENOME      ==============      ==============      ==============

Showing the concept of multiallelic SNP events in (3): - ORI is the true original - REF is the reference control - MUT is the mutant sample

Let us assume, the true canonical nucleotide is C at a given position.:

==================
PLATONIC IDEAL REF
==================

ORI ------C-------
    ------C-------

=====EXAMPLE1=====
REF homozygous
MUT homozygous
SNP in ORI identical to REF
SNP in REF different to MUT
This is a SNP event

ORI ------C-------
    ------C-------

REF ------C-------
    ------C-------

MUT ------G-------
    ------G-------

=====EXAMPLE2=====
REF homozygous
MUT homozygous
SNP in ORI different to REF
SNP in REF identical to MUT
This is not a SNP event

ORI ------C-------
    ------C-------

REF ------G-------
    ------G-------

MUT ------G-------
    ------G-------

=====EXAMPLE3=====
REF homozygous
MUT heterozygous
SNP in ORI different to REF
SNP in REF different to MUT
This is a SNP event

ORI ------C-------
    ------C-------

REF ------A-------
    ------T-------

MUT ------G-------
    ------G-------

=====EXAMPLE4=====
REF heterozygous
MUT homozygous
SNP in ORI different to REF
SNP in REF partially different to MUT
This is not a SNP event

ORI ------C-------
    ------C-------

REF ------A-------
    ------G-------

MUT ------G-------
    ------G-------

=====EXAMPLE5=====
REF homozygous
MUT heterozygous
SNP in ORI different to REF
SNP in REF partially different to MUT
This is not a SNP event

ORI ------C-------
    ------C-------

REF ------G-------
    ------G-------

MUT ------A-------
    ------G-------

The three criteria are implemented together.

  1. mutant sample must be covered to a depth of >= 1 read.

  2. At least one reference control must be covered to a depth of >= 1 read.

  3. If the reference control is multiallelic for a SNP, it should not match the mutant sample.

find_candidates_vcf() {
    local sample=$1
    local infile_dir="${RESULTS_DIR}/06_snps_filtered/"
    local outfile_dir="${RESULTS_DIR}/07_mutant_candidates/"

    NUM_REFS=${#REF_SAMPLES[@]}

    # mutant AND at least one ref has >=1 read depth
    DP_FILTER="FORMAT/DP[0]>=1"
    for ((i=1; i<=NUM_REFS; i++)); do
        if [ $i -eq 1 ]; then
            DP_FILTER="${DP_FILTER} && (FORMAT/DP[${i}]>=1"
        else
            DP_FILTER="${DP_FILTER} || FORMAT/DP[${i}]>=1"
        fi
    done
    # the trailing bracket isnt a typo btw
    DP_FILTER="${DP_FILTER})"

    # instead of a strict genotyping, we relax thresholds
    # this more effectively approximates genotype instead
    # mutant needs read depth of >= 0.85 and refs <= 0.15
    AF_FILTER='FORMAT/AD[0:1]/(FORMAT/AD[0:0]+FORMAT/AD[0:1]) >= 0.85'
    for ((i=1; i<=NUM_REFS; i++)); do
        # Missing data in references = assumed reference (good)
        # Only filter if there IS alt allele data and it's too high
        AF_FILTER="${AF_FILTER} && (FORMAT/AD[${i}:1]=\".\" || FORMAT/AD[${i}:0]=\".\" || FORMAT/AD[${i}:1]/(FORMAT/AD[${i}:0]+FORMAT/AD[${i}:1]) <= 0.15)"
    done

    # combine filters
    STRICT_FILTER="(${DP_FILTER}) && (${AF_FILTER})"

    bcftools filter -i "${STRICT_FILTER}" --threads ${THREADS} --write_index \
        "${infile_dir}/${sample}.vcf.gz" -Ob -o "${outfile_dir}/${sample}.vcf.gz"
}

for sample in "${MUT_SAMPLES[@]}"; do
    snps_vcf $sample
done

Screen impact of SNP

Either VEP or snpEff will produce similarly formatted results. We run both separately and combine their annotations at the end.

VEP

We run ENSEMBL’s variant effect predictor VEP on the data. Install instructions are provided separately on our install page.

Important

Since we are using an ancient zebrafish genome, a specific combination of settings must be used. A prerequisite is installing an older cached version of the genome, which is covered in the install section. If this was not done, it is unlikely that VEP will work as intended.

annot_vep() {
    local sample=$1
    local infile_dir="${RESULTS_DIR}/07_mutant_candidates/"
    local outfile_dir="${RESULTS_DIR}/08_annot_vep/"

    echo "Predicting variant impact for ${sample}..."
    vep \
        --cache \
        --dir_cache ${VEP_CACHE} \
        --species danio_rerio \
        --assembly ${VEP_ASSEMBLY} \
        --cache_version ${VEP_VERSION} \
        --offline \
        --regulatory \
        --vcf \
        --variant_class \
        --fork ${THREADS} \
        --buffer_size ${VEP_BUFFER} \
        --stats_html \
        --stats_text \
        --force_overwrite \
        --compress_output bgzip \
        --input_file "${infile_dir}/${sample}.vcf.gz" \
        --output_file "${outfile_dir}/${sample}.vcf.gz"
    bcftools index --threads ${THREADS} \
        "${outfile_dir}/${sample}.vcf.gz"
}

for sample in "${MUT_SAMPLES[@]}"; do
    annot_vep $sample
done

snpEff

We run snpEff to screen for SNPs which are predicted to be impactful. Annotations are added to the data in the process.

Warning

snpEff runs on java, and as a result may run into memory and other issues. You may need to set java memory flags accordingly.

# adjust max values accordingly
export JAVA_OPTIONS="-Xms512m -Xmx16g"
# SnpEff version SnpEff 5.2 (build 2023-09-29 06:17)
snpEff download Zv9.75

annot_eff() {
    local sample=$1
    local infile_dir="${RESULTS_DIR}/07_mutant_candidates/"
    local outfile_dir="${RESULTS_DIR}/08_annot_eff/"

    export JAVA_OPTIONS="-Xms512m -Xmx16g"

    csv_stat=${outfile_dir}/${sample}.csv
    fas_prot=${outfile_dir}/${sample}.fa
    web_stat=${outfile_dir}/${sample}.html
    out_path=${outfile_dir}/${sample}.vcf

    snpEff \
        -c "${SNPEFF_CONFIG}" \
        -csvStats "${csv_stat}" \
        -i vcf \
        -o vcf \
        -fastaProt "${fas_prot}" \
        -s "${web_stat}" \
        "${SNPEFF_GENOME}" \
        "${infile_dir}/${sample}.vcf.gz" > "${out_path}"
    bgzip -@ ${THREADS} ${out_path}
    bcftools index --threads ${THREADS} ${out_path}.gz
}

for sample in "${MUT_SAMPLES[@]}"; do
    annot_eff $sample
done

Combine SNP annotations

Both VEP and snpEff annotations are now available. These are recombined so that the aggregated information is available in one file. Note that we split the VEP and snpEff steps for efficiency, but running either method sequentially will obtain the same result.

annot_all() {
    local sample=$1
    local annot_eff="${RESULTS_DIR}/08_annot_eff/"
    local annot_vep="${RESULTS_DIR}/08_annot_vep/"
    local outfile_dir="${RESULTS_DIR}/08_annot_all/"

    bcftools merge --threads ${THREADS} \
        --force-samples \
        ${annot_vep}/${sample}.vcf.gz \
        ${annot_eff}/${sample}.vcf.gz \
        --write-index -Ob -o ${outfile_dir}/${sample}.vcf.gz
}

for sample in "${MUT_SAMPLES[@]}"; do
    annot_all $sample
done

Visualising data

Preparing data for visualisation

Ensure that contigs and scaffolds are removed.

THREADS=16
for i in $(find . -type f -name "*.bam"); do
    j="${i%.bam}_remap1.bam"
    k="${i%.bam}_remap2.bam"

    # remap chromosome names in header first
    samtools view -@ ${THREADS} -H "${i}" | \
        awk 'NR==FNR{map[$1]=$2; next} {for(old in map) gsub(old, map[old])} 1' \
        ${CHROM_MAP} - > "${i}_head"
    samtools reheader "${i}_head" "${i}" > "${j}"
    samtools index -@ ${THREADS} "${j}"

    # now filter to chr1-25
    CHRS=$(seq 1 25 | sed 's/^/chr/' | tr '\n' ' ')
    samtools view -@ ${THREADS} -b -o "${k}" "${j}" ${CHRS}

    # rework headers
    samtools view -@ ${THREADS} -H "${k}" | \
        awk '/^@SQ/{if($0 ~ /SN:chr([1-9]|1[0-9]|2[0-5])\t/) print; next} {print}' \
        > "${k}_head"
    samtools reheader "${k}_head" "${k}" > "${i}"
    samtools index -@ ${THREADS} "${i}"
    rm "${j}" "${k}" "${i}_head" "${k}_head"
done
Visualise individual SNPs (single base level resolution)

The aim is to visualise all reads at each SNP position. This allows a user to investigate individual SNPs at single base level resolution. Complements and acts as a layer of validation for genome-level visualisation.

The tables generated here contain all the necessary information which can be visualised in the user’s method of choice (e.g. R or pandas dataframes.) The bam files (subsetted or full) can be viewed in the user’s genomic browser of choice, e.g. IGV, JBrowse, gosling.

Visualising the entire genome is ideal but the size of the alignment data makes this step impractical. Therefore, we subset the bam alignment files for reads containing the final subset of SNPs only for a portable but still informative bam file. The full bam file can also be loaded and mounted on a separate volume if needed. Here we parse the final vcf files into a format suitable for visualisation.

Hint

Skipping bam file subsetting and inspecting the full file is also valid. However, portability is an on-going issue due to bam file size (in this case ~10-20GB each).

prep_bam() {
    local sample=$1
    local infile_dir="${RESULTS_DIR}/08_annot_all/"
    local outfile_dir="${RESULTS_DIR}/10_visualisations/"

    python parse_vcf.py \
        "${infile_dir}/${sample}.vcf.gz" \
        "${SAMPLESHEET}" \
        "${outfile_dir}/${sample}.csv" \
        --subset_path "${outfile_dir}/${sample}" \
        --subset_workers ${THREADS} \
        --coverage_report "${outfile_dir}/${sample}_coverage.csv" \
        --chrom_mapping ${CHROM_MAP} \
        --force_overwrite
}

for sample in "${MUT_SAMPLES[@]}"; do
    prep_bam $sample
done

The tables generated here contain all the necessary information which can be visualised in the user’s method of choice (e.g. R or pandas dataframes.) The bam files (subsetted or full) can be viewed in the user’s genomic browser of choice, e.g. IGV, JBrowse, gosling.

Visualise mutation hotspots (genome-level visualisation)

The aim is to visualise homozygosity hotspots across the entire genome. This allows a user to “zoom in” on hotspots for further investigation. Complements and acts as a layer of validation for base-level visualisation.

prep_tdf() {
    local sample=$1
    local infile_dir="${RESULTS_DIR}/06_snps_filtered/"
    local outfile_dir="${RESULTS_DIR}/10_visualisations/"

    python plot_homozygosity.py \
        "${infile_dir}/${sample}.vcf.gz" \
        ${sample} \
        -i ${REF_FIXED_FA} \
        -t 0.85 \
        -n 16
}

for sample in "${MUT_SAMPLES[@]}"; do
    prep_tdf $sample
done

We use the scoring algorithm described in Henke et al., 2013.

Note

In theory, this step can be performed directly after vcf files are merged and SNPs are filtered out. This is ok if you want to generate independent tdf files for your own use. However, in the context of this pipeline, data is appended to json files created in the previous step.

Tip

After this step, no further data is generated.

Visualisation module

Caution

This part is more involved and is intended for developers who want to host a web page to view the data.

Tip

Can also be set up on localhost to view.

Install dependencies

Install npm following the instructions for your own operating system. A few linux examples are provided.

# Ubuntu/Debian
sudo apt install nodejs npm

# CentOS/RHEL
sudo yum install nodejs npm

Hint

If you get an error saying the host name cannot be resolved, try the following (at your own risk).

sudo sed -i 's/mirrorlist/#mirrorlist/g' /etc/yum.repos.d/CentOS-*
sudo sed -i 's|#baseurl=http://mirror.centos.org|baseurl=http://vault.centos.org|g' /etc/yum.repos.d/CentOS-*

Then install igv-dist.

Note

IGV code is also redistributed in this repository under the MIT license.

npm install express

mkdir igv-dist
curl -o igv-dist/igv.min.js https://cdn.jsdelivr.net/npm/igv@2.15.11/dist/igv.min.js
curl -o igv-dist/igv.css https://cdn.jsdelivr.net/npm/igv@2.15.11/dist/igv.css

Our html file calls the local script.

<script src="/api/igv-proxy/igv.min.js"></script>
<link rel="stylesheet" href="/api/igv-proxy/igv.css">

Secure hosting

To set up a https secure host:

  1. Obtain a free domain name at duckdns

  2. Apply certbot to generate a SSL certificate

  3. Setup a nginx configuration.

Note

This is highly user-specific, and will change depending on your level of access to the host server, file location and open port(s). For this reason, no specific instructions are provided. However, following the instructions for each of the above three services should work in most scenarios.

Custom genome

Current version of IGV does not support genome assembly danRer7 (Zv9) by default.

Tip

If you want to add your own custom genome, you can follow the steps here.

genome_dir="igv-genomes/danRer7/"
fasta_path=${genomes_dir}/danRer7.fa
index_path=${genomes_dir}/danRer7.fa.fai
annot_path=${genomes_dir}/danRer7.gff
mapfile=${genomes_dir}/mapfile.txt

mkdir -p ${genomes_dir}

cp ${REF_FIXED_FA} ${fasta_path}
cp ${REF_FIXED_FA}.fai ${index_path}
wget 'https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/002/035/GCF_000002035.4_Zv9/GCF_000002035.4_Zv9_genomic.gff.gz' -o ${annot_path}
gzip -d ${annot_path}.gz

paste <(grep -v '#' ${annot_path} | cut -f1 | uniq | grep NC_007) \
    <(seq 1 25 | sed 's/^/chr/') > ${mapfile}

awk 'BEGIN{FS=OFS="\t"} FNR==NR{map[$1]=$2; next} {if($1 in map) $1=map[$1]; print}' ${mapfile} ${annot_path} > tmp.gff

mv tmp.gff ${annot_path}

Note

Use the fixed fasta file and indices with the chromosome names corresponding to the bam files.

Directory structure

At the end, your directory structure should look like this:

vis/
├── variant_viewer.html
├── file_server.py
├── data/
│   ├── ${SAMPLE_NAME}/
│   │   ├── ${SAMPLE_NAME}_mutant.bam
│   │   ├── ${SAMPLE_NAME}_mutant.bam.bai
│   │   ├── ${REFERENCE_NAME}_reference.bam
│   │   ├── ${REFERENCE_NAME}_reference.bam.bai
│   │   ...
│   ├── ${SAMPLE_NAME}.csv
│   ├── ${SAMPLE_NAME}.json
│   └── ${SAMPLE_NAME}_coverage.csv
├── igv-dist/
│   ├── igv.css
│   └── igv.min.js
└── igv-genomes/
    ├── danRer7/
    │   ├── danRer7.fa
    │   ├── danRer7.fa.fai
    │   ├── danRer7.gff
    │   └── mapfile.txt
    └── genomes.json

We need the mapfile.txt in this case to match chromosome names to the updated ones. Configuration files are provided as part of the repository, but larger genome and data files are not. Instructions on how to obtain or regenerate these are covered in previous steps.

Hint

For data, symlinking results/10_visualisations to vis/data should work normally after completing the previous steps in the pipeline. For visualisation, the gff file is optional but provides useful annotations for gene regions.

Core server files:

variant_viewer.html
file_server.py

Data files, where the contents of data are the same as 10_visualisations:

data/
├── ${SAMPLE_NAME}/
│   ├── ${SAMPLE_NAME}_mutant.bam
│   ├── ${SAMPLE_NAME}_mutant.bam.bai
│   ├── ${REFERENCE_NAME}_reference.bam
│   ├── ${REFERENCE_NAME}_reference.bam.bai
│   ...
├── ${SAMPLE_NAME}.csv
├── ${SAMPLE_NAME}.json
└── ${SAMPLE_NAME}_coverage.csv

IGV genome browser source files:

igv-dist/
├── igv.css
└── igv.min.js

Custom reference genomes (in this case danRer7):

igv-genomes/
├── danRer7/
│   ├── danRer7.fa
│   ├── danRer7.fa.fai
│   └── danRer7.gff (optional)
└── genomes.json

Caution

The chromosome names in the gff files must match the fasta and alignment files.

You can modify genomes.json as needed for your use case. Then modify the html directly to add the corresponding value matching the id to the drop-down list.

{
    "danRer7": {
        "id": "danRer7",
        "name": "Zebrafish Zv9 (danRer7)",
        "fastaURL": "/genomes/danRer7/danRer7.fa",
        "indexURL": "/genomes/danRer7/danRer7.fa.fai",
        "chromosomeOrder": "chr1,chr2,chr3,chr4,chr5,chr6,chr7,chr8,chr9,chr10,chr11,chr12,chr13,chr14,chr15,chr16,chr17,chr18,chr19,chr20,chr21,chr22,chr23,chr24,chr25,chrM"
    }
}