Internal reference only

MIT License CC BY 4.0 GitHub Build Status Docker Pulls Website Zenodo

Important

This contains information about an earlier iteration of the pipeline and is only relevant if you want to compare previous in-house steps to current protocols. If you are a new user of the pipeline and see this page, you can probably ignore the information here.

Copyright © 2025 Tyrone Chen ORCID logo, Richard Lupat ORCID logo, Michelle Meier ORCID logo, Maia Zethoven ORCID logo, Gregory Baillie ORCID logo, Scott Paterson ORCID logo, Oguzhan Baltaci ORCID logo, Cas Simons ORCID logo, Jason Li ORCID logo, Andrew Cox ORCID logo, Kelly Smith ORCID logo, Benjamin Hogan ORCID logo

Code in this repository is provided under a MIT license. This documentation is provided under a CC-BY-4.0 license.

Visit our lab website here. Contact Benjamin Hogan at ben.hogan@petermac.org.

Changelog

  • Uses updated versions of gatk, bcftools, snpEff, VEP to replace obsolete functionality.

  • Uses csi instead of tbi indexing for stability with long chromosomes.

  • Reworked logic of the original pipeline to handle missing values.

  • Increased threshold for read mapping quality which was causing false negatives.

  • Lowered threshold for homozygous calls to handle leakage which was causing false negatives.

  • All-in-one pipeline which can handle any combination of mutant and control data.

  • Now handles mulitallelic SNPs correctly (i.e. if a reference has a SNP but this SNP is different to the mutant, it is correctly identified as a SNP event in the mutant.)

  • Concept of candidate strictness is removed and replaced with allele frequencies with user-defined filters.

  • Variant impact prediction now uses two independent annotation prediction software.

  • Modern visualisation methods independent of IGV genome browser.

Important information

Warning

Parts of the pipeline silently truncate file names. To fix this, original filenames were shortened. A map to the original file names were preserved.

Experimental design

We have 4 FO fish. These are mutagenised in a ENU forward genetics screen. The germline mutants are obtained and propagated until the F2 generation. The aim is to find SNPs in the F2 generation that are not present in the F0 generation.

Note

These are functionally germline mutants.

To filter out variants present in the Zebrafish reference genome, variant calling is performed against a reference in all cases. - Variants in F0 are called against the Zebrafish Zv9 danRer7 reference genome. - Variants in F2 are called against the Zebrafish Zv9 danRer7 reference genome.

VCF

The original variant caller did not pool the reference files.

BAM

Alignment bam files provide the most direct SNP readouts.

SNZL and FOSHZL usage

Usage tested on CentOS7 only. Works in apptainer/singularity containers with CentOS7. Note that the software is no longer available on pypi.

FOSHZL is used to create homozygosity plots. These appear as peaks showing regions of interest on genome browsers. Note that these regions are tiled for visualisation purposes (ie not single base pair resolution).

SNZL performs various calculations to determine if a SNP is a good candidate. SNZL claims to account for the following criteria:

  1. The mutant must be covered to a depth of >= 1 read

  2. >= of the unaffected sibling or other samples must be covered to a depth of >= 1 read

  3. If the mutant and sibling have a single allele, it must not be the same in both

  4. The mutant must have a majority allele that is not the majority allele in any of the ‘other’ samples

Usage

Data setup

Original data (reference only)

Caution

The original data setup was sample-centric as opposed to more conventional process-centric directory structure. In addition, code had hardcoded paths which limited file movement. This impacted all steps of analysis and reduced reproducibility. To lower the impact of the original file layout, the data directory now contains samplesheets with updated file paths to symlinks and their attributes. However, it is possible that some errors remain.

Run 1.extract_filepaths.sh with the required command line arguments to generate the samplesheets with paths to old files. The samplesheets are tab separated files with the following fields:

Samplesheet column information

Field

Description

Sample_Identity

Unique identifier for the sample

Fastq_File

Path to the raw FASTQ/.gz file

Alignment_File

Path to the aligned BAM file

VCF_Original

Path to the original VCF file from GATK variant calling

VCF_Merged

Path to the merged VCF file per chromosome

VCF_ChrFixed

Path to the VCF file with correctly renamed chromosomes

VCF_Annotated

Path to the annotated VCF file post-SNPEff run

VCF_Candidates

Path to the VCF file containing strict candidate information

Vis_NoGaps_NBases*

Number of bases not in gaps (bedgraph + tdf for visualisation)

Vis_NoGaps_NSnps*

Number of SNPs not in gaps (bedgraph + tdf for visualisation)

Vis_WithGaps_NBases*

Number of bases in gaps (bedgraph + tdf for visualisation)

Vis_WithGaps_NSnps*

Number of SNPs in gaps (bedgraph + tdf for visualisation)

Metadata_Json

Path to the JSON file containing metadata for visualisation

Sample_Short

Short identifier for the sample

Warning

A downstream step silently truncates file names if it exceeds a certain limit. No explanation for this behaviour was available. The 1.extract_filepaths.sh script will try to “guess” file paths by performing an arbitrary truncation and subsequent substring matching.

Note

Vis_ fields should contain the file name but not the file extensions, both bedgraph and tdf files will be generated. Note that both files contain the same information but tdf is optimised for viewing with the IGV Genome Browser.

Example samplesheet showing one sample

Sample_Identity

Fastq_File

Alignment_File

VCF_Original

VCF_Merged

VCF_ChrFixed

VCF_Annotated

VCF_Candidates

Snzl_NoGaps_NBases

Snzl_NoGaps_NSnps

Snzl_WithGaps_NBases

Snzl_WithGaps_NSnps

Json

Sample_Short

TL2211776-1-5-MAN-20220915

NA

/hogan_lab/Hogan_Lab_Shared/Zebrafish_Homozygosity/Alignment_VariantCall/220930_A00692_316_AHVMJFDSX3/TL2211776-1-5-MAN-20220915/Bam/TL2211776-1-5-MAN-20220915_merged.markdups.bam

/hogan_lab/Hogan_Lab_Shared/Zebrafish_Homozygosity/Alignment_VariantCall/220930_A00692_316_AHVMJFDSX3/TL2211776-1-5-MAN-20220915/Vcf/TL2211776-1-5-MAN-20220915_gcv_UG.vcf.gz

/hogan_lab/Hogan_Lab_Shared/Zebrafish_Homozygosity/Alignment_VariantCall/220930_A00692_316_AHVMJFDSX3/TL2211776-1-5-MAN-20220915/Vcf/TL2211776-1-5-MAN-20220915_Ref_merged.vcf.gz

/hogan_lab/Hogan_Lab_Shared/Zebrafish_Homozygosity/Alignment_VariantCall/220930_A00692_316_AHVMJFDSX3/TL2211776-1-5-MAN-20220915/Vcf/TL2211776-1-5-MAN-20220915_Ref_merged_ChromFixed.vcf.gz

/hogan_lab/Hogan_Lab_Shared/Zebrafish_Homozygosity/Alignment_VariantCall/220930_A00692_316_AHVMJFDSX3/TL2211776-1-5-MAN-20220915/Vcf/TL2211776-1-5-MAN-20220915_Ref_merged_ChromFixed.vcf.annotated.vcf.gz

/hogan_lab/Hogan_Lab_Shared/Zebrafish_Homozygosity/Alignment_VariantCall/220930_A00692_316_AHVMJFDSX3/TL2211776-1-5-MAN-20220915/candidates.vcf

/hogan_lab/Hogan_Lab_Shared/Zebrafish_Homozygosity/Alignment_VariantCall/220930_A00692_316_AHVMJFDSX3/TL2211776-1-5-MAN-20220915/bedgraph/TL2211776-1-5-MAN-20220915.nogap_nbases

/hogan_lab/Hogan_Lab_Shared/Zebrafish_Homozygosity/Alignment_VariantCall/220930_A00692_316_AHVMJFDSX3/TL2211776-1-5-MAN-20220915/bedgraph/TL2211776-1-5-MAN-20220915.nogap_nsnps

/hogan_lab/Hogan_Lab_Shared/Zebrafish_Homozygosity/Alignment_VariantCall/220930_A00692_316_AHVMJFDSX3/TL2211776-1-5-MAN-20220915/bedgraph/TL2211776-1-5-MAN-20220915.withgap_nbases

/hogan_lab/Hogan_Lab_Shared/Zebrafish_Homozygosity/Alignment_VariantCall/220930_A00692_316_AHVMJFDSX3/TL2211776-1-5-MAN-20220915/bedgraph/TL2211776-1-5-MAN-20220915.withgap_nsnps

/hogan_lab/Hogan_Lab_Shared/Zebrafish_Homozygosity/Alignment_VariantCall/220930_A00692_316_AHVMJFDSX3/TL2211776-1-5-MAN-20220915/TL2211776-1-5-MAN-20220915_candidates.json

TL2211776-1-5

The 2.validate_samples.py script checks the validity of each file in the samplesheet per sample. This also drops the fastq column. Automatically runs in 1.extract_filepaths.sh.

python 2.validate_samples.py ../data/samplesheet.tmp -o ../data/samplesheet.tsv

Caution

Custom directories and/or files exist since the data passed through multiple iterations. To some extent this is accommodated in the setup and validation scripts, but make sure to double-check everything.

For the purposes of rerunning the postprocessing, we only want the bam alignment files and original vcf files, since we will be recreating everything after this step. Running 3.setup_dataset.sh will create the corresponding input and output directories:

../data/Alignment_File
../data/VCF_Original
../results/Sample_Identity
../results/Fastq_File
../results/Alignment_File
../results/VCF_Merged
../results/VCF_ChrFixed
../results/VCF_Annotated
../results/VCF_Candidates
../results/Snzl_NoGaps_NBases
../results/Snzl_NoGaps_NSnps
../results/Snzl_WithGaps_NBases
../results/Snzl_WithGaps_NSnps
../results/Json

While the bam and vcf files are not directly included in this repository due to size constraints, md5 sums are preserved in data/Alignment_File/bam.md5 and data/VCF_Original/vcf.md5.

Four samplesheet files are generated: - samplesheet.220701_A01221_0125_BHCCHNDMXY.tsv - samplesheet.220930_A00692_316_AHVMJFDSX3.tsv - samplesheet.221014_A00692_0319_BHGJ7CDMXY.tsv - samplesheet.231201_A00692_0394_231208_A00692_0396.tsv

The samplesheet.220701_A01221_0125_BHCCHNDMXY.tsv contains F0 grandparent reference generation metadata. All other samplesheets contain F2 generation metadata. samplesheet_original.tsv contains these legacy file paths.

3.setup_dataset.sh copies bam alignment files and vcf variant calling files over from their specified locations in samplesheet_original.tsv to data/Alignment_File and data/VCF_Original respectively. We generate samplesheet_postprocess.tsv for use in this postprocessing analysis.

Usage

High-level automated approach

A nextflow pipeline developed by the Peter MacCallum Cancer Centre Bioinformatics Core is also available. This ingests raw fastq files as input, performs alignments to generate bam files, and performs variant calling to generate vcf files. Note that the paths are hardcoded at time of writing. Using this pipeline incorporates all steps below.

Data setup

Original data

Caution

The original data setup was sample-centric as opposed to more conventional process-centric directory structure. In addition, code had hardcoded paths which limited file movement. This impacted all steps of analysis and reduced reproducibility. To lower the impact of the original file layout, the data directory now contains samplesheets with updated file paths to symlinks and their attributes. However, it is possible that some errors remain.

Run 1.extract_filepaths.sh with the required command line arguments to generate the samplesheets with paths to old files. The samplesheets are tab separated files with the following fields:

Samplesheet column information

Field

Description

Sample_Identity

Unique identifier for the sample

Fastq_File

Path to the raw FASTQ/.gz file

Alignment_File

Path to the aligned BAM file

VCF_Original

Path to the original VCF file from GATK variant calling

VCF_Merged

Path to the merged VCF file per chromosome

VCF_ChrFixed

Path to the VCF file with correctly renamed chromosomes

VCF_Annotated

Path to the annotated VCF file post-SNPEff run

VCF_Candidates

Path to the VCF file containing strict candidate information

Vis_NoGaps_NBases*

Number of bases not in gaps (bedgraph + tdf for visualisation)

Vis_NoGaps_NSnps*

Number of SNPs not in gaps (bedgraph + tdf for visualisation)

Vis_WithGaps_NBases*

Number of bases in gaps (bedgraph + tdf for visualisation)

Vis_WithGaps_NSnps*

Number of SNPs in gaps (bedgraph + tdf for visualisation)

Metadata_Json

Path to the JSON file containing metadata for visualisation

Sample_Short

Short identifier for the sample

Warning

A downstream step silently truncates file names if it exceeds a certain limit. No explanation for this behaviour was available. The 1.extract_filepaths.sh script will try to “guess” file paths by performing an arbitrary truncation and subsequent substring matching.

Note

Vis_ fields should contain the file name but not the file extensions, both bedgraph and tdf files will be generated. Note that both files contain the same information but tdf is optimised for viewing with the IGV Genome Browser.

Example samplesheet showing one sample

Sample_Identity

Fastq_File

Alignment_File

VCF_Original

VCF_Merged

VCF_ChrFixed

VCF_Annotated

VCF_Candidates

Snzl_NoGaps_NBases

Snzl_NoGaps_NSnps

Snzl_WithGaps_NBases

Snzl_WithGaps_NSnps

Json

Sample_Short

TL2211776-1-5-MAN-20220915

NA

/hogan_lab/Hogan_Lab_Shared/Zebrafish_Homozygosity/Alignment_VariantCall/220930_A00692_316_AHVMJFDSX3/TL2211776-1-5-MAN-20220915/Bam/TL2211776-1-5-MAN-20220915_merged.markdups.bam

/hogan_lab/Hogan_Lab_Shared/Zebrafish_Homozygosity/Alignment_VariantCall/220930_A00692_316_AHVMJFDSX3/TL2211776-1-5-MAN-20220915/Vcf/TL2211776-1-5-MAN-20220915_gcv_UG.vcf.gz

/hogan_lab/Hogan_Lab_Shared/Zebrafish_Homozygosity/Alignment_VariantCall/220930_A00692_316_AHVMJFDSX3/TL2211776-1-5-MAN-20220915/Vcf/TL2211776-1-5-MAN-20220915_Ref_merged.vcf.gz

/hogan_lab/Hogan_Lab_Shared/Zebrafish_Homozygosity/Alignment_VariantCall/220930_A00692_316_AHVMJFDSX3/TL2211776-1-5-MAN-20220915/Vcf/TL2211776-1-5-MAN-20220915_Ref_merged_ChromFixed.vcf.gz

/hogan_lab/Hogan_Lab_Shared/Zebrafish_Homozygosity/Alignment_VariantCall/220930_A00692_316_AHVMJFDSX3/TL2211776-1-5-MAN-20220915/Vcf/TL2211776-1-5-MAN-20220915_Ref_merged_ChromFixed.vcf.annotated.vcf.gz

/hogan_lab/Hogan_Lab_Shared/Zebrafish_Homozygosity/Alignment_VariantCall/220930_A00692_316_AHVMJFDSX3/TL2211776-1-5-MAN-20220915/candidates.vcf

/hogan_lab/Hogan_Lab_Shared/Zebrafish_Homozygosity/Alignment_VariantCall/220930_A00692_316_AHVMJFDSX3/TL2211776-1-5-MAN-20220915/bedgraph/TL2211776-1-5-MAN-20220915.nogap_nbases

/hogan_lab/Hogan_Lab_Shared/Zebrafish_Homozygosity/Alignment_VariantCall/220930_A00692_316_AHVMJFDSX3/TL2211776-1-5-MAN-20220915/bedgraph/TL2211776-1-5-MAN-20220915.nogap_nsnps

/hogan_lab/Hogan_Lab_Shared/Zebrafish_Homozygosity/Alignment_VariantCall/220930_A00692_316_AHVMJFDSX3/TL2211776-1-5-MAN-20220915/bedgraph/TL2211776-1-5-MAN-20220915.withgap_nbases

/hogan_lab/Hogan_Lab_Shared/Zebrafish_Homozygosity/Alignment_VariantCall/220930_A00692_316_AHVMJFDSX3/TL2211776-1-5-MAN-20220915/bedgraph/TL2211776-1-5-MAN-20220915.withgap_nsnps

/hogan_lab/Hogan_Lab_Shared/Zebrafish_Homozygosity/Alignment_VariantCall/220930_A00692_316_AHVMJFDSX3/TL2211776-1-5-MAN-20220915/TL2211776-1-5-MAN-20220915_candidates.json

TL2211776-1-5

The 2.validate_samples.py script checks the validity of each file in the samplesheet per sample. This also drops the fastq column. Automatically runs in 1.extract_filepaths.sh.

python 2.validate_samples.py ../data/samplesheet.tmp -o ../data/samplesheet.tsv

Caution

Custom directories and/or files exist since the data passed through multiple iterations. To some extent this is accommodated in the setup and validation scripts, but make sure to double-check everything.

For the purposes of rerunning this experiment, we only want the bam alignment files since we will be recreating everything starting from the variant calling step. Running 3.setup_dataset.sh will create the corresponding input and output directories:

../data/Alignment_File
../results/Sample_Identity
../results/Fastq_File
../results/Alignment_File
../results/VCF_Original
../results/VCF_Merged
../results/VCF_ChrFixed
../results/VCF_Annotated
../results/VCF_Candidates
../results/Snzl_NoGaps_NBases
../results/Snzl_NoGaps_NSnps
../results/Snzl_WithGaps_NBases
../results/Snzl_WithGaps_NSnps
../results/Json

While the bam files are not directly included in this repository due to size constraints, md5 sums are preserved in data/Alignment_File/bam.md5.

Four samplesheet files are generated: - samplesheet.220701_A01221_0125_BHCCHNDMXY.tsv - samplesheet.220930_A00692_316_AHVMJFDSX3.tsv - samplesheet.221014_A00692_0319_BHGJ7CDMXY.tsv - samplesheet.231201_A00692_0394_231208_A00692_0396.tsv

The samplesheet.220701_A01221_0125_BHCCHNDMXY.tsv contains F0 grandparent reference generation metadata. All other samplesheets contain F2 generation metadata. samplesheet_original.tsv contains these legacy file paths.

This dataset

3.setup_dataset.sh copies bam alignment files over from their specified locations in samplesheet_original.tsv to data/Alignment_File. A new samplesheet_new.tsv is created for the latest analysis.

Whole zebrafish genome assembly

The ``seqliner`` assembly step is not covered in this documentation. The pipeline starts from the aligned bam files

Variant calling

NCBI reference genome

Download the reference genome here. Note that the chromosome names may not be the same as the newly assembled ones.

F0 genomes

There are 4 F0 references. - TL2209397-ENUref-Female-6-MAN-20220655 - TL2209398-ENUref-Male-6-MAN-20220656 - TL2209399-TUref-Female-4-MAN-20220657 - TL2209400-TUref-Male-4-MAN-20220658

F2 genomes

There are 44 F2 samples.

Call references against the NCBI reference

The vcf files are generated by calling variants from the: - NCBI reference genome against the F0 genomes - NCBI reference genome against the F2 genomes

We want to find SNPs in the F2 generation which are not in the F1 generation.

F0 pooling

Perform the variant calling for the F0, pooling samples.

Attention

In the first iteration, samples were aggregated only but not pooled during variant calling.

Run the 4.run_gatk.sh script and follow the instructions.

Caution

The script is not designed to be run in one go. Each step submits a series of slurm jobs, which generates the files used in the next stage of the pipeline.

Attention

The original iteration of the pipeline used an older variant caller UnifiedGenotyper. This method is now obsolete and documentation is sparse. We use the updated HaplotypeCaller, which is functionally similar and has a higher performance. We follow the pipeline described here.

3. Generate homozygosity peaks for visualisation

4. Quantify SNP impact

5. Identify strict candidate SNPs

New way

Example on chromosome 24 of sample TL2312073-163-4L-MAN-20231_Ref_merged_ChromFixed.vcf.annotated.vcf.

infile_path="TL2312073-163-4L-MAN-20231_Ref_merged_ChromFixed.vcf.annotated.vcf"
greppattern='./.:.:.:.:.\t./.:.:.:.:.\t./.:.:.:.:.\t./.:.:.:.:.$'
head -n 1172 ${infile_path} > chr24.vcf
grep chr24 ${infile_path} >> chr24.vcf
grep -P ${greppattern} chr24.vcf > chr24.strict.vcf

Custom module

Description of claims:

The snzl module claims to identify strict candidate SNPs using the following criteria: 1. The mutant must be covered to a depth of >= 1 read 2. At least one of the unaffected sibling or other samples must be covered to a depth of >= 1 read 3. If the mutant and sibling have a single allele, it must not be the same in both 4. The mutant must have a majority allele that is: a. not the majority allele in any of the ‘other’ samples

Unit tests of ``snzl``

Setup test environment

Caution

These steps are specific to the Peter Mac compute cluster

Load singularity container. You can either load it manually:

apptainer shell --bind \
   /team_folders/hogan_lab/Hogan_Lab_Shared/Zebrafish_Homozygosity/,/etc/profile.d/ --home /hogan_lab/Hogan_Lab_People/Tyrone_Chen/repos/hogan-lab-shared-repository/wgs_vbm/ /config/spack/containers/centos7/container.sif
source /etc/profile.d/modules.sh
source /team_folders/hogan_lab/Hogan_Lab_Shared/Zebrafish_Homozygosity/gjbzebrafishtools/venv/bin/activate
module load tabix
module load bcftools

Or specify this as an option in the sbatch script:

#SBATCH --container /config/spack/containers/centos7/container.sif
source /etc/profile.d/modules.sh
export LD_LIBRARY_PATH=/config/binaries/gsl/2.7.1/lib:/config/binaries/gcc/12.2.0/lib64:/config/binaries/R/4.2.0.Core/lib64/R/lib:/config/binaries/python/3.8.1/lib:/config/binaries/hdf5/1.10.5/lib
source /team_folders/hogan_lab/Hogan_Lab_Shared/Zebrafish_Homozygosity/gjbzebrafishtools/venv/bin/activate

sinteractive option:

sinteractive -p rhel_short --time 0-08:00:00 --mem 64G --container /config/spack/containers/centos7/container.sif
source /etc/profile.d/modules.sh
source /team_folders/hogan_lab/Hogan_Lab_Shared/Zebrafish_Homozygosity/gjbzebrafishtools/venv/bin/activate

Warning

Library snzl only works on CentOS7.

Warning

Library snzl no longer exists in public.

Test design

Check the original assumptions by manually editing the VCF, varying SNP and DP, and see if snzl picks it up.

To test these claims, in each case a vcf file of a target sample was subsampled to generate a small test sample. Most headers were removed.

All of the steps below are carried out in the test directory.

Controls (DP quantity)

Note

DP indicates read depth. Discussion on an ideal read depth is outside the scope of this document. 10-30 is generally considered OK.

Five entries with varying levels of read depth {0,10,20,30,40}. In this sample, SNPs are not detected in any references, making this a strict candidate. SNP effect was predicted to be MODERATE, matching filter threshold. For the purposes of this test, SNP positions are artificial.

head -n1172 TL2312073-163-4L-MAN-20231_Ref_merged_ChromFixed.vcf.annotated.vcf > header.vcf
(cat header.vcf; for i in {1..5}; do echo $(grep chr24 TL2312073-163-4L-MAN-20231_Ref_merged_ChromFixed.vcf.annotated.vcf | grep -m 1 423800); done) > test_dp.tmp
tac test_dp.tmp | sed -e "1,5s| |\t|g" -e "1s/DP=10/DP=40/" -e "1s/423800/5/" -e "2s/DP=10/DP=30/" -e "2s/423800/4/" -e "3s/DP=10/DP=20/" -e "3s/423800/3/" -e "4s/423800/2/" -e "5s/423800/1/" -e "5s/DP=10/DP=0/" | tac > test_dp.vcf
rm test_dp.tmp
bgzip test_dp.vcf
bcftools index -t test_dp.vcf.gz

Here is an example of one entry in the file. A strict candidate is chosen where only the sample contains the SNP, and it is absent in the F0 generation. Only the DP value is modified in each iteration:

chr24   5       .       G       T       269.98  .       BaseQRankSum=0;Dels=0;ExcessHet=3.0103;FS=0;HaplotypeScore=0.9997;MQ=60;MQ0=0;MQRankSum=0;QD=27;ReadPosRankSum=1.036;SOR=0.307;DP=40;AF=0.5;MLEAC=1;MLEAF=0.5;AN=2;AC=1;EFF=NON_SYNONYMOUS_CODING(MODERATE|MISSENSE|Gtt/Ttt|V7F|128|RECK|protein_coding|CODING|ENSDART00000129135|1|T|WARNING_TRANSCRIPT_NO_START_CODON),DOWNSTREAM(MODIFIER||2776||157|INSC|protein_coding|CODING|ENSDART00000131091||T|WARNING_TRANSCRIPT_NO_STOP_CODON)      GT:AD:DP:GQ:PL  0/1:1,9:10:13:298,0,13  ./.:.:.:.:.    ./.:.:.:.:.      ./.:.:.:.:.     ./.:.:.:.:.

The snzl strict candidate pipeline is run.

infile_path="test_dp.vcf.gz"
outfile_path="test_dp_out.vcf"
sample_name="TL2312073-163-4L-MAN-20231116"
chromosome="chr24"

# see only 2 alleles (not sure what cases would result in >2 in zebrafish)
bcftools view --max-alleles 2 ${infile_path} ${chromosome} | \
   snzl --no-filtered --output ${outfile_path} \
      - candidate_snp -csm ${sample_name} \
      snpeff -sem ${sample_name} -see EFF -semi MODERATE

We expect the strict candidate to be detected where read depths > 10.

Instead, no candidates were detected:

grep -v '#' test_dp_out.vcf | wc -l
# get zero lines detected
Controls (SNP quantity)

Note

./.:.:.:.:. indicates a non-SNP event in the corresponding sample.

Sample with varying numbers of SNPs across the four references {0,1,2,3,4}. In this sample, read depth is set to 100. SNP effect was predicted to be MODERATE, matching filter threshold. For the purposes of this test, SNP positions are artificial:

1 0 0 0 0
1 1 0 0 0
1 1 1 0 0
1 1 1 1 0
1 1 1 1 1

This sample was identified as a region of high interest by the original pipeline.

OUT="test_snp.vcf"
SNP='1/1:0,16:16:15:158,15,0'
NAN='./.:.:.:.:.'

head -n1172 TL2312073-163-4L-MAN-20231_Ref_merged_ChromFixed.vcf.annotated.vcf > header.vcf
mv header.vcf ${OUT}
paste -d '\t' \
   <(grep -m 1 5996897 chr24.vcf | cut -f1-10 | sed "s|5996897|1|") \
   <(printf "${NAN}\t${NAN}\t${NAN}\t${NAN}\n") >> ${OUT}
paste -d '\t' \
   <(grep -m 1 5996897 chr24.vcf | cut -f1-10 | sed "s|5996897|2|") \
   <(printf "${SNP}\t${NAN}\t${NAN}\t${NAN}\n") >> ${OUT}
paste -d '\t' \
   <(grep -m 1 5996897 chr24.vcf | cut -f1-10 | sed "s|5996897|3|") \
   <(printf "${SNP}\t${SNP}\t${NAN}\t${NAN}\n") >> ${OUT}
paste -d '\t' \
   <(grep -m 1 5996897 chr24.vcf | cut -f1-10 | sed "s|5996897|4|") \
   <(printf "${SNP}\t${SNP}\t${SNP}\t${NAN}\n") >> ${OUT}
paste -d '\t' \
   <(grep -m 1 5996897 chr24.vcf | cut -f1-10 | sed "s|5996897|5|") \
   <(printf "${SNP}\t${SNP}\t${SNP}\t${SNP}\n") >> ${OUT}
sed -i "s|DP=25|DP=100|" $OUT
bgzip ${OUT}
bcftools index -t ${OUT}.gz

Here is an example of one entry in the file. Only the SNP field is modified in each iteration, with an increasing number of blanks: ./.:.:.:.:.

chr24   5       .       A       T       129.9   .       BaseQRankSum=0;Dels=0;ExcessHet=3.0103;FS=0;HaplotypeScore=0;MQ=12.77;MQ0=6;MQRankSum=0.967;QD=3.97;ReadPosRankSum=0.967;SOR=1.179;DP=100;AF=1;MLEAC=2;MLEAF=1;AN=4;AC=4;EFF=NON_SYNONYMOUS_CODING(MODERATE|MISSENSE|gaA/gaT|E32D|310|si:ch211-193e5.4|protein_coding|CODING|ENSDART00000132686|2|T|WARNING_TRANSCRIPT_INCOMPLETE),UPSTREAM(MODIFIER||1545||310|si:ch211-193e5.3|protein_coding|CODING|ENSDART00000077933||T|WARNING_TRANSCRIPT_INCOMPLETE),UPSTREAM(MODIFIER||2423||279|si:ch211-193e5.3|protein_coding|CODING|ENSDART00000153736||T|WARNING_TRANSCRIPT_NO_START_CODON),UPSTREAM(MODIFIER||237||278|si:ch211-193e5.4|protein_coding|CODING|ENSDART00000077922||T|WARNING_TRANSCRIPT_NO_START_CODON),DOWNSTREAM(MODIFIER||2894|||CT573382.1|miRNA|NON_CODING|ENSDART00000119433||T),DOWNSTREAM(MODIFIER||705|||CT573382.2|miRNA|NON_CODING|ENSDART00000119567||T),DOWNSTREAM(MODIFIER||2377||503|acbd5a|protein_coding|CODING|ENSDART00000135124||T),DOWNSTREAM(MODIFIER||2377||502|acbd5a|protein_coding|CODING|ENSDART00000122018||T),DOWNSTREAM(MODIFIER||2377||501|acbd5a|protein_coding|CODING|ENSDART00000007373||T)      GT:AD:DP:GQ:PL  1/1:7,2:9:6:63,6,0      1/1:0,16:16:15:158,15,01/1:0,16:16:15:158,15,0  1/1:0,16:16:15:158,15,0 1/1:0,16:16:15:158,15,0

The snzl strict candidate pipeline is run.

infile_path="test_snp.vcf.gz"
outfile_path="test_snp_out.vcf"
sample_name="TL2312073-163-4L-MAN-20231116"
chromosome="chr24"

# see only 2 alleles (not sure what cases would result in >2 in zebrafish)
bcftools view --max-alleles 2 ${infile_path} ${chromosome} | \
   snzl --no-filtered --output ${outfile_path} \
      - candidate_snp -csm ${sample_name} \
      snpeff -sem ${sample_name} -see EFF -semi MODERATE

We expect that all strict candidates will be retained and all non-strict candidates discarded.

Instead, all non-strict candidates were retained and all strict candidates were discarded:

grep -v '#' test_snp_out.vcf | grep -P './.:.:.:.:.\t./.:.:.:.:.\t./.:.:.:.:.\t./.:.:.:.:.' | wc -l
# 0
Controls (SNP presence and DP quantity)

Next, both the tests are combined. test_snp.vcf.gz entries are duplicated and assigned DP values of either 1 or 100.

(cat test_snp.vcf; grep -v '#' test_snp.vcf | sed "s|DP=100|DP=1|") | tac | sed -e "1s|\t5\t|\t50\t|" -e "2s|\t4\t|\t40\t|" -e "3s|\t3\t|\t30\t|" -e "4s|\t2\t|\t20\t|" -e "5s|\t1\t|\t10\t|" | tac > test_snp_dp.vcf
bgzip test_snp_dp.vcf
bcftools index -t test_snp_dp.vcf.gz

The snzl strict candidate pipeline is run.

infile_path="test_snp_dp.vcf.gz"
outfile_path="test_snp_dp_out.vcf"
sample_name="TL2312073-163-4L-MAN-20231116"
chromosome="chr24"

# see only 2 alleles (not sure what cases would result in >2 in zebrafish)
bcftools view --max-alleles 2 ${infile_path} ${chromosome} | \
   snzl --no-filtered --output ${outfile_path} \
      - candidate_snp -csm ${sample_name} \
      snpeff -sem ${sample_name} -see EFF -semi MODERATE

We expect that all strict candidates will be retained passing a read depth threshold and all non-strict candidates discarded. Instead all non-strict candidates were retained regardless of read depth.:

grep -v '#' test_dp_out.vcf | wc -l
# get 8 lines detected, all non-strict and full range of read depths
Test snzl strict candidate filtering

Use the same file as above, but vary snzl settings.

infile_path="test_snp_dp.vcf.gz"
outfile_path="test_snp_dp_csm_out.vcf"
sample_name="TL2312073-163-4L-MAN-20231116"
chromosome="chr24"

# see only 2 alleles (not sure what cases would result in >2 in zebrafish)
bcftools view --max-alleles 2 ${infile_path} ${chromosome} | \
   snzl --no-filtered --output ${outfile_path} \
      - candidate_snp -css -csm ${sample_name} \
      snpeff -sem ${sample_name} -see EFF -semi MODERATE

We expect strict candidate entries to be returned, but instead no entries are returned. Upscaling this test to a full sample had the same result (not shown for size reasons).

6. Filter out predicted high impact SNPs

7. Filter out only SNPs (excluding CNV, INDELS, etc)

Example hyperparameter.json file
echo TEST