API Reference

MIT License CC BY 4.0 GitHub Build Status Docker Pulls Website Zenodo Copyright © 2025 Tyrone Chen ORCID logo, Richard Lupat ORCID logo, Michelle Meier ORCID logo, Maia Zethoven ORCID logo, Gregory Baillie ORCID logo, Scott Paterson ORCID logo, Oguzhan Baltaci ORCID logo, Cas Simons ORCID logo, Jason Li ORCID logo, Andrew Cox ORCID logo, Kelly Smith ORCID logo, Benjamin Hogan ORCID logo

Code in this repository is provided under a MIT license. This documentation is provided under a CC-BY-4.0 license.

Visit our lab website here. Contact Benjamin Hogan at ben.hogan@petermac.org.

class emumadz.parse_vcf.VCFParser(vcf_file: str, samplesheet_file: str, chrom_mapping_file: str | None = None)[source]

Bases: object

calculate_allele_freq(genotype: str) Tuple[float | None, str | None][source]

Calculate allele frequency from genotype.

Parameters:

genotype (str) – Genotype string (e.g., “0/1”, “1/1”)

Returns:

Tuple of (frequency_float, count_string)

Return type:

Tuple[Optional[float], Optional[str]]

create_bam_subset(bam_file: str, variant_positions: List[Tuple[str, int]], output_path: str) bool[source]

Create subset BAM containing reads around variant positions.

Parameters:
  • bam_file (str) – Path to source BAM file

  • variant_positions (List[Tuple[str, int]]) – List of (chromosome, position) tuples

  • output_path (str) – Output BAM file path

Returns:

Success status

Return type:

bool

create_reference_bam_subsets(subset_path: str, variant_positions: List[Tuple[str, int]], force_overwrite: bool = False) Dict[str, str][source]

Create chromosome-renamed reference BAM subsets for all reference samples.

Parameters:
  • subset_path (str) – Output directory for subset BAMs

  • variant_positions (List[Tuple[str, int]]) – List of (chromosome, position) tuples

  • force_overwrite (bool) – Overwrites bam files

Returns:

Dictionary mapping sample_id to output BAM path

Return type:

Dict[str, str]

generate_coverage_report(data: DataFrame, subset_dir: str) DataFrame[source]

Generate coverage report for all BAM subsets.

Parameters:
  • data (pd.DataFrame) – DataFrame with variant data

  • subset_dir (str) – Directory containing subset BAMs

Returns:

Coverage report DataFrame

Return type:

pd.DataFrame

get_impact_color(impact: str) str[source]

Get color coding for impact level.

Parameters:

impact (str) – Impact level string (HIGH, MODERATE, LOW, MODIFIER)

Returns:

Hex color code

Return type:

str

get_relative_bam_file(output_file: str, bam_filename: str, subset_path: str) str[source]

Get the correct relative path from output JSON to BAM file.

Parameters:
  • output_file (str) – Path to the output JSON/CSV file

  • bam_filename (str) – BAM filename (e.g., “sample_mutant.bam”)

  • subset_path (str) – Directory where BAM subsets are created

Returns:

Relative path from JSON to BAM file

Return type:

str

match_sample(vcf_sample: str, column_index: int) Tuple[str | None, str | None][source]

Match VCF sample to samplesheet entry based on column position.

Parameters:
  • vcf_sample (str) – Sample name from VCF

  • column_index (int) – Zero-based index of the sample column in VCF

Returns:

Tuple of (bam_file, sample_type)

Return type:

Tuple[Optional[str], Optional[str]]

parse_ann_field(ann_string: str) List[Dict[str, str]][source]

Parse SnpEff ANN field.

Parameters:

ann_string (str) – SnpEff ANN annotation string

Returns:

List of parsed annotation dictionaries

Return type:

List[Dict[str, str]]

parse_csq_field(csq_string: str) List[Dict[str, str]][source]

Parse VEP CSQ field.

Parameters:

csq_string (str) – VEP CSQ annotation string

Returns:

List of parsed annotation dictionaries

Return type:

List[Dict[str, str]]

process_sample_variants(sample_data: Dict[str, str | DataFrame], force_overwrite: bool = False) bool[source]

Process all variants for a single sample and create subset BAM.

Parameters:
  • sample_data (Dict[str, Union[str, pd.DataFrame]]) – Dictionary containing sample info and variants

  • force_overwrite (bool) – Whether to overwrite existing files

Returns:

Success status

Return type:

bool

process_vcf(output_file: str, subset_path: str | None = None, max_workers: int = 4, force_overwrite: bool = False, coverage_report: str | None = None) DataFrame[source]

Process VCF and create analysis table.

Parameters:
  • output_file (str) – Output CSV file path

  • subset_path (Optional[str]) – Directory for subset BAM files

  • max_workers (int) – Number of parallel workers for BAM subsetting

  • force_overwrite (bool) – Whether to overwrite existing files

  • coverage_report (Optional[str]) – Path for coverage report CSV

Returns:

DataFrame with processed variants

Return type:

pd.DataFrame

sample_info

Parse annotated VCF files and create BAM subsets for variant analysis.

This class processes VCF files with VEP/SnpEff annotations, matches samples to BAM files, and optionally creates chromosome-renamed BAM subsets around variant positions for visualization.

Parameters:
  • vcf_file (str) – Path to annotated VCF file

  • samplesheet_file (str) – Tab-separated file mapping sample IDs to BAM paths and types

  • chrom_mapping_file (Optional[str]) – Optional TSV file mapping old to new chromosome names

Example

>>> parser = VCFParser('variants.vcf', 'samples.tsv', 'chr_mapping.tsv')
>>> df = parser.process_vcf('output.csv', subset_path='bam_subsets/')
validate_bam_subset(subset_path: str, expected_variants: List[Tuple[str, int]]) Tuple[bool, str | Dict][source]

Validate that a BAM subset contains reads covering expected variant positions.

Parameters:
  • subset_path (str) – Path to subset BAM file

  • expected_variants (List[Tuple[str, int]]) – List of (chromosome, position) tuples

Returns:

Tuple of (success_bool, result_dict)

Return type:

Tuple[bool, Union[str, Dict]]

emumadz.parse_vcf.main() None[source]
class emumadz.plot_homozygosity.HenkeScorer(min_coverage: int = 1, homozygosity_threshold: float = 0.85)[source]

Bases: HomozygosityScorer

Scorer: (hom / het) × (missing_ref_alleles / present_ref_alleles)

Implements the Henke et al. (2013) mapping score formula, which identifies regions where mutants are homozygous for alleles that differ from reference strains. High scores indicate candidate linked regions.

The score is calculated as:

(homozygous_SNPs / heterozygous_SNPs) × (ref_alleles_absent / ref_alleles_present)

min_coverage

Minimum read depth required for a sample to be considered

homozygosity_threshold

Allele fraction threshold for “soft” homozygosity (0.85 means ≥85% of reads support one allele counts as homozygous)

Example

scorer = HenkeScorer(min_coverage=10, homozygosity_threshold=0.85)

calculate_score(record, variant: VariantData, sample_name: str, ref_samples: List[str]) Dict[str, int][source]

Calculate per-variant homozygosity counts for window aggregation.

Examines one VCF record and returns count flags that are later summed across genomic windows to compute the Henke score.

Parameters:
  • record – pysam VCF record object

  • variant – VariantData container with position info

  • sample_name – Name of mutant sample

  • ref_samples – List of reference sample names

Returns:

  • mut_is_hom: 1 if mutant is homozygous

  • mut_is_het: 1 if mutant is heterozygous

  • mut_has_map: 1 if all reference alleles present in mutant

  • mut_miss_map: 1 if any reference alleles missing from mutant

Return type:

Dictionary with binary flags (0 or 1)

Example

>>> counts = scorer.calculate_score(record, variant, "mutant1", ["ref1", "ref2"])
>>> counts
{'mut_is_hom': 1, 'mut_is_het': 0, 'mut_has_map': 0, 'mut_miss_map': 1}
get_name() str[source]
requires_references() bool[source]
class emumadz.plot_homozygosity.HomozygosityAnalyser(scorer: HomozygosityScorer, window_size: int = 10000, step_size: int = 1000, use_snp_windows: bool = False, snp_window_size: int = 200, snp_step_size: int = 20, samplesheet_path: str = None)[source]

Bases: object

Processes VCF files to generate homozygosity scores across genomic windows.

Supports two windowing strategies:
  • BP-based: Fixed genomic coordinates (e.g., 10kb windows)

  • SNP-based: Fixed number of variants (e.g., 200 SNP windows)

BP windows are useful for uniform genome coverage, while SNP windows normalize for variable mutation density across the genome.

scorer

HomozygosityScorer instance (e.g., HenkeScorer)

window_size

Window size in bp (BP mode) or SNP count (SNP mode)

step_size

Step size in bp (BP mode) or SNP count (SNP mode)

use_snp_windows

If True, use SNP-count windows; if False, use BP windows

Example

>>> scorer = HenkeScorer(min_coverage=10)
>>> analyser = HomozygosityAnalyser(scorer, window_size=50000, step_size=5000)
>>> variants_df, windows_df, sample = analyser.process_vcf("data.vcf", False, 8)
process_vcf(vcf_path: str, all_variants: bool, ncpu: int) Tuple[DataFrame, DataFrame, str][source]

Process entire VCF file with parallel chromosome processing.

Automatically infers sample names from VCF header (first sample = mutant, remaining samples = references). Processes each chromosome in parallel and combines results.

Parameters:
  • vcf_path – Path to VCF file (must have AD format field)

  • all_variants – If True, process all variant types; if False, SNPs only

  • ncpu – Number of CPU cores for parallel processing (one per chromosome optimal)

Returns:

Tuple of (combined_variants_df, combined_windows_df, sample_name, display_name) - combined_variants_df: All per-variant counts across genome - combined_windows_df: All window scores across genome - sample_name: Name of mutant sample - display_name: Name shown on display

Raises:

ValueError – If VCF has <2 samples or samples not found in header

Example

>>> analyser = HomozygosityAnalyser(scorer, window_size=10000)
>>> vars_df, wins_df, sample = analyser.process_vcf("input.vcf", False, 8)
>>> print(f"Analyzed {len(wins_df)} windows for sample {sample}")
exception emumadz.plot_homozygosity.HomozygosityError[source]

Bases: Exception

Custom exception for homozygosity analysis errors

class emumadz.plot_homozygosity.HomozygosityScorer[source]

Bases: ABC

Abstract base class for homozygosity scoring methods

abstractmethod calculate_score(record, variant: VariantData, sample_name: str, ref_samples: List[str]) Dict[str, int][source]
abstractmethod get_name() str[source]
abstractmethod requires_references() bool[source]
class emumadz.plot_homozygosity.VariantData(chrom: str, pos: int, is_snp: bool)[source]

Bases: object

Container for variant information

chrom: str
get_sample_data(record, sample_name: str) Dict[source]

Get sample data on demand

is_snp: bool
pos: int
emumadz.plot_homozygosity.add_homozygosity_to_variant_json(variant_json_path: str, tdf_file: str, bedgraph_file: str, window_size: int, display_name: str = None, overwrite: bool = True) None[source]

Add or update homozygosity track information in variant JSON file.

Parameters:
  • variant_json_path – Path to variant JSON file from VCF parser

  • tdf_file – Path to TDF file (or None if not generated)

  • bedgraph_file – Path to bedgraph file

  • window_size – Analysis window size

  • display_name – Sample display name (from samplesheet)

  • overwrite – If True, overwrite existing homozygosity data; if False, append new track

emumadz.plot_homozygosity.bedgraph_to_tdf(bedgraph_file: str, fasta_index: str, output_tdf: str) bool[source]

Convert bedgraph to TDF (Tiled Data Format) for efficient IGV visualization.

TDF is a binary format that loads much faster than text bedgraph for large genomic datasets. Requires IGVTools to be installed and in PATH.

Parameters:
  • bedgraph_file – Input bedgraph file path

  • fasta_index – Reference genome .fai index file (e.g., hg38.fa.fai)

  • output_tdf – Output TDF file path

Returns:

True if conversion succeeded, False if IGVTools not available or failed

Example

>>> success = bedgraph_to_tdf("scores.bedgraph", "genome.fa.fai", "scores.tdf")
>>> if success:
...     print("Load scores.tdf in IGV for visualization")
emumadz.plot_homozygosity.load_samplesheet(samplesheet_path: str) Dict[str, str][source]

Load samplesheet to map sample names to VCF sample IDs.

Parameters:

samplesheet_path – Path to tab-separated samplesheet

Returns:

Dictionary mapping sample_identity to VCF sample names

Example samplesheet format:

sample_identity vcf_sample_name alignment_file sample_type mutant_1 TL2312073-163-4L /path/to.bam mutant

emumadz.plot_homozygosity.main()[source]
emumadz.plot_homozygosity.parse_allele_depths(ad_field) List[int][source]

Parse AD field with comprehensive error handling

emumadz.plot_homozygosity.write_bedgraph(df: DataFrame, output_file: str, track_name: str, description: str)[source]

Write windowed scores to UCSC BedGraph format for genome browser visualization.

BedGraph format: chrom <tab> start <tab> end <tab> score Output can be loaded directly into IGV, UCSC Genome Browser, or converted to TDF.

Parameters:
  • df – DataFrame with columns: chrom, start, end, score

  • output_file – Path for output .bedgraph file

  • track_name – Track name for genome browser display

  • description – Track description for genome browser

Example

>>> write_bedgraph(windows_df, "output.bedgraph",
...                "mutant1_henke", "Henke homozygosity scores")