API Reference¶

Code in this repository is provided under a MIT license. This documentation is provided under a CC-BY-4.0 license.

Visit our lab website here. Contact Benjamin Hogan at ben.hogan@petermac.org.

class emumadz.parse_vcf.VCFParser(vcf_file: str, samplesheet_file: str, chrom_mapping_file: str | None = None)[source]¶

Bases: object

calculate_allele_freq(genotype: str) → Tuple[float | None, str | None][source]¶

Calculate allele frequency from genotype.

Parameters:: genotype (str) – Genotype string (e.g., “0/1”, “1/1”)
Returns:: Tuple of (frequency_float, count_string)
Return type:: Tuple[Optional[float], Optional[str]]

create_bam_subset(bam_file: str, variant_positions: List[Tuple[str, int]], output_path: str) → bool[source]¶

Create subset BAM containing reads around variant positions.

Parameters:

bam_file (str) – Path to source BAM file
variant_positions (List[Tuple[str, int]]) – List of (chromosome, position) tuples
output_path (str) – Output BAM file path

Returns:

Success status

Return type:

bool

create_reference_bam_subsets(subset_path: str, variant_positions: List[Tuple[str, int]], force_overwrite: bool = False) → Dict[str, str][source]¶

Create chromosome-renamed reference BAM subsets for all reference samples.

Parameters:

subset_path (str) – Output directory for subset BAMs
variant_positions (List[Tuple[str, int]]) – List of (chromosome, position) tuples
force_overwrite (bool) – Overwrites bam files

Returns:

Dictionary mapping sample_id to output BAM path

Return type:

Dict[str, str]

generate_coverage_report(data: DataFrame, subset_dir: str) → DataFrame[source]¶

Generate coverage report for all BAM subsets.

Parameters:

data (pd.DataFrame) – DataFrame with variant data
subset_dir (str) – Directory containing subset BAMs

Returns:

Coverage report DataFrame

Return type:

pd.DataFrame

get_impact_color(impact: str) → str[source]¶

Get color coding for impact level.

Parameters:: impact (str) – Impact level string (HIGH, MODERATE, LOW, MODIFIER)
Returns:: Hex color code
Return type:: str

get_relative_bam_file(output_file: str, bam_filename: str, subset_path: str) → str[source]¶

Get the correct relative path from output JSON to BAM file.

Parameters:

output_file (str) – Path to the output JSON/CSV file
bam_filename (str) – BAM filename (e.g., “sample_mutant.bam”)
subset_path (str) – Directory where BAM subsets are created

Returns:

Relative path from JSON to BAM file

Return type:

str

match_sample(vcf_sample: str, column_index: int) → Tuple[str | None, str | None][source]¶

Match VCF sample to samplesheet entry based on column position.

Parameters:

vcf_sample (str) – Sample name from VCF
column_index (int) – Zero-based index of the sample column in VCF

Returns:

Tuple of (bam_file, sample_type)

Return type:

Tuple[Optional[str], Optional[str]]

parse_ann_field(ann_string: str) → List[Dict[str, str]][source]¶

Parse SnpEff ANN field.

Parameters:: ann_string (str) – SnpEff ANN annotation string
Returns:: List of parsed annotation dictionaries
Return type:: List[Dict[str, str]]

parse_csq_field(csq_string: str) → List[Dict[str, str]][source]¶

Parse VEP CSQ field.

Parameters:: csq_string (str) – VEP CSQ annotation string
Returns:: List of parsed annotation dictionaries
Return type:: List[Dict[str, str]]

process_sample_variants(sample_data: Dict[str, str | DataFrame], force_overwrite: bool = False) → bool[source]¶

Process all variants for a single sample and create subset BAM.

Parameters:

sample_data (Dict[str, Union[str, pd.DataFrame]]) – Dictionary containing sample info and variants
force_overwrite (bool) – Whether to overwrite existing files

Returns:

Success status

Return type:

bool

process_vcf(output_file: str, subset_path: str | None = None, max_workers: int = 4, force_overwrite: bool = False, coverage_report: str | None = None) → DataFrame[source]¶

Process VCF and create analysis table.

Parameters:

output_file (str) – Output CSV file path
subset_path (Optional[str]) – Directory for subset BAM files
max_workers (int) – Number of parallel workers for BAM subsetting
force_overwrite (bool) – Whether to overwrite existing files
coverage_report (Optional[str]) – Path for coverage report CSV

Returns:

DataFrame with processed variants

Return type:

pd.DataFrame

sample_info¶

Parse annotated VCF files and create BAM subsets for variant analysis.

This class processes VCF files with VEP/SnpEff annotations, matches samples to BAM files, and optionally creates chromosome-renamed BAM subsets around variant positions for visualization.

Parameters:

vcf_file (str) – Path to annotated VCF file
samplesheet_file (str) – Tab-separated file mapping sample IDs to BAM paths and types
chrom_mapping_file (Optional[str]) – Optional TSV file mapping old to new chromosome names

Example

>>> parser = VCFParser('variants.vcf', 'samples.tsv', 'chr_mapping.tsv')
>>> df = parser.process_vcf('output.csv', subset_path='bam_subsets/')

validate_bam_subset(subset_path: str, expected_variants: List[Tuple[str, int]]) → Tuple[bool, str | Dict][source]¶

Validate that a BAM subset contains reads covering expected variant positions.

Parameters:

subset_path (str) – Path to subset BAM file
expected_variants (List[Tuple[str, int]]) – List of (chromosome, position) tuples

Returns:

Tuple of (success_bool, result_dict)

Return type:

Tuple[bool, Union[str, Dict]]

emumadz.parse_vcf.main() → None[source]¶

class emumadz.plot_homozygosity.HenkeScorer(min_coverage: int = 1, homozygosity_threshold: float = 0.85)[source]¶

Bases: HomozygosityScorer

Scorer: (hom / het) × (missing_ref_alleles / present_ref_alleles)

Implements the Henke et al. (2013) mapping score formula, which identifies regions where mutants are homozygous for alleles that differ from reference strains. High scores indicate candidate linked regions.

The score is calculated as:: (homozygous_SNPs / heterozygous_SNPs) × (ref_alleles_absent / ref_alleles_present)

min_coverage¶: Minimum read depth required for a sample to be considered

homozygosity_threshold¶: Allele fraction threshold for “soft” homozygosity (0.85 means ≥85% of reads support one allele counts as homozygous)

Example

scorer = HenkeScorer(min_coverage=10, homozygosity_threshold=0.85)

calculate_score(record, variant: VariantData, sample_name: str, ref_samples: List[str]) → Dict[str, int][source]¶

Calculate per-variant homozygosity counts for window aggregation.

Examines one VCF record and returns count flags that are later summed across genomic windows to compute the Henke score.

Parameters:

record – pysam VCF record object
variant – VariantData container with position info
sample_name – Name of mutant sample
ref_samples – List of reference sample names

Returns:

mut_is_hom: 1 if mutant is homozygous
mut_is_het: 1 if mutant is heterozygous
mut_has_map: 1 if all reference alleles present in mutant
mut_miss_map: 1 if any reference alleles missing from mutant

Return type:

Dictionary with binary flags (0 or 1)

Example

>>> counts = scorer.calculate_score(record, variant, "mutant1", ["ref1", "ref2"])
>>> counts
{'mut_is_hom': 1, 'mut_is_het': 0, 'mut_has_map': 0, 'mut_miss_map': 1}

get_name() → str[source]¶

requires_references() → bool[source]¶

class emumadz.plot_homozygosity.HomozygosityAnalyser(scorer: HomozygosityScorer, window_size: int = 10000, step_size: int = 1000, use_snp_windows: bool = False, snp_window_size: int = 200, snp_step_size: int = 20, samplesheet_path: str = None)[source]¶

Bases: object

Processes VCF files to generate homozygosity scores across genomic windows.

Supports two windowing strategies:

BP-based: Fixed genomic coordinates (e.g., 10kb windows)
SNP-based: Fixed number of variants (e.g., 200 SNP windows)

BP windows are useful for uniform genome coverage, while SNP windows normalize for variable mutation density across the genome.

scorer¶: HomozygosityScorer instance (e.g., HenkeScorer)

window_size¶: Window size in bp (BP mode) or SNP count (SNP mode)

step_size¶: Step size in bp (BP mode) or SNP count (SNP mode)

use_snp_windows¶: If True, use SNP-count windows; if False, use BP windows

Example

>>> scorer = HenkeScorer(min_coverage=10)
>>> analyser = HomozygosityAnalyser(scorer, window_size=50000, step_size=5000)
>>> variants_df, windows_df, sample = analyser.process_vcf("data.vcf", False, 8)

process_vcf(vcf_path: str, all_variants: bool, ncpu: int) → Tuple[DataFrame, DataFrame, str][source]¶

Process entire VCF file with parallel chromosome processing.

Automatically infers sample names from VCF header (first sample = mutant, remaining samples = references). Processes each chromosome in parallel and combines results.

Parameters:

vcf_path – Path to VCF file (must have AD format field)
all_variants – If True, process all variant types; if False, SNPs only
ncpu – Number of CPU cores for parallel processing (one per chromosome optimal)

Returns:

Tuple of (combined_variants_df, combined_windows_df, sample_name, display_name) - combined_variants_df: All per-variant counts across genome - combined_windows_df: All window scores across genome - sample_name: Name of mutant sample - display_name: Name shown on display

Raises:

ValueError – If VCF has <2 samples or samples not found in header

Example

>>> analyser = HomozygosityAnalyser(scorer, window_size=10000)
>>> vars_df, wins_df, sample = analyser.process_vcf("input.vcf", False, 8)
>>> print(f"Analyzed {len(wins_df)} windows for sample {sample}")

exception emumadz.plot_homozygosity.HomozygosityError[source]¶

Bases: Exception

Custom exception for homozygosity analysis errors

class emumadz.plot_homozygosity.HomozygosityScorer[source]¶

Bases: ABC

Abstract base class for homozygosity scoring methods

abstractmethod calculate_score(record, variant: VariantData, sample_name: str, ref_samples: List[str]) → Dict[str, int][source]¶

abstractmethod get_name() → str[source]¶

abstractmethod requires_references() → bool[source]¶

class emumadz.plot_homozygosity.VariantData(chrom: str, pos: int, is_snp: bool)[source]¶

Bases: object

Container for variant information

chrom: str¶

get_sample_data(record, sample_name: str) → Dict[source]¶: Get sample data on demand

is_snp: bool¶

pos: int¶

emumadz.plot_homozygosity.add_homozygosity_to_variant_json(variant_json_path: str, tdf_file: str, bedgraph_file: str, window_size: int, display_name: str = None, overwrite: bool = True) → None[source]¶

Add or update homozygosity track information in variant JSON file.

Parameters:

variant_json_path – Path to variant JSON file from VCF parser
tdf_file – Path to TDF file (or None if not generated)
bedgraph_file – Path to bedgraph file
window_size – Analysis window size
display_name – Sample display name (from samplesheet)
overwrite – If True, overwrite existing homozygosity data; if False, append new track

emumadz.plot_homozygosity.bedgraph_to_tdf(bedgraph_file: str, fasta_index: str, output_tdf: str) → bool[source]¶

Convert bedgraph to TDF (Tiled Data Format) for efficient IGV visualization.

TDF is a binary format that loads much faster than text bedgraph for large genomic datasets. Requires IGVTools to be installed and in PATH.

Parameters:

bedgraph_file – Input bedgraph file path
fasta_index – Reference genome .fai index file (e.g., hg38.fa.fai)
output_tdf – Output TDF file path

Returns:

True if conversion succeeded, False if IGVTools not available or failed

Example

>>> success = bedgraph_to_tdf("scores.bedgraph", "genome.fa.fai", "scores.tdf")
>>> if success:
...     print("Load scores.tdf in IGV for visualization")

emumadz.plot_homozygosity.load_samplesheet(samplesheet_path: str) → Dict[str, str][source]¶

Load samplesheet to map sample names to VCF sample IDs.

Parameters:: samplesheet_path – Path to tab-separated samplesheet
Returns:: Dictionary mapping sample_identity to VCF sample names

Example samplesheet format:: sample_identity vcf_sample_name alignment_file sample_type mutant_1 TL2312073-163-4L /path/to.bam mutant

emumadz.plot_homozygosity.main()[source]¶

emumadz.plot_homozygosity.parse_allele_depths(ad_field) → List[int][source]¶: Parse AD field with comprehensive error handling

emumadz.plot_homozygosity.write_bedgraph(df: DataFrame, output_file: str, track_name: str, description: str)[source]¶

Write windowed scores to UCSC BedGraph format for genome browser visualization.

BedGraph format: chrom <tab> start <tab> end <tab> score Output can be loaded directly into IGV, UCSC Genome Browser, or converted to TDF.

Parameters:

df – DataFrame with columns: chrom, start, end, score
output_file – Path for output .bedgraph file
track_name – Track name for genome browser display
description – Track description for genome browser

Example

>>> write_bedgraph(windows_df, "output.bedgraph",
...                "mutant1_henke", "Henke homozygosity scores")