API Reference¶
Code in this repository is provided under a MIT license. This documentation is provided under a CC-BY-4.0 license.
Visit our lab website here. Contact Benjamin Hogan at ben.hogan@petermac.org.
- class emumadz.parse_vcf.VCFParser(vcf_file: str, samplesheet_file: str, chrom_mapping_file: str | None = None)[source]¶
Bases:
object- calculate_allele_freq(genotype: str) Tuple[float | None, str | None][source]¶
Calculate allele frequency from genotype.
- Parameters:
genotype (str) – Genotype string (e.g., “0/1”, “1/1”)
- Returns:
Tuple of (frequency_float, count_string)
- Return type:
Tuple[Optional[float], Optional[str]]
- create_bam_subset(bam_file: str, variant_positions: List[Tuple[str, int]], output_path: str) bool[source]¶
Create subset BAM containing reads around variant positions.
- Parameters:
bam_file (str) – Path to source BAM file
variant_positions (List[Tuple[str, int]]) – List of (chromosome, position) tuples
output_path (str) – Output BAM file path
- Returns:
Success status
- Return type:
bool
- create_reference_bam_subsets(subset_path: str, variant_positions: List[Tuple[str, int]], force_overwrite: bool = False) Dict[str, str][source]¶
Create chromosome-renamed reference BAM subsets for all reference samples.
- Parameters:
subset_path (str) – Output directory for subset BAMs
variant_positions (List[Tuple[str, int]]) – List of (chromosome, position) tuples
force_overwrite (bool) – Overwrites bam files
- Returns:
Dictionary mapping sample_id to output BAM path
- Return type:
Dict[str, str]
- generate_coverage_report(data: DataFrame, subset_dir: str) DataFrame[source]¶
Generate coverage report for all BAM subsets.
- Parameters:
data (pd.DataFrame) – DataFrame with variant data
subset_dir (str) – Directory containing subset BAMs
- Returns:
Coverage report DataFrame
- Return type:
pd.DataFrame
- get_impact_color(impact: str) str[source]¶
Get color coding for impact level.
- Parameters:
impact (str) – Impact level string (HIGH, MODERATE, LOW, MODIFIER)
- Returns:
Hex color code
- Return type:
str
- get_relative_bam_file(output_file: str, bam_filename: str, subset_path: str) str[source]¶
Get the correct relative path from output JSON to BAM file.
- Parameters:
output_file (str) – Path to the output JSON/CSV file
bam_filename (str) – BAM filename (e.g., “sample_mutant.bam”)
subset_path (str) – Directory where BAM subsets are created
- Returns:
Relative path from JSON to BAM file
- Return type:
str
- match_sample(vcf_sample: str, column_index: int) Tuple[str | None, str | None][source]¶
Match VCF sample to samplesheet entry based on column position.
- Parameters:
vcf_sample (str) – Sample name from VCF
column_index (int) – Zero-based index of the sample column in VCF
- Returns:
Tuple of (bam_file, sample_type)
- Return type:
Tuple[Optional[str], Optional[str]]
- parse_ann_field(ann_string: str) List[Dict[str, str]][source]¶
Parse SnpEff ANN field.
- Parameters:
ann_string (str) – SnpEff ANN annotation string
- Returns:
List of parsed annotation dictionaries
- Return type:
List[Dict[str, str]]
- parse_csq_field(csq_string: str) List[Dict[str, str]][source]¶
Parse VEP CSQ field.
- Parameters:
csq_string (str) – VEP CSQ annotation string
- Returns:
List of parsed annotation dictionaries
- Return type:
List[Dict[str, str]]
- process_sample_variants(sample_data: Dict[str, str | DataFrame], force_overwrite: bool = False) bool[source]¶
Process all variants for a single sample and create subset BAM.
- Parameters:
sample_data (Dict[str, Union[str, pd.DataFrame]]) – Dictionary containing sample info and variants
force_overwrite (bool) – Whether to overwrite existing files
- Returns:
Success status
- Return type:
bool
- process_vcf(output_file: str, subset_path: str | None = None, max_workers: int = 4, force_overwrite: bool = False, coverage_report: str | None = None) DataFrame[source]¶
Process VCF and create analysis table.
- Parameters:
output_file (str) – Output CSV file path
subset_path (Optional[str]) – Directory for subset BAM files
max_workers (int) – Number of parallel workers for BAM subsetting
force_overwrite (bool) – Whether to overwrite existing files
coverage_report (Optional[str]) – Path for coverage report CSV
- Returns:
DataFrame with processed variants
- Return type:
pd.DataFrame
- sample_info¶
Parse annotated VCF files and create BAM subsets for variant analysis.
This class processes VCF files with VEP/SnpEff annotations, matches samples to BAM files, and optionally creates chromosome-renamed BAM subsets around variant positions for visualization.
- Parameters:
vcf_file (str) – Path to annotated VCF file
samplesheet_file (str) – Tab-separated file mapping sample IDs to BAM paths and types
chrom_mapping_file (Optional[str]) – Optional TSV file mapping old to new chromosome names
Example
>>> parser = VCFParser('variants.vcf', 'samples.tsv', 'chr_mapping.tsv') >>> df = parser.process_vcf('output.csv', subset_path='bam_subsets/')
- validate_bam_subset(subset_path: str, expected_variants: List[Tuple[str, int]]) Tuple[bool, str | Dict][source]¶
Validate that a BAM subset contains reads covering expected variant positions.
- Parameters:
subset_path (str) – Path to subset BAM file
expected_variants (List[Tuple[str, int]]) – List of (chromosome, position) tuples
- Returns:
Tuple of (success_bool, result_dict)
- Return type:
Tuple[bool, Union[str, Dict]]
- class emumadz.plot_homozygosity.HenkeScorer(min_coverage: int = 1, homozygosity_threshold: float = 0.85)[source]¶
Bases:
HomozygosityScorerScorer: (hom / het) × (missing_ref_alleles / present_ref_alleles)
Implements the Henke et al. (2013) mapping score formula, which identifies regions where mutants are homozygous for alleles that differ from reference strains. High scores indicate candidate linked regions.
- The score is calculated as:
(homozygous_SNPs / heterozygous_SNPs) × (ref_alleles_absent / ref_alleles_present)
- min_coverage¶
Minimum read depth required for a sample to be considered
- homozygosity_threshold¶
Allele fraction threshold for “soft” homozygosity (0.85 means ≥85% of reads support one allele counts as homozygous)
Example
scorer = HenkeScorer(min_coverage=10, homozygosity_threshold=0.85)
- calculate_score(record, variant: VariantData, sample_name: str, ref_samples: List[str]) Dict[str, int][source]¶
Calculate per-variant homozygosity counts for window aggregation.
Examines one VCF record and returns count flags that are later summed across genomic windows to compute the Henke score.
- Parameters:
record – pysam VCF record object
variant – VariantData container with position info
sample_name – Name of mutant sample
ref_samples – List of reference sample names
- Returns:
mut_is_hom: 1 if mutant is homozygous
mut_is_het: 1 if mutant is heterozygous
mut_has_map: 1 if all reference alleles present in mutant
mut_miss_map: 1 if any reference alleles missing from mutant
- Return type:
Dictionary with binary flags (0 or 1)
Example
>>> counts = scorer.calculate_score(record, variant, "mutant1", ["ref1", "ref2"]) >>> counts {'mut_is_hom': 1, 'mut_is_het': 0, 'mut_has_map': 0, 'mut_miss_map': 1}
- class emumadz.plot_homozygosity.HomozygosityAnalyser(scorer: HomozygosityScorer, window_size: int = 10000, step_size: int = 1000, use_snp_windows: bool = False, snp_window_size: int = 200, snp_step_size: int = 20, samplesheet_path: str = None)[source]¶
Bases:
objectProcesses VCF files to generate homozygosity scores across genomic windows.
- Supports two windowing strategies:
BP-based: Fixed genomic coordinates (e.g., 10kb windows)
SNP-based: Fixed number of variants (e.g., 200 SNP windows)
BP windows are useful for uniform genome coverage, while SNP windows normalize for variable mutation density across the genome.
- scorer¶
HomozygosityScorer instance (e.g., HenkeScorer)
- window_size¶
Window size in bp (BP mode) or SNP count (SNP mode)
- step_size¶
Step size in bp (BP mode) or SNP count (SNP mode)
- use_snp_windows¶
If True, use SNP-count windows; if False, use BP windows
Example
>>> scorer = HenkeScorer(min_coverage=10) >>> analyser = HomozygosityAnalyser(scorer, window_size=50000, step_size=5000) >>> variants_df, windows_df, sample = analyser.process_vcf("data.vcf", False, 8)
- process_vcf(vcf_path: str, all_variants: bool, ncpu: int) Tuple[DataFrame, DataFrame, str][source]¶
Process entire VCF file with parallel chromosome processing.
Automatically infers sample names from VCF header (first sample = mutant, remaining samples = references). Processes each chromosome in parallel and combines results.
- Parameters:
vcf_path – Path to VCF file (must have AD format field)
all_variants – If True, process all variant types; if False, SNPs only
ncpu – Number of CPU cores for parallel processing (one per chromosome optimal)
- Returns:
Tuple of (combined_variants_df, combined_windows_df, sample_name, display_name) - combined_variants_df: All per-variant counts across genome - combined_windows_df: All window scores across genome - sample_name: Name of mutant sample - display_name: Name shown on display
- Raises:
ValueError – If VCF has <2 samples or samples not found in header
Example
>>> analyser = HomozygosityAnalyser(scorer, window_size=10000) >>> vars_df, wins_df, sample = analyser.process_vcf("input.vcf", False, 8) >>> print(f"Analyzed {len(wins_df)} windows for sample {sample}")
- exception emumadz.plot_homozygosity.HomozygosityError[source]¶
Bases:
ExceptionCustom exception for homozygosity analysis errors
- class emumadz.plot_homozygosity.HomozygosityScorer[source]¶
Bases:
ABCAbstract base class for homozygosity scoring methods
- abstractmethod calculate_score(record, variant: VariantData, sample_name: str, ref_samples: List[str]) Dict[str, int][source]¶
- class emumadz.plot_homozygosity.VariantData(chrom: str, pos: int, is_snp: bool)[source]¶
Bases:
objectContainer for variant information
- chrom: str¶
- is_snp: bool¶
- pos: int¶
- emumadz.plot_homozygosity.add_homozygosity_to_variant_json(variant_json_path: str, tdf_file: str, bedgraph_file: str, window_size: int, display_name: str = None, overwrite: bool = True) None[source]¶
Add or update homozygosity track information in variant JSON file.
- Parameters:
variant_json_path – Path to variant JSON file from VCF parser
tdf_file – Path to TDF file (or None if not generated)
bedgraph_file – Path to bedgraph file
window_size – Analysis window size
display_name – Sample display name (from samplesheet)
overwrite – If True, overwrite existing homozygosity data; if False, append new track
- emumadz.plot_homozygosity.bedgraph_to_tdf(bedgraph_file: str, fasta_index: str, output_tdf: str) bool[source]¶
Convert bedgraph to TDF (Tiled Data Format) for efficient IGV visualization.
TDF is a binary format that loads much faster than text bedgraph for large genomic datasets. Requires IGVTools to be installed and in PATH.
- Parameters:
bedgraph_file – Input bedgraph file path
fasta_index – Reference genome .fai index file (e.g., hg38.fa.fai)
output_tdf – Output TDF file path
- Returns:
True if conversion succeeded, False if IGVTools not available or failed
Example
>>> success = bedgraph_to_tdf("scores.bedgraph", "genome.fa.fai", "scores.tdf") >>> if success: ... print("Load scores.tdf in IGV for visualization")
- emumadz.plot_homozygosity.load_samplesheet(samplesheet_path: str) Dict[str, str][source]¶
Load samplesheet to map sample names to VCF sample IDs.
- Parameters:
samplesheet_path – Path to tab-separated samplesheet
- Returns:
Dictionary mapping sample_identity to VCF sample names
- Example samplesheet format:
sample_identity vcf_sample_name alignment_file sample_type mutant_1 TL2312073-163-4L /path/to.bam mutant
- emumadz.plot_homozygosity.parse_allele_depths(ad_field) List[int][source]¶
Parse AD field with comprehensive error handling
- emumadz.plot_homozygosity.write_bedgraph(df: DataFrame, output_file: str, track_name: str, description: str)[source]¶
Write windowed scores to UCSC BedGraph format for genome browser visualization.
BedGraph format: chrom <tab> start <tab> end <tab> score Output can be loaded directly into IGV, UCSC Genome Browser, or converted to TDF.
- Parameters:
df – DataFrame with columns: chrom, start, end, score
output_file – Path for output .bedgraph file
track_name – Track name for genome browser display
description – Track description for genome browser
Example
>>> write_bedgraph(windows_df, "output.bedgraph", ... "mutant1_henke", "Henke homozygosity scores")