API

API#

class genoray.VCF(path, filter=None, phasing=False, dosage_field=None, progress=False, with_gvi_index=True)[source]#

Create a VCF reader.

Parameters:

path (str | Path) – Path to the VCF file.
filter (Filter | None, default: None) –
A Filter bundling a cyvcf2 record predicate with its matching .gvi polars expression, or None to disable filtering.

Note

The record predicate needs to be tolerant to missing fields. For example, if you access an INFO or FORMAT field, not all variants are guaranteed to have the same fields. The cyvcf2.Variant API provides the .get method on the INFO and FORMAT attributes. For example, lambda v: v.INFO.get("AF", 0) > 0.05 will skip any variants with an AF <= 0.05 or a missing AF by treating missing AFs as 0.

Note

The expr polars expression will be applied to the polars DataFrame returned by get_record_info(). It is not applied to the VCF file itself, so it will not be able to use the cyvcf2.Variant API. For example, if you want to filter variants by INFO field, you can use: pl.col("AF") > 0.05 but you can not use: lambda v: v.INFO.get("AF", 0) > 0.05 because the expression will be applied to the polars DataFrame, not the VCF file.
read_as – Type of data to read from the VCF file. Can be VCF.Genos, VCF.Dosages, or VCF.GenosDosages.
phasing (bool, default: False) – Whether to include phasing information on genotypes. If True, the ploidy axis will be length 3 such that phasing is indicated by the 3rd value: 0 = unphased, 1 = phased. If False, the ploidy axis will be length 2.
dosage_field (str | None, default: None) – Name of the dosage field to read from the VCF file. Required if read_as is VCF.Dosages, VCF.Genos8Dosages, or VCF.Genos16Dosages.
progress (bool, default: False) – Whether to show a progress bar while reading the VCF file.
with_gvi_index (bool)

class Dosages(instance)#

class Genos16(instance)#

class Genos16Dosages(instance)#

class Genos8(instance)#

class Genos8Dosages(instance)#

chunk(contig, start=0, end=POS_MAX, max_mem='4g', mode=Genos16)[source]#

Iterate over genotypes and/or dosages for a range in chunks limited by max_mem.

Parameters:

contig (str) – Contig name.
start (int | integer, default: 0) – 0-based start position.
end (int | integer, default: POS_MAX) – 0-based, exclusive end position.
max_mem (int | str, default: "4g") – Maximum memory to use for each chunk. Can be an integer or a string with a suffix (e.g. “4g”, “2 MB”).
mode (type[TypeVar(T, Genos8, Genos16, Dosages, Genos8Dosages, Genos16Dosages)], default: Genos16) – Type of data to read.

Return type:

Generator[TypeVar(T, Genos8, Genos16, Dosages, Genos8Dosages, Genos16Dosages)]

Returns:

Generator of genotypes and/or dosages. Genotypes have shape (samples ploidy variants) and dosages have shape (samples variants). Missing genotypes have value -1 and missing dosages have value np.nan. If just using genotypes or dosages, will be a single array, otherwise will be a tuple of arrays.

get_record_info(contig=None, start=None, end=None, fields=None, info=None, lazy=False)[source]#

Overloads:

self, contig (str | None), start (int | np.integer | None), end (int | np.integer | None), fields (list[str] | None), info (list[str] | None), lazy (Literal[False]) → pl.DataFrame
self, contig (str | None), start (int | np.integer | None), end (int | np.integer | None), fields (list[str] | None), info (list[str] | None), lazy (Literal[True]) → pl.LazyFrame
self, contig (str | None), start (int | np.integer | None), end (int | np.integer | None), fields (list[str] | None), info (list[str] | None), lazy (bool) → pl.DataFrame | pl.LazyFrame

Parameters:

contig (str | None)
start (int | integer | None)
end (int | integer | None)
fields (list[str] | None)
info (list[str] | None)
lazy (bool)

Return type:

DataFrame | LazyFrame

Get a DataFrame of any non-FORMAT fields in the VCF for a given range or the entire VCF.

Will filter variants if the VCF instance has a filter function.

Parameters:

contig (str | None, default: None) – Contig name. If None, will read the entire VCF.
start (int | integer | None, default: None) – 0-based start position.
end (int | integer | None, default: None) – 0-based, exclusive end position.
fields (list[str] | None, default: None) – List of non-FORMAT, non-INFO fields to include. Returns all by default.
info (list[str] | None, default: None) – List of INFO fields to include. Returns all by default.
lazy (bool, default: False) – If True, return a polars.LazyFrame instead of collecting to a polars.DataFrame.

Return type:

DataFrame | LazyFrame

n_vars_in_ranges(contig, starts=0, ends=POS_MAX)[source]#

Return the start and end indices of the variants in the given ranges.

Parameters:

contig (str) – Contig name.
starts (ArrayLike, default: 0) – 0-based start positions of the ranges.
ends (ArrayLike, default: POS_MAX) – 0-based, exclusive end positions of the ranges.

Returns:

(ranges). Number of variants in the given ranges.

Return type:

ndarray[tuple[Any, ...], dtype[uintc]]

read(contig, start=0, end=POS_MAX, mode=Genos16, out=None)[source]#

Read genotypes and/or dosages for a range.

Parameters:

contig (str) – Contig name.
start (int | integer, default: 0) – 0-based start position.
end (int | integer, default: POS_MAX) – 0-based, exclusive end position.
mode (type[TypeVar(T, Genos8, Genos16, Dosages, Genos8Dosages, Genos16Dosages)], default: Genos16) – Type of data to read.
out (Optional[TypeVar(T, Genos8, Genos16, Dosages, Genos8Dosages, Genos16Dosages)], default: None) – Output array to fill with genotypes and/or dosages. If None, a new array will be created.

Return type:

TypeVar(T, Genos8, Genos16, Dosages, Genos8Dosages, Genos16Dosages)

Returns:

Genotypes and/or dosages. Genotypes have shape (samples ploidy variants) and dosages have shape (samples variants). Missing genotypes have value -1 and missing dosages have value np.nan. If just using genotypes or dosages, will be a single array, otherwise will be a tuple of arrays.

set_samples(samples)[source]#

Set the samples to read from the VCF file. Modifies the VCF reader in place and returns it.

Parameters:: samples (ArrayLike | None) – List of sample names to read from the VCF file.
Return type:: Self
Returns:: The VCF reader with the specified samples.

using_pbar(pbar)[source]#

Create a context where the given progress bar will be incremented by any calls to a read method.

Parameters:: pbar (tqdm_asyncio) – Progress bar to use while reading variants. This will be incremented per variant during any calls to a read function.

available_samples: list[str]#: List of available samples in the VCF file.

contigs: list[str]#: Naturally sorted list of available contigs in the VCF file.

property current_samples: list[str]#: List of samples currently being read from the VCF file.

dosage_field: str | None#: Name of the dosage field to read from the VCF file. Required if you want to use modes that include dosages.

property filter: Filter | None#

The Filter currently in effect, or None if no filter is set.

Assigning vcf.filter = vcf.filter round-trips.

property n_samples: int#: Number of samples currently selected.

property nbytes: int#

Total in-memory footprint, in bytes, of resident (non-mmap’d) data structures held by this reader.

Currently this is the gvi variant index (CHROM/POS/REF/ALT/ILEN). Returns 0 before the index is loaded.

path: Path#: Path to the VCF file.

phasing: bool#: Whether to include phasing information on genotypes. If True, the ploidy axis will be length 3 such that phasing is indicated by the 3rd value: 0 = unphased, 1 = phased. If False, the ploidy axis will be length 2.

ploidy: int = 2#: Ploidy of the VCF file. This is currently always 2 since we use cyvcf2.

class genoray.PGEN(geno_path, filter=None, dosage_path=None, load_index=True)[source]#

Create a PGEN reader.

Parameters:

path – Path to the PGEN file. Only used for genotypes if a dosage path is provided as well.
filter (Expr | None, default: None) – Polars expression to filter variants. Should return True for variants to keep. Will have at least the columns CHROM, POS (1-based), REF, ALT, and ILEN available to use.
dosage_path (str | Path | None, default: None) – Path to a dosage PGEN file. If None, the genotype PGEN file will be used for both genotypes and dosages.
geno_path (str | Path)
load_index (bool)

class Dosages(instance)#

class Genos(instance)#

class GenosDosages(instance)#

class GenosPhasing(instance)#

class GenosPhasingDosages(instance)#

chunk(contig, start=0, end=POS_MAX, max_mem='4g', mode=Genos)[source]#

Iterate over genotypes and/or dosages for a range in chunks limited by max_mem.

Parameters:

contig (str) – Contig name.
start (int | integer, default: 0) – 0-based start position.
end (int | integer, default: POS_MAX) – 0-based, exclusive end position.
max_mem (int | str, default: "4g") – Maximum memory to use for each chunk. Can be an integer or a string with a suffix (e.g. “4g”, “2 MB”).
mode (type[TypeVar(T, Genos, Dosages, GenosPhasing, GenosDosages, GenosPhasingDosages)], default: Genos) – Type of data to read. Can be Genos, Dosages, GenosPhasing, GenosDosages, or GenosPhasingDosages.

Return type:

Generator[TypeVar(T, Genos, Dosages, GenosPhasing, GenosDosages, GenosPhasingDosages)]

Returns:

chunk_ranges(contig, starts=0, ends=POS_MAX, max_mem='4g', mode=Genos)[source]#

Read genotypes and/or dosages for multiple ranges in chunks limited by max_mem.

Parameters:

contig (str) – Contig name.
starts (ArrayLike, default: 0) – 0-based start positions.
ends (ArrayLike, default: POS_MAX) – 0-based, exclusive end positions.
max_mem (int | str, default: "4g") – Maximum memory to use for each chunk. Can be an integer or a string with a suffix (e.g. “4g”, “2 MB”).
mode (type[TypeVar(T, Genos, Dosages, GenosPhasing, GenosDosages, GenosPhasingDosages)], default: Genos) – Type of data to read. Can be Genos, Dosages, GenosPhasing, GenosDosages, or GenosPhasingDosages.

Return type:

Generator[Generator[TypeVar(T, Genos, Dosages, GenosPhasing, GenosDosages, GenosPhasingDosages)]]

Returns:

Generator of generators of genotypes and/or dosages of each ranges’ data. Genotypes have shape (samples ploidy variants) and dosages have shape (samples variants). Missing genotypes have value -1 and missing dosages have value np.nan. If just using genotypes or dosages, will be a single array, otherwise will be a tuple of arrays.

Examples

gen = reader.read_ranges_chunks(...)
for range_ in gen:
    if range_ is None:
        continue
    for chunk in range_:
        # do something with chunk
        pass

n_vars_in_ranges(contig, starts=0, ends=POS_MAX)[source]#

Return the start and end indices of the variants in the given ranges.

Parameters:

contig (str) – Contig name.
starts (ArrayLike, default: 0) – 0-based start positions of the ranges.
ends (ArrayLike, default: POS_MAX) – 0-based, exclusive end positions of the ranges.

Returns:

Shape: (ranges). Number of variants in the given ranges.

Return type:

ndarray[tuple[Any, ...], dtype[uintc]]

read(contig, start=0, end=POS_MAX, mode=Genos)[source]#

Read genotypes and/or dosages for a range.

Parameters:

contig (str) – Contig name.
start (int | integer, default: 0) – 0-based start position.
end (int | integer, default: POS_MAX) – 0-based, exclusive end position.
mode (type[TypeVar(T, Genos, Dosages, GenosPhasing, GenosDosages, GenosPhasingDosages)], default: Genos) – Type of data to read. Can be Genos, Dosages, GenosPhasing, GenosDosages, or GenosPhasingDosages.

Return type:

TypeVar(T, Genos, Dosages, GenosPhasing, GenosDosages, GenosPhasingDosages)

Returns:

read_ranges(contig, starts=0, ends=POS_MAX, mode=Genos)[source]#

Read genotypes and/or dosages for multiple ranges.

Parameters:

contig (str) – Contig name.
starts (ArrayLike, default: 0) – 0-based start positions.
ends (ArrayLike, default: POS_MAX) – 0-based, exclusive end positions.
mode (type[TypeVar(T, Genos, Dosages, GenosPhasing, GenosDosages, GenosPhasingDosages)], default: Genos) – Type of data to read. Can be Genos, Dosages, GenosPhasing, GenosDosages, or GenosPhasingDosages.

Return type:

tuple[TypeVar(T, Genos, Dosages, GenosPhasing, GenosDosages, GenosPhasingDosages), ndarray[tuple[Any, ...], dtype[int_]]]

Returns:

Shape: (ranges+1). Offsets to slice out data for each range from the variants axis like so:

Examples

data, offsets = reader.read_ranges(...)
data[..., offsets[i] : offsets[i + 1]]  # data for range i

Note that the number of variants for range i is np.diff(offsets)[i].

set_samples(samples)[source]#

Set the samples to use.

Parameters:: samples (ArrayLike | None) – List of sample names to use. If None, all samples will be used.
Return type:: Self

var_idxs(contig, starts=0, ends=POS_MAX)[source]#

Get variant indices and the number of indices per range.

Parameters:

contig (str) – Contig name.
starts (ArrayLike, default: 0) – 0-based start positions of the ranges.
ends (ArrayLike, default: POS_MAX) – 0-based, exclusive end positions of the ranges.

Returns:

(tot_variants). Variant indices for the given ranges, as 0-based positions into this reader’s (filtered) ``_index`` — i.e. reader._index[var_idxs] is always valid. With no filter these equal the physical PVAR row order.

Shape: (ranges+1). Offsets to get variant indices for each range.

Return type:

tuple[ndarray[tuple[Any, ...], dtype[uintc]], ndarray[tuple[Any, ...], dtype[int_]]]

available_samples: list[str]#: List of available samples in the PGEN file.

contigs: list[str] | None = None#: Naturally sorted list of contig names in the PGEN file.

property current_samples: list[str]#: List of samples that are currently being used, in order.

property dosage_path: Path | None#: Path to the dosage file.

property filter: Expr | None#: Polars expression to filter variants. Should return True for variants to keep.

property geno_path: Path#: Path to the genotype file.

property n_samples: int#: Number of samples in the file.

property nbytes: int#

Total in-memory footprint, in bytes, of resident (non-mmap’d) data structures held by this reader.

Sums the gvi variant index dataframe and the StartsEndsIlens cache. Returns 0 after _free_index().

ploidy = 2#: Ploidy of the samples. The PGEN format currently only supports diploid (2).

class genoray.SparseVar(path, attrs=None, fields=None)[source]#

Open a Sparse Variant (SVAR) directory.

Parameters:

path (str | Path) – Path to the SVAR directory.
attrs (IntoExpr | None, default: None) – Expression of attributes to load in addition to the ALT and ILEN columns.
fields (Sequence[str] | None, default: None) – Names of fields to load from the SVAR directory. Must be keys of available_fields. Only VCF FORMAT fields with Number=G are currently supported as custom fields.

classmethod from_pgen(out, pgen, max_mem, overwrite=False, with_dosages=False, n_jobs=-1, *, regions=None, samples=None, merge_overlapping=False, regions_overlap='pos', haploid=False)[source]#

Create a Sparse Variant (.svar) from a PGEN.

Parameters:

out (str | Path) – Path to the output directory.
pgen (PGEN) – PGEN file to write from.
max_mem (int | str) – Maximum memory to use while writing.
overwrite (bool, default: False) – Whether to overwrite the output directory if it exists.
with_dosages (bool, default: False) – Whether to write dosages.
n_jobs (int, default: -1) – Number of jobs to use for parallel processing.
regions (str | tuple[str, int, int] | PathLike | object | None, default: None) – Region(s) to include. Accepts the same input types as write_view: a "chrom:start-end" string (1-based, end-inclusive), a (chrom, start, end) tuple (0-based, end-exclusive), a BED file path, or a frame-like. None (default) includes all regions.
samples (str | Sequence[str] | PathLike | None, default: None) – Sample name(s) to include (a name, a sequence of names, or a path to a newline-delimited file). Caller order is preserved, deduped by first occurrence. None (default) includes all samples. Variants whose minor allele count is 0 across the chosen samples are dropped from the output; if every variant drops, a ValueError is raised.
merge_overlapping (bool, default: False) – If False (default) raise on overlapping input regions; if True dedupe via pyranges merge.
regions_overlap (Literal['pos', 'record', 'variant'], default: "pos") – "pos" (default), "record", or "variant" — same semantics as write_view.
haploid (bool, default: False) – If True, OR-collapse the ploidy axis into a single haploid call per sample (a variant present on any haplotype becomes one call) and record ploidy=1 in the output metadata. Intended for unphased somatic data. Default False.

classmethod from_vcf(out, vcf, max_mem, overwrite=False, with_dosages=False, n_jobs=-1, *, regions=None, samples=None, merge_overlapping=False, regions_overlap='pos', haploid=False)[source]#

Create a Sparse Variant (.svar) from a VCF/BCF.

Parameters:

out (str | Path) – Path to the output directory.
vcf (VCF) – VCF file to write from.
max_mem (int | str) – Maximum memory to use while writing.
overwrite (bool, default: False) – Whether to overwrite the output directory if it exists.
with_dosages (bool, default: False) – Whether to write dosages.
n_jobs (int, default: -1) – Number of jobs to use for parallel processing.
regions (str | tuple[str, int, int] | PathLike | object | None, default: None) – Region(s) to include. Accepts the same input types as write_view: a "chrom:start-end" string (1-based, end-inclusive), a (chrom, start, end) tuple (0-based, end-exclusive), a BED file path, or a frame-like. None (default) includes all regions.
samples (str | Sequence[str] | PathLike | None, default: None) – Sample name(s) to include (a name, a sequence of names, or a path to a newline-delimited file). Caller order is preserved, deduped by first occurrence. None (default) includes all samples. Variants whose minor allele count is 0 across the chosen samples are dropped from the output; if every variant drops, a ValueError is raised.
merge_overlapping (bool, default: False) – If False (default) raise on overlapping input regions; if True dedupe via pyranges merge.
regions_overlap (Literal['pos', 'record', 'variant'], default: "pos") – "pos" (default), "record", or "variant" — same semantics as write_view.
haploid (bool, default: False) – If True, OR-collapse the ploidy axis into a single haploid call per sample (a variant present on any haplotype becomes one call) and record ploidy=1 in the output metadata. Intended for unphased somatic data. Default False.

read_ranges(contig, starts=0, ends=POS_MAX, samples=None)[source]#

Read the genotypes for the given ranges.

Parameters:

contig (str) – Contig name.
starts (ArrayLike, default: 0) – 0-based start positions of the ranges.
ends (ArrayLike, default: POS_MAX) – 0-based, exclusive end positions of the ranges.
samples (ArrayLike | None, default: None) – List of sample names to read. If None, read all samples.

Returns:

Ragged[V_IDX_TYPE] with shape (ranges, samples, ploidy, ~variants). When fields are loaded: an awkward record array of the same outer shape where result.genos is Ragged[V_IDX_TYPE] and each additional field (e.g. result.dosages) is a Ragged of its respective dtype. All arrays are backed by memory-mapped data so only the offsets reside in RAM.

Return type:

TypeVar(_SRT)

read_ranges_with_length(contig, starts=0, ends=POS_MAX, samples=None)[source]#

Read genotypes for the given ranges, each with the minimum variants to reach the query length.

This can mean either fewer or more variants than would be returned than by read_ranges, depending on the presence of indels.

Parameters:

contig (str) – Contig name.
starts (ArrayLike, default: 0) – 0-based start positions of the ranges.
ends (ArrayLike, default: POS_MAX) – 0-based, exclusive end positions of the ranges.
samples (ArrayLike | None, default: None) – List of sample names to read. If None, read all samples.

Return type:

TypeVar(_SRT)

Returns:

Same return structure as read_ranges().

var_ranges(contig, starts=0, ends=POS_MAX)[source]#

Get variant index ranges for each query range.

For each query range, return the minimum and maximum variant that overlaps. Note that this means some variants within those ranges may not actually overlap with the query range if there is a deletion that spans the start of the query.

Parameters:

contig (str) – Contig name.
starts (ArrayLike, default: 0) – 0-based start positions of the ranges.
ends (ArrayLike, default: POS_MAX) – 0-based, exclusive end positions of the ranges.

Returns:

(ranges, 2). The first column is the start index of the variant and the second column is the end index of the variant.

Return type:

ndarray[tuple[Any, ...], dtype[intc]]

with_fields(fields=None)[source]#

Overloads:

self, fields (Sequence[str]) → SparseVar[Ragged[np.void]]
self, fields (Literal[False]) → SparseVar[Ragged[V_IDX_TYPE]]
self, fields (None) → SparseVar[_SRT]

Parameters:

fields (Sequence[str] | Literal[False] | None)

Return type:

SparseVar

Return a shallow copy of this SparseVar with updated fields.

Parameters:

fields (Union[Sequence[str], Literal[False], None], default: None) –

None: leave fields unchanged (returns shallow copy).
Sequence[str]: names of fields to load from the SVAR directory. Must be keys of available_fields.
False: drop all fields, returning a SparseVar[Ragged[V_IDX_TYPE]].

Return type:

SparseVar

write_view(regions, samples, output, fields=None, reference=None, merge_overlapping=False, regions_overlap='pos', overwrite=False, threads=None, progress=False)[source]#

Write a subset of this SparseVar to a new directory.

Parameters:

regions (str | tuple[str, int, int] | Path | DataFrame) – Region(s) to include. Accepts the same input types as _normalize_regions(): a "chrom:start-end" string, a (chrom, start, end) tuple, a BED file path, or a polars/pandas/pyranges frame.
samples (str | Sequence[str] | Path) – Samples to include. Accepts a single sample name, a list, or a path to a file of newline-separated names.
output (str | Path) – Destination directory for the new SparseVar.
fields (Sequence[str] | None, default: None) – Fields to carry over (None = all available except "mutcat"; [] = none). The derived mutcat field is never copied positionally by write_view because its mutation codes — especially DBS adjacency — are only valid for the full variant set; subsetting may drop a DBS partner and leave a stale 5’ code. Pass reference= to recompute mutcat on the subset instead (see below). Explicitly including "mutcat" in fields without also providing reference raises a ValueError.
reference (Reference | str | Path | None, default: None) – If provided (a Reference instance, or a path to a FASTA file), annotate_mutations() is called on the output view after all other data have been written, recomputing mutcat codes for the subset. This is the supported way to get a valid mutcat field on a view. When None (default), no annotation is performed and the output will not have a mutcat field. When provided, the FASTA is validated up front (before any output is written) so a bad path fails fast.
merge_overlapping (bool, default: False) – If True silently merge overlapping regions; if False raise ValueError when overlaps are detected.
regions_overlap (Literal['pos', 'record', 'variant'], default: "pos") – How variants are matched to regions — "pos", "record", or "variant". See _resolve_kept_var_idxs().
overwrite (bool, default: False) – Whether to overwrite output if it already exists.
threads (int | None, default: None) – Number of Numba threads to use. None uses all available CPUs.
progress (bool, default: False) – If True, display a phase-level rich progress bar while the view is written (one tick per major step: counting, genotypes, each field, index build, and mutation annotation when reference is given). Defaults to False (no bar, no overhead).

Return type:

None

Notes

Variants whose minor allele count is 0 in the chosen sample subset are dropped from the output. If every candidate variant drops, a ValueError is raised — the same code path that fires when regions itself selects no variants.

contigs: list[str]#: Contigs in the order they appear in the dataset. Variants are only sorted within each contig.

property index: DataFrame#

The full variant index, materialized on first access.

Table of variants with columns CHROM, POS, REF, ALT, ILEN, and any additional attributes specified in attrs on construction.

property n_samples: int#: Number of samples in the dataset.

property n_variants: int#: Number of variants in the dataset.

property nbytes: int#

Total in-memory footprint, in bytes, of resident (non-mmap’d) data held by this reader.

Only the polars variant index counts; genos and fields are memory-mapped and excluded.

class genoray.SparseVar2(path, *, fields=None)[source]#

Reader for a finished SVAR2 store (M6a skeleton).

Loads the top-level meta.json and opens one native genoray._core.PyContigReader per contig. Query methods land in M6b (raw two-channel result) and M6c (decoded seqpro.rag.Ragged).

Parameters:

path (Any)
fields (Sequence[str] | None)

classmethod concat(output, sources, *, mode='copy', overwrite=False)[source]#

Concatenate disjoint-contig SVAR2 stores (identical samples/ploidy/fields) into one.

Return type:

None

Parameters:

output (str | Path)
sources (Sequence[str | Path | 'SparseVar2'])
mode (Mode)
overwrite (bool)

classmethod from_pgen(out, source, reference=None, *, regions=None, samples=None, merge_overlapping=False, regions_overlap='pos', no_reference=False, skip_out_of_scope=False, chunk_size=None, threads=None, overwrite=False, long_allele_capacity=8 * 1024 * 1024, signatures=False, dosages=None, check_ref='e', progress=False, log_level='info')[source]#

Convert a PLINK2 PGEN to an SVAR2 store.

Genotypes are read through the pgenlib package; variant metadata comes from the sibling .pvar/.pvar.zst and sample names from the .psam.

Exactly one of reference or no_reference=True is required, with the same meaning as from_vcf(): with a reference, indels are validated against and left-aligned to the FASTA; with no_reference, both are skipped and the input is trusted to be already normalized. Returns the number of out-of-scope (symbolic/breakend) ALTs dropped (0 unless skip_out_of_scope).

PGEN is diploid, so there is no ploidy parameter.

chunk_size: variants per conversion chunk. Defaults to a value derived from a memory budget, since a packed dense chunk costs chunk_size * n_samples * 2 / 8 bytes.

regions restricts conversion to one or more .pvar variant-index ranges. Region strings use Genoray’s existing convention: "chrom:start-end" is 1-based inclusive and is converted to a 0-based half-open interval; tuple/BED/frame inputs are already 0-based half-open. Overlapping regions raise unless merge_overlapping=True.

regions_overlap controls which variants a region keeps, matching bcftools –regions-overlap: “pos” (POS inside [start,end)), “record” (POS in [start,end+1), so an indel at the region’s last base is kept), or “variant” (the anchor-trimmed variant extent overlaps the region). In “variant” mode a multiallelic record is kept whole if ANY of its alleles truly overlaps the region; individual non-overlapping alleles are not dropped.

samples selects and reorders .psam samples by name, preserving caller order and de-duplicating first occurrences – the store’s available_samples (and every decoded column) matches that order exactly, regardless of each sample’s original .psam position.

dosages stores one or more .pgen dosage tracks as SVAR2 FORMAT fields, in addition to (never instead of) the hardcalls read as usual. Each DosageField names its source: "self" reads the dosage track from source itself (the same file supplying hardcalls), while a separate .pgen path reads dosages from that file instead (e.g. a VAF/CCF track kept apart because pgenlib derives hardcalls from dosage when both live in one file). A separate dosage file’s .psam samples must match source’s exactly, same names in the same order, and its .pvar must align 1:1 on variant count with source’s. Like every FORMAT field, dosages are genotype-aligned: under var_key routing a non-carrier’s dosage is dropped from the store.

Not supported (and silently ignored rather than errored, where noted):

INFO fields. .pvar INFO extraction is not implemented.

Haplotype resolution for unphased heterozygotes follows the allele-code order pgenlib returns — the same caveat from_vcf() carries for unphased GT.

check_ref: policy for a record whose REF disagrees with the reference FASTA (ignored when no_reference=True). “e” (default) raises and aborts the build — matching bcftools norm –check-ref e. “x” drops the offending record (including a REF that runs past the contig end) and continues, logging a per-contig count. Comparison is case-insensitive, so soft-masked (lowercase) reference bases match.

progress: if True, render live write progress. In a terminal or Jupyter, this is a rich progress bar (one row per in-flight contig); elsewhere it falls back to compact heartbeat lines (“chr1 42% (12,345/29,000) …”), throttled to roughly one line every 5s per contig. Regardless of progress, a one-line “chrom done: N kept, M excluded (Ts)” summary is printed per contig once it finishes, unless log_level=”off”. Default False keeps writes silent aside from those summaries.

log_level: minimum severity for structured write-time log lines – one of “off”, “warning”, “info” (default), or “debug”. “off” disables all output, including the per-contig summaries and progress rendering (a pure no-op, zero overhead). The environment variable GENORAY_LOG overrides this argument when set to one of the same four values (e.g. GENORAY_LOG=debug for troubleshooting without touching call sites).

Return type:

int

Parameters:

out (str | Path)
source (str | Path)
reference (str | Path | None)
regions (str | tuple[str, int, int] | PathLike | object | None)
samples (str | Sequence[str] | PathLike | None)
merge_overlapping (bool)
regions_overlap (Literal['pos', 'record', 'variant'])
no_reference (bool)
skip_out_of_scope (bool)
chunk_size (int | None)
threads (int | None)
overwrite (bool)
long_allele_capacity (int)
signatures (bool)
dosages (Sequence[DosageField] | None)
check_ref (Literal['e', 'x'])
progress (bool)
log_level (Literal['off', 'warning', 'info', 'debug'])

classmethod from_svar1(out, source, reference=None, *, regions=None, samples=None, merge_overlapping=False, regions_overlap='pos', no_reference=False, skip_out_of_scope=False, chunk_size=None, threads=None, overwrite=False, long_allele_capacity=8 * 1024 * 1024, signatures=False, fields=None, check_ref='e', progress=False, log_level='info')[source]#

Convert a SVAR1 (SparseVar) store to an SVAR2 store natively.

Reads no VCF and no htslib: SVAR1 is already sparse, so this reconstructs variant records from SVAR1’s arrays and reuses the same conversion spine as from_vcf().

Exactly one of reference or no_reference=True is required, same meaning as from_vcf(). ploidy is read from SVAR1’s metadata. Returns the number of out-of-scope (symbolic/breakend) ALTs dropped.

Only biallelic SVAR1 stores are supported (SVAR1’s geno==1 model); multiallelic input raises. fields selects which SVAR1 FORMAT fields (e.g. dosages) are carried through: None (default) carries all of them, matching prior behavior; [] carries none; a subset of names carries only those. An unknown name raises ValueError listing the available fields. mutcat is never selectable this way and is always dropped (pass signatures=True to recompute signatures from the reference). Because SVAR1 discarded non-carrier FORMAT values, a dense-routed variant’s non-carrier cells are filled with the field’s default/missing sentinel — field output is byte-identical to from_vcf() only for var_key (carrier-only) routing.

regions restricts conversion to one or more genomic ranges. Region strings use Genoray’s existing convention: "chrom:start-end" is 1-based inclusive and is converted to a 0-based half-open interval; tuple/BED/frame inputs are already 0-based half-open. Overlapping regions raise unless merge_overlapping=True. Unlike from_pgen(), SVAR1 has no on-disk covering-range index to narrow against, so a selected contig’s local variants are still scanned in full – the per-record filter (in the Rust Svar1RecordSource) is what actually restricts the output.

samples selects and reorders SVAR1 samples by name, preserving caller order and de-duplicating first occurrences – the store’s available_samples (and every decoded column) matches that order exactly, regardless of each sample’s original SVAR1 position.

Return type:

int

Parameters:

out (str | Path)
source (str | Path)
reference (str | Path | None)
regions (str | tuple[str, int, int] | PathLike | object | None)
samples (str | Sequence[str] | PathLike | None)
merge_overlapping (bool)
regions_overlap (Literal['pos', 'record', 'variant'])
no_reference (bool)
skip_out_of_scope (bool)
chunk_size (int | None)
threads (int | None)
overwrite (bool)
long_allele_capacity (int)
signatures (bool)
fields (Sequence[str] | None)
check_ref (Literal['e', 'x'])
progress (bool)
log_level (Literal['off', 'warning', 'info', 'debug'])

classmethod from_vcf(out, source, reference=None, *, regions=None, samples=None, merge_overlapping=False, regions_overlap='pos', no_reference=False, skip_out_of_scope=False, ploidy=2, chunk_size=25_000, threads=None, overwrite=False, long_allele_capacity=8 * 1024 * 1024, signatures=False, info_fields=None, format_fields=None, check_ref='e', progress=False, log_level='info')[source]#

Convert a bgzipped VCF or BCF to an SVAR2 store.

Exactly one of reference or no_reference=True is required. With a reference, indels are validated against and left-aligned to the FASTA; with no_reference, validation and left-alignment are skipped and the input is trusted to be already normalized. Returns the number of out-of-scope (symbolic/breakend) ALTs dropped (0 unless skip_out_of_scope).

regions restricts conversion to one or more indexed VCF fetch intervals. Region strings use Genoray’s existing convention: "chrom:start-end" is 1-based inclusive and is converted to a 0-based half-open interval; tuple/BED/frame inputs are already 0-based half-open. Overlapping regions raise unless merge_overlapping=True.

samples selects and reorders VCF samples by name, preserving caller order and de-duplicating first occurrences.

signatures: if True, classify SBS96/ID83 codes during the write and store the mutcat sidecar (factored into the dense/var_key cost model). Requires a reference; raises if no_reference=True.

info_fields, format_fields: scalar-numeric (Integer/Float, and Flag for INFO) header fields to carry through to the SVAR2 store. Each entry is either a bare field name (dtype auto-narrowed from the header, no default fill) or an InfoField/FormatField spec (explicit dtype/default). default fills VCF-missing entries; otherwise a reserved sentinel/NaN is written. FORMAT fields are genotype-aligned: non-carrier values are dropped for var_key-routed variants.

log_level: minimum severity for structured write-time log lines — one of “off”, “warning”, “info” (default), or “debug”. “off” disables all output, including the per-contig summaries and progress rendering (a pure no-op, zero overhead). The environment variable GENORAY_LOG overrides this argument when set to one of the same four values (e.g. GENORAY_LOG=debug for troubleshooting without touching call sites).

Return type:

int

Parameters:

out (str | Path)
source (str | Path)
reference (str | Path | None)
regions (str | tuple[str, int, int] | PathLike | object | None)
samples (str | Sequence[str] | PathLike | None)
merge_overlapping (bool)
regions_overlap (Literal['pos', 'record', 'variant'])
no_reference (bool)
skip_out_of_scope (bool)
ploidy (int)
chunk_size (int)
threads (int | None)
overwrite (bool)
long_allele_capacity (int)
signatures (bool)
info_fields (Sequence[str | InfoField] | None)
format_fields (Sequence[str | FormatField] | None)
check_ref (Literal['e', 'x'])
progress (bool)
log_level (Literal['off', 'warning', 'info', 'debug'])

classmethod from_vcf_list(out, sources, reference=None, *, regions=None, merge_overlapping=False, regions_overlap='pos', no_reference=False, skip_out_of_scope=False, ploidy=2, chunk_size=None, max_mem=None, threads=None, overwrite=False, long_allele_capacity=8 * 1024 * 1024, signatures=False, info_fields=None, format_fields=None, check_ref='e', progress=False, log_level='info')[source]#

Build one SVAR2 store from many single-sample VCFs/BCFs via a native k-way merge (no bcftools merge, no intermediate multi-sample VCF).

Each file in sources must have exactly one sample column; that sample becomes one sample in the resulting store, named after its VCF header sample name (duplicates across files are rejected). A site present in some input files but absent from another is filled hom-ref (`0`) for the samples that lack it. An in-file ./. (missing) call is not separately preserved once merged: SVAR2’s sparse layout stores only ALT-carrying entries, so a missing hap and a hom-ref hap both produce zero entries and are indistinguishable through decode or region_counts. (The -1 missing sentinel is a dense genoray.VCF/genoray.PGEN convention; it is not part of SVAR2’s decode.) The merge is join-on-atom: files are merged one contig at a time by walking each file’s already-sorted record stream in lockstep, so a variant is one shared row in the output store iff its normalized (pos, ref, alt) atom matches exactly across files, not merely its position.

sources accepts three forms (resolved by module-level _resolve_vcf_sources):

a Sequence of paths – explicit, in the given order.
a single directory Path – every *.vcf.gz then every *.bcf directly inside it (non-recursive), each group name-sorted.
a single file Path – if it ends in .vcf.gz/.bcf, that one file; otherwise treated as a manifest (one path per line, blank and #-comment lines skipped, relative entries resolved against the manifest’s directory).

As with from_vcf(), each input VCF’s records must already be position-sorted per contig; an unsorted file raises ValueError naming the offending file and positions rather than silently corrupting the k-way merge.

Every input file must also use the same contig naming scheme (e.g. all chr1-style or all 1-style) – the merge matches contigs by an exact per-file string, so a cohort mixing schemes raises ValueError up front (naming the conflicting files/spellings) instead of silently producing a half-hom-ref-filled store.

Opens all N input files concurrently (one file descriptor per file, per contig); at large N (roughly N > (RLIMIT_NOFILE - 64) / 2) this raises ValueError with the ulimit -n remedy rather than htslib’s more confusing “no index?” error. There is no batched/ hierarchical merge to fall back on for very large cohorts (future work) – raise the open-file limit instead.

Exactly one of reference or no_reference=True is required, with the same semantics as from_vcf(): with a reference, atoms are validated against it and left-aligned before merging; with no_reference, both are skipped and each atom’s REF is reconstructed from its own record’s REF bytes. Caveat specific to this method: because merging is a per-contig k-way join on normalized (pos, ref, alt) atoms across independently produced files, skipping left-alignment under no_reference means a shared site only joins into one output row if every input file already represents it identically (e.g. all inputs came from the same caller, or were all already run through bcftools norm against the same reference). Two files encoding the same indel differently (different anchor base, different padding) will NOT join under no_reference – they surface as two separate variants in the output store instead of one shared row, silently. signatures=True requires a reference (not no_reference).

info_fields/format_fields: same declaration API as from_vcf() (resolved against the FIRST file in sources’ header). INFO fields merge first-carrier-wins: when a site is shared across files, the value comes from the lowest-numbered (earliest in sources order) file that carries the atom, not the last or the max. FORMAT fields remain per-sample, exactly as in from_vcf: each sample gets its own file’s value, and a sample that doesn’t carry the atom gets the field’s default.

regions restricts the merge to one or more indexed VCF fetch intervals, with the same convention as from_vcf(): "chrom:start-end" is 1-based inclusive (converted to 0-based half-open); tuple/BED/frame inputs are already 0-based half-open. Overlapping regions raise unless merge_overlapping=True. regions_overlap controls which variants a region keeps, matching bcftools –regions-overlap: “pos” (POS inside [start,end)), “record” (POS in [start,end+1), so an indel at the region’s last base is kept), or “variant” (the anchor-trimmed variant extent overlaps the region). In “variant” mode a multiallelic record is kept whole if ANY of its alleles truly overlaps the region; individual non-overlapping alleles are not dropped. The mode applies identically to every input file in the merge.

from_vcf_list has no samples parameter – each input is single-sample and the cohort is defined by sources.

Returns the number of out-of-scope (symbolic/breakend) ALTs dropped (0 unless skip_out_of_scope).

chunk_size: variants per conversion chunk. When omitted (None, default), a memory-budget-derived value (_auto_chunk_size) is used so one packed dense chunk stays ~`_DENSE_CHUNK_TARGET_BYTES` regardless of cohort size — the same default from_pgen and from_svar1 already use. Scope: this bounds only the dense-chunk term, which is a small fraction of peak RAM at typical cohort sizes (the budget does not bite until roughly 43k inputs, below which it returns the historical 25_000). It is a large-cohort guardrail, not a fix for overall RAM scaling in the number of input files. Pass an int to override with a fixed count.

max_mem caps the bytes one in-flight dense chunk may occupy, sizing chunk_size against the packed genotype grid plus staged FORMAT values. It is a worst-case ceiling: the estimate assumes every variant routes dense, so cohorts whose variants route sparse (e.g. private somatic calls) use considerably less.

For large multi-contig merges, also consider setting MALLOC_ARENA_MAX in the environment – see the note below.

Note

Peak RSS on very large multi-contig merges is dominated by glibc arena behaviour, not by live data: glibc sizes its arena count from the machine’s core count (8 x ncores, so 768 on a 96-core node) and never unmaps a heap once created.

Setting MALLOC_ARENA_MAX=2 in the environment before the process starts can help – measured 9.20 GB -> 6.89 GB peak on a 3-contig, 1000-file merge at no time cost. It is not a safe default: with thousands of concurrent readers contending on two arena locks the same knob measured 12% worse RAM and 73% slower on a 4000-file single-contig merge. Measure before adopting it.

Return type:

int

Parameters:

out (str | Path)
sources (str | Path | Sequence[str | Path])
reference (str | Path | None)
regions (str | tuple[str, int, int] | PathLike | object | None)
merge_overlapping (bool)
regions_overlap (Literal['pos', 'record', 'variant'])
no_reference (bool)
skip_out_of_scope (bool)
ploidy (int)
chunk_size (int | None)
max_mem (int | str | None)
threads (int | None)
overwrite (bool)
long_allele_capacity (int)
signatures (bool)
info_fields (Sequence[str | InfoField] | None)
format_fields (Sequence[str | FormatField] | None)
check_ref (Literal['e', 'x'])
progress (bool)
log_level (Literal['off', 'warning', 'info', 'debug'])

split_by_contig(out_dir, *, mode='copy', overwrite=False)[source]#

Explode into one single-contig store per contig at out_dir/{contig}.svar2.

Return type:

list[Path]

Parameters:

out_dir (str | Path)
mode (Literal['copy', 'hardlink', 'symlink', 'move'])
overwrite (bool)

subset_contigs(output, contigs, *, mode='copy', overwrite=False)[source]#

Write a new SVAR2 store containing only contigs (metadata + file copy).

Return type:

None

Parameters:

output (str | Path)
contigs (str | Sequence[str])
mode (Literal['copy', 'hardlink', 'symlink', 'move'])
overwrite (bool)

with_fields(fields)[source]#

A new reader over the same store that also decodes fields.

Keys are those of available_fields: the bare field name when it is unique across INFO/FORMAT, else bcftools-style INFO/DP / FORMAT/DP.

Return type:: SparseVar2
Parameters:: fields (Sequence[str])

write_view(regions, samples, output, fields=None, reference=None, *, merge_overlapping=False, regions_overlap='pos', reroute='auto', overwrite=False, threads=None, progress=False, log_level='info')[source]#

Write a region/sample subset of this store to output.

regions/samples accept the same inputs as the query methods (region string, (chrom, start, end) tuple, BED path, or a samples sequence / path to a sample list). regions_overlap controls how a variant’s span is matched against the requested regions (“pos”/”record”/”variant” — see _normalize_regions/_resolve_kept_rows); merge_overlapping silently merges overlapping input regions instead of raising.

fields defaults to None, meaning no fields are carried through (genotypes only) — this always succeeds, even on a store that has INFO/FORMAT fields (available_fields non-empty). “mutcat” is always excluded from fields — pass reference= to recompute it instead of copying.

Both reroute=True and reroute=False go through the same slicer backend, which carries fields and recomputes mutcat from reference (when given) on either path:

reroute=True reruns the var_key/dense routing cost model over the subset. This is size-optimal (each variant is re-routed to whichever representation is smaller for the subset’s sample/carrier counts).
reroute=False directly slices each variant’s existing on-disk representation (byte-copy, no cost model) — representation- preserving regardless of the subset’s sample/carrier counts. Recommended when the subset is expected to route the same way as the source anyway (e.g. slicing somatic/rare-variant cohorts, where nearly every variant is already var_key-routed) or when the view must be produced under tight memory constraints.
“auto” (default) resolves to False when any FORMAT field is carried (any entry of fields other than “mutcat” whose available_fields[…].category == “format”), True otherwise. A dense->var_key flip stores one value per carrier call and has no slot for a non-carrier sample’s FORMAT value, so re-routing a source-dense variant under a FORMAT-carrying view would silently drop that value; “auto” prefers fidelity in that case and takes the size-optimal re-route otherwise (genotype-only / INFO-only views, which have no per-sample slot to lose).

progress: if True, render live write progress. Unlike the other from_* writers, write_view has no per-record stream to sample – progress here is COARSE, one line per contig (no within-contig bar movement): in a terminal or Jupyter this is a live-updating list of in-flight/finished contigs, elsewhere compact “chrom done” lines as each contig finishes. Regardless of progress, a one-line “chrom done: N kept, 0 excluded (Ts)” summary is printed per contig once it finishes, unless log_level=”off” (slicing never excludes variants, so excluded is always 0). Default False keeps writes silent aside from those summaries.

threads caps the number of contigs sliced concurrently (autodetected from available CPUs when None), same convention as from_vcf. Peak memory is O(output size) per in-flight contig times threads; with reference= given, each in-flight contig additionally holds that contig’s reference sequence in memory.

Raises FileExistsError if output exists and overwrite=False, and ValueError if output resolves to this store’s own path (writing a view in place is not supported).

Return type:

None

Parameters:

regions (str | tuple[str, int, int] | Path | object)
samples (str | Sequence[str] | Path)
output (str | Path)
fields (Sequence[str] | None)
reference (str | Path | None)
merge_overlapping (bool)
regions_overlap (Literal['pos', 'record', 'variant'])
reroute (bool | Literal['auto'])
overwrite (bool)
threads (int | None)
progress (bool)
log_level (Literal['off', 'warning', 'info', 'debug'])

class genoray.Reference(path, contigs)[source]#

A reference genome backed by an indexed FASTA, read on demand via pysam.

One contig is held in memory at a time and sliced for flanking-base lookups. Queries accept chr-prefixed or unprefixed contig names interchangeably.

Do not instantiate directly; use Reference.from_path().

Parameters:

path (Path)
contigs (list[str])

contig_array(contig)[source]#

Return the full contig sequence as a cached uint8 array.

Accepts chr-prefixed or unprefixed names. One contig is held in memory at a time (shared with fetch()).

Return type:: ndarray[tuple[Any, ...], dtype[ubyte]]
Parameters:: contig (str)

fetch(contig, start, end)[source]#

Return reference bytes for 0-based half-open [start, end).

Positions outside the contig are padded with N. Returns a uint8 array; bytes(...) gives the ASCII sequence.

Return type:

ndarray[tuple[Any, ...], dtype[ubyte]]

Parameters:

contig (str)
start (int)
end (int)

genoray.cosmic_signatures(kind, *, version='3.4', genome='GRCh38')[source]#

Fetch (and cache) the COSMIC reference signatures for kind.

Parameters:

kind (Literal['SBS96', 'DBS78', 'ID83', 'SBS192', 'SBS384']) – One of "SBS96", "DBS78", "ID83".
version (str, default: "3.4") – COSMIC signature release (default "3.4").
genome (str, default: "GRCh38") – Reference build for SBS/DBS ("GRCh37" or "GRCh38"). Ignored for ID83 (indel signatures are build-independent in the COSMIC release).

Returns:

A MutationType column (in genoray’s canonical codebook order for kind) followed by one column per COSMIC signature, ready to pass to fit_signatures().

Return type:

DataFrame

genoray.fit_signatures(catalogue, reference, *, max_delta=0.01, min_activity=0.005, n_jobs=1, backend='loky')[source]#

Refit a mutation catalogue against reference signatures.

Parameters:

catalogue (DataFrame) – A mutation_matrix-shaped frame: a MutationType column followed by one numeric column per sample.
reference (DataFrame) – A MutationType column followed by one column per reference signature. Columns need not be pre-normalized; each is scaled to sum 1 so reported activities are in mutation-count units.
max_delta (float, default: 0.01) – Minimum cosine-similarity improvement to keep adding a signature (forward-selection stop criterion).
min_activity (float, default: 0.005) – Minimum fractional contribution; signatures below this are pruned.
n_jobs (int, default: 1) – Number of parallel workers for the per-sample refit (passed to joblib.Parallel). 1 (default) runs serially; -1 uses all cores. Results are identical regardless of n_jobs.
backend (str, default: "loky") – joblib backend (default "loky", process-based). Samples are refit independently, so a process backend avoids GIL contention from the forward-selection orchestration.

Returns:

One row per sample: a Sample column, one Float column per reference signature (activities, 0.0 if unselected), and a cosine_similarity column for the final reconstruction.

Return type:

DataFrame

Raises:

ValueError – If a MutationType present in the catalogue is missing from the reference (rows cannot be aligned).

`genoray.exprs`#

Polars expressions for filtering a genoray index (extension .gvi).

These require the minimum set of index columns:

"CHROM" : pl.Utf8
"POS" : pl.Int64
"REF" : pl.Utf8
"ALT" : pl.List[Utf8]
"ILEN" : pl.List[Int32]

Applicable to PGEN indexes, and to VCF indexes when one has been built.

Note

For PGEN, all columns that existed in the underlying PVAR will be available in the index.

genoray.exprs.ILEN = <Expr ['[(col("ALT").list.eval(element…']>#: Indel length of the variant. Positive for insertions, negative for deletions, and zero for SNPs and MNPs.

genoray.exprs.is_biallelic = <Expr ['[(col("ALT").list.length()) ==…']>#: True if the variant is biallelic (one ALT allele).

genoray.exprs.is_breakend = <Expr ['col("ALT").list.eval(element()…']>#

True if any ALT allele is a breakend (BND) in mate-pair / single-breakend notation (e.g. G[chr2:321[, ]chr2:321]G, .TGCA, TGCA.), per the VCF 4.x spec (§5.4).

Breakends are a distinct ALT class from symbolic <...> alleles, so is_symbolic does not flag them. But like symbolic alleles they are not expandable into nucleotides — the bracket/colon/position bytes corrupt personalized DNA buffers in haplotype consumers (e.g. genvarloader). Their _symbolic_ilen() value is null (so they are also is_imprecise).

To drop breakends, use expr=~genoray.exprs.is_breakend in a genoray.Filter. To drop all un-expandable ALTs (symbolic + breakends) for haplotype consumers, combine:

expr=~genoray.exprs.is_symbolic & ~genoray.exprs.is_breakend

genoray.exprs.is_imprecise = <Expr ['col("ILEN").list.eval(element(…']>#: True if any ALT allele’s ILEN could not be precisely determined (an un-sizable symbolic allele — IMPRECISE, missing SVLEN/END, or an unsupported symbolic type). Such alleles carry null ILEN. Filter them out with expr=~genoray.exprs.is_imprecise (in a genoray.Filter for VCF, or directly for PGEN) to keep precise structural variants while dropping the rest; use ~genoray.exprs.is_symbolic to drop all symbolic alleles (required for haplotype consumers such as genvarloader, which cannot expand any symbolic ALT).

genoray.exprs.is_indel = <Expr ['col("ILEN").list.eval([([(elem…']>#

True if all ALT alleles are indels (insertions or deletions).

Un-sizable symbolic alleles (null ILEN) are treated as neither SNP nor indel: a row containing any null ILEN element evaluates to False.

genoray.exprs.is_snp = <Expr ['col("ILEN").list.eval([([(elem…']>#

True if all ALT alleles are SNPs (single nucleotide polymorphisms).

Un-sizable symbolic alleles (null ILEN) are treated as neither SNP nor indel: a row containing any null ILEN element evaluates to False.

genoray.exprs.is_symbolic = <Expr ['col("ALT").list.eval(element()…']>#

True if any ALT allele is a symbolic allele (e.g. <DEL>, <INS>, <DUP>, <INV>, <CNV>, <BND> — anything matching <…> per the VCF 4.x spec).

Symbolic ALTs are placeholders for structural variants whose exact replacement nucleotides are unknown. Downstream haplotype injection (e.g. via genvarloader) cannot expand them — the literal <DEL> ASCII bytes end up in personalized DNA buffers and become non-canonical bytes for translators.

To drop symbolic records, pass this as a filter. For PGEN, the single filter expression suffices:

pgen = genoray.PGEN("file.pgen", filter=~genoray.exprs.is_symbolic)

For VCF, bundle it with the equivalent cyvcf2 record predicate in a genoray.Filter (both are required):

vcf = genoray.VCF(
    "file.vcf.gz",
    filter=genoray.Filter(
        record=lambda rec: not any(a.startswith("<") for a in rec.ALT),
        expr=~genoray.exprs.is_symbolic,
    ),
)

SparseVar.from_vcf / from_pgen inherit the source’s filter, so the SVAR is filtered to match.

API

Contents

API#

genoray.exprs#

`genoray.exprs`#