API#

class genoray.VCF(path, filter=None, pl_filter=None, phasing=False, dosage_field=None, progress=False, with_gvi_index=True)[source]#

Create a VCF reader.

Parameters:
  • path (str | Path) – Path to the VCF file.

  • filter (Callable[[Variant], bool] | None, default: None) –

    Function to filter variants. Should return True for variants to keep.

    Note

    To avoid KeyErrors, this function needs to be tolerant to missing fields. For example, if you access an INFO or FORMAT field, not all variants are guaranteed to have the same fields. The cyvcf2.Variant API provides the .get method on the INFO and FORMAT attributes. For example, lambda v: v.INFO.get("AF", 0) > 0.05 will skip any variants with an AF <= 0.05 or a missing AF by treating missing AFs as 0.

  • pl_filter (Expr | None, default: None) –

    Polars expression to filter variants. Should return True for variants to keep. Must match the filter function.

    Note

    This expression will be applied to the polars DataFrame returned by get_record_info(). It is not applied to the VCF file itself, so it will not be able to use the cyvcf2.Variant API. For example, if you want to filter variants by INFO field, you can use: pl.col("AF") > 0.05 but you can not use: lambda v: v.INFO.get("AF", 0) > 0.05 because the expression will be applied to the polars DataFrame, not the VCF file.

  • read_as – Type of data to read from the VCF file. Can be VCF.Genos, VCF.Dosages, or VCF.GenosDosages.

  • phasing (bool, default: False) – Whether to include phasing information on genotypes. If True, the ploidy axis will be length 3 such that phasing is indicated by the 3rd value: 0 = unphased, 1 = phased. If False, the ploidy axis will be length 2.

  • dosage_field (str | None, default: None) – Name of the dosage field to read from the VCF file. Required if read_as is VCF.Dosages, VCF.Genos8Dosages, or VCF.Genos16Dosages.

  • progress (bool, default: False) – Whether to show a progress bar while reading the VCF file.

  • with_gvi_index (bool)

class Dosages(instance)#
class Genos16(instance)#
class Genos16Dosages(instance)#
class Genos8(instance)#
class Genos8Dosages(instance)#
chunk(contig, start=0, end=POS_MAX, max_mem='4g', mode=Genos16)[source]#

Iterate over genotypes and/or dosages for a range in chunks limited by max_mem.

Parameters:
  • contig (str) – Contig name.

  • start (int | integer, default: 0) – 0-based start position.

  • end (int | integer, default: POS_MAX) – 0-based, exclusive end position.

  • max_mem (int | str, default: "4g") – Maximum memory to use for each chunk. Can be an integer or a string with a suffix (e.g. “4g”, “2 MB”).

  • mode (type[TypeVar(T, Genos8, Genos16, Dosages, Genos8Dosages, Genos16Dosages)], default: Genos16) – Type of data to read.

Return type:

Generator[TypeVar(T, Genos8, Genos16, Dosages, Genos8Dosages, Genos16Dosages)]

Returns:

Generator of genotypes and/or dosages. Genotypes have shape (samples ploidy variants) and dosages have shape (samples variants). Missing genotypes have value -1 and missing dosages have value np.nan. If just using genotypes or dosages, will be a single array, otherwise will be a tuple of arrays.

get_record_info(contig=None, start=None, end=None, fields=None, info=None, lazy=False)[source]#
Overloads:
  • self, contig (str | None), start (int | np.integer | None), end (int | np.integer | None), fields (list[str] | None), info (list[str] | None), lazy (Literal[False]) → pl.DataFrame

  • self, contig (str | None), start (int | np.integer | None), end (int | np.integer | None), fields (list[str] | None), info (list[str] | None), lazy (Literal[True]) → pl.LazyFrame

  • self, contig (str | None), start (int | np.integer | None), end (int | np.integer | None), fields (list[str] | None), info (list[str] | None), lazy (bool) → pl.DataFrame | pl.LazyFrame

Parameters:
Return type:

DataFrame | LazyFrame

Get a DataFrame of any non-FORMAT fields in the VCF for a given range or the entire VCF. Will filter variants if the VCF instance has a filter function.

Parameters:
  • contig (str | None, default: None) – Contig name. If None, will read the entire VCF.

  • start (int | integer | None, default: None) – 0-based start position.

  • end (int | integer | None, default: None) – 0-based, exclusive end position.

  • fields (list[str] | None, default: None) – List of non-FORMAT, non-INFO fields to include. Returns all by default.

  • info (list[str] | None, default: None) – List of INFO fields to include. Returns all by default.

  • lazy (bool)

Return type:

DataFrame | LazyFrame

n_vars_in_ranges(contig, starts=0, ends=POS_MAX)[source]#

Return the start and end indices of the variants in the given ranges.

Parameters:
  • contig (str) – Contig name.

  • starts (ArrayLike, default: 0) – 0-based start positions of the ranges.

  • ends (ArrayLike, default: POS_MAX) – 0-based, exclusive end positions of the ranges.

Return type:

ndarray[tuple[Any, ...], dtype[uintc]]

Returns:

Shape: (ranges). Number of variants in the given ranges.

read(contig, start=0, end=POS_MAX, mode=Genos16, out=None)[source]#

Read genotypes and/or dosages for a range.

Parameters:
Return type:

TypeVar(T, Genos8, Genos16, Dosages, Genos8Dosages, Genos16Dosages)

Returns:

Genotypes and/or dosages. Genotypes have shape (samples ploidy variants) and dosages have shape (samples variants). Missing genotypes have value -1 and missing dosages have value np.nan. If just using genotypes or dosages, will be a single array, otherwise will be a tuple of arrays.

set_samples(samples)[source]#

Set the samples to read from the VCF file. Modifies the VCF reader in place and returns it.

Parameters:

samples (ArrayLike | None) – List of sample names to read from the VCF file.

Return type:

Self

Returns:

The VCF reader with the specified samples.

using_pbar(pbar)[source]#

Create a context where the given progress bar will be incremented by any calls to a read method.

Parameters:

pbar (tqdm_asyncio) – Progress bar to use while reading variants. This will be incremented per variant during any calls to a read function.

available_samples: list[str]#

List of available samples in the VCF file.

contigs: list[str]#

Naturally sorted list of available contigs in the VCF file.

property current_samples: list[str]#

List of samples currently being read from the VCF file.

dosage_field: str | None#

Name of the dosage field to read from the VCF file. Required if you want to use modes that include dosages.

property filter: Callable[[Variant], bool] | None#

Function to filter variants. Should return True for variants to keep.

property n_samples: int#

Number of samples currently selected.

property nbytes: int#

Total in-memory footprint, in bytes, of resident (non-mmap’d) data structures held by this reader. Currently this is the gvi variant index (CHROM/POS/REF/ALT/ILEN). Returns 0 before the index is loaded.

path: Path#

Path to the VCF file.

phasing: bool#

Whether to include phasing information on genotypes. If True, the ploidy axis will be length 3 such that phasing is indicated by the 3rd value: 0 = unphased, 1 = phased. If False, the ploidy axis will be length 2.

ploidy: int = 2#

Ploidy of the VCF file. This is currently always 2 since we use cyvcf2.

class genoray.PGEN(geno_path, filter=None, dosage_path=None, load_index=True)[source]#

Create a PGEN reader.

Parameters:
  • path – Path to the PGEN file. Only used for genotypes if a dosage path is provided as well.

  • filter (Expr | None, default: None) – Polars expression to filter variants. Should return True for variants to keep. Will have at least the columns CHROM, POS (1-based), REF, ALT, and ILEN available to use.

  • dosage_path (str | Path | None, default: None) – Path to a dosage PGEN file. If None, the genotype PGEN file will be used for both genotypes and dosages.

  • geno_path (str | Path)

  • load_index (bool)

class Dosages(instance)#
class Genos(instance)#
class GenosDosages(instance)#
class GenosPhasing(instance)#
class GenosPhasingDosages(instance)#
chunk(contig, start=0, end=POS_MAX, max_mem='4g', mode=Genos)[source]#

Iterate over genotypes and/or dosages for a range in chunks limited by max_mem.

Parameters:
  • contig (str) – Contig name.

  • start (int | integer, default: 0) – 0-based start position.

  • end (int | integer, default: POS_MAX) – 0-based, exclusive end position.

  • max_mem (int | str, default: "4g") – Maximum memory to use for each chunk. Can be an integer or a string with a suffix (e.g. “4g”, “2 MB”).

  • mode (type[TypeVar(T, Genos, Dosages, GenosPhasing, GenosDosages, GenosPhasingDosages)], default: Genos) – Type of data to read. Can be Genos, Dosages, GenosPhasing, GenosDosages, or GenosPhasingDosages.

Return type:

Generator[TypeVar(T, Genos, Dosages, GenosPhasing, GenosDosages, GenosPhasingDosages)]

Returns:

Generator of genotypes and/or dosages. Genotypes have shape (samples ploidy variants) and dosages have shape (samples variants). Missing genotypes have value -1 and missing dosages have value np.nan. If just using genotypes or dosages, will be a single array, otherwise will be a tuple of arrays.

chunk_ranges(contig, starts=0, ends=POS_MAX, max_mem='4g', mode=Genos)[source]#

Read genotypes and/or dosages for multiple ranges in chunks limited by max_mem.

Parameters:
  • contig (str) – Contig name.

  • starts (ArrayLike, default: 0) – 0-based start positions.

  • ends (ArrayLike, default: POS_MAX) – 0-based, exclusive end positions.

  • max_mem (int | str, default: "4g") – Maximum memory to use for each chunk. Can be an integer or a string with a suffix (e.g. “4g”, “2 MB”).

  • mode (type[TypeVar(T, Genos, Dosages, GenosPhasing, GenosDosages, GenosPhasingDosages)], default: Genos) – Type of data to read. Can be Genos, Dosages, GenosPhasing, GenosDosages, or GenosPhasingDosages.

Return type:

Generator[Generator[TypeVar(T, Genos, Dosages, GenosPhasing, GenosDosages, GenosPhasingDosages)]]

Returns:

Generator of generators of genotypes and/or dosages of each ranges’ data. Genotypes have shape (samples ploidy variants) and dosages have shape (samples variants). Missing genotypes have value -1 and missing dosages have value np.nan. If just using genotypes or dosages, will be a single array, otherwise will be a tuple of arrays.

Examples

gen = reader.read_ranges_chunks(...)
for range_ in gen:
    if range_ is None:
        continue
    for chunk in range_:
        # do something with chunk
        pass
n_vars_in_ranges(contig, starts=0, ends=POS_MAX)[source]#

Return the start and end indices of the variants in the given ranges.

Parameters:
  • contig (str) – Contig name.

  • starts (ArrayLike, default: 0) – 0-based start positions of the ranges.

  • ends (ArrayLike, default: POS_MAX) – 0-based, exclusive end positions of the ranges.

Returns:

Shape: (ranges). Number of variants in the given ranges.

Return type:

ndarray[tuple[Any, ...], dtype[uintc]]

read(contig, start=0, end=POS_MAX, mode=Genos)[source]#

Read genotypes and/or dosages for a range.

Parameters:
Return type:

TypeVar(T, Genos, Dosages, GenosPhasing, GenosDosages, GenosPhasingDosages)

Returns:

Genotypes and/or dosages. Genotypes have shape (samples ploidy variants) and dosages have shape (samples variants). Missing genotypes have value -1 and missing dosages have value np.nan. If just using genotypes or dosages, will be a single array, otherwise will be a tuple of arrays.

read_ranges(contig, starts=0, ends=POS_MAX, mode=Genos)[source]#

Read genotypes and/or dosages for multiple ranges.

Parameters:
Return type:

tuple[TypeVar(T, Genos, Dosages, GenosPhasing, GenosDosages, GenosPhasingDosages), ndarray[tuple[Any, ...], dtype[int_]]]

Returns:

Genotypes and/or dosages. Genotypes have shape (samples ploidy variants) and dosages have shape (samples variants). Missing genotypes have value -1 and missing dosages have value np.nan. If just using genotypes or dosages, will be a single array, otherwise will be a tuple of arrays.

Shape: (ranges+1). Offsets to slice out data for each range from the variants axis like so:

Examples

data, offsets = reader.read_ranges(...)
data[..., offsets[i] : offsets[i + 1]]  # data for range i

Note that the number of variants for range i is np.diff(offsets)[i].

set_samples(samples)[source]#

Set the samples to use.

Parameters:

samples (ArrayLike | None) – List of sample names to use. If None, all samples will be used.

Return type:

Self

var_idxs(contig, starts=0, ends=POS_MAX)[source]#

Get variant indices and the number of indices per range.

Parameters:
  • contig (str) – Contig name.

  • starts (ArrayLike, default: 0) – 0-based start positions of the ranges.

  • ends (ArrayLike, default: POS_MAX) – 0-based, exclusive end positions of the ranges.

Return type:

tuple[ndarray[tuple[Any, ...], dtype[uintc]], ndarray[tuple[Any, ...], dtype[int_]]]

Returns:

Shape: (tot_variants). Variant indices for the given ranges.

Shape: (ranges+1). Offsets to get variant indices for each range.

available_samples: list[str]#

List of available samples in the PGEN file.

contigs: list[str] | None = None#

Naturally sorted list of contig names in the PGEN file.

property current_samples: list[str]#

List of samples that are currently being used, in order.

property dosage_path: Path | None#

Path to the dosage file.

property filter: Expr | None#

Polars expression to filter variants. Should return True for variants to keep.

property geno_path: Path#

Path to the genotype file.

property n_samples: int#

Number of samples in the file.

property nbytes: int#

Total in-memory footprint, in bytes, of resident (non-mmap’d) data structures held by this reader. Sums the gvi variant index dataframe and the StartsEndsIlens cache. Returns 0 after _free_index().

ploidy = 2#

Ploidy of the samples. The PGEN format currently only supports diploid (2).

class genoray.SparseVar(path, attrs=None, fields=None)[source]#

Open a Sparse Variant (SVAR) directory.

Parameters:
  • path (str | Path) – Path to the SVAR directory.

  • attrs (IntoExpr | None, default: None) – Expression of attributes to load in addition to the ALT and ILEN columns.

  • fields (Sequence[str] | None, default: None) – Names of fields to load from the SVAR directory. Must be keys of available_fields. Only VCF FORMAT fields with Number=G are currently supported as custom fields.

classmethod from_pgen(out, pgen, max_mem, overwrite=False, with_dosages=False, n_jobs=-1)[source]#

Create a Sparse Variant (.svar) from a PGEN.

Parameters:
  • out (str | Path) – Path to the output directory.

  • pgen (PGEN) – PGEN file to write from.

  • max_mem (int | str) – Maximum memory to use while writing.

  • overwrite (bool, default: False) – Whether to overwrite the output directory if it exists.

  • with_dosages (bool, default: False) – Whether to write dosages.

  • n_jobs (int, default: -1) – Number of jobs to use for parallel processing.

classmethod from_vcf(out, vcf, max_mem, overwrite=False, with_dosages=False, n_jobs=-1)[source]#

Create a Sparse Variant (.svar) from a VCF/BCF.

Parameters:
  • out (str | Path) – Path to the output directory.

  • vcf (VCF) – VCF file to write from.

  • max_mem (int | str) – Maximum memory to use while writing.

  • overwrite (bool, default: False) – Whether to overwrite the output directory if it exists.

  • with_dosages (bool, default: False) – Whether to write dosages.

  • n_jobs (int, default: -1) – Number of jobs to use for parallel processing.

annotate_with_gtf(gtf, level_filter=1, write_back=True, *, strand_encoding=None, codon_null_token=None)[source]#

Annotate variants with gene_id, strand, and codon_pos from GTF CDS features.

Computes codon position for SNVs only; indels receive strand but null codon_pos.

Parameters:
  • gtf (str | DataFrame) – Path to GTF file (.gtf or .gtf.gz) or pre-loaded Polars DataFrame.

  • level_filter (int | None, default: 1) – If set, keep rows with GTF ‘level’ <= level_filter (1 = highest quality).

  • write_back (bool, default: True) – If True, update self.var_table in-place and write to index.arrow file.

  • strand_encoding (dict[str | None, int] | None, default: None) – Encode strand as integers. Example: {‘+’: 0, ‘-’: 1, None: 2}

  • codon_null_token (int | None, default: None) – Replace null codon_pos with this integer for ML models.

Returns:

Columns: varID (UInt32), gene_id (Utf8), strand (Utf8/Int16), codon_pos (Int8/Int16)

Return type:

DataFrame

Examples

>>> svar = SparseVar("data.svar")
>>> annot = svar.annotate_with_gtf("gencode.v45.gtf.gz")
>>> annot.head()
cache_afs()[source]#

Cache the allele frequencies on disk. Will also load all possible attributes and add the AF column in-memory.

read_ranges(contig, starts=0, ends=POS_MAX, samples=None)[source]#

Read the genotypes for the given ranges.

Parameters:
  • contig (str) – Contig name.

  • starts (ArrayLike, default: 0) – 0-based start positions of the ranges.

  • ends (ArrayLike, default: POS_MAX) – 0-based, exclusive end positions of the ranges.

  • samples (ArrayLike | None, default: None) – List of sample names to read. If None, read all samples.

Return type:

TypeVar(_SRT)

Returns:

When no fields are loaded: Ragged[V_IDX_TYPE] with shape (ranges, samples, ploidy, ~variants). When fields are loaded: an awkward record array of the same outer shape where result.genos is Ragged[V_IDX_TYPE] and each additional field (e.g. result.dosages) is a Ragged of its respective dtype. All arrays are backed by memory-mapped data so only the offsets reside in RAM.

read_ranges_with_length(contig, starts=0, ends=POS_MAX, samples=None)[source]#

Read the genotypes for the given ranges such that each entry of variants is guaranteed to have the minimum amount of variants to reach the query length. This can mean either fewer or more variants than would be returned than by read_ranges, depending on the presence of indels.

Parameters:
  • contig (str) – Contig name.

  • starts (ArrayLike, default: 0) – 0-based start positions of the ranges.

  • ends (ArrayLike, default: POS_MAX) – 0-based, exclusive end positions of the ranges.

  • samples (ArrayLike | None, default: None) – List of sample names to read. If None, read all samples.

Return type:

TypeVar(_SRT)

Returns:

Same return structure as read_ranges().

var_ranges(contig, starts=0, ends=POS_MAX)[source]#

Get variant index ranges for each query range. i.e. For each query range, return the minimum and maximum variant that overlaps. Note that this means some variants within those ranges may not actually overlap with the query range if there is a deletion that spans the start of the query.

Parameters:
  • contig (str) – Contig name.

  • starts (ArrayLike, default: 0) – 0-based start positions of the ranges.

  • ends (ArrayLike, default: POS_MAX) – 0-based, exclusive end positions of the ranges.

Return type:

ndarray[tuple[Any, ...], dtype[intc]]

Returns:

Shape: (ranges, 2). The first column is the start index of the variant and the second column is the end index of the variant.

with_fields(fields=None)[source]#
Overloads:
  • self, fields (Sequence[str]) → SparseVar[Ragged[np.void]]

  • self, fields (Literal[False]) → SparseVar[Ragged[V_IDX_TYPE]]

  • self, fields (None) → SparseVar[_SRT]

Parameters:

fields (Sequence[str] | Literal[False] | None)

Return type:

SparseVar

Return a shallow copy of this SparseVar with updated fields.

Parameters:

fields (Union[Sequence[str], Literal[False], None], default: None) –

  • None: leave fields unchanged (returns shallow copy).

  • Sequence[str]: names of fields to load from the SVAR directory. Must be keys of available_fields.

  • False: drop all fields, returning a SparseVar[Ragged[V_IDX_TYPE]].

Return type:

SparseVar

write_view(regions, samples, output, fields=None, merge_overlapping=False, regions_overlap='pos', overwrite=False, threads=None)[source]#

Write a subset of this SparseVar to a new directory.

Parameters:
  • regions (str | tuple[str, int, int] | Path) – Region(s) to include. Accepts the same input types as _normalize_regions(): a "chrom:start-end" string, a (chrom, start, end) tuple, a BED file path, or a polars/pandas/pyranges frame.

  • samples (str | Sequence[str] | Path) – Samples to include. Accepts a single sample name, a list, or a path to a file of newline-separated names.

  • output (str | Path) – Destination directory for the new SparseVar.

  • fields (Sequence[str] | None, default: None) – Fields to carry over (None = all available; [] = none).

  • merge_overlapping (bool, default: False) – If True silently merge overlapping regions; if False raise ValueError when overlaps are detected.

  • regions_overlap (Literal['pos', 'record', 'variant'], default: "pos") – How variants are matched to regions — "pos", "record", or "variant". See _resolve_kept_var_idxs().

  • overwrite (bool, default: False) – Whether to overwrite output if it already exists.

  • threads (int | None, default: None) – Number of Numba threads to use. None uses all available CPUs.

Return type:

None

Notes

Variants whose minor allele count is 0 in the chosen sample subset are dropped from the output. If every candidate variant drops, a ValueError is raised — the same code path that fires when regions itself selects no variants.

contigs: list[str]#

Contigs in the order they appear in the dataset. Variants are only sorted within each contig.

index: DataFrame#

CHROM, POS, REF, ALT, ILEN, and any additional attributes specified in attrs on construction.

Type:

Table of variants with columns

property n_samples: int#

Number of samples in the dataset.

property n_variants: int#

Number of variants in the dataset.

property nbytes: int#

Total in-memory footprint, in bytes, of resident (non-mmap’d) data held by this reader. Only the polars variant index counts; genos and fields are memory-mapped and excluded.

genoray.exprs#

Polars expressions for filtering a genoray index (extension .gvi) given the minimum set of index columns:

  • "CHROM" : pl.Utf8

  • "POS" : pl.Int64

  • "REF" : pl.Utf8

  • "ALT" : pl.List[Utf8]

  • "ILEN" : pl.List[Int32]

Applicable for PGEN files and the experimental VCF._load_index() method.

Note

For PGEN, all columns that existed in the underlying PVAR will be available in the index.

genoray.exprs.symbolic_ilen(alt='ALT', ref='REF', svlen='SVLEN', end='END', imprecise='IMPRECISE')[source]#

Per-ALT corrected ILEN as List[Int32].

Non-symbolic, non-breakend ALTs use the literal len(ALT) - len(REF). Precise symbolic <DEL>/<INS>/<DUP> use SVLEN magnitude (or |END - POS| for <DEL>/<DUP> when SVLEN is absent): -|len| for <DEL>, +|len| for <INS>/<DUP>. Everything we cannot size precisely (IMPRECISE flag, missing length, unsupported type such as <BND>, <CNV>, <INV>, <*>, and breakend mate notation such as G[chr2:321[) becomes null.

svlen/end/imprecise name scalar columns already extracted by the caller (per record). SVLEN magnitude is read as |SVLEN| so the VCF 4.3/4.4 sign-convention flip does not matter.

Note

Evaluated per-record via a Python callback (map_elements) at index-build time; expect one Python call per variant row.

Return type:

Expr

Parameters:
genoray.exprs.ILEN = <Expr ['[(col("ALT").list.eval(element…']>#

Indel length of the variant. Positive for insertions, negative for deletions, and zero for SNPs and MNPs.

genoray.exprs.is_biallelic = <Expr ['[(col("ALT").list.length()) ==…']>#

True if the variant is biallelic (one ALT allele).

genoray.exprs.is_breakend = <Expr ['col("ALT").list.eval(element()…']>#

True if any ALT allele is a breakend (BND) in mate-pair / single-breakend notation (e.g. G[chr2:321[, ]chr2:321]G, .TGCA, TGCA.), per the VCF 4.x spec (§5.4).

Breakends are a distinct ALT class from symbolic <...> alleles, so is_symbolic does not flag them. But like symbolic alleles they are not expandable into nucleotides — the bracket/colon/position bytes corrupt personalized DNA buffers in haplotype consumers (e.g. genvarloader). Their symbolic_ilen() value is null (so they are also is_imprecise).

To drop breakends, pass pl_filter=~genoray.exprs.is_breakend. To drop all un-expandable ALTs (symbolic + breakends) for haplotype consumers, combine:

pl_filter=~genoray.exprs.is_symbolic & ~genoray.exprs.is_breakend
genoray.exprs.is_imprecise = <Expr ['col("ILEN").list.eval(element(…']>#

True if any ALT allele’s ILEN could not be precisely determined (an un-sizable symbolic allele — IMPRECISE, missing SVLEN/END, or an unsupported symbolic type). Such alleles carry null ILEN. Filter them out with pl_filter=~genoray.exprs.is_imprecise to keep precise structural variants while dropping the rest; use ~genoray.exprs.is_symbolic to drop all symbolic alleles (required for haplotype consumers such as genvarloader, which cannot expand any symbolic ALT).

genoray.exprs.is_indel = <Expr ['col("ILEN").list.eval([([(elem…']>#

True if all ALT alleles are indels (insertions or deletions).

Un-sizable symbolic alleles (null ILEN) are treated as neither SNP nor indel: a row containing any null ILEN element evaluates to False.

genoray.exprs.is_snp = <Expr ['col("ILEN").list.eval([([(elem…']>#

True if all ALT alleles are SNPs (single nucleotide polymorphisms).

Un-sizable symbolic alleles (null ILEN) are treated as neither SNP nor indel: a row containing any null ILEN element evaluates to False.

genoray.exprs.is_symbolic = <Expr ['col("ALT").list.eval(element()…']>#

True if any ALT allele is a symbolic allele (e.g. <DEL>, <INS>, <DUP>, <INV>, <CNV>, <BND> — anything matching <…> per the VCF 4.x spec).

Symbolic ALTs are placeholders for structural variants whose exact replacement nucleotides are unknown. Downstream haplotype injection (e.g. via genvarloader) cannot expand them — the literal <DEL> ASCII bytes end up in personalized DNA buffers and become non-canonical bytes for translators.

To drop symbolic records, pass this as a filter. For PGEN, the single filter expression suffices:

pgen = genoray.PGEN("file.pgen", filter=~genoray.exprs.is_symbolic)

For VCF, pair it with the equivalent cyvcf2 filter (both are required):

vcf = genoray.VCF(
    "file.vcf.gz",
    filter=lambda rec: not any(a.startswith("<") for a in rec.ALT),
    pl_filter=~genoray.exprs.is_symbolic,
)

SparseVar.from_vcf / from_pgen inherit the source’s filter, so the SVAR is filtered to match.