API#
- class genoray.VCF(path, filter=None, pl_filter=None, phasing=False, dosage_field=None, progress=False, with_gvi_index=True)[source]#
Create a VCF reader.
- Parameters:
filter (
Callable[[Variant],bool] |None, default:None) –Function to filter variants. Should return True for variants to keep.
Note
To avoid KeyErrors, this function needs to be tolerant to missing fields. For example, if you access an INFO or FORMAT field, not all variants are guaranteed to have the same fields. The cyvcf2.Variant API provides the
.getmethod on the INFO and FORMAT attributes. For example,lambda v: v.INFO.get("AF", 0) > 0.05will skip any variants with an AF <= 0.05 or a missing AF by treating missing AFs as 0.pl_filter (
Expr|None, default:None) –Polars expression to filter variants. Should return True for variants to keep. Must match the filter function.
Note
This expression will be applied to the polars DataFrame returned by
get_record_info(). It is not applied to the VCF file itself, so it will not be able to use the cyvcf2.Variant API. For example, if you want to filter variants by INFO field, you can use:pl.col("AF") > 0.05but you can not use:lambda v: v.INFO.get("AF", 0) > 0.05because the expression will be applied to the polars DataFrame, not the VCF file.read_as – Type of data to read from the VCF file. Can be VCF.Genos, VCF.Dosages, or VCF.GenosDosages.
phasing (
bool, default:False) – Whether to include phasing information on genotypes. If True, the ploidy axis will be length 3 such that phasing is indicated by the 3rd value: 0 = unphased, 1 = phased. If False, the ploidy axis will be length 2.dosage_field (
str|None, default:None) – Name of the dosage field to read from the VCF file. Required if read_as is VCF.Dosages, VCF.Genos8Dosages, or VCF.Genos16Dosages.progress (
bool, default:False) – Whether to show a progress bar while reading the VCF file.with_gvi_index (bool)
- class Dosages(instance)#
- class Genos16(instance)#
- class Genos16Dosages(instance)#
- class Genos8(instance)#
- class Genos8Dosages(instance)#
- chunk(contig, start=0, end=POS_MAX, max_mem='4g', mode=Genos16)[source]#
Iterate over genotypes and/or dosages for a range in chunks limited by
max_mem.- Parameters:
contig (
str) – Contig name.end (
int|integer, default:POS_MAX) – 0-based, exclusive end position.max_mem (
int|str, default:"4g") – Maximum memory to use for each chunk. Can be an integer or a string with a suffix (e.g. “4g”, “2 MB”).mode (
type[TypeVar(T,Genos8,Genos16,Dosages,Genos8Dosages,Genos16Dosages)], default:Genos16) – Type of data to read.
- Return type:
Generator[TypeVar(T,Genos8,Genos16,Dosages,Genos8Dosages,Genos16Dosages)]- Returns:
Generator of genotypes and/or dosages. Genotypes have shape
(samples ploidy variants)and dosages have shape(samples variants). Missing genotypes have value -1 and missing dosages have value np.nan. If just using genotypes or dosages, will be a single array, otherwise will be a tuple of arrays.
- get_record_info(contig=None, start=None, end=None, fields=None, info=None, lazy=False)[source]#
- Overloads:
self, contig (str | None), start (int | np.integer | None), end (int | np.integer | None), fields (list[str] | None), info (list[str] | None), lazy (Literal[False]) → pl.DataFrame
self, contig (str | None), start (int | np.integer | None), end (int | np.integer | None), fields (list[str] | None), info (list[str] | None), lazy (Literal[True]) → pl.LazyFrame
self, contig (str | None), start (int | np.integer | None), end (int | np.integer | None), fields (list[str] | None), info (list[str] | None), lazy (bool) → pl.DataFrame | pl.LazyFrame
- Parameters:
- Return type:
DataFrame | LazyFrame
Get a DataFrame of any non-FORMAT fields in the VCF for a given range or the entire VCF. Will filter variants if the VCF instance has a filter function.
- Parameters:
contig (
str|None, default:None) – Contig name. If None, will read the entire VCF.start (
int|integer|None, default:None) – 0-based start position.end (
int|integer|None, default:None) – 0-based, exclusive end position.fields (
list[str] |None, default:None) – List of non-FORMAT, non-INFO fields to include. Returns all by default.info (
list[str] |None, default:None) – List of INFO fields to include. Returns all by default.lazy (bool)
- Return type:
DataFrame | LazyFrame
- n_vars_in_ranges(contig, starts=0, ends=POS_MAX)[source]#
Return the start and end indices of the variants in the given ranges.
- Parameters:
- Return type:
- Returns:
Shape:
(ranges). Number of variants in the given ranges.
- read(contig, start=0, end=POS_MAX, mode=Genos16, out=None)[source]#
Read genotypes and/or dosages for a range.
- Parameters:
contig (
str) – Contig name.end (
int|integer, default:POS_MAX) – 0-based, exclusive end position.mode (
type[TypeVar(T,Genos8,Genos16,Dosages,Genos8Dosages,Genos16Dosages)], default:Genos16) – Type of data to read.out (
Optional[TypeVar(T,Genos8,Genos16,Dosages,Genos8Dosages,Genos16Dosages)], default:None) – Output array to fill with genotypes and/or dosages. If None, a new array will be created.
- Return type:
TypeVar(T,Genos8,Genos16,Dosages,Genos8Dosages,Genos16Dosages)- Returns:
Genotypes and/or dosages. Genotypes have shape
(samples ploidy variants)and dosages have shape(samples variants). Missing genotypes have value -1 and missing dosages have value np.nan. If just using genotypes or dosages, will be a single array, otherwise will be a tuple of arrays.
- set_samples(samples)[source]#
Set the samples to read from the VCF file. Modifies the VCF reader in place and returns it.
- using_pbar(pbar)[source]#
Create a context where the given progress bar will be incremented by any calls to a read method.
- Parameters:
pbar (
tqdm_asyncio) – Progress bar to use while reading variants. This will be incremented per variant during any calls to a read function.
- dosage_field: str | None#
Name of the dosage field to read from the VCF file. Required if you want to use modes that include dosages.
- property filter: Callable[[Variant], bool] | None#
Function to filter variants. Should return True for variants to keep.
- property nbytes: int#
Total in-memory footprint, in bytes, of resident (non-mmap’d) data structures held by this reader. Currently this is the gvi variant index (CHROM/POS/REF/ALT/ILEN). Returns 0 before the index is loaded.
- class genoray.PGEN(geno_path, filter=None, dosage_path=None, load_index=True)[source]#
Create a PGEN reader.
- Parameters:
path – Path to the PGEN file. Only used for genotypes if a dosage path is provided as well.
filter (
Expr|None, default:None) – Polars expression to filter variants. Should return True for variants to keep. Will have at least the columns CHROM, POS (1-based), REF, ALT, and ILEN available to use.dosage_path (
str|Path|None, default:None) – Path to a dosage PGEN file. If None, the genotype PGEN file will be used for both genotypes and dosages.geno_path (str | Path)
load_index (bool)
- class Dosages(instance)#
- class Genos(instance)#
- class GenosDosages(instance)#
- class GenosPhasing(instance)#
- class GenosPhasingDosages(instance)#
- chunk(contig, start=0, end=POS_MAX, max_mem='4g', mode=Genos)[source]#
Iterate over genotypes and/or dosages for a range in chunks limited by
max_mem.- Parameters:
contig (
str) – Contig name.end (
int|integer, default:POS_MAX) – 0-based, exclusive end position.max_mem (
int|str, default:"4g") – Maximum memory to use for each chunk. Can be an integer or a string with a suffix (e.g. “4g”, “2 MB”).mode (
type[TypeVar(T,Genos,Dosages,GenosPhasing,GenosDosages,GenosPhasingDosages)], default:Genos) – Type of data to read. Can beGenos,Dosages,GenosPhasing,GenosDosages, orGenosPhasingDosages.
- Return type:
Generator[TypeVar(T,Genos,Dosages,GenosPhasing,GenosDosages,GenosPhasingDosages)]- Returns:
Generator of genotypes and/or dosages. Genotypes have shape
(samples ploidy variants)and dosages have shape(samples variants). Missing genotypes have value -1 and missing dosages have value np.nan. If just using genotypes or dosages, will be a single array, otherwise will be a tuple of arrays.
- chunk_ranges(contig, starts=0, ends=POS_MAX, max_mem='4g', mode=Genos)[source]#
Read genotypes and/or dosages for multiple ranges in chunks limited by
max_mem.- Parameters:
contig (
str) – Contig name.starts (
ArrayLike, default:0) – 0-based start positions.ends (
ArrayLike, default:POS_MAX) – 0-based, exclusive end positions.max_mem (
int|str, default:"4g") – Maximum memory to use for each chunk. Can be an integer or a string with a suffix (e.g. “4g”, “2 MB”).mode (
type[TypeVar(T,Genos,Dosages,GenosPhasing,GenosDosages,GenosPhasingDosages)], default:Genos) – Type of data to read. Can beGenos,Dosages,GenosPhasing,GenosDosages, orGenosPhasingDosages.
- Return type:
Generator[Generator[TypeVar(T,Genos,Dosages,GenosPhasing,GenosDosages,GenosPhasingDosages)]]- Returns:
Generator of generators of genotypes and/or dosages of each ranges’ data. Genotypes have shape
(samples ploidy variants)and dosages have shape(samples variants). Missing genotypes have value -1 and missing dosages have value np.nan. If just using genotypes or dosages, will be a single array, otherwise will be a tuple of arrays.
Examples
gen = reader.read_ranges_chunks(...) for range_ in gen: if range_ is None: continue for chunk in range_: # do something with chunk pass
- n_vars_in_ranges(contig, starts=0, ends=POS_MAX)[source]#
Return the start and end indices of the variants in the given ranges.
- Parameters:
- Returns:
Shape:
(ranges). Number of variants in the given ranges.- Return type:
- read(contig, start=0, end=POS_MAX, mode=Genos)[source]#
Read genotypes and/or dosages for a range.
- Parameters:
contig (
str) – Contig name.end (
int|integer, default:POS_MAX) – 0-based, exclusive end position.mode (
type[TypeVar(T,Genos,Dosages,GenosPhasing,GenosDosages,GenosPhasingDosages)], default:Genos) – Type of data to read. Can beGenos,Dosages,GenosPhasing,GenosDosages, orGenosPhasingDosages.
- Return type:
TypeVar(T,Genos,Dosages,GenosPhasing,GenosDosages,GenosPhasingDosages)- Returns:
Genotypes and/or dosages. Genotypes have shape
(samples ploidy variants)and dosages have shape(samples variants). Missing genotypes have value -1 and missing dosages have value np.nan. If just using genotypes or dosages, will be a single array, otherwise will be a tuple of arrays.
- read_ranges(contig, starts=0, ends=POS_MAX, mode=Genos)[source]#
Read genotypes and/or dosages for multiple ranges.
- Parameters:
contig (
str) – Contig name.starts (
ArrayLike, default:0) – 0-based start positions.ends (
ArrayLike, default:POS_MAX) – 0-based, exclusive end positions.mode (
type[TypeVar(T,Genos,Dosages,GenosPhasing,GenosDosages,GenosPhasingDosages)], default:Genos) – Type of data to read. Can beGenos,Dosages,GenosPhasing,GenosDosages, orGenosPhasingDosages.
- Return type:
tuple[TypeVar(T,Genos,Dosages,GenosPhasing,GenosDosages,GenosPhasingDosages),ndarray[tuple[Any,...],dtype[int_]]]- Returns:
Genotypes and/or dosages. Genotypes have shape
(samples ploidy variants)and dosages have shape(samples variants). Missing genotypes have value -1 and missing dosages have value np.nan. If just using genotypes or dosages, will be a single array, otherwise will be a tuple of arrays.Shape: (ranges+1). Offsets to slice out data for each range from the variants axis like so:
Examples
data, offsets = reader.read_ranges(...) data[..., offsets[i] : offsets[i + 1]] # data for range i
Note that the number of variants for range
iisnp.diff(offsets)[i].
- var_idxs(contig, starts=0, ends=POS_MAX)[source]#
Get variant indices and the number of indices per range.
- property filter: Expr | None#
Polars expression to filter variants. Should return True for variants to keep.
- property nbytes: int#
Total in-memory footprint, in bytes, of resident (non-mmap’d) data structures held by this reader. Sums the gvi variant index dataframe and the StartsEndsIlens cache. Returns 0 after _free_index().
- ploidy = 2#
Ploidy of the samples. The PGEN format currently only supports diploid (2).
- class genoray.SparseVar(path, attrs=None, fields=None)[source]#
Open a Sparse Variant (SVAR) directory.
- Parameters:
path (str | Path) – Path to the SVAR directory.
attrs (IntoExpr | None, default:
None) – Expression of attributes to load in addition to the ALT and ILEN columns.fields (Sequence[str] | None, default:
None) – Names of fields to load from the SVAR directory. Must be keys ofavailable_fields. Only VCF FORMAT fields withNumber=Gare currently supported as custom fields.
- classmethod from_pgen(out, pgen, max_mem, overwrite=False, with_dosages=False, n_jobs=-1)[source]#
Create a Sparse Variant (.svar) from a PGEN.
- classmethod from_vcf(out, vcf, max_mem, overwrite=False, with_dosages=False, n_jobs=-1)[source]#
Create a Sparse Variant (.svar) from a VCF/BCF.
- annotate_with_gtf(gtf, level_filter=1, write_back=True, *, strand_encoding=None, codon_null_token=None)[source]#
Annotate variants with gene_id, strand, and codon_pos from GTF CDS features.
Computes codon position for SNVs only; indels receive strand but null codon_pos.
- Parameters:
gtf (
str|DataFrame) – Path to GTF file (.gtf or .gtf.gz) or pre-loaded Polars DataFrame.level_filter (
int|None, default:1) – If set, keep rows with GTF ‘level’ <= level_filter (1 = highest quality).write_back (
bool, default:True) – If True, update self.var_table in-place and write to index.arrow file.strand_encoding (
dict[str|None,int] |None, default:None) – Encode strand as integers. Example: {‘+’: 0, ‘-’: 1, None: 2}codon_null_token (
int|None, default:None) – Replace null codon_pos with this integer for ML models.
- Returns:
Columns: varID (UInt32), gene_id (Utf8), strand (Utf8/Int16), codon_pos (Int8/Int16)
- Return type:
DataFrame
Examples
>>> svar = SparseVar("data.svar") >>> annot = svar.annotate_with_gtf("gencode.v45.gtf.gz") >>> annot.head()
- cache_afs()[source]#
Cache the allele frequencies on disk. Will also load all possible attributes and add the AF column in-memory.
- read_ranges(contig, starts=0, ends=POS_MAX, samples=None)[source]#
Read the genotypes for the given ranges.
- Parameters:
- Return type:
TypeVar(_SRT)- Returns:
When no fields are loaded:
Ragged[V_IDX_TYPE]with shape(ranges, samples, ploidy, ~variants). When fields are loaded: an awkward record array of the same outer shape whereresult.genosisRagged[V_IDX_TYPE]and each additional field (e.g.result.dosages) is aRaggedof its respective dtype. All arrays are backed by memory-mapped data so only the offsets reside in RAM.
- read_ranges_with_length(contig, starts=0, ends=POS_MAX, samples=None)[source]#
Read the genotypes for the given ranges such that each entry of variants is guaranteed to have the minimum amount of variants to reach the query length. This can mean either fewer or more variants than would be returned than by
read_ranges, depending on the presence of indels.- Parameters:
- Return type:
TypeVar(_SRT)- Returns:
Same return structure as
read_ranges().
- var_ranges(contig, starts=0, ends=POS_MAX)[source]#
Get variant index ranges for each query range. i.e. For each query range, return the minimum and maximum variant that overlaps. Note that this means some variants within those ranges may not actually overlap with the query range if there is a deletion that spans the start of the query.
- Parameters:
- Return type:
- Returns:
Shape:
(ranges, 2). The first column is the start index of the variant and the second column is the end index of the variant.
- with_fields(fields=None)[source]#
- Overloads:
self, fields (Sequence[str]) → SparseVar[Ragged[np.void]]
self, fields (Literal[False]) → SparseVar[Ragged[V_IDX_TYPE]]
self, fields (None) → SparseVar[_SRT]
- Parameters:
- Return type:
Return a shallow copy of this
SparseVarwith updated fields.- Parameters:
fields (
Union[Sequence[str],Literal[False],None], default:None) –None: leave fields unchanged (returns shallow copy).Sequence[str]: names of fields to load from the SVAR directory. Must be keys ofavailable_fields.False: drop all fields, returning aSparseVar[Ragged[V_IDX_TYPE]].
- Return type:
- write_view(regions, samples, output, fields=None, merge_overlapping=False, regions_overlap='pos', overwrite=False, threads=None)[source]#
Write a subset of this SparseVar to a new directory.
- Parameters:
regions (
str|tuple[str,int,int] |Path) – Region(s) to include. Accepts the same input types as_normalize_regions(): a"chrom:start-end"string, a(chrom, start, end)tuple, a BED file path, or a polars/pandas/pyranges frame.samples (
str|Sequence[str] |Path) – Samples to include. Accepts a single sample name, a list, or a path to a file of newline-separated names.output (
str|Path) – Destination directory for the new SparseVar.fields (
Sequence[str] |None, default:None) – Fields to carry over (None= all available;[]= none).merge_overlapping (
bool, default:False) – IfTruesilently merge overlapping regions; ifFalseraiseValueErrorwhen overlaps are detected.regions_overlap (
Literal['pos','record','variant'], default:"pos") – How variants are matched to regions —"pos","record", or"variant". See_resolve_kept_var_idxs().overwrite (
bool, default:False) – Whether to overwrite output if it already exists.threads (
int|None, default:None) – Number of Numba threads to use.Noneuses all available CPUs.
- Return type:
Notes
Variants whose minor allele count is 0 in the chosen sample subset are dropped from the output. If every candidate variant drops, a
ValueErroris raised — the same code path that fires whenregionsitself selects no variants.
- contigs: list[str]#
Contigs in the order they appear in the dataset. Variants are only sorted within each contig.
- index: DataFrame#
CHROM, POS, REF, ALT, ILEN, and any additional attributes specified in attrs on construction.
- Type:
Table of variants with columns
genoray.exprs#
Polars expressions for filtering a genoray index (extension .gvi)
given the minimum set of index columns:
"CHROM":pl.Utf8"POS":pl.Int64"REF":pl.Utf8"ALT":pl.List[Utf8]"ILEN":pl.List[Int32]
Applicable for PGEN files and the experimental VCF._load_index() method.
Note
For PGEN, all columns that existed in the underlying PVAR will be available in the index.
- genoray.exprs.symbolic_ilen(alt='ALT', ref='REF', svlen='SVLEN', end='END', imprecise='IMPRECISE')[source]#
Per-ALT corrected ILEN as
List[Int32].Non-symbolic, non-breakend ALTs use the literal
len(ALT) - len(REF). Precise symbolic<DEL>/<INS>/<DUP>useSVLENmagnitude (or|END - POS|for<DEL>/<DUP>whenSVLENis absent):-|len|for<DEL>,+|len|for<INS>/<DUP>. Everything we cannot size precisely (IMPRECISEflag, missing length, unsupported type such as<BND>,<CNV>,<INV>,<*>, and breakend mate notation such asG[chr2:321[) becomesnull.svlen/end/imprecisename scalar columns already extracted by the caller (per record).SVLENmagnitude is read as|SVLEN|so the VCF 4.3/4.4 sign-convention flip does not matter.Note
Evaluated per-record via a Python callback (
map_elements) at index-build time; expect one Python call per variant row.
- genoray.exprs.ILEN = <Expr ['[(col("ALT").list.eval(element…']>#
Indel length of the variant. Positive for insertions, negative for deletions, and zero for SNPs and MNPs.
- genoray.exprs.is_biallelic = <Expr ['[(col("ALT").list.length()) ==…']>#
True if the variant is biallelic (one ALT allele).
- genoray.exprs.is_breakend = <Expr ['col("ALT").list.eval(element()…']>#
True if any ALT allele is a breakend (BND) in mate-pair / single-breakend notation (e.g.
G[chr2:321[,]chr2:321]G,.TGCA,TGCA.), per the VCF 4.x spec (§5.4).Breakends are a distinct ALT class from symbolic
<...>alleles, sois_symbolicdoes not flag them. But like symbolic alleles they are not expandable into nucleotides — the bracket/colon/position bytes corrupt personalized DNA buffers in haplotype consumers (e.g.genvarloader). Theirsymbolic_ilen()value isnull(so they are alsois_imprecise).To drop breakends, pass
pl_filter=~genoray.exprs.is_breakend. To drop all un-expandable ALTs (symbolic + breakends) for haplotype consumers, combine:pl_filter=~genoray.exprs.is_symbolic & ~genoray.exprs.is_breakend
- genoray.exprs.is_imprecise = <Expr ['col("ILEN").list.eval(element(…']>#
True if any ALT allele’s ILEN could not be precisely determined (an un-sizable symbolic allele —
IMPRECISE, missingSVLEN/END, or an unsupported symbolic type). Such alleles carrynullILEN. Filter them out withpl_filter=~genoray.exprs.is_impreciseto keep precise structural variants while dropping the rest; use~genoray.exprs.is_symbolicto drop all symbolic alleles (required for haplotype consumers such as genvarloader, which cannot expand any symbolic ALT).
- genoray.exprs.is_indel = <Expr ['col("ILEN").list.eval([([(elem…']>#
True if all ALT alleles are indels (insertions or deletions).
Un-sizable symbolic alleles (
nullILEN) are treated as neither SNP nor indel: a row containing anynullILEN element evaluates toFalse.
- genoray.exprs.is_snp = <Expr ['col("ILEN").list.eval([([(elem…']>#
True if all ALT alleles are SNPs (single nucleotide polymorphisms).
Un-sizable symbolic alleles (
nullILEN) are treated as neither SNP nor indel: a row containing anynullILEN element evaluates toFalse.
- genoray.exprs.is_symbolic = <Expr ['col("ALT").list.eval(element()…']>#
True if any ALT allele is a symbolic allele (e.g.
<DEL>,<INS>,<DUP>,<INV>,<CNV>,<BND>— anything matching<…>per the VCF 4.x spec).Symbolic ALTs are placeholders for structural variants whose exact replacement nucleotides are unknown. Downstream haplotype injection (e.g. via
genvarloader) cannot expand them — the literal<DEL>ASCII bytes end up in personalized DNA buffers and become non-canonical bytes for translators.To drop symbolic records, pass this as a filter. For PGEN, the single
filterexpression suffices:pgen = genoray.PGEN("file.pgen", filter=~genoray.exprs.is_symbolic)
For VCF, pair it with the equivalent cyvcf2
filter(both are required):vcf = genoray.VCF( "file.vcf.gz", filter=lambda rec: not any(a.startswith("<") for a in rec.ALT), pl_filter=~genoray.exprs.is_symbolic, )
SparseVar.from_vcf/from_pgeninherit the source’s filter, so the SVAR is filtered to match.