Sparse Variant Format (SVAR)#
Motivation#
Typical genomic data formats such as VCF/BCF and PLINK encode genotypes in a dense matrix. However, these matrices are typically extremely sparse (< 1% density), especially with whole genome sequencing or cancer data. To avoid consuming excessive amounts of disk space, these formats use block-wise compression. However, block compressed data can easily create data processing bottlenecks in machine learning applications, where random sampling is required to during training. This was a huge problem during the development of GenVarLoader, for example. By instead using a sparse format, we were able to circumvent compression while keeping the file size on par with an equivalent compressed BCF. As a result, we can memory map the genotypes to work with larger-than-RAM data with random access that is much, much faster than compressed formats. For example, GenVarLoader computes direclty on the SVAR format and this is a major factor in its 1000x speedup over alternative methods.
Creating SVAR files#
from genoray import SparseVar, VCF, PGEN
SparseVar.from_vcf("out.svar", "file.vcf.gz", max_mem="4g")
SparseVar.from_pgen("out.svar", "file.pgen", max_mem="4g")
svar = SparseVar("out.svar")
Reading SVAR#
# shape: (ranges, samples, ploidy, ~variants)
sp_genos = svar.read_ranges("1", starts=0, ends=365, samples="Aang")
When using an SVAR file, read_ranges returns a Ragged[V_IDX_TYPE] — a ragged array where
the number of ALT calls per sample and ploid varies. For a brief visual description of Ragged
arrays, see this section of the GenVarLoader FAQ.
The returned array can be arbitrarily large because its data is backed by a
numpy.memmap object
(only the offsets reside in RAM).
Each value in the ragged array is a variant index: the row number in svar.index for the
variant that is present in each range, sample, and ploid.
v_idxs = sp_genos.to_awkward()[0, 0, 0].to_numpy()
Loading additional fields#
Custom numeric fields stored as .npy files in the SVAR directory can be loaded alongside
genotype indices. Only VCF FORMAT fields with Number=G are currently supported.
# Load at construction time
svar = SparseVar("out.svar", fields={"dosages": np.float32})
# Or derive from an existing SparseVar (shallow copy, re-opens the memmaps)
svar_with = svar.with_fields({"dosages": np.float32})
# read_ranges now returns an awkward record array
result = svar_with.read_ranges("1", starts=0, ends=365)
result.genos.data # flat array of variant indices (uint32)
result.dosages.data # flat array of dosage values (float32)
# Drop all fields to get back a plain Ragged[V_IDX_TYPE]
svar_plain = svar_with.with_fields(False)
There’s a lot more that can be done with SparseVar; this documentation will be expanded as time permits.