# Sparse Variant Format (SVAR)

## Motivation

Typical genomic data formats such as VCF/BCF and PLINK encode genotypes in a dense matrix. However, these matrices are typically extremely sparse (< 1% density), especially with whole genome sequencing or cancer data. To avoid consuming excessive amounts of disk space, these formats use block-wise compression. However, block compressed data can easily create data processing bottlenecks in machine learning applications, where random sampling is required to during training. This was a huge problem during the development of [GenVarLoader](https://github.com/mcvickerlab/GenVarLoader), for example. By instead using a sparse format, we were able to circumvent compression while keeping the file size on par with an equivalent compressed BCF. As a result, we can memory map the genotypes to work with larger-than-RAM data with random access that is much, much faster than compressed formats. For example, GenVarLoader computes direclty on the SVAR format and this is a major factor in its 1000x speedup over alternative methods.

## Creating SVAR files

```python
from genoray import SparseVar, VCF, PGEN

SparseVar.from_vcf("out.svar", "file.vcf.gz", max_mem="4g")
SparseVar.from_pgen("out.svar", "file.pgen", max_mem="4g")

svar = SparseVar("out.svar")
```

## Reading SVAR

```python
# shape: (ranges, samples, ploidy, ~variants)
sp_genos = svar.read_ranges("1", starts=0, ends=365, samples="Aang")
```

When using an SVAR file, `read_ranges` returns a `Ragged[V_IDX_TYPE]` — a ragged array where
the number of ALT calls per sample and ploid varies. For a brief visual description of Ragged
arrays, see [this section of the GenVarLoader FAQ](https://genvarloader.readthedocs.io/en/latest/faq.html#why-does-a-dataset-return-ragged-objects-and-what-are-they).
The returned array can be arbitrarily large because its data is backed by a
[`numpy.memmap`](https://numpy.org/doc/stable/reference/generated/numpy.memmap.html) object
(only the offsets reside in RAM).

Each value in the ragged array is a variant index: the row number in `svar.index` for the
variant that is present in each range, sample, and ploid.

```python
v_idxs = sp_genos.to_awkward()[0, 0, 0].to_numpy()
```

## Loading additional fields

Custom numeric fields stored as `.npy` files in the SVAR directory can be loaded alongside
genotype indices. Only VCF FORMAT fields with `Number=G` are currently supported.

```python
# Load at construction time
svar = SparseVar("out.svar", fields={"dosages": np.float32})

# Or derive from an existing SparseVar (shallow copy, re-opens the memmaps)
svar_with = svar.with_fields({"dosages": np.float32})

# read_ranges now returns an awkward record array
result = svar_with.read_ranges("1", starts=0, ends=365)
result.genos.data    # flat array of variant indices (uint32)
result.dosages.data  # flat array of dosage values (float32)

# Drop all fields to get back a plain Ragged[V_IDX_TYPE]
svar_plain = svar_with.with_fields(False)
```

There's a lot more that can be done with `SparseVar`; this documentation will be expanded as time permits.
