Cookbook

Iterate and reframe datasets

Let’s first load the bencoded data from the compressed file:

from allisbns.dataset import load_bencoded

input_path = "aa_isbn13_codes_20251118T170842Z.benc.zst"
with open(input_path, "rb") as f:
    input_data = load_bencoded(f)

Then create an iterator over all datasets and iterate:

from allisbns.dataset import iterate_datasets, CodeDataset
from allisbns.isbn import LAST_ISBN

for dataset in iterate_datasets(input_data):
    ...

The iterable datasets can be narrowed only to the selected ones:

for dataset in iterate_datasets(
    input_data, collections=["md5", "rgb"]
):
    ...

Also, the iterable datasets can be lazy reframed to some new bounds. For example, let’s iterate over the ‘978’ region of all datasets:

from allisbns.isbn import get_prefix_bounds

# Get the corresponding bounds
start_isbn, end_isbn = *get_prefix_bounds("978")

# Create the iterator, fill all datasets to the end ISBN
iterator = iterate_datasets(input_data, fill_to_isbn=end_isbn)

# Use the generator expression to lazy reframe datasets
reframing = (x.reframe(start_isbn, end_isbn) for x in iterator)
for reframed_dataset in reframing:
    ...

Merge all datasets

Create the iterator as above and union all datasets together:

from allisbns.isbn import LAST_ISBN
from allisbns.merge import union

# The bounds must be the same
iterator = iterate_datasets(input_data, fill_to_isbn=LAST_ISBN)

all_merged = merge.union(iterator)

After merging, we can save the result codes to a file for later use. For example, let’s temporarily save it to a binary file in NumPy format:

timestamp = str(input_path).split(".")[0].split("_")[-1]
output_path = f"xy_isbn13_codes_{timestamp}_all.npy"

with open(output_path, "wb") as f:
    np.save(f, all_merged.codes, allow_pickle=False)

To write it down in the original format with compression, we can use write_bencoded():

with open(output_path.with_suffix(".benc.zst"), "wb") as f:
    all_merged.write_bencoded(f, prefix="all")

Store datasets in HDF5

The original files with ISBN codes have a quite simple structure. All codes are packed into a single bencoded dictionary and shipped compressed with Zstd. That yields a compact distribution (~80MB) but forces full decompression (~700MB) to read any subset, which can hurt interactive use a bit. An alternative for working with datasets would be to store codes in a different container format optimized for homogeneous arrays and partial access without uncompressing the whole file, such as HDF5, NetCDF, or Zarr.

Here we experiment with HDF5 to convert bencoded files and save grouped analysis results. Additionally, experimental support is added to work with HDF5 files.

Current way

The current reading of codes for a single collection from a bencoded file (without keeping uncompressed data in memory) can be written as follows:

import struct

import bencodepy
import zstandard

def read_codes(path: str, collection: str) -> tuple[int]:
    with open(path, "rb") as f:
        with zstandard.ZstdDecompressor().stream_reader(f) as s:
            uncompressed_data = bencodepy.bread(s)
    packed_binary_codes = uncompressed_data[collection.encode()]
    return struct.unpack(
        f"{len(packed_binary_codes) // 4}I",
        packed_binary_codes
    )

On a non-performant laptop, I got this, which can be noticeable in interactive sessions:

In [1]: %timeit read_codes("aa_isbn13_codes_20251118T170842Z.benc.zst", "md5")
2.38 s ± 21.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Convert bencoded files

Alternatively, we provide the convert-bencoded-to-h5.py script (link) for converting the *.benc.zst files to HDF5.

$ uv run python scripts/convert-bencoded-to-h5.py \
    aa_isbn13_codes_20251118T170842Z.benc.zst
$ ls -sh
78M  aa_isbn13_codes_20251118T170842Z.benc.zst
82M  aa_isbn13_codes_20251118T170842Z.h5

The conversion is pretty quick and produces comparable file sizes. For the compression, we use Blosc (non-standard, available via hdf5plugin) with the shuffle and Zstd filters activated:

hdf5plugin.Blosc(
    cname="zstd",
    clevel=5,
    shuffle=hdf5plugin.Blosc.SHUFFLE
)

After that, reading codes as NumPy arrays is simply as follows:

import h5py
import numpy as np
import numpy.typing as npt

def read_codes(path: str, collection: str) -> npt.NDArray[np.int32]:
    with h5py.File(path, "r") as f:
        return f[collection][:]
In [1]: %timeit read_codes("aa_isbn13_codes_20251118T170842Z.benc.zst", "md5")
104 ms ± 1.39 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Basically, this setup preserves compact storage while enabling fast partial reads.

Groups and attributes

The structured nature of HDF5 can be useful to store datasets and the corresponding metadata in a single file. For example, see the Compare dumps example where we compare ‘md5’ datasets from two latest dumps to find additions and deletions and save results in an HDF5 file using groups.

HDF5 support

See the experimental hdf5 branch with the module added to read and iterate datasets from HDF5 files. See docstrings in the module for documentation.

To install allisbns with the HDF5 support, run the following command:

uv add git+https://github.com/xymaxim/allisbns@hdf5[h5]

Read a dataset from a converted source file:

from allisbns.hdf5 import from_h5

from_h5("aa_isbn13_codes_20251222T170326Z.h5", "md5")

Iterate over all datasets available in a source file:

import h5py

from allisbns.hdf5 import iterate_datasets

with h5py.File("aa_isbn13_codes_20251222T170326Z.h5") as f:
    for dataset in iterate_datasets(f):
        ...