Cookbook¶
Iterate and reframe datasets¶
Let’s first load the bencoded data from the compressed file:
from allisbns.dataset import load_bencoded
input_path = "aa_isbn13_codes_20251118T170842Z.benc.zst"
with open(input_path, "rb") as f:
input_data = load_bencoded(f)
Then create an iterator over all datasets and iterate:
from allisbns.dataset import iterate_datasets, CodeDataset
from allisbns.isbn import LAST_ISBN
for dataset in iterate_datasets(input_data):
...
The iterable datasets can be narrowed only to the selected ones:
for dataset in iterate_datasets(
input_data, collections=["md5", "rgb"]
):
...
Also, the iterable datasets can be lazy reframed to some new bounds. For example, let’s iterate over the ‘978’ region of all datasets:
from allisbns.isbn import get_prefix_bounds
# Get the corresponding bounds
start_isbn, end_isbn = *get_prefix_bounds("978")
# Create the iterator, fill all datasets to the end ISBN
iterator = iterate_datasets(input_data, fill_to_isbn=end_isbn)
# Use the generator expression to lazy reframe datasets
reframing = (x.reframe(start_isbn, end_isbn) for x in iterator)
for reframed_dataset in reframing:
...
Merge all datasets¶
Create the iterator as above and union all datasets together:
from allisbns.isbn import LAST_ISBN
from allisbns.merge import union
# The bounds must be the same
iterator = iterate_datasets(input_data, fill_to_isbn=LAST_ISBN)
all_merged = merge.union(iterator)
After merging, we can save the result codes to a file for later use. For
example, let’s temporarily save it to a binary file in NumPy format:
timestamp = str(input_path).split(".")[0].split("_")[-1]
output_path = f"xy_isbn13_codes_{timestamp}_all.npy"
with open(output_path, "wb") as f:
np.save(f, all_merged.codes, allow_pickle=False)
To write it down in the original format with compression, we can use
write_bencoded():
with open(output_path.with_suffix(".benc.zst"), "wb") as f:
all_merged.write_bencoded(f, prefix="all")
Store datasets in HDF5¶
The original files with ISBN codes have a quite simple structure. All codes are packed into a single bencoded dictionary and shipped compressed with Zstd. That yields a compact distribution (~80MB) but forces full decompression (~700MB) to read any subset, which can hurt interactive use a bit. An alternative for working with datasets would be to store codes in a different container format optimized for homogeneous arrays and partial access without uncompressing the whole file, such as HDF5, NetCDF, or Zarr.
Here we experiment with HDF5 to convert bencoded files and save grouped analysis results. Additionally, experimental support is added to work with HDF5 files.
Current way¶
The current reading of codes for a single collection from a bencoded file (without keeping uncompressed data in memory) can be written as follows:
import struct
import bencodepy
import zstandard
def read_codes(path: str, collection: str) -> tuple[int]:
with open(path, "rb") as f:
with zstandard.ZstdDecompressor().stream_reader(f) as s:
uncompressed_data = bencodepy.bread(s)
packed_binary_codes = uncompressed_data[collection.encode()]
return struct.unpack(
f"{len(packed_binary_codes) // 4}I",
packed_binary_codes
)
On a non-performant laptop, I got this, which can be noticeable in interactive sessions:
In [1]: %timeit read_codes("aa_isbn13_codes_20251118T170842Z.benc.zst", "md5")
2.38 s ± 21.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Convert bencoded files¶
Alternatively, we provide the convert-bencoded-to-h5.py script (link)
for converting the *.benc.zst files to HDF5.
$ uv run python scripts/convert-bencoded-to-h5.py \
aa_isbn13_codes_20251118T170842Z.benc.zst
$ ls -sh
78M aa_isbn13_codes_20251118T170842Z.benc.zst
82M aa_isbn13_codes_20251118T170842Z.h5
The conversion is pretty quick and produces comparable file sizes. For the compression, we use Blosc (non-standard, available via hdf5plugin) with the shuffle and Zstd filters activated:
hdf5plugin.Blosc(
cname="zstd",
clevel=5,
shuffle=hdf5plugin.Blosc.SHUFFLE
)
After that, reading codes as NumPy arrays is simply as follows:
import h5py
import numpy as np
import numpy.typing as npt
def read_codes(path: str, collection: str) -> npt.NDArray[np.int32]:
with h5py.File(path, "r") as f:
return f[collection][:]
In [1]: %timeit read_codes("aa_isbn13_codes_20251118T170842Z.benc.zst", "md5")
104 ms ± 1.39 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Basically, this setup preserves compact storage while enabling fast partial reads.
Groups and attributes¶
The structured nature of HDF5 can be useful to store datasets and the corresponding metadata in a single file. For example, see the Compare dumps example where we compare ‘md5’ datasets from two latest dumps to find additions and deletions and save results in an HDF5 file using groups.
HDF5 support¶
See the experimental hdf5 branch with the module added to
read and iterate datasets from HDF5 files. See docstrings in the module for
documentation.
To install allisbns with the HDF5 support, run the following command:
uv add git+https://github.com/xymaxim/allisbns@hdf5[h5]
Read a dataset from a converted source file:
from allisbns.hdf5 import from_h5
from_h5("aa_isbn13_codes_20251222T170326Z.h5", "md5")
Iterate over all datasets available in a source file:
import h5py
from allisbns.hdf5 import iterate_datasets
with h5py.File("aa_isbn13_codes_20251222T170326Z.h5") as f:
for dataset in iterate_datasets(f):
...