Cookbook

Iterate and reframe datasets

Let’s first load the bencoded data from the compressed file:

from allisbns.dataset import load_bencoded

input_path = "aa_isbn13_codes_20251118T170842Z.benc.zst"
with open(input_path, "rb") as f:
    input_data = load_bencoded(f)

Then create an iterator over all datasets and iterate:

from allisbns.dataset import iterate_datasets, CodeDataset
from allisbns.isbn import LAST_ISBN

for dataset in iterate_datasets(input_data):
    ...

The iterable datasets can be narrowed only to the selected ones:

for dataset in iterate_datasets(
    input_data, collections=["md5", "rgb"]
):
    ...

Also, the iterable datasets can be lazy reframed to some new bounds. For example, let’s iterate over the ‘978’ region of all datasets:

from allisbns.isbn import get_prefix_bounds

# Get the corresponding bounds
start_isbn, end_isbn = *get_prefix_bounds("978")

# Create the iterator, fill all datasets to the end ISBN
iterator = iterate_datasets(input_data, fill_to_isbn=end_isbn)

# Use the generator expression to lazy reframe datasets
reframing = (x.reframe(start_isbn, end_isbn) for x in iterator)
for reframed_dataset in reframing:
    ...

Merge and save datasets

Create the iterator as above and union all datasets together:

from allisbns.isbn import LAST_ISBN
from allisbns.merge import union

# The bounds must be the same
iterator = iterate_datasets(input_data, fill_to_isbn=LAST_ISBN)

all_merged = merge.union(iterator)

After merging, we can save the result codes to a file for later use. For example, let’s temporarily save it to a binary file in NumPy format:

timestamp = str(input_path).split(".")[0].split("_")[-1]
output_path = f"ms_isbn13_codes_{timestamp}_all.npy"

with open(output_path, "wb") as f:
    np.save(f, all_merged.codes, allow_pickle=False)

To write it down in the original format with compression, we can use write_bencoded():

with open(output_path.with_suffix(".benc.zst"), "wb") as f:
    all_merged.write_bencoded(f, prefix="all")