Overview¶

Working with datasets ¶

Source data¶

Anna and the team periodically, as part of the derived metadata, publishes files with ISBN codes named as aa_isbn13_codes_*.benc.zst. They can be downloaded via the aa_derived_mirror_metadata torrent from this page.

Codes¶

The packed ISBN codes are a set of integers that represent the length of the alternating streak (present ISBNs) and gap (missing ISBNs) segments, for example:

3, 2, 1, ...

The codes are mapped to 2 million ISBNs covering both 978 and 979 prefixes. It is similar to the run-length encoding and efficiently describes information about ISBN availability for a selected dataset. The above codes can be expand to the following boolean mask:

True, True, True, False, False, True, ...

It is supposed for all published datasets that the offset is 9780000000002. Also, codes end with a streak segment and the last gap segment is omitted. See here on how codes are generated and for the description of the binary format used to store them.

In our package we use several terms that relate to ISBN availability. We can distinguish between packed and unpacked codes: PackedCodes refer to the original presentation of the codes and UnpackedCodes to the decoded one as boolean values. To distinguish between the individual ISBNs from streak and gap segments, we call them filled and unfilled ISBNs, respectively.

Read datasets¶

Several ways¶

There are several ways to read datasets from the bencoded compressed files.

The first way is quick and short. It uncompress the file and extract a single dataset:

>>> from allisbns.dataset import CodeDataset
>>>
>>> CodeDataset.from_file(
...     source="aa_isbn13_codes_20251118T170842Z.benc.zst",
...     collection="md5",
... )
CodeDataset(array([    6,     1,     9, ...,     1, 91739,     1],
  shape=(14737375,), dtype=int32), bounds=(978000000000, 979999468900))

The second way is more practical when you need to read several datasets from the same file.

from allisbns.dataset import load_bencoded, unpack_data

# Load the bencoded data
input_path = "aa_isbn13_codes_20251118T170842Z.benc.zst"
with open(input_path, "rb") as f:
    input_data = load_bencoded(f)

# Extract the desired datasets
md5 = CodeDataset(unpack_data(input_data[b"md5"]))
rgb = CodeDataset(unpack_data(input_data[b"rgb"]))

The third way allows you to iterate over datasets in the input data. The above example can be rewritten with allisbns.dataset.iterate_datasets() as the following:

md5, rgb = iterate_datasets(input_data, ["md5", "rgb"])

See more examples in iterate_datasets()’s docstring or this cookbook recipe.

Extend codes¶

By default, datasets are read as is and have different sizes in ISBNs. However, sometimes it is required to bring datasets to the same size: for example, when you merge or plot them. To extend a dataset to some ISBN value with a gap segment, we can use the fill_to_isbn argument:

>>> from allisbns.isbn import LAST_ISBN
>>>
>>> CodeDataset.from_file(
...     source="aa_isbn13_codes_20251118T170842Z.benc.zst",
...     collection="md5",
...     fill_to_isbn=LAST_ISBN,
... )
CodeDataset(array([     6,      1,      9, ...,  91739,      1, 531099],
  shape=(14737376,)), bounds=(978000000000, 979999999999))

Reframe datasets¶

After reading, it is possible to select a part of a dataset by reframing it. The reframing includes both cropping existing segments and extending codes with gap segments on both sides.

The use cases could include limiting the output of some methods. For example, if you need to get all filled ISBNs with the ‘979’ prefix, then you will need to reframe a dataset in prior with reframe():

from allisbns.isbn import get_prefix_bounds
md5.reframe(*get_prefix_bounds("979")).get_filled_isbns()

Immutable datasets¶

By design, datasets are considered immutable after creation. If you need to modify codes for some reason, you can copy codes and create a new dataset after editing. For example, here is an equivalent of the invert() method to inverse a dataset:

>>> CodeDataset(np.concatenate([[0], md5.codes]), offset=md5.offset)
CodeDataset(array([     0,      6,      1, ...,  91739,      1],
  shape=(14737377,), dtype=int32), bounds=(978000000000, 979999468900))

Working with ISBNs ¶

The package provides classes and functions to work with ISBNs, both numeric and string types.

To simplify things, the methods of CodeDataset only accepts ISBNs represented as ISBN-12 values (without the check digit) of integer types. We omit the range validation there, but you can use ensure_isbn12() to make sure that your values are valid if needed.

To work with strings, we have two classes, CanonicalISBN and MaskedISBN.

Normalize ISBNs¶

Let’s say we have some ISBN that comes from anywhere.

isbn = "978-23-6590-117-X"

It might be delimited by hyphens (correctly or not), contain the incorrect check digit, or even not be an ISBN at all.

First, let’s try to normalize and complete it with the correct check digit if needed with normalize_isbn().

# Normalize the ISBN, keep the 'X' check digit
>>> canonical = normalize_isbn(isbn)
CanonicalISBN(978236590117X)

# Complete the ISBN with the check digit
>>> canonical.complete()
CanonicalISBN(9782365901178)

Now we are sure that our ISBN value is valid. The output is CanonicalISBN. We can then, for example, convert it to the ISBN-12 integer number:

>>> isbn12 = canonical.to_isbn12()
>>> isbn12
978236590117

>>> # Can be safely used for querying the dataset
>>> md5.query(isbn12)

Format ISBNs¶

The canonical ISBNs can be formatted with hyphens to separate their elements:

>>> canonical.hyphen()
'978-2-36590-117-8'

We have the MaskedISBN class underneath for that: it validates ISBN ranges and splits ISBNs into distinct elements. In most cases you do not need to initialize it directly, since creating it from canonical ISBNs is more practical:

>>> masked = MaskedISBN.from_canonical(canonical)
>>> masked
MaskedISBN(
    bookland='978',
    group='2',
    registrant='36590',
    publication='117',
    check_digit='8',
)

The masked ISBNs enable slicing to output formatted ISBNs in a granular manner:

>>> # Get the full publisher prefix
>>> masked[:3]
'978-2-36590'

ISBN ranges¶

As you may notice, the formatted ISBN differs from our initial input string (‘978-23-6590-117-X’) that is incorrectly formatted. Masking ISBNs is possible with knowing about the valid ISBN registration groups and registrant ranges. The valid ranges are available from the International ISBN Agency website. We store such values in the auto-generated allisbns.ranges module. Given that, the ranges may change and expire, which breaks validation: we will still consider previously undefined for use ranges invalid.

Plotting binned images ¶

Our package also provides a plotting functionality to visualize datasets as binned images. While you can plot your binned datasets by yourself, for example, with Matplotlib’s imshow(), there are some existing plotters available. Plotters set up axes, rearrange bins in various ways (with the help of functions from allisbns.rearrange), are aware of coordinate conversion (handling by allisbns.plotting.CoordinateConverter), and draw images.

Available plotters¶

Currently, two plotters exist: RowBinnedPlotter (plots bins as rows of fixed width) and BlockBinnedPlotter (plots bins as vertical blocks of fixed size stacked horizontally).

Simple example¶

Let’s draw a binned image with RowBinnedPlotter.

import matplotlib.pyplot as plt
from allisbns.plotting import RowBinnedPlotter

fig, ax = plt.subplots(figsize=(12, 12), dpi=100)

# Bin your dataset
binned = dataset.bin(2500)

# Set up a plotter
plotter = RowBinnedPlotter(
    ax, width=int(2.5e6), bin_size=binned.bin_size
)

# Draw a binned image
plotter.plot_bins(binned)

plt.tight_layout()
plt.show()

Setting up the plotter fixes the image width to the row width (in relative ISBNs). While the image height (number of rows) is automatically determined during drawing and depends on the number of bins and the selected aspect (provided via imshow_kwargs of plot_bins()). Similarly, BlockBinnedPlotter fixes the image height according to the input block width and capacity values.

The plotter insists on working with the fixed bin size, which seems reasonable since plotting differently binned datasets (with different colormaps) breaks the comparison.

Image extent¶

Once you have plotted bins with plot_bins() or an image with plot_image(), the extent is set and cannot be changed after. This is deliberate behavior in the current versions.

The way to plot different images with different extents with one plotter is the following: (1) determine one overall size for all images, (2) create arrays with the corresponding common shape, (3) assign the desired region with your data, and (4) plot new images.

Define extent without plotting¶

What if you want to scatter plot ISBNs using a plotter to keep all axis decorations and all else? There is the define_extent() method to define the extent based on the ISBN bounds without plotting bins or an image:

>>> plotter = RowBinnedPlotter(
...     ax,
...     # Imitate the resolution of one ISBN
...     bin_size=1,
...     # Width is in relative ISBNs
...     width=int(2.5e6),
...     # Let's start our extent with this offset
...     offset=978_300_000_000,
...     # Adjust the image aspect to the new resolution, or set
...     # to 'auto' to follow the defined figure size
...     aspect=int(2.5e3),
... )

>>> # This will set all the required axis limits for the extent corresponding
>>> # to the ISBN range from 978300000000X (offset) to 978399999999X
>>> plotter.define_extent(end_isbn=978_399_999_999)
>>> plotter.extent
(0, 2500000, 160, 0)

>>> # Can be useful for further general plotting
>>> ax.scatter(...)