allisbns.dataset

Classes and functions to work with datasets of packed ISBN codes.

type allisbns.dataset.PackedCodes = ndarray[tuple[Any, ...], dtype[int32]]

Packed ISBN codes that represent ISBN availability.

type allisbns.dataset.UnpackedCodes = ndarray[tuple[Any, ...], dtype[bool]]

Unpacked representation of codes.

allisbns.dataset.load_bencoded(source: IO[bytes]) dict[bytes, bytes]

Opens a compressed source and reads a bencoded data.

allisbns.dataset.unpack_data(data: bytes) PackedCodes

Unpacks data that come from a bencoded source.

class allisbns.dataset.BinnedArray(bins: ArrayLike, bin_size: int)

Bases: object

Represents the binned data.

bins: ArrayLike

Bins.

bin_size: int

A size of bins.

__array__(dtype: DTypeLike | None = None, copy: bool | None = None) ndarray[tuple[Any, ...], dtype[_ScalarT]]

Returns the underlying bins for NumPy when requested.

class allisbns.dataset.QueryResult(is_streak: bool, segment_index: int, position_in_segment: int)

Bases: NamedTuple

Represents the result of a query for an ISBN.

It shows whether the ISBN is filled (falls in a streak segment) or absent (in a gap segment) and the corresponding position in a segment.

is_streak: bool

Is the number in a streak or gap?

segment_index: int

The index of the corresponding segment.

position_in_segment: int

The position in the corresponding segment.

class allisbns.dataset.CodeDataset(codes: ndarray[tuple[Any, ...], dtype[int32]], offset: ISBN12 = 978000000000, fill_to_isbn: dataclasses.InitVar[ISBN12 | None] = None)

Bases: object

Represents a dataset of the packed ISBN codes.

Examples

Create a dataset from the input codes:

>>> dataset = CodeDataset(codes=[1, 2, 3])
>>> dataset
CodeDataset(array([1, 2, 3], dtype=int32), bounds=(978000000000,
978000000005))
>>> dataset.codes
array([1, 2, 3], dtype=int32)

With the custom offset and fill to some ISBN:

>>> CodeDataset(
...     codes=[1, 2, 3],
...     offset=979_000_000_000,
...     fill_to_isbn=979_999_999_999
... )
CodeDataset(array([ 1, 2, 3, 999999994]), bounds=(979000000000,
979999999999))
codes: ndarray[tuple[Any, ...], dtype[int32]]

Packed ISBN codes.

offset: ISBN12 = 978000000000

First ISBN in the dataset.

fill_to_isbn: dataclasses.InitVar[ISBN12 | None] = None

ISBN up to which to fill dataset codes.

bounds: ISBNBounds

First and last ISBNs in the dataset.

total_isbns: int

Total number of ISBNs encoded in the dataset.

classmethod from_file(source: str | Path | IO[bytes], collection: str, offset: ISBN12 = FIRST_ISBN, fill_to_isbn: ISBN12 | None = None) Self

Creates a dataset from a source file or byte stream.

Parameters:
  • source – A path to a bencoded compressed file or a byte stream.

  • collection – A collection name to be read (e.g., ‘md5’, ‘rgb’, etc.). Refers to aarecord_id_prefix in the original data format.

  • offset – The first ISBN.

  • fill_to_isbn – An ISBN up to which to fill dataset codes.

Returns:

A dataset created from a source.

classmethod from_unpacked(unpacked_codes: UnpackedCodes, offset: ISBN12 = FIRST_ISBN) Self

Creates a dataset from the unpacked codes.

Parameters:
  • unpacked_codes – An array of unpack codes (boolean values).

  • collection – A dataset collection name.

  • offset – The first ISBN.

Returns:

A dataset created from the unpacked codes.

reframe(start_isbn: ISBN12 | None, end_isbn: ISBN12 | None) Self

Reframes the dataset to a new bounds.

Framing could crop or expand the existing bounds.

Parameters:
  • start_isbn – An ISBN to crop the dataset from. When None, the start bound will be used.

  • end_isbn – An ISBN to crop the dataset until. When None, the end bound will be used.

Returns:

A new reframed dataset.

Examples

Crop at both sides:

>>> dataset = CodeDataset([3, 2, 1], offset=978_000_000_000)
CodeDataset(array([3, 2, 1], dtype=int32), bounds=(978000000000,
978000000005))
>>> dataset.reframe(978_000_000_001, 978_000_000_004)
CodeDataset(array([2, 2], dtype=int32), bounds=(978000000001,
978000000004))

Reframe with the default start bound:

>>> dataset.reframe(None, 978_000_000_100)
CodeDataset(array([ 3, 2, 1, 95], dtype=int32),
bounds=(978000000000, 978000000100))

Reframe to both start and end outside bounds:

>>> dataset.reframe(979_000_000_000, 979_999_999_999)
CodeDataset(array([ 0, 1000000000], dtype=int32),
bounds=(979000000000, 979999999999))

See also

__getitem__(): Reframe a dataset using slicing.

unpack_codes() UnpackedCodes

Unpacks codes into boolean values.

Returns:

An array of unpacked codes.

invert() Self

Inverts the dataset by making streak segments gap.

query_isbn(isbn: ISBN12) QueryResult

Queries if the ISBN is filled in the dataset and its position.

Parameters:

isbn – An ISBN to query for.

Returns:

A query result.

Raises:

ValueError – If the ISBN is outside of the dataset bounds.

check_isbns(isbns: list[ISBN12]) ndarray[tuple[Any, ...], dtype[bool]]

Checks if ISBNs are filled in the dataset or not.

Parameters:

isbns – A list with ISBNs to check.

Returns:

An array of boolean values.

get_filled_isbns() ndarray[tuple[Any, ...], dtype[_ScalarT]]

Gets filled ISBNs in the dataset.

Returns:

An array of filled ISBNs.

Examples

To get unfilled ISBNs, invert the dataset first:

dataset.invert().get_filled_isbns()
count_filled_isbns() int

Counts the number of filled ISBNs in the dataset.

bin(bin_size: int = 2500, num_chunks: int = 4) BinnedArray

Performs a fixed-size binning of codes into bins.

The bin value is the number of filled ISBN values.

Parameters:
  • bin_size – A number of ISBNs in one bin.

  • num_chunks – A number of chunks used to process the dataset.

Returns:

A binned array.

is_subset(other: Self) bool

Determines if a dataset is a subset of another.

write_bencoded(file: str | Path | IO[bytes], prefix: str, normalize: bool = True) None

Writes ISBN codes to a bencoded compressed file.

Parameters:
  • file – A file path or file-like object to which the codes will be written.

  • prefix – A dataset collection name. Refers to ‘aarecord_id_prefix’ in the original data format description.

  • normalize – Whether to normalize codes so that the ending gap code, if present, is omitted (default) or not.

__getitem__(key: slice[ISBN12 | None, ISBN12 | None, None]) Self

Reframes a dataset to a new bounds using slicing.

Parameters:

key – A slice object with optional start and stop ISBNs. See reframe() for more info.

Examples

Reframe a dataset using start and stop ISBNs:

>>> dataset = CodeDataset([3, 2, 1], offset=978_000_000_000)
>>> dataset[978_000_000_001:978_000_000_004]
CodeDataset(array([2, 2], dtype=int32), bounds=(978000000001,
978000000004))
allisbns.dataset.iterate_datasets(data: dict[bytes, bytes], collections: list[str] | None = None, fill_to_isbn: ISBN12 | None = None) Generator[CodeDataset]

Iterates over datasets created from the loaded bencoded data.

By default, iterates over all collections in the data.

Parameters:
  • data – Loaded bencoded data that contain collections to unpack.

  • collections – Collection names to unpack (e.g. ‘md5’, ‘rgb’, …). When None (default), iterates over all collections in the data.

  • fill_to_isbn – An ISBN up to which to fill dataset codes.

Returns:

Yields datasets for the selected collections.

Example

Iterate over all datasets and count filled ISBNs:

from allisbns.dataset import load_bencoded, iterate_datasets

with open("aa_isbn13_codes_20251118T170842Z.benc.zst", "rb") as f:
    input_data = load_bencoded(f)

filled_counts: dict[str, int] = {}

collections = [x.decode() for x in input_data.keys()]
dataset_iterator = iterate_datasets(input_data, collections)
for collection, dataset in zip(collections, dataset_iterator):
    filled_counts[collection] = dataset.count_filled_isbns()