`allisbns.dataset`¶

Classes and functions to work with datasets of packed ISBN codes.

type allisbns.dataset.PackedCodes = ndarray[tuple[Any, ...], dtype[int32]]¶: Packed ISBN codes that represent ISBN availability.

type allisbns.dataset.UnpackedCodes = ndarray[tuple[Any, ...], dtype[bool]]¶: Unpacked representation of codes.

allisbns.dataset.load_bencoded(source: IO[bytes]) → dict[bytes, bytes]¶: Opens a compressed source and reads a bencoded data.

allisbns.dataset.unpack_data(data: bytes) → PackedCodes¶: Unpacks data that come from a bencoded source.

class allisbns.dataset.BinnedArray(bins: ArrayLike, bin_size: int)¶

Bases: object

Represents the binned data.

bins: ArrayLike¶: Bins.

bin_size: int¶: A size of bins.

__array__(dtype: DTypeLike | None = None, copy: bool | None = None) → ndarray[tuple[Any, ...], dtype[_ScalarT]]¶: Returns the underlying bins for NumPy when requested.

class allisbns.dataset.QueryResult(is_streak: bool, segment_index: int, position_in_segment: int)¶

Bases: NamedTuple

Represents the result of a query for an ISBN.

It shows whether the ISBN is filled (falls in a streak segment) or absent (in a gap segment) and the corresponding position in a segment.

is_streak: bool¶: Is the number in a streak or gap?

segment_index: int¶: The index of the corresponding segment.

position_in_segment: int¶: The position in the corresponding segment.

class allisbns.dataset.CodeDataset(codes: ndarray[tuple[Any, ...], dtype[int32]], offset: ISBN12 = 978000000000, fill_to_isbn: dataclasses.InitVar[ISBN12 | None] = None)¶

Bases: object

Represents a dataset of the packed ISBN codes.

Examples

Create a dataset from the input codes:

>>> dataset = CodeDataset(codes=[1, 2, 3])
>>> dataset
CodeDataset(array([1, 2, 3], dtype=int32), bounds=(978000000000,
978000000005))
>>> dataset.codes
array([1, 2, 3], dtype=int32)

With the custom offset and fill to some ISBN:

>>> CodeDataset(
...     codes=[1, 2, 3],
...     offset=979_000_000_000,
...     fill_to_isbn=979_999_999_999
... )
CodeDataset(array([ 1, 2, 3, 999999994]), bounds=(979000000000,
979999999999))

codes: ndarray[tuple[Any, ...], dtype[int32]]¶: Packed ISBN codes.

offset: ISBN12 = 978000000000¶: First ISBN in the dataset.

fill_to_isbn: dataclasses.InitVar[ISBN12 | None] = None¶: ISBN up to which to fill dataset codes.

bounds: ISBNBounds¶: First and last ISBNs in the dataset.

total_isbns: int¶: Total number of ISBNs encoded in the dataset.

classmethod from_file(source: str | Path | IO[bytes], collection: str, offset: ISBN12 = FIRST_ISBN, fill_to_isbn: ISBN12 | None = None) → Self¶

Creates a dataset from a source file or byte stream.

Parameters:

source – A path to a bencoded compressed file or a byte stream.
collection – A collection name to be read (e.g., ‘md5’, ‘rgb’, etc.). Refers to aarecord_id_prefix in the original data format.
offset – The first ISBN.
fill_to_isbn – An ISBN up to which to fill dataset codes.

Returns:

A dataset created from a source.

classmethod from_unpacked(unpacked_codes: UnpackedCodes, offset: ISBN12 = FIRST_ISBN) → Self¶

Creates a dataset from the unpacked codes.

Parameters:

unpacked_codes – An array of unpack codes (boolean values).
collection – A dataset collection name.
offset – The first ISBN.

Returns:

A dataset created from the unpacked codes.

reframe(start_isbn: ISBN12 | None, end_isbn: ISBN12 | None) → Self¶

Reframes the dataset to a new bounds.

Framing could crop or expand the existing bounds.

Parameters:

start_isbn – An ISBN to crop the dataset from. When None, the start bound will be used.
end_isbn – An ISBN to crop the dataset until. When None, the end bound will be used.

Returns:

A new reframed dataset.

Examples

Crop at both sides:

>>> dataset = CodeDataset([3, 2, 1], offset=978_000_000_000)
CodeDataset(array([3, 2, 1], dtype=int32), bounds=(978000000000,
978000000005))
>>> dataset.reframe(978_000_000_001, 978_000_000_004)
CodeDataset(array([2, 2], dtype=int32), bounds=(978000000001,
978000000004))

Reframe with the default start bound:

>>> dataset.reframe(None, 978_000_000_100)
CodeDataset(array([ 3, 2, 1, 95], dtype=int32),
bounds=(978000000000, 978000000100))

Reframe to both start and end outside bounds:

>>> dataset.reframe(979_000_000_000, 979_999_999_999)
CodeDataset(array([ 0, 1000000000], dtype=int32),
bounds=(979000000000, 979999999999))

See also

__getitem__(): Reframe a dataset using slicing.

unpack_codes() → UnpackedCodes¶

Unpacks codes into boolean values.

Returns:: An array of unpacked codes.

invert() → Self¶: Inverts the dataset by making streak segments gap.

query_isbn(isbn: ISBN12) → QueryResult¶

Queries if the ISBN is filled in the dataset and its position.

Parameters:: isbn – An ISBN to query for.
Returns:: A query result.
Raises:: ValueError – If the ISBN is outside of the dataset bounds.

check_isbns(isbns: list[ISBN12]) → ndarray[tuple[Any, ...], dtype[bool]]¶

Checks if ISBNs are filled in the dataset or not.

Parameters:: isbns – A list with ISBNs to check.
Returns:: An array of boolean values.

get_filled_isbns() → ndarray[tuple[Any, ...], dtype[_ScalarT]]¶

Gets filled ISBNs in the dataset.

Returns:: An array of filled ISBNs.

Examples

To get unfilled ISBNs, invert the dataset first:

dataset.invert().get_filled_isbns()

count_filled_isbns() → int¶: Counts the number of filled ISBNs in the dataset.

bin(bin_size: int = 2500, num_chunks: int = 4) → BinnedArray¶

Performs a fixed-size binning of codes into bins.

The bin value is the number of filled ISBN values.

Parameters:

bin_size – A number of ISBNs in one bin.
num_chunks – A number of chunks used to process the dataset.

Returns:

A binned array.

is_subset(other: Self) → bool¶: Determines if a dataset is a subset of another.

write_bencoded(file: str | Path | IO[bytes], prefix: str, normalize: bool = True) → None¶

Writes ISBN codes to a bencoded compressed file.

Parameters:

file – A file path or file-like object to which the codes will be written.
prefix – A dataset collection name. Refers to ‘aarecord_id_prefix’ in the original data format description.
normalize – Whether to normalize codes so that the ending gap code, if present, is omitted (default) or not.

__getitem__(key: slice[ISBN12 | None, ISBN12 | None, None]) → Self¶

Reframes a dataset to a new bounds using slicing.

Parameters:: key – A slice object with optional start and stop ISBNs. See reframe() for more info.

Examples

Reframe a dataset using start and stop ISBNs:

>>> dataset = CodeDataset([3, 2, 1], offset=978_000_000_000)
>>> dataset[978_000_000_001:978_000_000_004]
CodeDataset(array([2, 2], dtype=int32), bounds=(978000000001,
978000000004))

allisbns.dataset.iterate_datasets(data: dict[bytes, bytes], collections: list[str] | None = None, fill_to_isbn: ISBN12 | None = None) → Generator[CodeDataset]¶

Iterates over datasets created from the loaded bencoded data.

By default, iterates over all collections in the data.

Parameters:

data – Loaded bencoded data that contain collections to unpack.
collections – Collection names to unpack (e.g. ‘md5’, ‘rgb’, …). When None (default), iterates over all collections in the data.
fill_to_isbn – An ISBN up to which to fill dataset codes.

Returns:

Yields datasets for the selected collections.

Example

Iterate over all datasets and count filled ISBNs:

from allisbns.dataset import load_bencoded, iterate_datasets

with open("aa_isbn13_codes_20251118T170842Z.benc.zst", "rb") as f:
    input_data = load_bencoded(f)

filled_counts: dict[str, int] = {}

collections = [x.decode() for x in input_data.keys()]
dataset_iterator = iterate_datasets(input_data, collections)
for collection, dataset in zip(collections, dataset_iterator):
    filled_counts[collection] = dataset.count_filled_isbns()

allisbns.dataset¶

`allisbns.dataset`¶