allisbns.dataset¶
Classes and functions to work with datasets of packed ISBN codes.
- type allisbns.dataset.PackedCodes = ndarray[tuple[Any, ...], dtype[int32]]¶
Packed ISBN codes that represent ISBN availability.
- type allisbns.dataset.UnpackedCodes = ndarray[tuple[Any, ...], dtype[bool]]¶
Unpacked representation of codes.
- allisbns.dataset.load_bencoded(source: IO[bytes]) dict[bytes, bytes]¶
Opens a compressed source and reads a bencoded data.
- allisbns.dataset.unpack_data(data: bytes) PackedCodes¶
Unpacks data that come from a bencoded source.
- class allisbns.dataset.BinnedArray(bins: ArrayLike, bin_size: int)¶
Bases:
objectRepresents the binned data.
- bins: ArrayLike¶
Bins.
- class allisbns.dataset.QueryResult(is_streak: bool, segment_index: int, position_in_segment: int)¶
Bases:
NamedTupleRepresents the result of a query for an ISBN.
It shows whether the ISBN is filled (falls in a streak segment) or absent (in a gap segment) and the corresponding position in a segment.
- class allisbns.dataset.CodeDataset(codes: ndarray[tuple[Any, ...], dtype[int32]], offset: ISBN12 = 978000000000, fill_to_isbn: dataclasses.InitVar[ISBN12 | None] = None)¶
Bases:
objectRepresents a dataset of the packed ISBN codes.
Examples
Create a dataset from the input codes:
>>> dataset = CodeDataset(codes=[1, 2, 3]) >>> dataset CodeDataset(array([1, 2, 3], dtype=int32), bounds=(978000000000, 978000000005)) >>> dataset.codes array([1, 2, 3], dtype=int32)
With the custom offset and fill to some ISBN:
>>> CodeDataset( ... codes=[1, 2, 3], ... offset=979_000_000_000, ... fill_to_isbn=979_999_999_999 ... ) CodeDataset(array([ 1, 2, 3, 999999994]), bounds=(979000000000, 979999999999))
- fill_to_isbn: dataclasses.InitVar[ISBN12 | None] = None¶
ISBN up to which to fill dataset codes.
- bounds: ISBNBounds¶
First and last ISBNs in the dataset.
- classmethod from_file(source: str | Path | IO[bytes], collection: str, offset: ISBN12 = FIRST_ISBN, fill_to_isbn: ISBN12 | None = None) Self¶
Creates a dataset from a source file or byte stream.
- Parameters:
source – A path to a bencoded compressed file or a byte stream.
collection – A collection name to be read (e.g., ‘md5’, ‘rgb’, etc.). Refers to
aarecord_id_prefixin the original data format.offset – The first ISBN.
fill_to_isbn – An ISBN up to which to fill dataset codes.
- Returns:
A dataset created from a source.
- classmethod from_unpacked(unpacked_codes: UnpackedCodes, offset: ISBN12 = FIRST_ISBN) Self¶
Creates a dataset from the unpacked codes.
- Parameters:
unpacked_codes – An array of unpack codes (boolean values).
collection – A dataset collection name.
offset – The first ISBN.
- Returns:
A dataset created from the unpacked codes.
- reframe(start_isbn: ISBN12 | None, end_isbn: ISBN12 | None) Self¶
Reframes the dataset to a new bounds.
Framing could crop or expand the existing bounds.
- Parameters:
start_isbn – An ISBN to crop the dataset from. When None, the start bound will be used.
end_isbn – An ISBN to crop the dataset until. When None, the end bound will be used.
- Returns:
A new reframed dataset.
Examples
Crop at both sides:
>>> dataset = CodeDataset([3, 2, 1], offset=978_000_000_000) CodeDataset(array([3, 2, 1], dtype=int32), bounds=(978000000000, 978000000005)) >>> dataset.reframe(978_000_000_001, 978_000_000_004) CodeDataset(array([2, 2], dtype=int32), bounds=(978000000001, 978000000004))
Reframe with the default start bound:
>>> dataset.reframe(None, 978_000_000_100) CodeDataset(array([ 3, 2, 1, 95], dtype=int32), bounds=(978000000000, 978000000100))
Reframe to both start and end outside bounds:
>>> dataset.reframe(979_000_000_000, 979_999_999_999) CodeDataset(array([ 0, 1000000000], dtype=int32), bounds=(979000000000, 979999999999))
See also
__getitem__(): Reframe a dataset using slicing.
- unpack_codes() UnpackedCodes¶
Unpacks codes into boolean values.
- Returns:
An array of unpacked codes.
- query_isbn(isbn: ISBN12) QueryResult¶
Queries if the ISBN is filled in the dataset and its position.
- Parameters:
isbn – An ISBN to query for.
- Returns:
A query result.
- Raises:
ValueError – If the ISBN is outside of the dataset bounds.
- check_isbns(isbns: list[ISBN12]) ndarray[tuple[Any, ...], dtype[bool]]¶
Checks if ISBNs are filled in the dataset or not.
- Parameters:
isbns – A list with ISBNs to check.
- Returns:
An array of boolean values.
- get_filled_isbns() ndarray[tuple[Any, ...], dtype[_ScalarT]]¶
Gets filled ISBNs in the dataset.
- Returns:
An array of filled ISBNs.
Examples
To get unfilled ISBNs, invert the dataset first:
dataset.invert().get_filled_isbns()
- bin(bin_size: int = 2500, num_chunks: int = 4) BinnedArray¶
Performs a fixed-size binning of codes into bins.
The bin value is the number of filled ISBN values.
- Parameters:
bin_size – A number of ISBNs in one bin.
num_chunks – A number of chunks used to process the dataset.
- Returns:
A binned array.
- write_bencoded(file: str | Path | IO[bytes], prefix: str, normalize: bool = True) None¶
Writes ISBN codes to a bencoded compressed file.
- Parameters:
file – A file path or file-like object to which the codes will be written.
prefix – A dataset collection name. Refers to ‘aarecord_id_prefix’ in the original data format description.
normalize – Whether to normalize codes so that the ending gap code, if present, is omitted (default) or not.
- __getitem__(key: slice[ISBN12 | None, ISBN12 | None, None]) Self¶
Reframes a dataset to a new bounds using slicing.
- Parameters:
key – A slice object with optional start and stop ISBNs. See
reframe()for more info.
Examples
Reframe a dataset using start and stop ISBNs:
>>> dataset = CodeDataset([3, 2, 1], offset=978_000_000_000) >>> dataset[978_000_000_001:978_000_000_004] CodeDataset(array([2, 2], dtype=int32), bounds=(978000000001, 978000000004))
- allisbns.dataset.iterate_datasets(data: dict[bytes, bytes], collections: list[str] | None = None, fill_to_isbn: ISBN12 | None = None) Generator[CodeDataset]¶
Iterates over datasets created from the loaded bencoded data.
By default, iterates over all collections in the data.
- Parameters:
data – Loaded bencoded data that contain collections to unpack.
collections – Collection names to unpack (e.g. ‘md5’, ‘rgb’, …). When None (default), iterates over all collections in the data.
fill_to_isbn – An ISBN up to which to fill dataset codes.
- Returns:
Yields datasets for the selected collections.
Example
Iterate over all datasets and count filled ISBNs:
from allisbns.dataset import load_bencoded, iterate_datasets with open("aa_isbn13_codes_20251118T170842Z.benc.zst", "rb") as f: input_data = load_bencoded(f) filled_counts: dict[str, int] = {} collections = [x.decode() for x in input_data.keys()] dataset_iterator = iterate_datasets(input_data, collections) for collection, dataset in zip(collections, dataset_iterator): filled_counts[collection] = dataset.count_filled_isbns()