Skip to content

fgnt/lazy_dataset

Repository files navigation

lazy_dataset

Run python tests codecov.io License: MIT

Lazy_dataset is a helper to deal with large datasets that do not fit into memory. It allows to define transformations that are applied lazily, (e.g. a mapping function to read data from HDD). When someone iterates over the dataset all transformations are applied.

Supported transformations:

  • dataset.map(map_fn): Apply the function map_fn to each example (builtins.map)
  • dataset[2]: Get example at index 2.
  • dataset['example_id'] Get that example that has the example id 'example_id'.
  • dataset[10:20]: Get a sub dataset that contains only the examples in the slice 10 to 20.
  • dataset.filter(filter_fn, lazy=True) Drops examples where filter_fn(example) is false (builtins.filter).
  • dataset.concatenate(*others): Concatenates two or more datasets (numpy.concatenate)
  • dataset.intersperse(*others): Combine two or more datasets such that examples of each input dataset are evenly spaced (https://stackoverflow.com/a/19293603).
  • dataset.zip(*others): Zip two or more datasets
  • dataset.shuffle(reshuffle=False): Shuffles the dataset. When reshuffle is True it shuffles each time when you iterate over the data.
  • dataset.tile(reps, shuffle=False): Repeats the dataset reps times and concatenates it (numpy.tile)
  • dataset.cycle(): Repeats the dataset endlessly (itertools.cycle but without caching)
  • dataset.groupby(group_fn): Groups examples together. In contrast to itertools.groupby a sort is not nessesary, like in pandas (itertools.groupby, pandas.DataFrame.groupby)
  • dataset.sort(key_fn, sort_fn=sorted): Sorts the examples depending on the values key_fn(example) (list.sort)
  • dataset.batch(batch_size, drop_last=False): Batches batch_size examples together as a list. Usually followed by a map (tensorflow.data.Dataset.batch)
  • dataset.random_choice(): Get a random example (numpy.random.choice)
  • dataset.cache(): Cache in RAM (similar to ESPnet's keep_all_data_on_mem)
  • dataset.diskcache(): Cache to a cache directory on the local filesystem (useful in clusters network slow filesystems)
  • ...
>>> from IPython.lib.pretty import pprint
>>> import lazy_dataset
>>> examples = {
...     'example_id_1': {
...         'observation': [1, 2, 3],
...         'label': 1,
...     },
...     'example_id_2': {
...         'observation': [4, 5, 6],
...         'label': 2,
...     },
...     'example_id_3': {
...         'observation': [7, 8, 9],
...         'label': 3,
...     },
... }
>>> for example_id, example in examples.items():
...     example['example_id'] = example_id
>>> ds = lazy_dataset.new(examples)
>>> ds
  DictDataset(len=3)
MapDataset(_pickle.loads)
>>> ds.keys()
('example_id_1', 'example_id_2', 'example_id_3')
>>> for example in ds:
...     print(example)
{'observation': [1, 2, 3], 'label': 1, 'example_id': 'example_id_1'}
{'observation': [4, 5, 6], 'label': 2, 'example_id': 'example_id_2'}
{'observation': [7, 8, 9], 'label': 3, 'example_id': 'example_id_3'}
>>> def transform(example):
...     example['label'] *= 10
...     return example
>>> ds = ds.map(transform)
>>> for example in ds:
...     print(example)
{'observation': [1, 2, 3], 'label': 10, 'example_id': 'example_id_1'}
{'observation': [4, 5, 6], 'label': 20, 'example_id': 'example_id_2'}
{'observation': [7, 8, 9], 'label': 30, 'example_id': 'example_id_3'}
>>> ds = ds.filter(lambda example: example['label'] > 15)
>>> for example in ds:
...     print(example)
{'observation': [4, 5, 6], 'label': 20, 'example_id': 'example_id_2'}
{'observation': [7, 8, 9], 'label': 30, 'example_id': 'example_id_3'}
>>> ds['example_id_2']
{'observation': [4, 5, 6], 'label': 20, 'example_id': 'example_id_2'}
>>> ds
      DictDataset(len=3)
    MapDataset(_pickle.loads)
  MapDataset(<function transform at 0x7ff74efb6620>)
FilterDataset(<function <lambda> at 0x7ff74efb67b8>)

Comparison with PyTorch's DataLoader

See here for a feature and throughput comparison of lazy_dataset with PyTorch's DataLoader.

Installation

Install it directly with Pip, if you just want to use it:

pip install lazy_dataset

If you want to make changes or want the most recent version: Clone the repository and install it as follows:

git clone https://github.com/fgnt/lazy_dataset.git
cd lazy_dataset
pip install --editable .