Refactor DataCollectorV0 and HDF5 dependencies isolation #133

younik · 2023-08-23T10:19:01Z

Motivation

In the future, we aim to consider different storage methods than HDF5 files for our datasets (see #98), as other backends can be faster.
Currently, different files depend on h5py; this PR isolates the dependency on MinariStorage.
If we add support for other storage, we simply need to create a new MinariStorage that implements the same interface.

Changes

To achieve the h5py isolation, we make DataCollectorV0, functions in utils and MinariDataset to interface with MinariStorage instead of directly to h5py. Thus, MinariStorage, should offer an API for the most used methods.
This makes as natural dependencies DataCollectorV0 -> MinariStorage. and MinariDataset -> MinariStorage.

To avoid MinariDataset -> DataCollectorV0 dependency, we change the current API for adding the buffer of a DataCollector to a MInariDataset, from:

minari_dataset.update_from_env_collector(data_collector)

to:

data_collector.add_to_dataset(minari_dataset)

Other user notable changes:

MinariStorage doesn't want anymore the path to the file, but the path to the directory containing the file(s).
The MinariStorage of a MinariDataset is now accessible using .storage
The init method of MinariStorage is for an existing dataset. To create a new storage, use the class method new
It is not possible anymore to combine dataset without copying. We may want to add this again in the future (but it will likely be incompatible with different backends).
Remove the custom error when env module is missing. There is already an informative error from Gymnasium (and if it is not informative enough, it should be changed there).

balisujohn · 2023-09-25T08:34:34Z

tests/utils/test_dataset_combine.py

    # Check that we get max(max_episode_steps) when there is no max_episode_steps=None
    test_datasets.pop()
-    # testing without creating a copy


What is the reason for removing these checks?

It is not possible anymore to have copy=False, as it was causing problem (the ids of episode were not consistent)

I see so this test removal is in combination with removing the copy optional argument? That seems reasonable, assuming we must remove that feature to isolate h5py dependence to MinariStorage

It is not specific about h5py isolation, but it is a bug in the current code: when you combine two datasets, the episode ids of the second dataset are modified when you have copy=False, i.e. they don't start anymore from 0

tests/dataset/test_minari_dataset.py

balisujohn

I noticed there is still an h5py import in utils.py and in it is used in get_normalized_scores Is this by design, and if so, what is blocking removing the h5py dependency there? Other than this and the other question I had, this seems good overall. It adds new tests and passes the existing tests.

Here are profiling results.

tests/utils/test_dataset_creation.py

younik · 2023-10-03T03:44:13Z

I noticed there is still an h5py import in utils.py and in it is used in get_normalized_scores Is this by design?

Oh, thanks for spotting this, it wasn't by design; I fixed it.

balisujohn

Just had one more question, otherwise, once this is passing pre-commit, I think it's ready to merge :^)

balisujohn · 2023-10-20T21:38:46Z

tests/common.py

@@ -547,7 +549,7 @@ def check_load_and_delete_dataset(dataset_id: str):


 def create_dummy_dataset_with_collecter_env_helper(
-    dataset_id: str, env: DataCollectorV0, num_episodes: int = 10
+    dataset_id: str, env: DataCollectorV0, num_episodes: int = 10, **kwargs


what is the purpose of the kwargs?

It is needed for this new test; they are a general kwarg that can be passed to thecreate_dataset_from_collector_env function

Minari/tests/utils/test_get_normalized_score.py

Lines 20 to 27 in c5c88e8

ref_min_score, ref_max_score = -1, 100

dataset = create_dummy_dataset_with_collecter_env_helper(

"cartpole-test-v0",

env,

num_episodes=num_episodes,

ref_min_score=ref_min_score,

ref_max_score=ref_max_score,

)

balisujohn

Ready to merge :^) but one last question, what it two datasets that are themselves each the result of combining two datasets are combined? Will this situation be handled correctly?

younik · 2023-10-21T04:36:53Z

Ready to merge :^) but one last question, what it two datasets that are themselves each the result of combining two datasets are combined? Will this situation be handled correctly?

Thanks! Yes, this should not create any issue as the new generated dataset is a perfectly normal dataset (everything is copied); with just different metadata.

younik added 10 commits August 23, 2023 12:15

init draft PR

972d0cc

fix bugs

1c24f39

remove from hosting.py

8d8f799

fix some tests

e6355ce

fix problems

d83ba32

remove h5py dependency in local

2f77f92

general refactoring

e936c7f

remove PettingZoo leftovers

ab36b07

remove h5py in pydoc

ef28c2f

refactor minari_storage

8560018

younik force-pushed the refactor-storage branch from 5e95c97 to 8560018 Compare September 12, 2023 20:35

younik added 3 commits September 13, 2023 18:11

factor combine_datasets

01fe219

cache spaces and make them optional in metadata

67c0aee

test fixes

5a85faf

younik requested review from rodrigodelazcano and balisujohn September 19, 2023 17:34

younik marked this pull request as ready for review September 19, 2023 17:34

fix build 3.9 & some linter

0b41d47

balisujohn reviewed Sep 25, 2023

View reviewed changes

tests/dataset/test_minari_dataset.py Outdated Show resolved Hide resolved

restore tests

037b08b

balisujohn reviewed Oct 2, 2023

View reviewed changes

tests/utils/test_dataset_creation.py Show resolved Hide resolved

refactor get_normalized_score

31b126a

This was referenced Oct 9, 2023

small simple bug-fix: update lost infos in function create_dataset function #144

Merged

change create dataset from buffer to ensure same behavior with create data from env #146

Closed

younik added 2 commits October 12, 2023 01:03

Merge branch 'main' into refactor-storage

83fe18a

fix test after merge

47c4a68

younik force-pushed the refactor-storage branch from 9661a84 to 47c4a68 Compare October 18, 2023 22:44

fix metadata on GCP

c5c88e8

balisujohn reviewed Oct 20, 2023

View reviewed changes

fix pre-commit

707f138

balisujohn approved these changes Oct 21, 2023

View reviewed changes

younik merged commit dd8406e into Farama-Foundation:main Oct 21, 2023
10 checks passed

younik deleted the refactor-storage branch May 26, 2024 09:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor DataCollectorV0 and HDF5 dependencies isolation #133

Refactor DataCollectorV0 and HDF5 dependencies isolation #133

younik commented Aug 23, 2023 •

edited

Loading

balisujohn Sep 25, 2023

younik Oct 2, 2023

balisujohn Oct 2, 2023

younik Oct 3, 2023

balisujohn left a comment •

edited

Loading

younik commented Oct 3, 2023

balisujohn left a comment

balisujohn Oct 20, 2023

younik Oct 20, 2023

balisujohn left a comment

younik commented Oct 21, 2023

	ref_min_score, ref_max_score = -1, 100
	dataset = create_dummy_dataset_with_collecter_env_helper(
	"cartpole-test-v0",
	env,
	num_episodes=num_episodes,
	ref_min_score=ref_min_score,
	ref_max_score=ref_max_score,
	)

Refactor DataCollectorV0 and HDF5 dependencies isolation #133

Refactor DataCollectorV0 and HDF5 dependencies isolation #133

Conversation

younik commented Aug 23, 2023 • edited Loading

Motivation

Changes

balisujohn Sep 25, 2023

Choose a reason for hiding this comment

younik Oct 2, 2023

Choose a reason for hiding this comment

balisujohn Oct 2, 2023

Choose a reason for hiding this comment

younik Oct 3, 2023

Choose a reason for hiding this comment

balisujohn left a comment • edited Loading

Choose a reason for hiding this comment

younik commented Oct 3, 2023

balisujohn left a comment

Choose a reason for hiding this comment

balisujohn Oct 20, 2023

Choose a reason for hiding this comment

younik Oct 20, 2023

Choose a reason for hiding this comment

balisujohn left a comment

Choose a reason for hiding this comment

younik commented Oct 21, 2023

younik commented Aug 23, 2023 •

edited

Loading

balisujohn left a comment •

edited

Loading