Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[METRICS, breaking] Refactor caching behavior, pickle/cloudpickle metrics and dataset, add tests on metrics #518

Merged
merged 7 commits into from
Aug 24, 2020

Conversation

thomwolf
Copy link
Member

@thomwolf thomwolf commented Aug 19, 2020

Move the acquisition of the filelock at a later stage during metrics processing so it can be pickled/cloudpickled after instantiation.

Also add some tests on pickling, concurrent but separate metric instances and concurrent and distributed metric instances.

Changes significantly the caching behavior for the metrics:

  • if the metric is used in a non-distributed setup (most common case) we try to find a free cache file using UUID instead of asking for an experiment_id if we can't lock the cache file this allows to use several instances of the same metrics in parallel.
  • if the metrics is used in a distributed setup we ask for an experiment_id if we can't lock the cache file (because all the nodes need to have related cache file names for the final sync.
  • after the computation, we free the locks and delete all the cache files.

Breaking: Some arguments for Metrics initialization have been removed for simplicity (version...) and some have been renamed for consistency with the rest of the library (in_memory => keep_in_memory).

Also remove the _has_transformers detection in utils to avoid importing transformers everytime during loading.

@thomwolf thomwolf changed the title Cloudpickle Pickle/cloudpickle metrics and dataset Aug 19, 2020
@thomwolf thomwolf requested a review from lhoestq August 19, 2020 20:02
@thomwolf
Copy link
Member Author

(test failure is unrelated)

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome !
Bonus test you could add: check that it's not possible to use two concurrent metric instances

@thomwolf thomwolf changed the title Pickle/cloudpickle metrics and dataset [METRICS] Refactor caching behavior, pickle/cloudpickle metrics and dataset, add tests on metrics Aug 21, 2020
@thomwolf thomwolf changed the title [METRICS] Refactor caching behavior, pickle/cloudpickle metrics and dataset, add tests on metrics [METRICS, breaking] Refactor caching behavior, pickle/cloudpickle metrics and dataset, add tests on metrics Aug 21, 2020
@thomwolf thomwolf requested a review from lhoestq August 21, 2020 13:22
Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool ! I think I'll play with it a little bit. I wonder if we can have user experience issues because of timeouts

if self.hash:
builder_data_dir = os.path.join(builder_data_dir, self.hash)
builder_data_dir = self._data_dir_root
builder_data_dir = os.path.join(builder_data_dir, self.name, self.config_name)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The hash is missing here ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed it. It's not used anymore now that we create hashes on the fly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok ! Maybe you can update the docstring that mentions self.hash then

else:
filelocks.append(filelock)

return file_paths, filelocks

def finalize(self, timeout=120):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it doesn't match the 100sec timeout by default in _get_all_cache_files.
Maybe add more info about the timeout in the docstring ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this is on purpose. When you open a new file you want to find quickly an open file to use.
When you get all the distributed files during a distributed setup you want to give the time to the nodes to finish their work before failing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add some info in the docstrings

Copy link
Member Author

@thomwolf thomwolf Aug 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I read to quickly, indeed 100 and 120 don't match. I'll update that.

file_paths = [self.cache_file_name]
else:
file_paths = [
os.path.join(self.data_dir, f"{self.experiment_id}-{self.num_process}-{process_id}.arrow")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about the files with file_uuid like this ?

file_path = os.path.join(
    self.data_dir, f"{self.experiment_id}-{file_uuid}-{self.num_process}-{self.process_id}.arrow"
)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are only used for single-process metrics (the previous branches of the if-else).
I'll add a comment.

@sgugger
Copy link
Contributor

sgugger commented Aug 24, 2020

As discussed with @thomwolf merging since the hyperparameter-search has been merged in transformers.

@sgugger sgugger merged commit c8fdfe3 into master Aug 24, 2020
@sgugger sgugger deleted the cloudpickle branch August 24, 2020 16:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants