-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[METRICS, breaking] Refactor caching behavior, pickle/cloudpickle metrics and dataset, add tests on metrics #518
Conversation
(test failure is unrelated) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome !
Bonus test you could add: check that it's not possible to use two concurrent metric instances
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool ! I think I'll play with it a little bit. I wonder if we can have user experience issues because of timeouts
if self.hash: | ||
builder_data_dir = os.path.join(builder_data_dir, self.hash) | ||
builder_data_dir = self._data_dir_root | ||
builder_data_dir = os.path.join(builder_data_dir, self.name, self.config_name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The hash is missing here ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed it. It's not used anymore now that we create hashes on the fly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok ! Maybe you can update the docstring that mentions self.hash
then
else: | ||
filelocks.append(filelock) | ||
|
||
return file_paths, filelocks | ||
|
||
def finalize(self, timeout=120): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it doesn't match the 100sec timeout by default in _get_all_cache_files.
Maybe add more info about the timeout in the docstring ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes this is on purpose. When you open a new file you want to find quickly an open file to use.
When you get all the distributed files during a distributed setup you want to give the time to the nodes to finish their work before failing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll add some info in the docstrings
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I read to quickly, indeed 100 and 120 don't match. I'll update that.
file_paths = [self.cache_file_name] | ||
else: | ||
file_paths = [ | ||
os.path.join(self.data_dir, f"{self.experiment_id}-{self.num_process}-{process_id}.arrow") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about the files with file_uuid
like this ?
file_path = os.path.join(
self.data_dir, f"{self.experiment_id}-{file_uuid}-{self.num_process}-{self.process_id}.arrow"
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They are only used for single-process metrics (the previous branches of the if-else).
I'll add a comment.
As discussed with @thomwolf merging since the hyperparameter-search has been merged in transformers. |
Move the acquisition of the filelock at a later stage during metrics processing so it can be pickled/cloudpickled after instantiation.
Also add some tests on pickling, concurrent but separate metric instances and concurrent and distributed metric instances.
Changes significantly the caching behavior for the metrics:
experiment_id
if we can't lock the cache file this allows to use several instances of the same metrics in parallel.experiment_id
if we can't lock the cache file (because all the nodes need to have related cache file names for the final sync.Breaking: Some arguments for Metrics initialization have been removed for simplicity (
version
...) and some have been renamed for consistency with the rest of the library (in_memory
=>keep_in_memory
).Also remove the
_has_transformers
detection in utils to avoid importing transformers everytime during loading.