Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Further reduce the number of alls to head for cached objects #18871

Merged
merged 3 commits into from
Sep 6, 2022
Merged

Conversation

sgugger
Copy link
Collaborator

@sgugger sgugger commented Sep 2, 2022

What does this PR do?

This PR completes #18534 and leverages the cache system of files that do not exist at a given commit in a repo introduced in the last release of huggingface_hub (by this PR) to further reduce the numbers of calls to the API when trying to load configurations/models/tokenizers/pipelines to just 1 call every time the object is cached and the current commit is the same one as the distant repo for the given revision.

cc @Narsil

@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Sep 2, 2022

The documentation is not available anymore as the PR was closed or merged.

@julien-c
Copy link
Member

julien-c commented Sep 2, 2022

and also cc @Wauplin :)

@@ -244,6 +244,9 @@ def try_to_load_from_cache(cache_dir, repo_id, filename, revision=None, commit_h
with open(os.path.join(model_cache, "refs", revision)) as f:
commit_hash = f.read()

if os.path.isfile(os.path.join(model_cache, ".no_exist", commit_hash, filename)):
return -1
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not necessarily the cleanest here, but I need a return type that is not None (means file not found in cache) and not a string (I could put "no_exist" but I'm sure someone will end up naming a cached file like this just to spite me).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No problem with this as long as there's a comment explaining why!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could put "no_exist" but I'm sure someone will end up naming a cached file like this just to spite me

For sure 😁

For such a case, -1 is fine but for a more explicit return you can also create an empty object NO_EXIST and use it as a return value:

# src/transformers/utils/hub.py

# Return value when trying to load a file from cache but the file does not exist.
NO_EXIST = object() # or "_NO_EXIST"

(...)

def try_to_load_from_cache(cache_dir, repo_id, filename, revision=None, commit_hash=None):
    (...)
    if os.path.isfile(os.path.join(model_cache, ".no_exist", commit_hash, filename)):
        return NO_EXIST

(...)

def cached_file(...):
    (...)
            if resolved_file is not NO_EXIST:
                return resolved_file    

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In any case I agree with @LysandreJik to document it, especially the difference between "a file not existing in the cache (e.g. not cached)" and "a file not existing at all".

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed as suggested, let me know if you have more comments on the new version!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just reviewed it and it looks good to me 👍

Copy link
Member

@LysandreJik LysandreJik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thank you!

It would be great to have @Wauplin's review before merging

@@ -244,6 +244,9 @@ def try_to_load_from_cache(cache_dir, repo_id, filename, revision=None, commit_h
with open(os.path.join(model_cache, "refs", revision)) as f:
commit_hash = f.read()

if os.path.isfile(os.path.join(model_cache, ".no_exist", commit_hash, filename)):
return -1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No problem with this as long as there's a comment explaining why!

@@ -244,6 +244,9 @@ def try_to_load_from_cache(cache_dir, repo_id, filename, revision=None, commit_h
with open(os.path.join(model_cache, "refs", revision)) as f:
commit_hash = f.read()

if os.path.isfile(os.path.join(model_cache, ".no_exist", commit_hash, filename)):
return -1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could put "no_exist" but I'm sure someone will end up naming a cached file like this just to spite me

For sure 😁

For such a case, -1 is fine but for a more explicit return you can also create an empty object NO_EXIST and use it as a return value:

# src/transformers/utils/hub.py

# Return value when trying to load a file from cache but the file does not exist.
NO_EXIST = object() # or "_NO_EXIST"

(...)

def try_to_load_from_cache(cache_dir, repo_id, filename, revision=None, commit_hash=None):
    (...)
    if os.path.isfile(os.path.join(model_cache, ".no_exist", commit_hash, filename)):
        return NO_EXIST

(...)

def cached_file(...):
    (...)
            if resolved_file is not NO_EXIST:
                return resolved_file    

@@ -244,6 +244,9 @@ def try_to_load_from_cache(cache_dir, repo_id, filename, revision=None, commit_h
with open(os.path.join(model_cache, "refs", revision)) as f:
commit_hash = f.read()

if os.path.isfile(os.path.join(model_cache, ".no_exist", commit_hash, filename)):
return -1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In any case I agree with @LysandreJik to document it, especially the difference between "a file not existing in the cache (e.g. not cached)" and "a file not existing at all".

Copy link
Contributor

@Wauplin Wauplin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.
In general this check could live in huggingface_hub as well. I created an issue to track it: huggingface/huggingface_hub#1033.

@sgugger
Copy link
Collaborator Author

sgugger commented Sep 6, 2022

Yes, my plan was to port this to hugginface_hub next, along with the commi_hash argument (which does not exist there yet), to then be able to use the function of huggingface_hub after the next release!

Thanks for the reviews, will address comments later this morning.

@sgugger sgugger merged commit 71ff88f into main Sep 6, 2022
@sgugger sgugger deleted the more_cache branch September 6, 2022 16:34
oneraghavan pushed a commit to oneraghavan/transformers that referenced this pull request Sep 26, 2022
…face#18871)

* Further reduce the number of alls to head for cached models/tokenizers/pipelines

* Fix tests

* Address review comments
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants