Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[TENTATIVe] Attempt to reduce number of HEAD calls during model warmup. #18429

Closed
wants to merge 4 commits into from

Conversation

Narsil
Copy link
Contributor

@Narsil Narsil commented Aug 2, 2022

What does this PR do?

When doing model loading with the various tools within transformers there's actually
a lot of duplicate HEAD calls that cost network time and duplicate usage in an unecessary fashion.

For instance when doing a simple pipeline(model="gpt2")

You're getting

Fetching https://huggingface.co/gpt2/resolve/main/config.json
Downloading: 100%|████████████████████████████████████████████████████████████████████████| 1.39k/1.39k [00:00<00:00, 1.20MB/s]
----
Fetching https://huggingface.co/gpt2/resolve/main/config.json
----
Fetching https://huggingface.co/gpt2/resolve/main/pytorch_model.bin
----
Fetching https://huggingface.co/gpt2/resolve/main/tokenizer_config.json
----
Fetching https://huggingface.co/gpt2/resolve/main/config.json
----
Fetching https://huggingface.co/gpt2/resolve/main/tokenizer_config.json
----
Fetching https://huggingface.co/gpt2/resolve/main/vocab.json
----
Fetching https://huggingface.co/gpt2/resolve/main/merges.txt
----
Fetching https://huggingface.co/gpt2/resolve/main/tokenizer.json
----
Fetching https://huggingface.co/gpt2/resolve/main/added_tokens.json
----
Fetching https://huggingface.co/gpt2/resolve/main/special_tokens_map.json
----
Fetching https://huggingface.co/gpt2/resolve/main/tokenizer_config.json
----
Fetching https://huggingface.co/gpt2/resolve/main/config.json

So you're doing a HEAD 4 to 5 times the config.json, 3 times to tokenizer_config.json.
Each of these is doing a call on the HUB which requires loading some resources which could be avoided.
@XciD

In addition it adds a lot of noise within our logs since we're getting a lot of random multiple
HEAD calls for the actual same code being run.

Fixing it "cleanly" is hard, since there are many pathways to load the various elements
and checking every single path is hard.

The proposed fix is to simply introduce a timed_cache wrapper on top of the request.head
function.

We can keep a very short ttl since it's only to reduce duplicates when the model is unlikely to have changed.
We need to keep in mind jupyter or long lived users, so we need a TTL so that model updates can still be seen and downloaded.

In addition to that, it seems each code path calls the HEAD part with a different user-agent which (afaik) makes
it harder to understand our user's usage.

This is a tentative PR, proposed to reduce redundant network calls. If this is thought as a correct direction
I will then add unit testing for this timed_cache function.

After the PR:

----
Fetching https://huggingface.co/gpt2/resolve/main/config.json
----
Fetching https://huggingface.co/gpt2/resolve/main/pytorch_model.bin
----
Fetching https://huggingface.co/gpt2/resolve/main/tokenizer_config.json
----
Fetching https://huggingface.co/gpt2/resolve/main/vocab.json
----
Fetching https://huggingface.co/gpt2/resolve/main/merges.txt
----
Fetching https://huggingface.co/gpt2/resolve/main/tokenizer.json
----
Fetching https://huggingface.co/gpt2/resolve/main/added_tokens.json
----
Fetching https://huggingface.co/gpt2/resolve/main/special_tokens_map.json

Fixes # (issue)

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

@sgugger

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no real point doing this right now as we are going to move away from the current implementation of the cache to the one that is more git-based in huggingface_hub (working on it this week actually), and then only call to head will be necessary when nothing has changed.

@Narsil
Copy link
Contributor Author

Narsil commented Aug 2, 2022

Ok, marking as draft while the other PR is being worked on.

@Narsil Narsil marked this pull request as draft August 2, 2022 17:30
@Narsil
Copy link
Contributor Author

Narsil commented Aug 8, 2022

@sgugger If you want to take a look a the tests.

Right now the tests are failing since the cache was written for the previous HEAD code.
We can do the cache here or on huggingface_hub, I am not sure where is the most appropriate.

This PR now hold a few (relatively hackish) attempts to preserve information from the user-agent.

The one thing that's surpised me, is that the pipeline load the model use from_auto_class/False because it looks directly within the config to fetch the model with the correct head. While it's technically correct, I am not sure if that's correct "intentionally" or for telemetry purposes, since it is actually using a lot of magic.

WDYT ?

Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing the telemetry!

The caching part should go more in the huggingface_hub IMO, especially now that we rely on it for everything. But I also think people might have strong opinion on it (if a user just updated a model and don't see it downloaded for instance, they'll be mad and won't necessarily understand there is some cache to clear).

In any case the number of calls will be reduced to just one in most cases after the work I'm starting today to leverage the new cache and commit shas :-)

@@ -502,7 +511,7 @@ def from_pretrained(cls, pretrained_model_name_or_path, *inputs, **kwargs):
config = kwargs.pop("config", None)
kwargs["_from_auto"] = True

use_fast = kwargs.pop("use_fast", True)
use_fast = kwargs.get("use_fast", True)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This means the use_fast stays in the kwargs, which are then sent to classes like AutoConfig, which don't use it. So I'd leave it as pop here and if it's missing in a call down there, let's add it manually.

@Narsil
Copy link
Contributor Author

Narsil commented Aug 8, 2022

The caching part should go more in the huggingface_hub IMO, especially now that we rely on it for everything. But I also think people might have strong opinion on it (if a user just updated a model and don't see it downloaded for instance, they'll be mad and won't necessarily understand there is some cache to clear).

I'll wait for you work for the cache, that should clear the need for it, and the tests here should cover, I'll update this PR to make sure user-agent is correct afterwards.

I had a TTL of 10s for the cache which is largely enough in most cases (well you still would have multiple hit if you were actually downloading the files but .. )

Removing the need for the cache is the best solution, so let's go with that for now.

@ankrgyl
Copy link
Contributor

ankrgyl commented Aug 9, 2022

@Narsil in case it's relevant, I was observing a significant perf disparity with the following repro (all files have been cached) if I specify TRANSFORMERS_OFFLINE=1 or not:

test.py:

from transformers import pipeline
nlp = pipeline("question-answering", model='distilbert-base-cased-distilled-squad')
$ time TRANSFORMERS_OFFLINE=1 python transformers_test.py
vocab_file vocab.txt
tokenizer_file tokenizer.json
added_tokens_file added_tokens.json
special_tokens_map_file special_tokens_map.json
tokenizer_config_file tokenizer_config.json

real    0m1.759s
user    0m1.815s
sys     0m2.443s


$ time python transformers_test.py
vocab_file vocab.txt
tokenizer_file tokenizer.json
added_tokens_file added_tokens.json
special_tokens_map_file special_tokens_map.json
tokenizer_config_file tokenizer_config.json


real    0m6.187s
user    0m2.184s
sys     0m2.248s

I then searched around and stumbled upon your PR. I tried patching it down into a venv linked to my repo and saw that the time was roughly the same:

(venv) ankur-m1:~/projects/layoutlm ankur$ time python transformers_test.py
vocab_file vocab.txt
tokenizer_file tokenizer.json
added_tokens_file added_tokens.json
special_tokens_map_file special_tokens_map.json
tokenizer_config_file tokenizer_config.json

real    0m6.034s
user    0m2.019s
sys     0m2.346s

I may have completely misunderstood the purpose of this change, so please ignore me if this comment is irrelevant, but since I was poking around with perf in a similar fashion, I thought I'd share if helpful!

@Narsil
Copy link
Contributor Author

Narsil commented Aug 9, 2022

@ankrgyl

This PR caches files for a very small amount of time (10s) because most of the time users will want the new models when they exist.

You can try to increase the TTL if you want.
Also the cache isn't kept through different python sessions.

You may want to check were the overhead of network is occurring too, if could be DNS issues on your end, or just high latency with the HF servers.

@ankrgyl
Copy link
Contributor

ankrgyl commented Aug 9, 2022

Ahh, okay, that makes sense. Let me dig around a bit further with my repro, and see if I can find anything useful. Thanks for the pointers.

@Narsil
Copy link
Contributor Author

Narsil commented Aug 9, 2022

Btw @sgugger is working on a better fix we should reduce the amount of network calls as close to 1 as possible.

@ankrgyl
Copy link
Contributor

ankrgyl commented Aug 9, 2022

@sgugger if it's helpful to have a guinea pig test your PR in the wild, I'm happy to help! For background context, the reason I'm trying to optimize this cold start is that I'm trying to use transformers in a command line utility where cold start time matters quite a bit.

@sgugger
Copy link
Collaborator

sgugger commented Aug 9, 2022

The PR is #18534, but nothing will beat using offline mode with the model cached, since you are then doing 0 calls to the API.

@ankrgyl
Copy link
Contributor

ankrgyl commented Aug 9, 2022

Thanks for the pointer @sgugger. I agree, however, the disparity exists even if you pin the revision:

$ cat transformers_test.py
from transformers import pipeline
nlp = pipeline("question-answering", model='distilbert-base-cased-distilled-squad', revision="1b9d42b637aed70c9f3cd27e13b66ee9f847ed03")

$ time python transformers_test.py

real    0m5.680s
user    0m1.694s
sys     0m2.252s

$ time TRANSFORMERS_OFFLINE=1 python transformers_test.py

real    0m1.321s
user    0m1.653s
sys     0m1.997s

While playing around with this, I noticed issue #18537, which might be leading to extra network calls (since the revision isn't pinned for the model) in my repro.

Apologies if I'm missing something obvious here, but I'd expect that (a) if the revision is specified and (b) it's cached, then there shouldn't be any network calls.

@sgugger
Copy link
Collaborator

sgugger commented Aug 9, 2022

If the revision is specified as a commit sha, then yes, the cache should be used. This is not implemented by #18534 however, but could be some follow up work.

The only exception is for files that don't exist, as we don't know if they haven't been downloaded yet or if they are not present. That's why we still have extra calls in #18534 and would still have extra calls in this case as well.

@ankrgyl
Copy link
Contributor

ankrgyl commented Aug 9, 2022

Got it, I pulled down your PR and ran the same test, and saw much better results:

$ time python transformers_test.py

real    0m2.384s
user    0m1.869s
sys     0m2.222s

$ time TRANSFORMERS_OFFLINE=1 python transformers_test.py

real    0m1.588s
user    0m1.722s
sys     0m2.229s

I'd be happy to help with the follow up work/exploration if helpful. I think you could theoretically handle the "all files downloaded" case too, by caching a file that simply marks that you've downloaded all files associated with a revision.

@sgugger
Copy link
Collaborator

sgugger commented Aug 9, 2022

Yes there is probably some way to cache that the file does not exist for a given commit sha. Pinged a few people internally to see if they like that idea and will let you know when I hear back!

@github-actions
Copy link

github-actions bot commented Sep 3, 2022

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@github-actions github-actions bot closed this Sep 12, 2022
@Narsil
Copy link
Contributor Author

Narsil commented Sep 27, 2022

Unstale. I'll come back to this at some poind.

@Narsil Narsil reopened this Sep 27, 2022
@sgugger
Copy link
Collaborator

sgugger commented Sep 27, 2022

There is only one call to head now once the model is cached @Narsil

@Narsil
Copy link
Contributor Author

Narsil commented Sep 27, 2022

You're right, closing.

@Narsil Narsil closed this Sep 27, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants