-
Notifications
You must be signed in to change notification settings - Fork 26.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[TENTATIVe] Attempt to reduce number of HEAD calls during model warmup. #18429
Conversation
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no real point doing this right now as we are going to move away from the current implementation of the cache to the one that is more git-based in huggingface_hub (working on it this week actually), and then only call to head will be necessary when nothing has changed.
Ok, marking as draft while the other PR is being worked on. |
80d0bdc
to
7d3e800
Compare
@sgugger If you want to take a look a the tests. Right now the tests are failing since the cache was written for the previous HEAD code. This PR now hold a few (relatively hackish) attempts to preserve information from the The one thing that's surpised me, is that the pipeline load the model use WDYT ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for fixing the telemetry!
The caching part should go more in the huggingface_hub IMO, especially now that we rely on it for everything. But I also think people might have strong opinion on it (if a user just updated a model and don't see it downloaded for instance, they'll be mad and won't necessarily understand there is some cache to clear).
In any case the number of calls will be reduced to just one in most cases after the work I'm starting today to leverage the new cache and commit shas :-)
@@ -502,7 +511,7 @@ def from_pretrained(cls, pretrained_model_name_or_path, *inputs, **kwargs): | |||
config = kwargs.pop("config", None) | |||
kwargs["_from_auto"] = True | |||
|
|||
use_fast = kwargs.pop("use_fast", True) | |||
use_fast = kwargs.get("use_fast", True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This means the use_fast
stays in the kwargs, which are then sent to classes like AutoConfig
, which don't use it. So I'd leave it as pop
here and if it's missing in a call down there, let's add it manually.
I'll wait for you work for the cache, that should clear the need for it, and the tests here should cover, I'll update this PR to make sure user-agent is correct afterwards. I had a TTL of 10s for the cache which is largely enough in most cases (well you still would have multiple hit if you were actually downloading the files but .. ) Removing the need for the cache is the best solution, so let's go with that for now. |
@Narsil in case it's relevant, I was observing a significant perf disparity with the following repro (all files have been cached) if I specify test.py: from transformers import pipeline
nlp = pipeline("question-answering", model='distilbert-base-cased-distilled-squad') $ time TRANSFORMERS_OFFLINE=1 python transformers_test.py
vocab_file vocab.txt
tokenizer_file tokenizer.json
added_tokens_file added_tokens.json
special_tokens_map_file special_tokens_map.json
tokenizer_config_file tokenizer_config.json
real 0m1.759s
user 0m1.815s
sys 0m2.443s
$ time python transformers_test.py
vocab_file vocab.txt
tokenizer_file tokenizer.json
added_tokens_file added_tokens.json
special_tokens_map_file special_tokens_map.json
tokenizer_config_file tokenizer_config.json
real 0m6.187s
user 0m2.184s
sys 0m2.248s I then searched around and stumbled upon your PR. I tried patching it down into a venv linked to my repo and saw that the time was roughly the same: (venv) ankur-m1:~/projects/layoutlm ankur$ time python transformers_test.py
vocab_file vocab.txt
tokenizer_file tokenizer.json
added_tokens_file added_tokens.json
special_tokens_map_file special_tokens_map.json
tokenizer_config_file tokenizer_config.json
real 0m6.034s
user 0m2.019s
sys 0m2.346s I may have completely misunderstood the purpose of this change, so please ignore me if this comment is irrelevant, but since I was poking around with perf in a similar fashion, I thought I'd share if helpful! |
This PR caches files for a very small amount of time (10s) because most of the time users will want the new models when they exist. You can try to increase the TTL if you want. You may want to check were the overhead of network is occurring too, if could be DNS issues on your end, or just high latency with the HF servers. |
Ahh, okay, that makes sense. Let me dig around a bit further with my repro, and see if I can find anything useful. Thanks for the pointers. |
Btw @sgugger is working on a better fix we should reduce the amount of network calls as close to 1 as possible. |
@sgugger if it's helpful to have a guinea pig test your PR in the wild, I'm happy to help! For background context, the reason I'm trying to optimize this cold start is that I'm trying to use transformers in a command line utility where cold start time matters quite a bit. |
The PR is #18534, but nothing will beat using offline mode with the model cached, since you are then doing 0 calls to the API. |
Thanks for the pointer @sgugger. I agree, however, the disparity exists even if you pin the revision:
While playing around with this, I noticed issue #18537, which might be leading to extra network calls (since the revision isn't pinned for the model) in my repro. Apologies if I'm missing something obvious here, but I'd expect that (a) if the revision is specified and (b) it's cached, then there shouldn't be any network calls. |
If the revision is specified as a commit sha, then yes, the cache should be used. This is not implemented by #18534 however, but could be some follow up work. The only exception is for files that don't exist, as we don't know if they haven't been downloaded yet or if they are not present. That's why we still have extra calls in #18534 and would still have extra calls in this case as well. |
Got it, I pulled down your PR and ran the same test, and saw much better results:
I'd be happy to help with the follow up work/exploration if helpful. I think you could theoretically handle the "all files downloaded" case too, by caching a file that simply marks that you've downloaded all files associated with a revision. |
Yes there is probably some way to cache that the file does not exist for a given commit sha. Pinged a few people internally to see if they like that idea and will let you know when I hear back! |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
Unstale. I'll come back to this at some poind. |
There is only one call to head now once the model is cached @Narsil |
You're right, closing. |
What does this PR do?
When doing model loading with the various tools within
transformers
there's actuallya lot of duplicate HEAD calls that cost network time and duplicate usage in an unecessary fashion.
For instance when doing a simple
pipeline(model="gpt2")
You're getting
So you're doing a HEAD 4 to 5 times the
config.json
, 3 times totokenizer_config.json
.Each of these is doing a call on the HUB which requires loading some resources which could be avoided.
@XciD
In addition it adds a lot of noise within our logs since we're getting a lot of random multiple
HEAD calls for the actual same code being run.
Fixing it "cleanly" is hard, since there are many pathways to load the various elements
and checking every single path is hard.
The proposed fix is to simply introduce a
timed_cache
wrapper on top of therequest.head
function.
We can keep a very short ttl since it's only to reduce duplicates when the model is unlikely to have changed.
We need to keep in mind jupyter or long lived users, so we need a TTL so that model updates can still be seen and downloaded.
In addition to that, it seems each code path calls the HEAD part with a different user-agent which (afaik) makes
it harder to understand our user's usage.
This is a tentative PR, proposed to reduce redundant network calls. If this is thought as a correct direction
I will then add unit testing for this
timed_cache
function.After the PR:
Fixes # (issue)
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
@sgugger
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.