[TENTATIVe] Attempt to reduce number of HEAD calls during model warmup. #18429

Narsil · 2022-08-02T16:17:48Z

What does this PR do?

When doing model loading with the various tools within transformers there's actually
a lot of duplicate HEAD calls that cost network time and duplicate usage in an unecessary fashion.

For instance when doing a simple pipeline(model="gpt2")

You're getting

Fetching https://huggingface.co/gpt2/resolve/main/config.json
Downloading: 100%|████████████████████████████████████████████████████████████████████████| 1.39k/1.39k [00:00<00:00, 1.20MB/s]
----
Fetching https://huggingface.co/gpt2/resolve/main/config.json
----
Fetching https://huggingface.co/gpt2/resolve/main/pytorch_model.bin
----
Fetching https://huggingface.co/gpt2/resolve/main/tokenizer_config.json
----
Fetching https://huggingface.co/gpt2/resolve/main/config.json
----
Fetching https://huggingface.co/gpt2/resolve/main/tokenizer_config.json
----
Fetching https://huggingface.co/gpt2/resolve/main/vocab.json
----
Fetching https://huggingface.co/gpt2/resolve/main/merges.txt
----
Fetching https://huggingface.co/gpt2/resolve/main/tokenizer.json
----
Fetching https://huggingface.co/gpt2/resolve/main/added_tokens.json
----
Fetching https://huggingface.co/gpt2/resolve/main/special_tokens_map.json
----
Fetching https://huggingface.co/gpt2/resolve/main/tokenizer_config.json
----
Fetching https://huggingface.co/gpt2/resolve/main/config.json

So you're doing a HEAD 4 to 5 times the config.json, 3 times to tokenizer_config.json.
Each of these is doing a call on the HUB which requires loading some resources which could be avoided.
@XciD

In addition it adds a lot of noise within our logs since we're getting a lot of random multiple
HEAD calls for the actual same code being run.

Fixing it "cleanly" is hard, since there are many pathways to load the various elements
and checking every single path is hard.

The proposed fix is to simply introduce a timed_cache wrapper on top of the request.head
function.

We can keep a very short ttl since it's only to reduce duplicates when the model is unlikely to have changed.
We need to keep in mind jupyter or long lived users, so we need a TTL so that model updates can still be seen and downloaded.

In addition to that, it seems each code path calls the HEAD part with a different user-agent which (afaik) makes
it harder to understand our user's usage.

This is a tentative PR, proposed to reduce redundant network calls. If this is thought as a correct direction
I will then add unit testing for this timed_cache function.

After the PR:

----
Fetching https://huggingface.co/gpt2/resolve/main/config.json
----
Fetching https://huggingface.co/gpt2/resolve/main/pytorch_model.bin
----
Fetching https://huggingface.co/gpt2/resolve/main/tokenizer_config.json
----
Fetching https://huggingface.co/gpt2/resolve/main/vocab.json
----
Fetching https://huggingface.co/gpt2/resolve/main/merges.txt
----
Fetching https://huggingface.co/gpt2/resolve/main/tokenizer.json
----
Fetching https://huggingface.co/gpt2/resolve/main/added_tokens.json
----
Fetching https://huggingface.co/gpt2/resolve/main/special_tokens_map.json

Fixes # (issue)

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

@sgugger

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

HuggingFaceDocBuilderDev · 2022-08-02T16:29:00Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

sgugger

There is no real point doing this right now as we are going to move away from the current implementation of the cache to the one that is more git-based in huggingface_hub (working on it this week actually), and then only call to head will be necessary when nothing has changed.

Narsil · 2022-08-02T17:29:56Z

Ok, marking as draft while the other PR is being worked on.

Narsil · 2022-08-08T12:41:08Z

@sgugger If you want to take a look a the tests.

Right now the tests are failing since the cache was written for the previous HEAD code.
We can do the cache here or on huggingface_hub, I am not sure where is the most appropriate.

This PR now hold a few (relatively hackish) attempts to preserve information from the user-agent.

The one thing that's surpised me, is that the pipeline load the model use from_auto_class/False because it looks directly within the config to fetch the model with the correct head. While it's technically correct, I am not sure if that's correct "intentionally" or for telemetry purposes, since it is actually using a lot of magic.

WDYT ?

sgugger

Thanks for fixing the telemetry!

The caching part should go more in the huggingface_hub IMO, especially now that we rely on it for everything. But I also think people might have strong opinion on it (if a user just updated a model and don't see it downloaded for instance, they'll be mad and won't necessarily understand there is some cache to clear).

In any case the number of calls will be reduced to just one in most cases after the work I'm starting today to leverage the new cache and commit shas :-)

sgugger · 2022-08-08T12:44:25Z

src/transformers/models/auto/tokenization_auto.py

@@ -502,7 +511,7 @@ def from_pretrained(cls, pretrained_model_name_or_path, *inputs, **kwargs):
        config = kwargs.pop("config", None)
        kwargs["_from_auto"] = True

-        use_fast = kwargs.pop("use_fast", True)
+        use_fast = kwargs.get("use_fast", True)


This means the use_fast stays in the kwargs, which are then sent to classes like AutoConfig, which don't use it. So I'd leave it as pop here and if it's missing in a call down there, let's add it manually.

Narsil · 2022-08-08T12:54:55Z

The caching part should go more in the huggingface_hub IMO, especially now that we rely on it for everything. But I also think people might have strong opinion on it (if a user just updated a model and don't see it downloaded for instance, they'll be mad and won't necessarily understand there is some cache to clear).

I'll wait for you work for the cache, that should clear the need for it, and the tests here should cover, I'll update this PR to make sure user-agent is correct afterwards.

I had a TTL of 10s for the cache which is largely enough in most cases (well you still would have multiple hit if you were actually downloading the files but .. )

Removing the need for the cache is the best solution, so let's go with that for now.

ankrgyl · 2022-08-09T01:52:57Z

@Narsil in case it's relevant, I was observing a significant perf disparity with the following repro (all files have been cached) if I specify TRANSFORMERS_OFFLINE=1 or not:

test.py:

from transformers import pipeline
nlp = pipeline("question-answering", model='distilbert-base-cased-distilled-squad')

$ time TRANSFORMERS_OFFLINE=1 python transformers_test.py
vocab_file vocab.txt
tokenizer_file tokenizer.json
added_tokens_file added_tokens.json
special_tokens_map_file special_tokens_map.json
tokenizer_config_file tokenizer_config.json

real    0m1.759s
user    0m1.815s
sys     0m2.443s


$ time python transformers_test.py
vocab_file vocab.txt
tokenizer_file tokenizer.json
added_tokens_file added_tokens.json
special_tokens_map_file special_tokens_map.json
tokenizer_config_file tokenizer_config.json


real    0m6.187s
user    0m2.184s
sys     0m2.248s

I then searched around and stumbled upon your PR. I tried patching it down into a venv linked to my repo and saw that the time was roughly the same:

(venv) ankur-m1:~/projects/layoutlm ankur$ time python transformers_test.py
vocab_file vocab.txt
tokenizer_file tokenizer.json
added_tokens_file added_tokens.json
special_tokens_map_file special_tokens_map.json
tokenizer_config_file tokenizer_config.json

real    0m6.034s
user    0m2.019s
sys     0m2.346s

I may have completely misunderstood the purpose of this change, so please ignore me if this comment is irrelevant, but since I was poking around with perf in a similar fashion, I thought I'd share if helpful!

Narsil · 2022-08-09T16:28:00Z

@ankrgyl

This PR caches files for a very small amount of time (10s) because most of the time users will want the new models when they exist.

You can try to increase the TTL if you want.
Also the cache isn't kept through different python sessions.

You may want to check were the overhead of network is occurring too, if could be DNS issues on your end, or just high latency with the HF servers.

ankrgyl · 2022-08-09T16:30:26Z

Ahh, okay, that makes sense. Let me dig around a bit further with my repro, and see if I can find anything useful. Thanks for the pointers.

Narsil · 2022-08-09T16:33:23Z

Btw @sgugger is working on a better fix we should reduce the amount of network calls as close to 1 as possible.

ankrgyl · 2022-08-09T16:40:33Z

@sgugger if it's helpful to have a guinea pig test your PR in the wild, I'm happy to help! For background context, the reason I'm trying to optimize this cold start is that I'm trying to use transformers in a command line utility where cold start time matters quite a bit.

sgugger · 2022-08-09T16:42:33Z

The PR is #18534, but nothing will beat using offline mode with the model cached, since you are then doing 0 calls to the API.

ankrgyl · 2022-08-09T16:55:57Z

Thanks for the pointer @sgugger. I agree, however, the disparity exists even if you pin the revision:

$ cat transformers_test.py
from transformers import pipeline
nlp = pipeline("question-answering", model='distilbert-base-cased-distilled-squad', revision="1b9d42b637aed70c9f3cd27e13b66ee9f847ed03")

$ time python transformers_test.py

real    0m5.680s
user    0m1.694s
sys     0m2.252s

$ time TRANSFORMERS_OFFLINE=1 python transformers_test.py

real    0m1.321s
user    0m1.653s
sys     0m1.997s

While playing around with this, I noticed issue #18537, which might be leading to extra network calls (since the revision isn't pinned for the model) in my repro.

Apologies if I'm missing something obvious here, but I'd expect that (a) if the revision is specified and (b) it's cached, then there shouldn't be any network calls.

sgugger · 2022-08-09T17:00:11Z

If the revision is specified as a commit sha, then yes, the cache should be used. This is not implemented by #18534 however, but could be some follow up work.

The only exception is for files that don't exist, as we don't know if they haven't been downloaded yet or if they are not present. That's why we still have extra calls in #18534 and would still have extra calls in this case as well.

ankrgyl · 2022-08-09T17:14:10Z

Got it, I pulled down your PR and ran the same test, and saw much better results:

$ time python transformers_test.py

real    0m2.384s
user    0m1.869s
sys     0m2.222s

$ time TRANSFORMERS_OFFLINE=1 python transformers_test.py

real    0m1.588s
user    0m1.722s
sys     0m2.229s

I'd be happy to help with the follow up work/exploration if helpful. I think you could theoretically handle the "all files downloaded" case too, by caching a file that simply marks that you've downloaded all files associated with a revision.

sgugger · 2022-08-09T17:26:05Z

Yes there is probably some way to cache that the file does not exist for a given commit sha. Pinged a few people internally to see if they like that idea and will let you know when I hear back!

github-actions · 2022-09-03T15:02:21Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Narsil · 2022-09-27T11:18:09Z

Unstale. I'll come back to this at some poind.

sgugger · 2022-09-27T11:28:44Z

There is only one call to head now once the model is cached @Narsil

Narsil · 2022-09-27T12:45:27Z

You're right, closing.

sgugger reviewed Aug 2, 2022

View reviewed changes

Narsil marked this pull request as draft August 2, 2022 17:30

Narsil added 3 commits August 5, 2022 16:17

[TENTATIVe] Attempt to reduce number of HEAD calls during model warmup.

b7b6ea3

Dirty fix of metadata passing.

0b7794d

Adding the tests to check number of calls.

7d3e800

Narsil force-pushed the reduce_head_calls branch from 80d0bdc to 7d3e800 Compare August 5, 2022 15:03

Attempt at fixing tests.

d67627f

sgugger reviewed Aug 8, 2022

View reviewed changes

github-actions bot closed this Sep 12, 2022

Narsil reopened this Sep 27, 2022

Narsil closed this Sep 27, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[TENTATIVe] Attempt to reduce number of HEAD calls during model warmup. #18429

[TENTATIVe] Attempt to reduce number of HEAD calls during model warmup. #18429

Narsil commented Aug 2, 2022 •

edited

Loading

HuggingFaceDocBuilderDev commented Aug 2, 2022

sgugger left a comment

Narsil commented Aug 2, 2022

Narsil commented Aug 8, 2022

sgugger left a comment

sgugger Aug 8, 2022

Narsil commented Aug 8, 2022

ankrgyl commented Aug 9, 2022

Narsil commented Aug 9, 2022

ankrgyl commented Aug 9, 2022

Narsil commented Aug 9, 2022

ankrgyl commented Aug 9, 2022

sgugger commented Aug 9, 2022

ankrgyl commented Aug 9, 2022

sgugger commented Aug 9, 2022

ankrgyl commented Aug 9, 2022

sgugger commented Aug 9, 2022

github-actions bot commented Sep 3, 2022

Narsil commented Sep 27, 2022

sgugger commented Sep 27, 2022

Narsil commented Sep 27, 2022

[TENTATIVe] Attempt to reduce number of HEAD calls during model warmup. #18429

[TENTATIVe] Attempt to reduce number of HEAD calls during model warmup. #18429

Conversation

Narsil commented Aug 2, 2022 • edited Loading

What does this PR do?

Before submitting

Who can review?

HuggingFaceDocBuilderDev commented Aug 2, 2022

sgugger left a comment

Choose a reason for hiding this comment

Narsil commented Aug 2, 2022

Narsil commented Aug 8, 2022

sgugger left a comment

Choose a reason for hiding this comment

sgugger Aug 8, 2022

Choose a reason for hiding this comment

Narsil commented Aug 8, 2022

ankrgyl commented Aug 9, 2022

Narsil commented Aug 9, 2022

ankrgyl commented Aug 9, 2022

Narsil commented Aug 9, 2022

ankrgyl commented Aug 9, 2022

sgugger commented Aug 9, 2022

ankrgyl commented Aug 9, 2022

sgugger commented Aug 9, 2022

ankrgyl commented Aug 9, 2022

sgugger commented Aug 9, 2022

github-actions bot commented Sep 3, 2022

Narsil commented Sep 27, 2022

sgugger commented Sep 27, 2022

Narsil commented Sep 27, 2022

Narsil commented Aug 2, 2022 •

edited

Loading