Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

oci_image based on pytorch fails with "could not parse reference" #436

Closed
faximan opened this issue Dec 13, 2023 · 9 comments
Closed

oci_image based on pytorch fails with "could not parse reference" #436

faximan opened this issue Dec 13, 2023 · 9 comments

Comments

@faximan
Copy link

faximan commented Dec 13, 2023

I am trying to build a simple "empty" image based on pytorch:

# WORKSPACE
oci_pull(
    name = "pytorch_base",
    image = "docker.io/pytorch/pytorch",
    tag = "1.12.1-cuda11.3-cudnn8-devel",
)

# BUILD
oci_image(
    name = "pytorch_image",
    base = "@pytorch_base",
)

Building pytorch_image fails with an error I cannot decipher:

$ bazel-6.3.2 build //:pytorch_image --verbose_failures
ERROR: /usr/local/home/faximan/repo//BUILD:3:10: OCI Image //:pytorch_image failed: (Exit 1): image_pytorch_image.sh failed: error executing command (from target //:pytorch_image)
  (cd /usr/local/home/faximan/.cache/bazel/_bazel_faximan/aa7e559759091793a65d16afaac928ae/execroot/repo && \
  exec env - \
  bazel-out/k8-fastbuild/bin/image_pytorch_image.sh mutate oci:layout/bazel-out/k8-fastbuild/bin/external/pytorch_base_single/layout '--output=bazel-out/k8-fastbuild/bin/pytorch_image')
# Configuration: 080d533b09bde3a5731cc76453084f9861e149e4d6a9fc0dc59f983aed67055b
# Execution platform: //bazel/rbe_config/config:platform
2023/12/13 14:28:11 storing blobs in bazel-out/k8-fastbuild/bin//storage_pytorch_image
2023/12/13 14:28:11 serving on port 34765
2023/12/13 14:28:11 GET /v2/
2023/12/13 14:28:11 HEAD /v2/oci/layout/manifests/latest 404 NAME_UNKNOWN Unknown name
2023/12/13 14:28:11 HEAD /v2/oci/layout/blobs/sha256:8889d023c18ea53dc42fff5ad1d81de2858ca7990ccffb179850543b0877c5e9 404 BLOB_UNKNOWN Unknown blob
2023/12/13 14:28:11 HEAD /v2/oci/layout/blobs/sha256:40dd5be53814ae70b2898558673b7ea18d58bf7ab3433560b9ce3cb76d9ff0b1 404 BLOB_UNKNOWN Unknown blob
2023/12/13 14:28:11 HEAD /v2/oci/layout/blobs/sha256:74115353f57d71e0b7291d9e8c0ebee2b4068925574bf9fbc7119708286c5276 404 BLOB_UNKNOWN Unknown blob
2023/12/13 14:28:11 HEAD /v2/oci/layout/blobs/sha256:0645c48e225b632d6eadd5a34db26f3aa051a91ce43007b8189a1759f2c0758e 404 BLOB_UNKNOWN Unknown blob
2023/12/13 14:28:11 POST /v2/oci/layout/blobs/uploads/
2023/12/13 14:28:11 POST /v2/oci/layout/blobs/uploads/
2023/12/13 14:28:11 POST /v2/oci/layout/blobs/uploads/
2023/12/13 14:28:11 POST /v2/oci/layout/blobs/uploads/
2023/12/13 14:28:11 PATCH /v2/oci/layout/blobs/uploads/3407172736063258897
2023/12/13 14:28:12 PATCH /v2/oci/layout/blobs/uploads/1974255942134569984
2023/12/13 14:28:12 PATCH /v2/oci/layout/blobs/uploads/7645361527262089714
2023/12/13 14:28:12 PATCH /v2/oci/layout/blobs/uploads/3314782370907986409
2023/12/13 14:28:12 PUT /v2/oci/layout/blobs/uploads/3407172736063258897?digest=sha256%3A8889d023c18ea53dc42fff5ad1d81de2858ca7990ccffb179850543b0877c5e9
2023/12/13 14:28:12 HEAD /v2/oci/layout/blobs/sha256:9e0ea72fe76c77bca208f180d6349a80d0e8576ae9260542fd9c4892b2acf8df 404 BLOB_UNKNOWN Unknown blob
2023/12/13 14:28:12 POST /v2/oci/layout/blobs/uploads/
2023/12/13 14:28:12 PUT /v2/oci/layout/blobs/uploads/1974255942134569984?digest=sha256%3A40dd5be53814ae70b2898558673b7ea18d58bf7ab3433560b9ce3cb76d9ff0b1
2023/12/13 14:28:12 HEAD /v2/oci/layout/blobs/sha256:13f51f3f80fd8f1a9a131175cbaf33b432fcf723e22cb28d2c3bda0e0f52e8ad 404 BLOB_UNKNOWN Unknown blob
2023/12/13 14:28:12 POST /v2/oci/layout/blobs/uploads/
2023/12/13 14:28:12 PUT /v2/oci/layout/blobs/uploads/7645361527262089714?digest=sha256%3A0645c48e225b632d6eadd5a34db26f3aa051a91ce43007b8189a1759f2c0758e
2023/12/13 14:28:12 PUT /v2/oci/layout/blobs/uploads/3314782370907986409?digest=sha256%3A74115353f57d71e0b7291d9e8c0ebee2b4068925574bf9fbc7119708286c5276
2023/12/13 14:28:12 PATCH /v2/oci/layout/blobs/uploads/5511052013354117544
2023/12/13 14:28:12 HEAD /v2/oci/layout/blobs/sha256:8d9b8d71f868a6810fbb5ef2937b7460ff98cf3c9e8210be32105f1f1c0eed73 404 BLOB_UNKNOWN Unknown blob
2023/12/13 14:28:12 HEAD /v2/oci/layout/blobs/sha256:c4ba9103d17de47501645adba914b86b965ab37d0ae612efd74bfa98b7d6cc4e 404 BLOB_UNKNOWN Unknown blob
2023/12/13 14:28:12 POST /v2/oci/layout/blobs/uploads/
2023/12/13 14:28:12 POST /v2/oci/layout/blobs/uploads/
2023/12/13 14:28:25 PATCH /v2/oci/layout/blobs/uploads/5885446964979633047
2023/12/13 14:28:25 PUT /v2/oci/layout/blobs/uploads/5511052013354117544?digest=sha256%3A9e0ea72fe76c77bca208f180d6349a80d0e8576ae9260542fd9c4892b2acf8df
2023/12/13 14:28:25 PATCH /v2/oci/layout/blobs/uploads/6133444309932702718
2023/12/13 14:28:25 HEAD /v2/oci/layout/blobs/sha256:97badaa6a776948df87a229c0b38266af20a344470d5a7e199a7d98c8c0629b7 404 BLOB_UNKNOWN Unknown blob
2023/12/13 14:28:25 POST /v2/oci/layout/blobs/uploads/
2023/12/13 14:28:44 PATCH /v2/oci/layout/blobs/uploads/4534439631410548796
2023/12/13 14:28:44 PUT /v2/oci/layout/blobs/uploads/6133444309932702718?digest=sha256%3Ac4ba9103d17de47501645adba914b86b965ab37d0ae612efd74bfa98b7d6cc4e
2023/12/13 14:28:44 HEAD /v2/oci/layout/blobs/sha256:63041bc7286a0f20672fb30b4bfbd067cc313452d9a600f109add0ecc23e4a6e 404 BLOB_UNKNOWN Unknown blob
2023/12/13 14:28:44 POST /v2/oci/layout/blobs/uploads/
2023/12/13 14:28:49 PUT /v2/oci/layout/blobs/uploads/5885446964979633047?digest=sha256%3A13f51f3f80fd8f1a9a131175cbaf33b432fcf723e22cb28d2c3bda0e0f52e8ad
2023/12/13 14:28:49 PATCH /v2/oci/layout/blobs/uploads/7691740193715543102
2023/12/13 14:28:49 HEAD /v2/oci/layout/blobs/sha256:5f237510596a196844c02f4b78863c55ac8c414d0ab2f300c8538e356ad7e9d7 404 BLOB_UNKNOWN Unknown blob
2023/12/13 14:28:49 POST /v2/oci/layout/blobs/uploads/
2023/12/13 14:28:59 PUT /v2/oci/layout/blobs/uploads/4534439631410548796?digest=sha256%3A8d9b8d71f868a6810fbb5ef2937b7460ff98cf3c9e8210be32105f1f1c0eed73
2023/12/13 14:28:59 HEAD /v2/oci/layout/blobs/sha256:e84ba40a3ba2c73a98e93d3afcb03753d59e08088441657850086bb6d8342f7c 404 BLOB_UNKNOWN Unknown blob
2023/12/13 14:28:59 POST /v2/oci/layout/blobs/uploads/
/tmp/tmp.AqSNy40B4n
Error: pulling : parsing reference "": could not parse reference:
Target //:pytorch_image failed to build

This only happens when using the pytorch image, any other base image we are using is OK.

I'd love to share a full reproduction, but I can so far only reproduce this in our custom RBE environment, which is not easy to share here. My hope is that somebody can tell me if this is a problem with rules_oci, pytorch or RBE?

Thanks so much!

@thesayyn
Copy link
Collaborator

Could you please share the version of rules_oci you are using? pytorch is known to have huge layers which might not fit into memory therefore get OOM killed?

@thesayyn thesayyn added the need: investigation A potential issue which we need to investigate first label Dec 13, 2023
@faximan
Copy link
Author

faximan commented Dec 13, 2023

Sorry, forgot: it is the latest (1.4.3).

Yes, layers are pretty large. Any way we can test this theory? I suppose I could try to find another big image to see if I get the same error.

@thesayyn
Copy link
Collaborator

you can try another big image to see if it also fails.

@thesayyn
Copy link
Collaborator

cross ref: bazelbuild/bazel#17368

@faximan
Copy link
Author

faximan commented Dec 14, 2023

Yes, Sahin, you are right. I tested

  • Generating a huge dummy image with RUN dd if=/dev/urandom of=/thicc bs=1M count=1000 - builds fails with a similar error.
  • Doubling the RAM on my RBE worker pool (8->16 GB) - build succeeds.

I guess my only question at this point is whether the failure message can hint to this being the problem? It took wayy to much effort to figure this out.

@thesayyn
Copy link
Collaborator

OOM kills are hard to detect unfortunately. there are some efforts on bazel side to estimate resource usage for actions but that's not fixed yet.

@thesayyn
Copy link
Collaborator

We can do better here though, we can print en error message to the log if the process gets killed for any reason and hint that it might be due to insufficient memory.

@thesayyn thesayyn added performance and removed need: investigation A potential issue which we need to investigate first labels Dec 23, 2023
@thesayyn
Copy link
Collaborator

Will be fixed by #505 #484

@thesayyn
Copy link
Collaborator

thesayyn commented May 8, 2024

fixed by #560

@thesayyn thesayyn closed this as completed May 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants