Re-consider container caching strategy to use registries #303

thomaseizinger · 2023-09-13T23:51:18Z

Whilst working on the interop-tests and taking a closer look at our container caching in general, I had the following thoughts:

The current caching mechanism introduces a fair bit of complexity:

~ 20 lines of Makefiles per version: https://github.com/libp2p/test-plans/blob/master/multidim-interop/impl/rust/v0.51/Makefile
Custom logic of computing the cache key and up/downloading the cache to S3: https://github.com/libp2p/test-plans/blob/master/multidim-interop/helpers/cache.ts

The above script needs to be manually called:

test-plans/.github/actions/run-interop-ping-test/action.yml

Lines 79 to 104 in b8235c9

    
               - name: Load cache and build 
        
                 working-directory: ${{ steps.find-workdir.outputs.WORK_DIR }} 
        
                 run: npm run cache -- load 
        
                 shell: bash 
        
               - name: Assert Git tree is clean. 
        
                 working-directory: ${{ steps.find-workdir.outputs.WORK_DIR }} 
        
                 shell: bash 
        
                 run: | 
        
                   if [[ -n "$(git status --porcelain)" ]]; then 
        
                     echo "Git tree is dirty. This means that building an impl generated something that should probably be .gitignore'd" 
        
                     git status 
        
                     exit 1 
        
                   fi 
        
               - name: Push the image cache 
        
                 if: env.PUSH_CACHE == 'true' 
        
                 working-directory: ${{ steps.find-workdir.outputs.WORK_DIR }} 
        
                 env: 
        
                   AWS_BUCKET: ${{ inputs.s3-cache-bucket }} 
        
                   AWS_REGION: ${{ inputs.aws-region }} 
        
                   AWS_ACCESS_KEY_ID: ${{ inputs.s3-access-key-id }} 
        
                   AWS_SECRET_ACCESS_KEY: ${{ inputs.s3-secret-access-key }} 
        
                 run: npm run cache -- push 
        
                 shell: bash

We need to manage AWS credentials:

test-plans/.github/actions/run-interop-ping-test/action.yml

Lines 39 to 43 in b8235c9

    
               - name: Configure AWS credentials for S3 build cache 
        
                 if: inputs.s3-access-key-id != '' && inputs.s3-secret-access-key != '' 
        
                 run: | 
        
                   echo "PUSH_CACHE=true" >> $GITHUB_ENV 
        
                 shell: bash

We need to set custom caching options for each docker build: https://github.com/libp2p/test-plans/blob/master/multidim-interop/dockerBuildWrapper.sh
I am about to duplicate all of the above for the hole-punching tests 😭

The benefit we are getting from this is that we can pretty much plug any commit of a repository into the Makefile, hit make and we end up with a container. The cache is thus purely a performance optimisation.

Is this worth the complexity? Yesterday, I discovered a subtle bug in our setup that made us not cache a particular Rust image, see #301.

The test runner is already designed to work in phases:

Generate the permutations of test cases
Generate a docker-compose.yml file
Run a particular docker-compose.yml file

If we would use container registries instead, we could delete all of the above code by just referencing image IDs in the versions.ts file.

When debugging code, we could always generate the offending docker-compose.yml file first and swap the container reference out to a point at a Dockerfile instead which would build a local container instead. #282 already hints at this too.

Currently, the "contract" between libp2p/test-plans and the repositories is that the need to provide a Makefile which builds a container. The new contract would be that they need to provide a Dockerfile that builds a functioning version. This would likely be more useful because not every developer has all build-environments of other languages set up.

Have I missed anything? Opinions welcome.

The text was updated successfully, but these errors were encountered:

thomaseizinger · 2023-09-14T21:36:59Z

cc @mxinden @achingbrain @marten-seemann

mxinden · 2023-10-16T07:48:32Z

Sorry for the late reply.

Generally I am in favor of the move to container registries for their simplicity.

That said, I don't have the capacity / priority today to change the existing transport interop caching strategy, nor to review all the necessary changes.

Thus, unless we have two people owning this change, I suggest keeping it as is.

What are other people's thoughts?

thomaseizinger · 2023-10-16T11:17:17Z

That said, I don't have the capacity / priority today to change the existing transport interop caching strategy, nor to review all the necessary changes.

Thus, unless we have two people owning this change, I suggest keeping it as is.

I am happy to implement the changes. Review should be trivial as we are just deleting code. Docker-compose implicitly pulls containers so we don't even have to do anything to make it work with registry-hosted images.

thomaseizinger · 2023-10-16T11:18:37Z

When debugging code, we could always generate the offending docker-compose.yml file first and swap the container reference out to a point at a Dockerfile instead which would build a local container instead. #282 already hints at this too.

FWIW, in #304 I adopted this design and always generate the docker-compose.yml file as a build artifact. Thus, it is trivial to modify it to build a container instead.

achingbrain · 2023-10-16T14:36:58Z

I think the caching strategy was implemented as-is because building the artefacts required to run the tests was previously very slow - if memory serves it was over 25 minutes on GH CI.

I’m not against any of the proposed changes as long as it doesn’t increase the time taken to run the test suite.

thomaseizinger · 2023-10-17T00:46:04Z

I’m not against any of the proposed changes as long as it doesn’t increase the time taken to run the test suite.

The idea is to decrease the time even further. At the moment, all images are pulled from the cache in sequence whereas I am pretty sure, docker compose would pull images in parallel.

I think the caching strategy was implemented as-is

From memory, what was important to @MarcoPolo is that everything is reproducible from a given commit. See also the design paragraph at the top in https://github.com/libp2p/test-plans/tree/master/transport-interop#transport-interoperability-tests.

Whilst I think that is a novel goal, it isn't fully true anyway. For example, we still depend on all sorts of software being installed and we don't pin the version of these tools. Referencing docker images by hash gives us similarish guarantees. We can still build a docker container from the exact same Git hash. It might not be bit-for-bit the same container but we also don't have this guarantee at the moment.

thomaseizinger mentioned this issue Sep 26, 2023

feat(hole-punch): add hole-punch interoperability test suite #304

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Re-consider container caching strategy to use registries #303

Re-consider container caching strategy to use registries #303

thomaseizinger commented Sep 13, 2023 •

edited

Loading

thomaseizinger commented Sep 14, 2023

mxinden commented Oct 16, 2023

thomaseizinger commented Oct 16, 2023

thomaseizinger commented Oct 16, 2023

achingbrain commented Oct 16, 2023

thomaseizinger commented Oct 17, 2023

Re-consider container caching strategy to use registries #303

Re-consider container caching strategy to use registries #303

Comments

thomaseizinger commented Sep 13, 2023 • edited Loading

thomaseizinger commented Sep 14, 2023

mxinden commented Oct 16, 2023

thomaseizinger commented Oct 16, 2023

thomaseizinger commented Oct 16, 2023

achingbrain commented Oct 16, 2023

thomaseizinger commented Oct 17, 2023

thomaseizinger commented Sep 13, 2023 •

edited

Loading