Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Let PR workflow check file sizes #4005

Merged

Conversation

bernt-matthias
Copy link
Contributor

Large test data should be avoided. So lets automatize this in the PR workflow. .. Sometimes it can even create problems: galaxyproject/galaxy#12604

At the moment we have 410 files > 500k and 225 files > 1M

We can argue if we want to fail determine-success if large files are found.

FOR CONTRIBUTOR:

  • - I have read the CONTRIBUTING.md document and this tool is appropriate for the tools-iuc repo.
  • - License permits unrestricted use (educational + commercial)
  • - This PR adds a new tool or tool collection
  • - This PR updates an existing tool or tool collection
  • - This PR does something else (explain below)

@bernt-matthias bernt-matthias changed the title let PR workflow check file sizes Let PR workflow check file sizes Oct 4, 2021
.github/workflows/pr.yaml Outdated Show resolved Hide resolved
.github/workflows/pr.yaml Outdated Show resolved Hide resolved
.github/workflows/pr.yaml Outdated Show resolved Hide resolved
@nsoranzo
Copy link
Member

nsoranzo commented Oct 4, 2021

At the moment we have 410 files > 500k and 225 files > 1M

I get a bit different numbers:

$ find . -size +500k | grep -v '^./.git/' | wc -l
324
$ find . -size +1M | grep -v '^./.git/' | wc -l
153

These are the worst offenders:

$ find . -size +10M | grep -v '^./.git/' | sort | xargs ls -1sh
24M ./tools/feelnc/test-data/genome_chr38.fa
62M ./tools/gatk4/test-data/chr20.fa
15M ./tools/kallisto/test-data/cached_locally/sacCer2_chrX.kallistei
20M ./tools/khmer/test-data/test-abund-read-2.oxlicg
11M ./tools/maxbin2/test-data/3/out.reassem/out.reads.noclass
13M ./tools/maxbin2/test-data/interleavedPE_unmapped_Sample3_small.fasta
15M ./tools/nanoplot/test-data/alignment.bam
47M ./tools/pangolin/test-data/2021-04-21/data/decision_tree_rules.txt
46M ./tools/pangolin/test-data/2021-04-21/data/lineages.metadata.csv
14M ./tools/pygenometracks/test-data/Li_et_al_2015.h5
22M ./tools/ucsc_blat/test-data/amaVit1_Gallus/amaVit1.fa

Co-authored-by: Nicola Soranzo <nicola.soranzo@gmail.com>
@nsoranzo
Copy link
Member

nsoranzo commented Oct 5, 2021

Can you add a (temporary) commit adding a large test file to see it at work?

@bernt-matthias
Copy link
Contributor Author

Can you add a (temporary) commit adding a large test file to see it at work?

Excellent idea. I would expect to get reported at least the following files (which seemed to be the smallest (larger than 500k) and the largest in the repo):

  • ./tools/artic/test-data/SRR11410539_seqtk_sample_500_1.fastq
  • ./tools/gatk4/test-data/chr20.fa

@bernt-matthias
Copy link
Contributor Author

Looks good:

Files larger than 500k found
tools/artic/test-data/SRR11410539_seqtk_sample_500_1.fastq
tools/gatk4/test-data/chr20.fa
Error: Process completed with exit code 1.

If you like I could add the size for each file.

Also the artifact is produced.

.github/workflows/pr.yaml Outdated Show resolved Hide resolved
Co-authored-by: Nicola Soranzo <nicola.soranzo@gmail.com>
@nsoranzo
Copy link
Member

nsoranzo commented Oct 5, 2021

Nice!
Two additional points for discussion:

  • This would fail builds also for PRs where the offending large file is pre-existing and not changed.
  • Instead of failing the build, the CI job could add a comment to the PR?

@bernt-matthias
Copy link
Contributor Author

Nice! Two additional points for discussion:

* This would fail builds also for PRs where the offending large file is pre-existing and not changed.

* Instead of failing the build, the CI job could add a comment to the PR?

Indeed. I also could imagine that there are tools where it is hard/impossible to find smaller test data. Then we should probably still accept it.

Is there a way to still be able to merge if this job fails? Just remove it from Check workflow success?

Comment is also an idea.

.github/workflows/pr.yaml Outdated Show resolved Hide resolved
.github/workflows/pr.yaml Outdated Show resolved Hide resolved
.github/workflows/pr.yaml Outdated Show resolved Hide resolved
@nsoranzo
Copy link
Member

nsoranzo commented Oct 5, 2021

It may very well be that the comment won't work until this get merged.

@bernt-matthias
Copy link
Contributor Author

@nsoranzo any idea why the workflow isn't running anymore?

@bernt-matthias
Copy link
Contributor Author

@nsoranzo any idea why the workflow isn't running anymore?

Also this one does not run: #4010

So maybe its a problem at github?

@nsoranzo
Copy link
Member

nsoranzo commented Oct 5, 2021

It seems workflows are running now.

This reverts commit 56409e0.
@bernt-matthias
Copy link
Contributor Author

From https://github.com/peter-evans/create-or-update-comment: Note: In public repositories this action does not work in pull_request workflows when triggered by forks.

@bernt-matthias bernt-matthias merged commit a786695 into galaxyproject:master Oct 6, 2021
@bernt-matthias bernt-matthias deleted the topic/ci-file-sizes branch March 14, 2022 16:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants