Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error: Internal error (invalid zip archive). Please try again #89

Closed
Rittika1 opened this issue Jan 10, 2022 · 13 comments
Closed

Error: Internal error (invalid zip archive). Please try again #89

Rittika1 opened this issue Jan 10, 2022 · 13 comments

Comments

@Rittika1
Copy link

Rittika1 commented Jan 10, 2022

Hi,

I'm trying to download protein sequences of all Vertebrates using this command. Its running on Redhat 8.3.

./datasets download genome taxon Vertebrates --exclude-rna --exclude-seq --exclude-gff3 --exclude-genomic-cds --filename Vertebrates.zip
It starts off downloading okay, but everytime after downloading till the filesize of 5.2GB, it fails and gives this error.
datasetdownload error

Why does this keep happening everytime? I know there is enough space in my folder, so it's not running out of space. Any help would be appreciated.

I have one more question:
I want to download all the protein sequences of the Vertebrates in one file. the current command I'm using divides it into several folders, with each protein.faa file, and I concatenate all of them, to make one protein sequence file. Is there a way to download them all into one file or folder?

Thank you

@ericcox1
Copy link
Collaborator

ericcox1 commented Jan 11, 2022

Hi Rittika,

Thanks for your feedback.

For such a large amount of data, I would recommend downloading a dehydrated package then rehydrating it. For more details, see: How to download large genome data packages.

You can get proteins for all annotated vertebrate genomes in 3 steps:

  1. Download a dehydrated package (note that I have added the --annotated flag because protein sequences are only available for annotated genomes)
    datasets download genome taxon Vertebrates --exclude-rna --exclude-seq --exclude-gff3 --exclude-genomic-cds --annotated --dehydrated --filename vertebrate-proteins.zip
  2. Unzip the downloaded zip archive
    unzip vertebrate-proteins.zip -d vertebrate-proteins/
  3. Rehydrate to get all the data
    datasets rehydrate --directory vertebrate-proteins/

Note that there is no way to download all the protein sequences as a single file--you'll have to download and then concatenate to get the desired result. We are looking into adding this feature in the future.

Best,
Eric

Eric Cox, PhD [Contractor] (he/him/his)
NCBI Datasets
Sequence Enhancements, Tools and Delivery (SeqPlus)
NIH/NLM/NCBI
eric.cox@nih.gov

@Rittika1
Copy link
Author

Thank you Eric, this worked, and I could download the datasets. It would be very helpful to roll out the feature to download all files as one

@ebur053
Copy link

ebur053 commented Feb 22, 2022

Hi there,
I am having the same issue, with the error "Error: Internal error (invalid zip archive). Please try again"
However, I am attempting to download much smaller datasets - just ortholog downloads.
This happens even for the example commands:
datasets download ortholog gene-id 59272
The ortholog summary function appears to work fine. Any help would be appreciated.
Thanks,
Erica

@ericcox1
Copy link
Collaborator

Hi Erica,

Thanks for your feedback and sorry to hear that you're having trouble with the download.
I tried running the command datasets download ortholog gene-id 59272 a few times from my home computer and I was unable to reproduce the problem. You may have encountered a transient network issue.

Best,
Eric

Eric Cox, PhD [Contractor] (he/him/his)
NCBI Datasets
Sequence Enhancements, Tools and Delivery (SeqPlus)
NIH/NLM/NCBI
eric.cox@nih.gov

@yexiao-cheng
Copy link

Hi,

The same issue happened when I downloaded all SARS-CoV-2 GenBank genomes.

截屏2022-06-14 17 54 13

I noticed that the datasets download virus genome command does not have the dehydrated/rehydrate option. Do you know why this is happening and how to fix it?

Thanks,
Yexiao

@chriswyatt1
Copy link

Thanks for making this great tool, super helpful.

I also have this issue. Is there a way to know when there is a network error?

I am running the example dataset (using the current version datasets version 13.40.0):

datasets download genome accession GCF_000001405.40 --dehydrated --exclude-rna --exclude-genomic-cds 
Downloading: ncbi_dataset.zip    2.15kB 14.4MB/s
Error: Internal error (invalid zip archive). Please try again

@olearyna
Copy link
Contributor

Hi,
Thanks for contacting us and sorry to hear you got an error. We suspect its related to some technical glitches earlier today. It should be working now. Let us know if you are still getting errors.

Nuala

Nuala A. O'Leary, PhD
Product Owner, NCBI Datasets
National Center for Biotechnology Information, NLM, NIH, DHHS [Contractor]
Building 45 Room 6As.41
Bethesda, MD 20892
tel 301.402.1808

@chriswyatt1
Copy link

Thanks. Yes its working perfectly again :)

@tshauck
Copy link

tshauck commented Oct 5, 2022

Hi, thank you for this tool... perhaps those glitches have returned?

./datasets download genome taxon "Bacilli" --reference --exclude-gff3 --exclude-genomic-cds --exclude-protein --exclude-rna --dehydrated
Collecting 1,936 genome accessions [================================================] 100% 1936/1936
Downloading: ncbi_dataset.zip    1.05MB 49.1kB/s
Error: Internal error (invalid zip archive). Please try again

I've also had issues not passing --dehydrated and the tool attempting to download what seems much larger packages than that the UI would indicate, to the point where it times out leaving a corrupt zip.

@ericcox1
Copy link
Collaborator

ericcox1 commented Oct 5, 2022

Hi Trent,

Thanks for your feedback. I was unable to reproduce the problem. It's possible that you may have encountered a temporary network glitch. We are going to look at our services to see if there are any issues.

About your second issue:

I've also had issues not passing --dehydrated and the tool attempting to download what seems much larger packages than that the UI would indicate, to the point where it times out leaving a corrupt zip.

Would you mind sharing an example?

Best,
Eric

Eric Cox, PhD [Contractor] (he/him/his)
NCBI Datasets
Sequence Enhancements, Tools and Delivery (SeqPlus)
NIH/NLM/NCBI
eric.cox@nih.gov

@MrAndrew
Copy link

Facing the same issues. Using the new alpha V2 executable amd64 Linux version downloaded via a curl request. No dehydration flag recognized and having invalided zip file error for a 111mb monkey pox using the datasets download virus genome taxon monkeypox command. Is there a hardware requirement on the underlying processor or something?

@ericcox1
Copy link
Collaborator

Hi @MrAndrew, Thanks for your report.

No dehydration flag recognized

We don't yet support the dehydration flag for download virus genome.

invalided zip file error for a 111mb monkey pox using the datasets download virus genome taxon monkeypox command

I wasn't able to reproduce this error. There is no special hardware requirement. Would you please try this again and let me know how it goes?

Best,
Eric

Eric Cox, PhD [Contractor] (he/him/his)
NCBI Datasets
Sequence Enhancements, Tools and Delivery (SeqPlus)
NIH/NLM/NCBI
eric.cox@nih.gov

@skylarwalters
Copy link

Hi! I'm trying to run:
datasets download virus genome taxon Viruses --complete-only --host human --geo-location Senegal --filename geo.zip
and get the same issue. I'm confident it's not a storage issue. Do you know why this may be happening? For reference, I was able to download all virus refseqs broadly with:
datasets download virus genome taxon Viruses --complete-only --refseq --filename viral_refseqs.zip
and had no issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants