ZIP-File Corruption Issues

The Copernicus Open Access Hub has a policy of moving data to a Long-Term-Archive after a certain period of time. Retrieval of files from this offline archive is a two step process:

  1. A request to the URL that initializes the download for online products ('$UUID')/$value) instead initializes a data retrieval request which moves the archived file from offline to online storage.

  2. After several minutes or hours, when the product has finished moving to online storage, a request to the same URL initializes the download.

However, due to technical issues that the Copernicus Open Access Hub issue channels did not provide additional information on, some of the offline files were incorrectly restored. The MD5 checksum of the files as delivered was identical to the downloaded product, but products seemed to be incomplete or incorrectly encoded ZIP-files.

The solution for this was to manually copy the file names into the search interface of the Open Hub and retrieve a download there.

This notebook contains information about a feasible process to identify corrupted zip files and manually initialize their correct retrieval, given the number of corruptions is sufficiently small.

Identification Process

Using unix command line tools, the following command lists all files in a target folder in ascending order by size. The size is printed in a human-readable format in the left column.

! ls -rSsh input/tempelhofer_feld/*.zip
 25M input/tempelhofer_feld/
 29M input/tempelhofer_feld/
 29M input/tempelhofer_feld/
 30M input/tempelhofer_feld/
 30M input/tempelhofer_feld/
 31M input/tempelhofer_feld/
 35M input/tempelhofer_feld/
 38M input/tempelhofer_feld/
 42M input/tempelhofer_feld/
 43M input/tempelhofer_feld/
723M input/tempelhofer_feld/
753M input/tempelhofer_feld/
761M input/tempelhofer_feld/
764M input/tempelhofer_feld/
766M input/tempelhofer_feld/
768M input/tempelhofer_feld/
771M input/tempelhofer_feld/
774M input/tempelhofer_feld/
789M input/tempelhofer_feld/
789M input/tempelhofer_feld/
802M input/tempelhofer_feld/
802M input/tempelhofer_feld/
809M input/tempelhofer_feld/
813M input/tempelhofer_feld/
819M input/tempelhofer_feld/
823M input/tempelhofer_feld/
823M input/tempelhofer_feld/
829M input/tempelhofer_feld/
845M input/tempelhofer_feld/
1.1G input/tempelhofer_feld/
1.1G input/tempelhofer_feld/
1.1G input/tempelhofer_feld/
1.1G input/tempelhofer_feld/
1.1G input/tempelhofer_feld/
1.1G input/tempelhofer_feld/
1.1G input/tempelhofer_feld/
1.1G input/tempelhofer_feld/
1.1G input/tempelhofer_feld/
1.1G input/tempelhofer_feld/
1.2G input/tempelhofer_feld/

The first 10 files have a file size that’s significantly lower than what would be expected. Using pipes the following command tries to extract one of the low-size files, which raises an error:

! ls -S input/tempelhofer_feld/*.zip | tail -n1 | xargs unzip
Archive:  input/tempelhofer_feld/
  End-of-central-directory signature not found.  Either this file is not
  a zipfile, or it constitutes one disk of a multi-part archive.  In the
  latter case the central directory and zipfile comment will be found on
  the last disk(s) of this archive.
unzip:  cannot find zipfile directory in one of input/tempelhofer_feld/ or
        input/tempelhofer_feld/, and cannot find input/tempelhofer_feld/, period.

API Responses

Continuing with the file above, S2A_MSIL2A_20190623T101031_N0212_R022_T33UUU_20190623T132509, the downloaded file is compared to what the API indicates it should look like:

import os
import sentinelsat

api = sentinelsat.SentinelAPI(os.getenv('SCIHUB_USERNAME'), os.getenv('SCIHUB_PASSWORD'))
res = api.to_geodataframe(api.query(raw='S2A_MSIL2A_20190623T101031_N0212_R022_T33UUU_20190623T132509'))
/opt/conda/lib/python3.8/site-packages/pyproj/crs/ FutureWarning: '+init=<authority>:<code>' syntax is deprecated. '<authority>:<code>' is the preferred initialization method. When making the change, be mindful of axis order changes:
  return _prepare_from_string(" ".join(pjargs))
bedec483-5ee1-4264-8dfa-a3b53ce364f7    816.67 MB
Name: size, dtype: object

The size given by the scihub API is a lot larger than what has been downloaded.

Verification through Repetition

For the purpose of identifying whether repeated downloads fail in identical ways, the products have been downloaded into a separate target folder, input/tempelhofer_feld_test.

Using the piped commands below the MD5 checksum for all ZIP-files below 500MB was calculated, once for all files in the original download folder input/tempelhofer_feld and once for input/tempelhofer_feld_test.

These checksums being identical shows that both downloads retrieved the same (corrupted) file.

! find input/tempelhofer_feld -type f -size -500M  -name '*.zip' | xargs md5sum
9ca05754c4cc5ff9d2bddf99e2e9e753  input/tempelhofer_feld/
5424cf8c0dd4384382366b37af9ee995  input/tempelhofer_feld/
f2050867b04f8911dfcd1412846f5f0e  input/tempelhofer_feld/
5c41f18b6c9745df406dbca49c50b0c7  input/tempelhofer_feld/
8e9dc7b716056f702912d11197fab44c  input/tempelhofer_feld/
7241ca7fc6ccca5eb8935efe1b834697  input/tempelhofer_feld/
7d2b67dac6f36f1d8744ec2ef296445f  input/tempelhofer_feld/
b078b9d41e7be70a89961214d4adb72b  input/tempelhofer_feld/
f4a2910be181bd1c85fba14e05ce69b1  input/tempelhofer_feld/
53e1beb3f29dc1dc5b20745c3d66568e  input/tempelhofer_feld/
! find input/tempelhofer_feld_test -type f -size -500M  -name '*.zip' | xargs md5sum
9ca05754c4cc5ff9d2bddf99e2e9e753  input/tempelhofer_feld_test/
5424cf8c0dd4384382366b37af9ee995  input/tempelhofer_feld_test/
f2050867b04f8911dfcd1412846f5f0e  input/tempelhofer_feld_test/
5c41f18b6c9745df406dbca49c50b0c7  input/tempelhofer_feld_test/
8e9dc7b716056f702912d11197fab44c  input/tempelhofer_feld_test/
7241ca7fc6ccca5eb8935efe1b834697  input/tempelhofer_feld_test/
7d2b67dac6f36f1d8744ec2ef296445f  input/tempelhofer_feld_test/
b078b9d41e7be70a89961214d4adb72b  input/tempelhofer_feld_test/
f4a2910be181bd1c85fba14e05ce69b1  input/tempelhofer_feld_test/
53e1beb3f29dc1dc5b20745c3d66568e  input/tempelhofer_feld_test/

Manual Download

Another approach was to explicitly use the link provided by the API response and compare this manual download to the downloads initialized by the sentinelsat API above.

While the fact that the API response from the Open Access Hub API contained a bad checksum is already a strong indicator that the error is introduced on the server-side during the retrieval process, this manual verification tries to further rule out the sentinelsat module as a possible source of error.

Seeing how the link initialized a download of a broken ZIP-file as well, all indicators point toward a server-side error on the side of the Copernicus Open Access Hub.



A temporary solution, as indicated above, is to manually copy the product names - file names without the .zip-extension - into the Search Mask of the Open Hub, which will show the product as Offline. The LTA retrieval can then be initialized manually, which is completed after several minutes.

While this approach restores the products without corruption, it is to be expected that the Open Access Hub API will resume operation as advertised after having elevated the issue.