ZIP-File Corruption Issues

The Copernicus Open Access Hub has a policy of moving data to a Long-Term-Archive after a certain period of time. Retrieval of files from this offline archive is a two step process:

  1. A request to the URL that initializes the download for online products (https://scihub.copernicus.eu/dhus/odata/v1/Products('$UUID')/$value) instead initializes a data retrieval request which moves the archived file from offline to online storage.

  2. After several minutes or hours, when the product has finished moving to online storage, a request to the same URL initializes the download.

However, due to technical issues that the Copernicus Open Access Hub issue channels did not provide additional information on, some of the offline files were incorrectly restored. The MD5 checksum of the files as delivered was identical to the downloaded product, but products seemed to be incomplete or incorrectly encoded ZIP-files.

The solution for this was to manually copy the file names into the search interface of the Open Hub and retrieve a download there.

This notebook contains information about a feasible process to identify corrupted zip files and manually initialize their correct retrieval, given the number of corruptions is sufficiently small.

Identification Process

Using unix command line tools, the following command lists all files in a target folder in ascending order by size. The size is printed in a human-readable format in the left column.

! ls -rSsh input/tempelhofer_feld/*.zip
 25M input/tempelhofer_feld/S2A_MSIL2A_20190623T101031_N0212_R022_T33UUU_20190623T132509.zip
 29M input/tempelhofer_feld/S2B_MSIL2A_20190512T102029_N0212_R065_T33UUU_20190512T134103.zip
 29M input/tempelhofer_feld/S2A_MSIL2A_20190216T102111_N0211_R065_T33UUU_20190216T130428.zip
 30M input/tempelhofer_feld/S2A_MSIL2A_20190424T101031_N0211_R022_T32UQD_20190424T162325.zip
 30M input/tempelhofer_feld/S2A_MSIL2A_20190404T101031_N0211_R022_T32UQD_20190404T174806.zip
 31M input/tempelhofer_feld/S2B_MSIL2A_20190419T101029_N0211_R022_T33UUU_20190419T132322.zip
 35M input/tempelhofer_feld/S2A_MSIL2A_20190613T101031_N0212_R022_T33UUU_20190614T125329.zip
 38M input/tempelhofer_feld/S2A_MSIL2A_20190822T101031_N0213_R022_T32UQD_20190822T143621.zip
 42M input/tempelhofer_feld/S2A_MSIL2A_20190603T101031_N0212_R022_T33UUU_20190603T114652.zip
 43M input/tempelhofer_feld/S2A_MSIL2A_20190407T102021_N0211_R065_T33UUU_20190407T134109.zip
723M input/tempelhofer_feld/S2A_MSIL2A_20190114T101351_N0211_R022_T32UQD_20190114T113404.zip
753M input/tempelhofer_feld/S2B_MSIL2A_20190409T101029_N0211_R022_T32UQD_20190409T134504.zip
761M input/tempelhofer_feld/S2B_MSIL2A_20190320T101029_N0211_R022_T33UUU_20190320T195148.zip
764M input/tempelhofer_feld/S2B_MSIL2A_20190218T101059_N0211_R022_T32UQD_20190218T161620.zip
766M input/tempelhofer_feld/S2B_MSIL2A_20190330T101029_N0211_R022_T33UUU_20190330T144328.zip
768M input/tempelhofer_feld/S2B_MSIL2A_20190529T101039_N0212_R022_T32UQD_20190529T130331.zip
771M input/tempelhofer_feld/S2B_MSIL2A_20190906T101029_N0213_R022_T32UQD_20190906T133832.zip
774M input/tempelhofer_feld/S2B_MSIL2A_20190728T101029_N0213_R022_T32UQD_20190728T134658.zip
789M input/tempelhofer_feld/S2B_MSIL2A_20190519T101039_N0212_R022_T33UUU_20190519T132053.zip
789M input/tempelhofer_feld/S2B_MSIL2A_20190827T101029_N0213_R022_T32UQD_20190827T134854.zip
802M input/tempelhofer_feld/S2A_MSIL2A_20191130T101401_N0213_R022_T33UUU_20191130T115440.zip
802M input/tempelhofer_feld/S2B_MSIL2A_20191205T101309_N0213_R022_T33UUU_20191205T122401.zip
809M input/tempelhofer_feld/S2A_MSIL2A_20191220T101431_N0213_R022_T33UUU_20191220T115219.zip
813M input/tempelhofer_feld/S2A_MSIL2A_20190921T101031_N0213_R022_T33UUU_20190921T130515.zip
819M input/tempelhofer_feld/S2A_MSIL2A_20191210T101411_N0213_R022_T33UUU_20191210T114322.zip
823M input/tempelhofer_feld/S2B_MSIL2A_20190718T101039_N0213_R022_T33UUU_20190718T131731.zip
823M input/tempelhofer_feld/S2A_MSIL2A_20190713T101031_N0213_R022_T33UUU_20190713T135651.zip
829M input/tempelhofer_feld/S2A_MSIL2A_20190911T101021_N0213_R022_T33UUU_20190911T143947.zip
845M input/tempelhofer_feld/S2A_MSIL2A_20190723T101031_N0213_R022_T33UUU_20190723T125722.zip
1.1G input/tempelhofer_feld/S2A_MSIL2A_20191213T102421_N0213_R065_T33UUU_20191213T120011.zip
1.1G input/tempelhofer_feld/S2B_MSIL2A_20190422T102029_N0211_R065_T32UQD_20190422T133643.zip
1.1G input/tempelhofer_feld/S2B_MSIL2A_20191029T102039_N0213_R065_T32UQD_20191029T134629.zip
1.1G input/tempelhofer_feld/S2B_MSIL2A_20190402T102029_N0211_R065_T33UUU_20190402T135010.zip
1.1G input/tempelhofer_feld/S2B_MSIL2A_20190711T102029_N0213_R065_T33UUU_20190711T135545.zip
1.1G input/tempelhofer_feld/S2A_MSIL2A_20190417T102031_N0211_R065_T33UUU_20190417T130913.zip
1.1G input/tempelhofer_feld/S2A_MSIL2A_20190626T102031_N0212_R065_T33UUU_20190626T125319.zip
1.1G input/tempelhofer_feld/S2A_MSIL2A_20190726T102031_N0213_R065_T33UUU_20190726T125507.zip
1.1G input/tempelhofer_feld/S2A_MSIL2A_20191014T102031_N0213_R065_T32UQD_20191014T130941.zip
1.1G input/tempelhofer_feld/S2A_MSIL2A_20190825T102031_N0213_R065_T33UUU_20190825T134836.zip
1.2G input/tempelhofer_feld/S2B_MSIL2A_20190601T102029_N0212_R065_T33UUU_20190601T135040.zip

The first 10 files have a file size that’s significantly lower than what would be expected. Using pipes the following command tries to extract one of the low-size files, which raises an error:

! ls -S input/tempelhofer_feld/*.zip | tail -n1 | xargs unzip
Archive:  input/tempelhofer_feld/S2A_MSIL2A_20190623T101031_N0212_R022_T33UUU_20190623T132509.zip
  End-of-central-directory signature not found.  Either this file is not
  a zipfile, or it constitutes one disk of a multi-part archive.  In the
  latter case the central directory and zipfile comment will be found on
  the last disk(s) of this archive.
unzip:  cannot find zipfile directory in one of input/tempelhofer_feld/S2A_MSIL2A_20190623T101031_N0212_R022_T33UUU_20190623T132509.zip or
        input/tempelhofer_feld/S2A_MSIL2A_20190623T101031_N0212_R022_T33UUU_20190623T132509.zip.zip, and cannot find input/tempelhofer_feld/S2A_MSIL2A_20190623T101031_N0212_R022_T33UUU_20190623T132509.zip.ZIP, period.

API Responses

Continuing with the file above, S2A_MSIL2A_20190623T101031_N0212_R022_T33UUU_20190623T132509, the downloaded file is compared to what the API indicates it should look like:

import os
import sentinelsat

api = sentinelsat.SentinelAPI(os.getenv('SCIHUB_USERNAME'), os.getenv('SCIHUB_PASSWORD'))
res = api.to_geodataframe(api.query(raw='S2A_MSIL2A_20190623T101031_N0212_R022_T33UUU_20190623T132509'))
/opt/conda/lib/python3.8/site-packages/pyproj/crs/crs.py:53: FutureWarning: '+init=<authority>:<code>' syntax is deprecated. '<authority>:<code>' is the preferred initialization method. When making the change, be mindful of axis order changes: https://pyproj4.github.io/pyproj/stable/gotchas.html#axis-order-changes-in-proj-6
  return _prepare_from_string(" ".join(pjargs))
res['size']
bedec483-5ee1-4264-8dfa-a3b53ce364f7    816.67 MB
Name: size, dtype: object

The size given by the scihub API is a lot larger than what has been downloaded.

Verification through Repetition

For the purpose of identifying whether repeated downloads fail in identical ways, the products have been downloaded into a separate target folder, input/tempelhofer_feld_test.

Using the piped commands below the MD5 checksum for all ZIP-files below 500MB was calculated, once for all files in the original download folder input/tempelhofer_feld and once for input/tempelhofer_feld_test.

These checksums being identical shows that both downloads retrieved the same (corrupted) file.

! find input/tempelhofer_feld -type f -size -500M  -name '*.zip' | xargs md5sum
9ca05754c4cc5ff9d2bddf99e2e9e753  input/tempelhofer_feld/S2A_MSIL2A_20190603T101031_N0212_R022_T33UUU_20190603T114652.zip
5424cf8c0dd4384382366b37af9ee995  input/tempelhofer_feld/S2A_MSIL2A_20190404T101031_N0211_R022_T32UQD_20190404T174806.zip
f2050867b04f8911dfcd1412846f5f0e  input/tempelhofer_feld/S2A_MSIL2A_20190216T102111_N0211_R065_T33UUU_20190216T130428.zip
5c41f18b6c9745df406dbca49c50b0c7  input/tempelhofer_feld/S2B_MSIL2A_20190419T101029_N0211_R022_T33UUU_20190419T132322.zip
8e9dc7b716056f702912d11197fab44c  input/tempelhofer_feld/S2A_MSIL2A_20190407T102021_N0211_R065_T33UUU_20190407T134109.zip
7241ca7fc6ccca5eb8935efe1b834697  input/tempelhofer_feld/S2B_MSIL2A_20190512T102029_N0212_R065_T33UUU_20190512T134103.zip
7d2b67dac6f36f1d8744ec2ef296445f  input/tempelhofer_feld/S2A_MSIL2A_20190613T101031_N0212_R022_T33UUU_20190614T125329.zip
b078b9d41e7be70a89961214d4adb72b  input/tempelhofer_feld/S2A_MSIL2A_20190424T101031_N0211_R022_T32UQD_20190424T162325.zip
f4a2910be181bd1c85fba14e05ce69b1  input/tempelhofer_feld/S2A_MSIL2A_20190822T101031_N0213_R022_T32UQD_20190822T143621.zip
53e1beb3f29dc1dc5b20745c3d66568e  input/tempelhofer_feld/S2A_MSIL2A_20190623T101031_N0212_R022_T33UUU_20190623T132509.zip
! find input/tempelhofer_feld_test -type f -size -500M  -name '*.zip' | xargs md5sum
9ca05754c4cc5ff9d2bddf99e2e9e753  input/tempelhofer_feld_test/S2A_MSIL2A_20190603T101031_N0212_R022_T33UUU_20190603T114652.zip
5424cf8c0dd4384382366b37af9ee995  input/tempelhofer_feld_test/S2A_MSIL2A_20190404T101031_N0211_R022_T32UQD_20190404T174806.zip
f2050867b04f8911dfcd1412846f5f0e  input/tempelhofer_feld_test/S2A_MSIL2A_20190216T102111_N0211_R065_T33UUU_20190216T130428.zip
5c41f18b6c9745df406dbca49c50b0c7  input/tempelhofer_feld_test/S2B_MSIL2A_20190419T101029_N0211_R022_T33UUU_20190419T132322.zip
8e9dc7b716056f702912d11197fab44c  input/tempelhofer_feld_test/S2A_MSIL2A_20190407T102021_N0211_R065_T33UUU_20190407T134109.zip
7241ca7fc6ccca5eb8935efe1b834697  input/tempelhofer_feld_test/S2B_MSIL2A_20190512T102029_N0212_R065_T33UUU_20190512T134103.zip
7d2b67dac6f36f1d8744ec2ef296445f  input/tempelhofer_feld_test/S2A_MSIL2A_20190613T101031_N0212_R022_T33UUU_20190614T125329.zip
b078b9d41e7be70a89961214d4adb72b  input/tempelhofer_feld_test/S2A_MSIL2A_20190424T101031_N0211_R022_T32UQD_20190424T162325.zip
f4a2910be181bd1c85fba14e05ce69b1  input/tempelhofer_feld_test/S2A_MSIL2A_20190822T101031_N0213_R022_T32UQD_20190822T143621.zip
53e1beb3f29dc1dc5b20745c3d66568e  input/tempelhofer_feld_test/S2A_MSIL2A_20190623T101031_N0212_R022_T33UUU_20190623T132509.zip

Manual Download

Another approach was to explicitly use the link provided by the API response and compare this manual download to the downloads initialized by the sentinelsat API above.

While the fact that the API response from the Open Access Hub API contained a bad checksum is already a strong indicator that the error is introduced on the server-side during the retrieval process, this manual verification tries to further rule out the sentinelsat module as a possible source of error.

Seeing how the link initialized a download of a broken ZIP-file as well, all indicators point toward a server-side error on the side of the Copernicus Open Access Hub.

res['link'].iloc[0]
"https://scihub.copernicus.eu/apihub/odata/v1/Products('bedec483-5ee1-4264-8dfa-a3b53ce364f7')/$value"

Solution

A temporary solution, as indicated above, is to manually copy the product names - file names without the .zip-extension - into the Search Mask of the Open Hub, which will show the product as Offline. The LTA retrieval can then be initialized manually, which is completed after several minutes.

While this approach restores the products without corruption, it is to be expected that the Open Access Hub API will resume operation as advertised after having elevated the issue.