ZIP-File Corruption Issues¶
The Copernicus Open Access Hub has a policy of moving data to a Long-Term-Archive after a certain period of time. Retrieval of files from this offline archive is a two step process:
A request to the URL that initializes the download for online products (
https://scihub.copernicus.eu/dhus/odata/v1/Products('$UUID')/$value
) instead initializes a data retrieval request which moves the archived file from offline to online storage.After several minutes or hours, when the product has finished moving to online storage, a request to the same URL initializes the download.
However, due to technical issues that the Copernicus Open Access Hub issue channels did not provide additional information on, some of the offline files were incorrectly restored. The MD5 checksum of the files as delivered was identical to the downloaded product, but products seemed to be incomplete or incorrectly encoded ZIP-files.
The solution for this was to manually copy the file names into the search interface of the Open Hub and retrieve a download there.
This notebook contains information about a feasible process to identify corrupted zip files and manually initialize their correct retrieval, given the number of corruptions is sufficiently small.
Identification Process¶
Using unix command line tools, the following command lists all files in a target folder in ascending order by size. The size is printed in a human-readable format in the left column.
! ls -rSsh input/tempelhofer_feld/*.zip
25M input/tempelhofer_feld/S2A_MSIL2A_20190623T101031_N0212_R022_T33UUU_20190623T132509.zip
29M input/tempelhofer_feld/S2B_MSIL2A_20190512T102029_N0212_R065_T33UUU_20190512T134103.zip
29M input/tempelhofer_feld/S2A_MSIL2A_20190216T102111_N0211_R065_T33UUU_20190216T130428.zip
30M input/tempelhofer_feld/S2A_MSIL2A_20190424T101031_N0211_R022_T32UQD_20190424T162325.zip
30M input/tempelhofer_feld/S2A_MSIL2A_20190404T101031_N0211_R022_T32UQD_20190404T174806.zip
31M input/tempelhofer_feld/S2B_MSIL2A_20190419T101029_N0211_R022_T33UUU_20190419T132322.zip
35M input/tempelhofer_feld/S2A_MSIL2A_20190613T101031_N0212_R022_T33UUU_20190614T125329.zip
38M input/tempelhofer_feld/S2A_MSIL2A_20190822T101031_N0213_R022_T32UQD_20190822T143621.zip
42M input/tempelhofer_feld/S2A_MSIL2A_20190603T101031_N0212_R022_T33UUU_20190603T114652.zip
43M input/tempelhofer_feld/S2A_MSIL2A_20190407T102021_N0211_R065_T33UUU_20190407T134109.zip
723M input/tempelhofer_feld/S2A_MSIL2A_20190114T101351_N0211_R022_T32UQD_20190114T113404.zip
753M input/tempelhofer_feld/S2B_MSIL2A_20190409T101029_N0211_R022_T32UQD_20190409T134504.zip
761M input/tempelhofer_feld/S2B_MSIL2A_20190320T101029_N0211_R022_T33UUU_20190320T195148.zip
764M input/tempelhofer_feld/S2B_MSIL2A_20190218T101059_N0211_R022_T32UQD_20190218T161620.zip
766M input/tempelhofer_feld/S2B_MSIL2A_20190330T101029_N0211_R022_T33UUU_20190330T144328.zip
768M input/tempelhofer_feld/S2B_MSIL2A_20190529T101039_N0212_R022_T32UQD_20190529T130331.zip
771M input/tempelhofer_feld/S2B_MSIL2A_20190906T101029_N0213_R022_T32UQD_20190906T133832.zip
774M input/tempelhofer_feld/S2B_MSIL2A_20190728T101029_N0213_R022_T32UQD_20190728T134658.zip
789M input/tempelhofer_feld/S2B_MSIL2A_20190519T101039_N0212_R022_T33UUU_20190519T132053.zip
789M input/tempelhofer_feld/S2B_MSIL2A_20190827T101029_N0213_R022_T32UQD_20190827T134854.zip
802M input/tempelhofer_feld/S2A_MSIL2A_20191130T101401_N0213_R022_T33UUU_20191130T115440.zip
802M input/tempelhofer_feld/S2B_MSIL2A_20191205T101309_N0213_R022_T33UUU_20191205T122401.zip
809M input/tempelhofer_feld/S2A_MSIL2A_20191220T101431_N0213_R022_T33UUU_20191220T115219.zip
813M input/tempelhofer_feld/S2A_MSIL2A_20190921T101031_N0213_R022_T33UUU_20190921T130515.zip
819M input/tempelhofer_feld/S2A_MSIL2A_20191210T101411_N0213_R022_T33UUU_20191210T114322.zip
823M input/tempelhofer_feld/S2B_MSIL2A_20190718T101039_N0213_R022_T33UUU_20190718T131731.zip
823M input/tempelhofer_feld/S2A_MSIL2A_20190713T101031_N0213_R022_T33UUU_20190713T135651.zip
829M input/tempelhofer_feld/S2A_MSIL2A_20190911T101021_N0213_R022_T33UUU_20190911T143947.zip
845M input/tempelhofer_feld/S2A_MSIL2A_20190723T101031_N0213_R022_T33UUU_20190723T125722.zip
1.1G input/tempelhofer_feld/S2A_MSIL2A_20191213T102421_N0213_R065_T33UUU_20191213T120011.zip
1.1G input/tempelhofer_feld/S2B_MSIL2A_20190422T102029_N0211_R065_T32UQD_20190422T133643.zip
1.1G input/tempelhofer_feld/S2B_MSIL2A_20191029T102039_N0213_R065_T32UQD_20191029T134629.zip
1.1G input/tempelhofer_feld/S2B_MSIL2A_20190402T102029_N0211_R065_T33UUU_20190402T135010.zip
1.1G input/tempelhofer_feld/S2B_MSIL2A_20190711T102029_N0213_R065_T33UUU_20190711T135545.zip
1.1G input/tempelhofer_feld/S2A_MSIL2A_20190417T102031_N0211_R065_T33UUU_20190417T130913.zip
1.1G input/tempelhofer_feld/S2A_MSIL2A_20190626T102031_N0212_R065_T33UUU_20190626T125319.zip
1.1G input/tempelhofer_feld/S2A_MSIL2A_20190726T102031_N0213_R065_T33UUU_20190726T125507.zip
1.1G input/tempelhofer_feld/S2A_MSIL2A_20191014T102031_N0213_R065_T32UQD_20191014T130941.zip
1.1G input/tempelhofer_feld/S2A_MSIL2A_20190825T102031_N0213_R065_T33UUU_20190825T134836.zip
1.2G input/tempelhofer_feld/S2B_MSIL2A_20190601T102029_N0212_R065_T33UUU_20190601T135040.zip
The first 10 files have a file size that’s significantly lower than what would be expected. Using pipes the following command tries to extract one of the low-size files, which raises an error:
! ls -S input/tempelhofer_feld/*.zip | tail -n1 | xargs unzip
Archive: input/tempelhofer_feld/S2A_MSIL2A_20190623T101031_N0212_R022_T33UUU_20190623T132509.zip
End-of-central-directory signature not found. Either this file is not
a zipfile, or it constitutes one disk of a multi-part archive. In the
latter case the central directory and zipfile comment will be found on
the last disk(s) of this archive.
unzip: cannot find zipfile directory in one of input/tempelhofer_feld/S2A_MSIL2A_20190623T101031_N0212_R022_T33UUU_20190623T132509.zip or
input/tempelhofer_feld/S2A_MSIL2A_20190623T101031_N0212_R022_T33UUU_20190623T132509.zip.zip, and cannot find input/tempelhofer_feld/S2A_MSIL2A_20190623T101031_N0212_R022_T33UUU_20190623T132509.zip.ZIP, period.
API Responses¶
Continuing with the file above, S2A_MSIL2A_20190623T101031_N0212_R022_T33UUU_20190623T132509
, the downloaded file is compared to what the API indicates it should look like:
import os
import sentinelsat
api = sentinelsat.SentinelAPI(os.getenv('SCIHUB_USERNAME'), os.getenv('SCIHUB_PASSWORD'))
res = api.to_geodataframe(api.query(raw='S2A_MSIL2A_20190623T101031_N0212_R022_T33UUU_20190623T132509'))
/opt/conda/lib/python3.8/site-packages/pyproj/crs/crs.py:53: FutureWarning: '+init=<authority>:<code>' syntax is deprecated. '<authority>:<code>' is the preferred initialization method. When making the change, be mindful of axis order changes: https://pyproj4.github.io/pyproj/stable/gotchas.html#axis-order-changes-in-proj-6
return _prepare_from_string(" ".join(pjargs))
res['size']
bedec483-5ee1-4264-8dfa-a3b53ce364f7 816.67 MB
Name: size, dtype: object
The size given by the scihub API is a lot larger than what has been downloaded.
Verification through Repetition¶
For the purpose of identifying whether repeated downloads fail in identical ways, the products have been downloaded into a separate target folder, input/tempelhofer_feld_test
.
Using the piped commands below the MD5 checksum for all ZIP-files below 500MB was calculated, once for all files in the original download folder input/tempelhofer_feld
and once for input/tempelhofer_feld_test
.
These checksums being identical shows that both downloads retrieved the same (corrupted) file.
! find input/tempelhofer_feld -type f -size -500M -name '*.zip' | xargs md5sum
9ca05754c4cc5ff9d2bddf99e2e9e753 input/tempelhofer_feld/S2A_MSIL2A_20190603T101031_N0212_R022_T33UUU_20190603T114652.zip
5424cf8c0dd4384382366b37af9ee995 input/tempelhofer_feld/S2A_MSIL2A_20190404T101031_N0211_R022_T32UQD_20190404T174806.zip
f2050867b04f8911dfcd1412846f5f0e input/tempelhofer_feld/S2A_MSIL2A_20190216T102111_N0211_R065_T33UUU_20190216T130428.zip
5c41f18b6c9745df406dbca49c50b0c7 input/tempelhofer_feld/S2B_MSIL2A_20190419T101029_N0211_R022_T33UUU_20190419T132322.zip
8e9dc7b716056f702912d11197fab44c input/tempelhofer_feld/S2A_MSIL2A_20190407T102021_N0211_R065_T33UUU_20190407T134109.zip
7241ca7fc6ccca5eb8935efe1b834697 input/tempelhofer_feld/S2B_MSIL2A_20190512T102029_N0212_R065_T33UUU_20190512T134103.zip
7d2b67dac6f36f1d8744ec2ef296445f input/tempelhofer_feld/S2A_MSIL2A_20190613T101031_N0212_R022_T33UUU_20190614T125329.zip
b078b9d41e7be70a89961214d4adb72b input/tempelhofer_feld/S2A_MSIL2A_20190424T101031_N0211_R022_T32UQD_20190424T162325.zip
f4a2910be181bd1c85fba14e05ce69b1 input/tempelhofer_feld/S2A_MSIL2A_20190822T101031_N0213_R022_T32UQD_20190822T143621.zip
53e1beb3f29dc1dc5b20745c3d66568e input/tempelhofer_feld/S2A_MSIL2A_20190623T101031_N0212_R022_T33UUU_20190623T132509.zip
! find input/tempelhofer_feld_test -type f -size -500M -name '*.zip' | xargs md5sum
9ca05754c4cc5ff9d2bddf99e2e9e753 input/tempelhofer_feld_test/S2A_MSIL2A_20190603T101031_N0212_R022_T33UUU_20190603T114652.zip
5424cf8c0dd4384382366b37af9ee995 input/tempelhofer_feld_test/S2A_MSIL2A_20190404T101031_N0211_R022_T32UQD_20190404T174806.zip
f2050867b04f8911dfcd1412846f5f0e input/tempelhofer_feld_test/S2A_MSIL2A_20190216T102111_N0211_R065_T33UUU_20190216T130428.zip
5c41f18b6c9745df406dbca49c50b0c7 input/tempelhofer_feld_test/S2B_MSIL2A_20190419T101029_N0211_R022_T33UUU_20190419T132322.zip
8e9dc7b716056f702912d11197fab44c input/tempelhofer_feld_test/S2A_MSIL2A_20190407T102021_N0211_R065_T33UUU_20190407T134109.zip
7241ca7fc6ccca5eb8935efe1b834697 input/tempelhofer_feld_test/S2B_MSIL2A_20190512T102029_N0212_R065_T33UUU_20190512T134103.zip
7d2b67dac6f36f1d8744ec2ef296445f input/tempelhofer_feld_test/S2A_MSIL2A_20190613T101031_N0212_R022_T33UUU_20190614T125329.zip
b078b9d41e7be70a89961214d4adb72b input/tempelhofer_feld_test/S2A_MSIL2A_20190424T101031_N0211_R022_T32UQD_20190424T162325.zip
f4a2910be181bd1c85fba14e05ce69b1 input/tempelhofer_feld_test/S2A_MSIL2A_20190822T101031_N0213_R022_T32UQD_20190822T143621.zip
53e1beb3f29dc1dc5b20745c3d66568e input/tempelhofer_feld_test/S2A_MSIL2A_20190623T101031_N0212_R022_T33UUU_20190623T132509.zip
Manual Download¶
Another approach was to explicitly use the link provided by the API response and compare this manual download to the downloads initialized by the sentinelsat
API above.
While the fact that the API response from the Open Access Hub API contained a bad checksum is already a strong indicator that the error is introduced on the server-side during the retrieval process, this manual verification tries to further rule out the sentinelsat
module as a possible source of error.
Seeing how the link initialized a download of a broken ZIP-file as well, all indicators point toward a server-side error on the side of the Copernicus Open Access Hub.
res['link'].iloc[0]
"https://scihub.copernicus.eu/apihub/odata/v1/Products('bedec483-5ee1-4264-8dfa-a3b53ce364f7')/$value"
Solution¶
A temporary solution, as indicated above, is to manually copy the product names - file names without the .zip
-extension - into the Search Mask of the Open Hub, which will show the product as Offline. The LTA retrieval can then be initialized manually, which is completed after several minutes.
While this approach restores the products without corruption, it is to be expected that the Open Access Hub API will resume operation as advertised after having elevated the issue.