[Esip-cloud] Cloud formats study
Jeff de La Beaujardiere
jeffdlb at ucar.edu
Wed Nov 6 17:07:02 EST 2019
Patrick-
We have been starting to use Zarr format for climate model outputs
hosted in the cloud. We don't really have quantitative comparisons or
metrics to offer yet, but below are a few preliminary answers to your
questions.
> Data access performance to support common forms of analysis, including time series, shape-based averaging, regridding and data intercomparison.
Zarr format breaks large, multidimensional datasets (e.g., 360
longitudes x 180 latitudes x 30 elevations x 3000 timesteps x 40
ensemble members) into smaller chunks. Multiple chunks are read in
parallel using Dask for good performance. You can choose how/whether
to chunk along each dimension, but general-purpose chunking along all
of them at once provides appropriate performance for various analyses
across time, space, or ensemble dimensions. Work by, I believe, the
Pangeo team suggests that chunks approximately 100 MB in size are
about right.
> Compatibility with existing off the shelf tools, including Panoply, gdal, nco, Jupyter/xarray, ArcGIS and GIS.
Dask, Xarray, and Jupyter are a primary focus. Also Intake or
Intake-ESM catalogs to provide access at a dataset level rather than
an individual file level.
> Ability to support fine-grained requests from S3 via range-get or other means.
Not doing range-gets because the chunks themselves are already small
(<= 100 MB).
> Ability to comply with community metadata conventions (e.g., CF)
Yes. Also, work is planned or in progress by Unidata to enable NetCDF
API to use Zarr under the hood.
Our first Zarr dataset currently published will be improved to be more
CF-compliant than the original data.
> Availability of independent libraries to read the data in C/C++, Fortran, Python and R
Python at least.
> Comparative cost of data preparation, storage and analysis, adjusted for lossless compressibility as appropriate.
No comparisons at this point. Hoping to have someone experiment with
modifying a climate model code to export data directly as Zarr.
Each chunk can be losslessly compressed.
> Ability to represent several different data types / structures including imagery, swath, trajectory, point cloud, Platte-Carre and Sinusoidal grids, in situ and airborne
Yes. Zarr actually originated in genetics community but is being used
for Earth science data as well. Should work as long as coordinate axes
can be clearly defined. For network of in situ point sensors I think
(not sure) it would be possible to have 2 dimensions be Sensor ID and
Time.
> Ability to verify data integrity upon reformatting and ongoing
Verification not built-in (as far as I know) but could be done
separately. Somewhat non-trivial to verify integrity after changing
format because simple checksum/hash approach doesn't work.
> Self-describability, i.e., ability to include complete sets of both descriptive and structural metadata
Yes. In Zarr format the descriptive and structural metadata is encoded
in JSON format, while the actual data values are in separate objects
that are just bytes with no header. Other types of metadata (e.g., ISO
record) are also possible.
> Open specification
Yes: https://zarr.readthedocs.io/en/stable/
> Number of independent implementations of read/write API
Don't know.
> Standards-body approval (OGC, W3C, etc.)
Not as far as I know. Would be premature at this point.
> Do you have examples of test data sets for things like time series analysis that you can share?
~70TB of NCAR Community Earth System Model (CESM) Large Ensemble in
Zarr format on AWS, with S3 storage graciously provided by Amazon
Sustainability Data Initiative and AWS Public Dataset Program:
https://doi.org/10.26024/wt24-5j82
Still working to improve ancillary material like Jupyter Notebooks,
Intake Catalog, etc.
Regards,
Jeff DLB
Jeff de La Beaujardiere, PhD
Director, NCAR/CISL Information Systems Division
https://staff.ucar.edu/users/jeffdlb
https://orcid.org/0000-0002-1001-9210
More information about the Esip-cloud
mailing list