[Esip-documentation] Let's get rid of spatial and temporal bounds in ACDD

Armstrong, Edward M (398M) Edward.M.Armstrong at jpl.nasa.gov
Fri Apr 25 12:54:42 EDT 2014


Hi John,

This is a great plan. I am working b) and c) for internal tools in the PO.DAAC and encourage other developers and institutions to do likewise.

Hopefully we’ll see some progress in a year. And this will reduce the intensity as you say.


On Apr 24, 2014, at 2:04 PM, John Graybeal <john.graybeal at marinexplore.com<mailto:john.graybeal at marinexplore.com>> wrote:

So this thread identified recommendations on several fronts for which we may have consensus:
A) Label risks when using global geospatiotemporal attributes (text to be added to TBD guidance; needs to cover more than just geospatiotemporal attributes)
B) Improve server-side handling of computable dynamic attributes like global geospatiotemporal bounds (recommend changes to server software)
C) Document and improve utilities' handling of global attribute creation (recommend changes to utilities that create such files)

I think all these are mutually reinforcing, and see some useful next steps:
a) Propose text for
   a.1) the CF standard that provides guidance about attributes for file creators, updaters, and users.
   a.2) the ACDD geospatiotemporal attributes that says if they exist, their validity should be confirmed before use.
b) Make change requests to authors of servers describing desired strategies for maintaining and serving current attributes.
c) Define current practices of utilities and recommend changes, for example by making change requests.

If there aren't major objections, I'm willing to tackle (a); and I have already created a tracking page<http://wiki.esipfed.org/index.php/NetCDF_Utilities_Metadata_Handling#Table_of_Data_Product_Utilities> for (c) [1] but could use some more authoritative help. (The tracking page also includes a description of options for producing accurate metadata when updating files.) Perhaps others in the community could pursue (b), either with or without further discussion in this list?

This finesses the heartfelt argument of the subject line: whether ACDD should include those attributes that can be derived from the data. Successfully pursuing the agreed (?) goals A-C should reduce the intensity of that issue, and may provide new perspective going forward.

John

[1] http://wiki.esipfed.org/index.php/NetCDF_Utilities_Metadata_Handling#Table_of_Data_Product_Utilities

On Mar 20, 2014, at 09:24, Steve Hankin <steven.c.hankin at noaa.gov<mailto:steven.c.hankin at noaa.gov>> wrote:

Hi Ken, et. al.,

I am grateful for your suggestion that the topic be raised and discussed.  Now, can that discussion be broadened to include alternative strategies by which you might achieve your goals (and hopefully avoid the down sides)?

Consider the suggestion that Dave Neufeld just offered:  "server technologies [should] implement a caching/refresh strategy (Hyrax/THREDDS) for geospatial and temporal bounds".  This is a recognition that the static geo-positioning attributes are unreliable and redundant (can readily be recomputed).  Why not take this insight to the next step:  recommend that servers synthesize the global bounds.  Between ncISO, ncdump, and the major Web servers you'll be close to a clean sweep of the use cases where the global attributes contribute significant value.  (They make negligible contributions to users of desktop analysis and visualization tools.)

There is of course a down side to this approach:  what is the leverage to force these software changes to be made?  The current strategy offers more leverage, because it breaks CF software, making the software look bad and the Web practitioners look sloppy.  (I've already railed against that.)  Following this strategy will require getting some major players on board through persuasion.

    - Steve

=====================================================

On 3/20/2014 4:17 AM, Kenneth S. Casey - NOAA Federal wrote:
Hi All -

Perhaps this group could lay out a simple "proposal" of sorts… that could be discussed and refined in this thread, and agreed to at ESIP Rocky Mountain High this summer if not sooner.  Perhaps that proposal would look something like:

"Dear Software Providers:  Please do the right thing with global attributes, and properly update spatial and temporal bounding attributes when you modify a netCDF file and either re-write or create a new one.  While you are at it, add some info to the history attribute too like you are supposed to.  In the meantime, dear community, be wary of global attributes that relate to coordinate variables.. trust the coordinate variables and if you notice a discrepancy with their corresponding global attributes SCREAM VERY LOUDLY at the provider of the software which generated that netCDF file."

Specific actions could then be requested of the big players to make the appropriate updates to their code.

I think we need global attributes in general, even ones relating to coordinate variables.  Everything said here about coordinate attributes actually applies more generally…  many, many, of the global attributes can and should be updated depending on the provenance of the file and who did what to it.  The only difference is that the attributes relating to coordinate variables can actually be tested against the data.

I'd add one other point… while computationally doing a max/min on the coordinate variables is not too terrible, much of the time (esp. with netCDF-3) you have to decompress the entire file first, and that is computationally terrible for large numbers of large files that are externally compressed (like we have with GHRSST, for example… loving that GHRSST Data Specification v2 now uses netCDF-4 with internal compression!).

Ken



On Mar 19, 2014, at 11:12 PM, "Signell, Richard" <rsignell at usgs.gov<mailto:rsignell at usgs.gov>> wrote:

Gang,
I understand the importance having the bounds information in metadata
-- in fact we start our workflows by querying catalog services which
uses bounding box information contained in the ISO metadata.  But this
ISO metadata was calculated by ncISO by reading the CF coordinate
variables via OPeNDAP, and the metadata points to the OPeNDAP service
endpoint, so I know that the bounds data is correct.

It would seem that NASA, OCEANSITES, and others could use this
approach as well, which would yield the same functionality as reading
metadata from the actual dataset, but without the drawbacks.

Having read all the arguments so far,  I'm going to continue
recommending that people not write these bounds attributes into their
datasets, because I remain convinced they do our community more harm
than good.  But I'll explain to them the arguments for and against.

-Rich

On Wed, Mar 19, 2014 at 5:43 PM, Armstrong, Edward M (398M)
<Edward.M.Armstrong at jpl.nasa.gov<mailto:Edward.M.Armstrong at jpl.nasa.gov>> wrote:
Hello,

Just to continue this thread and the way a popular tool works......I checked the
output of LAS and it does not update any attributes...just inherits what it
natively subsetted.  Its includes this global attribute:

:FERRET_comment = "File written via LAS. Attributes are inherited from
originating dataset";



On Mar 18, 2014, at 10:42 AM, Ted Habermann <thabermann at hdfgroup.org<mailto:thabermann at hdfgroup.org>> wrote:

All,

Just wanted to point out that CF is really use metadata used by tools that
are actually reading the data. ACDD are discovery conventions originally
motivated by the lack of discovery information included in CF. The opinion
of the CF mailing list is, therefore, not relevant in this discussion.

I agree with Ken and Ed... There are two sets of metadata because they serve
two communities. We need to do our best to make sure they are both correct.

Seems like the netCDF download service in THREDDS is a good place to
start... Does anyone know how it behaves with global attributes? What about
LAS?

Ted

On Mar 18, 2014, at 10:52 AM, Armstrong, Edward M (398M)
<Edward.M.Armstrong at jpl.nasa.gov<mailto:Edward.M.Armstrong at jpl.nasa.gov>> wrote:

Hi Steve,

I do agree with Ken and Ted on this issue....also from experience working with
granules in a large data center.

I think the crux of the counter argument is that from the perspective of the
data producer we want them to add more metadata, not less to the native
granule.  And many people do use these global bounds in some context, mostly
just looking at the granules through a browser like ncdump and Panoply.
Space and time bounds is just a natural thing most people look at in a
granule when they first acquire it.

The key, then is to recognize if the granule has been altered by a tool, to
treat the bounds with caution.  Nan had a great idea that tools should point
back to the original unaltered granule.  I think this is first requirement
for any tool that does modification to the granule via subsetting or
similar.

And the next step is to encourage the updating of global bounds by any tool
or aggregation operations. I do agree that is a challenge.


On Mar 18, 2014, at 8:51 AM, Steve Hankin <steven.c.hankin at noaa.gov<mailto:steven.c.hankin at noaa.gov>> wrote:

Hi Ken,

I think you are actually agreeing with me, rather than Ted.  The text that I
proposed did not say geo-bounds attributes should be forbidden.  It said
that they should be used with caution -- only in situations where they are
unlikely to lead to corruption.

It is recommended that 1) the use of these global attributes be restricted
to files whose contents are known to be completely stable -- i.e. files very
unlikely to be aggregated into larger collections;  and 2) as a matter of
best practice, software reading CF files should ignore these global
attributes; instead it should compute the geo-spatial bounds by scanning the
coordinate ranges found within the CF dataset, itself.

These words can certainly be adjusted and improved.  (The discussion that I
am still hoping will happen!)  The intent is to apply common sense,
practical thinking so that our interoperability frameworks works in
practice.

You have argued that at NODC you *need* these geo-bounds attribute in order
to avoid impossibly large processing burdens of examining the data, itself.
What would your strategy be if  the attributes proved not to be trustworthy?
Perhaps within the community of satellite swath folks you can agree that you
will all maintain these attributes faithfully.  Great.  Do so.  That is in
the spirit of the discussion that we should be having.  But should you speak
for the modeling community, too?  (I presume the NODC granules must be swath
data, because if they were grids, then the processing required to determine
the bounds from the data is negligible.)

Lets look at the context of this discussion.  What we are seeing here is a
collision between the priorities of two communities involved in developing
standards. Our emails here are largely confined to one of the communities.
Envision taking this discussion topic to the CF email list.  I predict what
you would see is that the potential of  redundant, static global attributes
to create corruption to the datasets would fly up as a giant red flag.  It
is not only aggregation operations that will lead to corruption, it is also
subsetting operations -- the most routine of all netCDF operations, and
performed by generic utilities that are unaware of CF or ACDD.  These are
glaring, unsolved problems in the use of these geo-bounds.  There is such a
strong case to be more nuanced when standardizing attributes that we can
plainly see are going to lead to many corrupted datasets.  What is the
counter-argument for ignoring these self-evident problems?

   - Steve

===========================================


On 3/18/2014 3:48 AM, Kenneth S. Casey - NOAA Federal wrote:

Hi All -

I've been silent but following this thread carefully.  Time to jump in now.

I concur with Ted's statements below.  I would characterize his responses as
a good example of combating the "Tyranny of the Or". It doesn't have to be
one solution or the other.  For us at NODC, processing literally tens and
maybe hundreds of millions of netCDF granules, having easily and quickly
read global attributes is not a convenience. It is a practical necessity.
We also like the idea of encouraging softwarians to write better software.
Corrupting attributes through negligence and inaction is not acceptable.
Fix it. If they can't, we can encourage our users to stop using that
software.

I love the idea of building the congruence checker into the ACDD rubric and
catalog cleaner, and I think an "ncdump -bounds" option, where the result is
calculated from the actual bounds, is terrific too.  These kinds of
additions to exiting tools would help encourage better practices and would
give us some simple tools to improve our management of netCDF data.

Ken




------------------------------------
John Graybeal
Marine Data Manager

M +1 408 675-5445
skype: graybealski
Marinexplore
920 Stewart Drive
Sunnyvale 94085
California, USA
www.marinexplore.com<http://marinexplore.com/>

_______________________________________________
Esip-documentation mailing list
Esip-documentation at lists.esipfed.org<mailto:Esip-documentation at lists.esipfed.org>
http://www.lists.esipfed.org/mailman/listinfo/esip-documentation

-ed

Ed Armstrong
JPL Physical Oceanography DAAC
818 519-7607



-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.lists.esipfed.org/pipermail/esip-documentation/attachments/20140425/d6c8561d/attachment-0001.html>


More information about the Esip-documentation mailing list