[Esip-documentation] Let's get rid of spatial and temporal bounds in ACDD

Steve Hankin steven.c.hankin at noaa.gov
Mon Mar 17 15:22:02 EDT 2014


Hi John, Rich,

The 'bananas' analogy fits right into the discussion below, so I'm not 
commenting on I specifically.    I like it as a good metaphor for the 
question we are debating.

On 3/17/2014 10:51 AM, John Graybeal wrote:
> Steve,
>
> Let me see if you take issue with my simplified version of what you said:
>
> *"NetCDF attributes should never be used to describe value-based or 
> processing-specific features of any data set, because data processors 
> (tools and people) can't help but corrupt derivative data sets with 
> that information. "*

I like this line of reasoning for understanding the issues, John. But 
the above is not a sufficiently nuanced statement to capture what I have 
advocated.  Thy this:

    "_Broad_ standardization of NetCDF attributes that contain
    value-based or processing-derived information about a data set,
    should be avoided -- used only if there is no reasonable
    alternative.  Use of such attributes breaks the 'backwards
    compatibility' goals that evolving software standards should
    follow.  It 'breaks' existing systems and places a burden upon
    future systems that may modify or extend the contents of that dataset."

Standardization of such attributes for special communities that run 
custom systems, presents no serious problems.  CF allows itself to be 
arbitrarily extended for such special purposes.  Each community will 
look after its own software quality.  But when you undertake a broad 
standardization of such attributes, you are placing a burden on software 
systems over which you have no knowledge and no control.

>
> Don't get me wrong; I think this presents a consistent philosophy in 
> response to today's practical realities. As of today, with 5 out of 6 
> tools failing the 'metadata consistency' test, it likely minimizes the 
> ratio of bad metadata in the wild.
thank you for this.
>
> What it does *not* do is establish a mature, flexible, interoperable, 
> metadata-aware community of practice going forward. It also does not 
> fix the problem caused by these tools, it just eliminates the 
> most-likely-to-fail attributes (from any standard, forevermore). And 
> it directly undercuts the greatest value of a self-describing data 
> format, the easy-to-access description.
>
As Ted pointed out, the issue is much broader and deeper than this one 
particular use case.  Dynamically changing datasets (including the 
'virtual datasets' created through aggregation) create a class of 
problems that are in fundamental conflict with static metadata 
representations.   Do we agree that that virtual datasets are not going 
to go away?   The class of datasets that CF is most centrally committed 
to are often extremely large and evolve over time (both the 'C' and the 
'F' in CF).

> The fact that much software pre-dates the standards is a red herring.

Really?  We watch standards self-destruct time after time because they 
fail to address pragmatic considerations of this very type. Backwards 
compatibility deserves to be a paramount consideration. Do we really 
disagree on this?

> Software that takes any self-describing file, modifies the contents, 
> yet passes through the original description _without validation_ to 
> the output file, will almost always produce a broken output file. This 
> isn't a feature, nor inevitable; it's a bug. Our underlying challenge 
> is to fix that problem.
>
> And in the meantime, we still have to maintain ALL the metadata 
> affected by that problem, not just the CF coordinate bounds. So we 
> should fix all that software soon. (As I can think of 3 trivial fixes 
> off the top of my head, I'm not sure why previous beat-arounds didn't 
> induce change. It's time. And it's a lot easier than modifying all the 
> software to include statements in the files about how unreliable the 
> metadata is.)
>
I agree with you on this ... at least in principle:  Those responsible 
for the software systems should make a good faith effort to upgrade them 
in response to evolving standards.  But they may simply lack the 
resources.  Or they may be unable to fit the fixes into hard-pressed 
schedules for some time to come.  In the meantime we have corrupted 
metadata.  We are living in lean times for our community.  
Considerations like this have to be a two-way street: those responsible 
for evolving software standards need to minimize so-called 
'enhancements' that break existing software.  They need to look for 
alternatives first.

     - Steve

> John
>
> On Mar 17, 2014, at 09:23, Steve Hankin <steven.c.hankin at noaa.gov 
> <mailto:steven.c.hankin at noaa.gov>> wrote:
>
>> Greetings Ted!
>>
>> As rants go, the message you sent on Friday was pretty restrained.   
>> I particularly like your suggestion that the UAF Catalog Cleaner 
>> could detect and report corrupted ACDD geo-positioning attributes.  
>> We will see what we can do with that idea. (aside:  Is there a way to 
>> ask ncISO when it has found this form of corruption in a dataset?)
>>
>> It would be nice if we could all wrestle this topic to a workable 
>> compromise.  I agree that the problem should be "fixed at the 
>> source".   But characterizing the source of the problem as "sloppy 
>> data management" seems off the mark.   This data management problem 
>> didn't exist until we created the potential for easy corruption by 
>> _defining easily-corrupted, redundant information in the CF datasets_.
>>
>> The duplication of information between the global attributes and the 
>> CF coordinates is an 'attractive nuisance'.   No matter how much we 
>> exhort data managers to clean up their act, the problem is going to 
>> continue showing up over and over and over.   Some of the data 
>> management tools that stumble into this problem are not even 
>> CF-aware;  generic netCDF utilities like nco create this form of 
>> corruption.   Offhand I can think of 6 independent pieces of software 
>> that perform aggregations on CF datasets.  Five of these predate the 
>> use of the ACDD geo-bounds attributes within CF files;  all of these 
>> exhibit this corruption.  Ed's example is yet another.
>>
>> Our underlying challenge is to expose the CF coordinate bounds for 
>> easy use wherever that information is needed for purposes of data 
>> discovery.   The ncISO tool contributed by your group has addressed 
>> this very successfully for formal ISO metadata.  (A big thanks to 
>> you, Dave N. et. al..)  There is similar excellent potential to 
>> address the more informal cases through software.  You mentioned the 
>> limitations of "ncdump  -h" as an illustration.  How about code 
>> contributed to Unidata to create "ncdump -bounds"?    This would be a 
>> smaller effort with a more robust outcome than to ask all current and 
>> future developers of CF aggregation techniques to make accommodation 
>> for the redundant, easily corrupted attributes in their datasets.
>>
>>     - Steve
>>
>> ===========================================
>>
>> On 3/14/2014 1:14 PM, Ted Habermann wrote:
>>> All,
>>>
>>> I agree with Ed and John, *this is a software tool problem that 
>>> should be fixed at the source.* The description of the history 
>>> attribute has always implied that it should be updated when a file 
>>> is processed (even though, IMHO, it is almost entirely unsuited for 
>>> doing that). The same is true for many others (listed by John G. 
>>> earlier in this thread). *The current practice is sloppy data 
>>> management* that, from the sound of this thread, is pervasive in the 
>>> community. Of course, ncISO provides a very easy way to identify 
>>> occurrences of this problem throughout a THREDDS catalog. The 
>>> "Catalog Cleaner" is another venue for quantifying the damage. CF is 
>>> a community standard. Maybe it is time for the community to 
>>> recommend providing correct metadata with the files and to avoid 
>>> developers and datasets that don't.
>>>
>>> A related problem is that the bounds calculated from the data are 
>>> only available if you read the data. Many users may not be equipped 
>>> to easily read the data during a data discovery process. They may 
>>> not want to go beyond ncdump -x -h (or something like that) before 
>>> they fire up the whole netCDF machine...
>>>
>>> BTW, this problem is trivial relative to that associated with 
>>> virtual datasets created through aggregation. In those cases, there 
>>> is no clear mechanism for providing meaningful metadata, although 
>>> the rich inventory we created several years ago comes close... That 
>>> situation is much more prone to mistakes as all semblance of the 
>>> historic record is wiped out.
>>>
>>> Its Friday, and spring... As Dave said last week, a good time for a 
>>> rant!
>>> Ted
>>>
>>>
>>> On Mar 14, 2014, at 1:44 PM, Steve Hankin <steven.c.hankin at noaa.gov 
>>> <mailto:steven.c.hankin at noaa.gov>> wrote:
>>>
>>>> Hi All,
>>>>
>>>> I'm joining into this discussion from the wings.  The topic here -- 
>>>> the common tendency for the ACDD geo-spacio-temporal bounds 
>>>> attributes to get corrupted -- has been beaten around a number of 
>>>> times among different groups.  At this point it isn't clear that 
>>>> there is a "clean" resolution to the problem;  there are already so 
>>>> many files out there that contain these attributes that there may 
>>>> be no easy way to unwind the problem.  Might the best path forward 
>>>> be to see about adding some words of caution into the the documents 
>>>> that suggest the use of these attributes?  Something along these lines:
>>>>
>>>>     /*Caution:*//   The encoding of geo-spatial bounds values as
>>>>     global attributes is a practice that should be used with
>>>>     caution or avoided. //
>>>>     //
>>>>     //The encoding of geo-spatial bounds values as global
>>>>     attributes introduces a high likelihood of corruption, because
>>>>     the attibute values duplicate information already contained in
>>>>     the self-describing coordinates of the dataset.   A number of
>>>>     data management operations that are common with netCDF files
>>>>     will invalidate the values stored as global attributes.  Such
>>>>     operations include extending the coordinate range of a netCDF
>>>>     file along its record axis;  aggregating a collection of netCDF
>>>>     files into a larger datasets (for example aggregating model
>>>>     outputs along their time axes); or appending files using
>>>>     file-based utilities (e.g. nco).//
>>>>     //
>>>>     //It is recommended that 1) the use of these global attributes
>>>>     be restricted to files whose contents are known to be
>>>>     completely stable -- i.e. files very unlikely to be aggregated
>>>>     into larger collections;  and 2) as a matter of best practice,
>>>>     software reading CF files should ignore these global
>>>>     attributes; instead it should compute the geo-spatial bounds by
>>>>     scanning the coordinate ranges found within the CF dataset,
>>>>     itself. /
>>>>
>>>> Comments?
>>>>
>>>>     - Steve
>>>>
>>>> ------------------------------------------------------------------------
>>>>
>>>> From: Armstrong, Edward M (398M) <Edward.M.Armstrong at jpl.nasa.gov>
>>>> Date: Wed, Mar 12, 2014 at 6:03 PM
>>>> Subject: Re: [Esip-documentation] Let's get rid of spatial and
>>>> temporal bounds in ACDD
>>>> To: Nan Galbraith <ngalbraith at whoi.edu>
>>>> Cc: Cluster Documentation <esip-documentation at lists.esipfed.org>
>>>>
>>>>
>>>> I think that is an excellent idea.
>>>>
>>>> The output of the PO.DAAC HiTIDE subsetter that I mentioned in a
>>>> previous email does exactly that and includes some useful information
>>>> about the granule and the wrapped subsetting request via OPeNDAP.
>>>> Below is a snapshot of some of the global attributes from  a subsetted
>>>> AVHRR SST granule (look at the naiad_ attributes):
>>>>
>>>>   :southernmost_latitude = -89.72987f; // float
>>>>   :northernmost_latitude = 89.80405f; // float
>>>>   :westernmost_longitude = -179.9997f; // float
>>>>   :easternmost_longitude = 179.99994f; // float
>>>>   :file_quality_index = 1S; // short
>>>>   :comment = "none";
>>>>   :naiad_download_date = "2014-03-10 21:25:47";
>>>>   :naiad_granule_url =
>>>> "http://podaac-opendap.jpl.nasa.gov/opendap/allData/ghrsst/data/L2P/AVHRR19_G/NAVO/2013/358/20131224-AVHRR19_G-NAVO-L2P-SST_s0827_e1009-v01.nc.bz2";
>>>>   :naiad_constraint_expression =
>>>> "lat[8000:1:9200][104:1:408],lon[8000:1:9200][104:1:408],time[0:1:0],sst_dtime[0:1:0][8000:1:9200][104:1:408],rejection_flag[0:1:0][8000:1:9200][104:1:408],SSES_bias_error[0:1:0][8000:1:9200][104:1:408],aod_dtime_from_sst[0:1:0][8000:1:9200][104:1:408],DT_analysis[0:1:0][8000:1:9200][104:1:408],brightness_temperature_11um[0:1:0][8000:1:9200][104:1:408],aerosol_optical_depth[0:1:0][8000:1:9200][104:1:408],sources_of_aod[0:1:0][8000:1:9200][104:1:408],confidence_flag[0:1:0][8000:1:9200][104:1:408],brightness_temperature_4um[0:1:0][8000:1:9200][104:1:408],SSES_standard_deviation_error[0:1:0][8000:1:9200][104:1:408],sea_surface_temperature[0:1:0][8000:1:9200][104:1:408],brightness_temperature_12um[0:1:0][8000:1:9200][104:1:408],proximity_confidence[0:1:0][8000:1:9200][104:1:408],satellite_zenith_angle[0:1:0][8000:1:9200][104:1:408]";
>>>> }
>>>>
>>>> I did misspeak earlier when I indicated that the spatial bounds were
>>>> also updated, in this case southernmost_latitude etc. They are the
>>>> original (global) bounds.  I have requested to the developer that
>>>> these bounds be updated for every subset request. Hopefully it will
>>>> get in the next version.
>>>>
>>>>
>>>>
>>>> On Mar 8, 2014, at 3:13 AM, Nan Galbraith <ngalbraith at whoi.edu> wrote:
>>>>
>>>> > Hows about adding an attribute that contains the URL of the data
>>>> > file to which the bounds apply? If your aggregator/sub-setter
>>>> > has misled you by failing to update the bounds attribute, he's also
>>>> > provided you with the link to the data you actually wanted.
>>>> >
>>>> > For programs that collect lots of data and don't molest it, this 
>>>> would
>>>> > let them continue to use the bounds atts; for programs that slice
>>>> > and dice, they'd be motivated to either update the fields dynamically
>>>> > or remove them.
>>>> >
>>>> > Cheers - Nan
>>>> >
>>>> > Is it really Friday? I'm at sea (yes, people still do that, 
>>>> sometimes)
>>>> > and I've lost all sense of time and place (no geospatial and temporal
>>>> > bounds information).
>>>> >
>>>> >
>>>> >
>>>> >
>>>> > On 3/7/14 6:47 PM, David Neufeld - NOAA Affiliate wrote:
>>>> >>
>>>> >> Ok, since it's Friday and we are in rant mode, I'm going to have 
>>>> a little fun with this one...
>>>> >>
>>>> >> Recommended Attribute Disclaimer:
>>>> >>
>>>> >> Suggested text: "When using geospatial and temporal bounds 
>>>> information in your global attributes, please know that it 
>>>> introduces a likely source of error and that you are far better off 
>>>> reading these values from the data stored in the file.  If you do 
>>>> choose to use the attributes please also include a global checksum 
>>>> attribute that humans can look at to decide whether the file has 
>>>> changed since you originally recorded these values."
>>>> >>
>>>> >> On Fri, Mar 7, 2014 at 2:02 PM, Nan Galbraith 
>>>> <ngalbraith at whoi.edu <mailto:ngalbraith at whoi.edu>> wrote:
>>>> >>
>>>> >>
>>>> >>
>>>> >>    Maybe we should add some text about which attributes should be
>>>> >>    considered fragile and under what conditions they need to be
>>>> >>    recalculated or removed, but I'm not in favor of removing the
>>>> >>    terms from ACDD.
>>>> >>
>>>> >>
>>>> >
>>>> >
>>>> > --
>>>> > *******************************************************
>>>> > * Nan Galbraith (508) 289-2444 *
>>>> > * Upper Ocean Processes Group Mail Stop 29 *
>>>> > * Woods Hole Oceanographic Institution                *
>>>> > * Woods Hole, MA 02543                                *
>>>> > *******************************************************
>>>> >
>>>> >
>>>> > _______________________________________________
>>>> > Esip-documentation mailing list
>>>> > Esip-documentation at lists.esipfed.org
>>>> > http://www.lists.esipfed.org/mailman/listinfo/esip-documentation
>>>>
>>>> -ed
>>>>
>>>> Ed Armstrong
>>>> JPL Physical Oceanography DAAC
>>>> 818 519-7607
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> Esip-documentation mailing list
>>>> Esip-documentation at lists.esipfed.org
>>>> http://www.lists.esipfed.org/mailman/listinfo/esip-documentation
>>>>
>>>>
>>>> -- 
>>>> Dr. Richard P. Signell   (508) 457-2229
>>>> USGS, 384 Woods Hole Rd.
>>>> Woods Hole, MA 02543-1598
>>>>
>>>> _______________________________________________
>>>> Esip-documentation mailing list
>>>> Esip-documentation at lists.esipfed.org 
>>>> <mailto:Esip-documentation at lists.esipfed.org>
>>>> http://www.lists.esipfed.org/mailman/listinfo/esip-documentation
>>>
>>> <Mail Attachment.png>
>>>
>>
>> _______________________________________________
>> Esip-documentation mailing list
>> Esip-documentation at lists.esipfed.org 
>> <mailto:Esip-documentation at lists.esipfed.org>
>> http://www.lists.esipfed.org/mailman/listinfo/esip-documentation
>
> John Graybeal
> jbgraybeal at mindspring.com <mailto:jbgraybeal at mindspring.com>
>
>
>
>
>
> _______________________________________________
> Esip-documentation mailing list
> Esip-documentation at lists.esipfed.org
> http://www.lists.esipfed.org/mailman/listinfo/esip-documentation

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.lists.esipfed.org/pipermail/esip-documentation/attachments/20140317/173f753d/attachment-0001.html>


More information about the Esip-documentation mailing list