[Esip-documentation] Let's get rid of spatial and temporal bounds in ACDD

Steve Hankin steven.c.hankin at noaa.gov
Tue Mar 18 11:51:23 EDT 2014


Hi Ken,

I think you are actually agreeing with me, rather than Ted.  The text 
that I proposed did not say geo-bounds attributes should be forbidden.  
It said that they should be used with caution -- only in situations 
where they are unlikely to lead to corruption.

    /It is recommended that 1) the use of these global attributes be
    restricted to files whose contents are known to be completely stable
    -- i.e. files very unlikely to be aggregated into larger
    collections;  and 2) as a matter of best practice, software reading
    CF files should ignore these global attributes; instead it should
    compute the geo-spatial bounds by scanning the coordinate ranges
    found within the CF dataset, itself. /

These words can certainly be adjusted and improved.  (The discussion 
that I am still hoping will happen!)  The intent is to apply common 
sense, practical thinking so that our interoperability frameworks works 
in practice.

You have argued that at NODC you *need* these geo-bounds attribute in 
order to avoid impossibly large processing burdens of examining the 
data, itself.  What would your strategy be if  the attributes proved not 
to be trustworthy?  Perhaps within the community of satellite swath 
folks you can agree that you will all maintain these attributes 
faithfully.  Great.  Do so.  That is in the spirit of the discussion 
that we should be having.  But should you speak for the modeling 
community, too?  (I presume the NODC granules must be swath data, 
because if they were grids, then the processing required to determine 
the bounds from the data is negligible.)

Lets look at the context of this discussion.  What we are seeing here is 
a collision between the priorities of two communities involved in 
developing standards. Our emails here are largely confined to one of the 
communities.   Envision taking this discussion topic to the CF email 
list.  I predict what you would see is that the potential of  redundant, 
static global attributes to create corruption to the datasets would fly 
up as a giant red flag. It is not only aggregation operations that will 
lead to corruption, it is also subsetting operations -- the most routine 
of all netCDF operations, and performed by generic utilities that are 
unaware of CF or ACDD.  These are glaring, unsolved problems in the use 
of these geo-bounds.  There is such a strong case to be more nuanced 
when standardizing attributes that we can plainly see are going to lead 
to many corrupted datasets.  What is the counter-argument for ignoring 
these self-evident problems?

     - Steve

===========================================


On 3/18/2014 3:48 AM, Kenneth S. Casey - NOAA Federal wrote:
> Hi All -
>
> I've been silent but following this thread carefully.  Time to jump in 
> now.
>
> I concur with Ted's statements below.  I would characterize his 
> responses as a good example of combating the "Tyranny of the Or". It 
> doesn't have to be one solution or the other.  For us at NODC, 
> processing literally tens and maybe hundreds of millions of netCDF 
> granules, having easily and quickly read global attributes is not a 
> convenience. It is a practical necessity.  We also like the idea of 
> encouraging softwarians to write better software.  Corrupting 
> attributes through negligence and inaction is not acceptable.  Fix it. 
> If they can't, we can encourage our users to stop using that software.
>
> I love the idea of building the congruence checker into the ACDD 
> rubric and catalog cleaner, and I think an "ncdump -bounds" option, 
> where the result is calculated from the actual bounds, is terrific 
> too.  These kinds of additions to exiting tools would help encourage 
> better practices and would give us some simple tools to improve our 
> management of netCDF data.
>
> Ken
>
>
>
> On Mar 17, 2014, at 4:35 PM, Ted Habermann <thabermann at hdfgroup.org 
> <mailto:thabermann at hdfgroup.org>> wrote:
>
>> All,
>>
>> This discussion is driving me bananas!
>>
>> I would argue (vehemently) against any recommendation that says 
>> anything like "don't add bounds elements to your global attributes". 
>> To encourage people to provide less metadata is just not acceptable. 
>> Much more palatable to
>> 1) offer a general warning to users - all data and metadata have 
>> varying quality - test before you leap and know how the data you have 
>> was processed.
>> 2) to start a concerted effort to encourage software developers to 
>> think about the metadata in the files they create and
>> 3) add a congruence check into the ACDD rubric / catalog cleaner, it 
>> would be very interesting to know how many datasets in the "clean" 
>> catalog suffer from inconsistencies between the data and the global 
>> attributes.
>>
>> As I said earlier, there are much more significant problems in 
>> ACDD/CF metadata land than this one. Hopefully they will generate the 
>> same amount of interest as we move forward.
>>
>> Ted
>>
>>
>> On Mar 17, 2014, at 1:22 PM, Steve Hankin <steven.c.hankin at noaa.gov 
>> <mailto:steven.c.hankin at noaa.gov>> wrote:
>>
>>> Hi John, Rich,
>>>
>>> The 'bananas' analogy fits right into the discussion below, so I'm 
>>> not commenting on I specifically.    I like it as a good metaphor 
>>> for the question we are debating.
>>>
>>> On 3/17/2014 10:51 AM, John Graybeal wrote:
>>>> Steve,
>>>>
>>>> Let me see if you take issue with my simplified version of what you 
>>>> said:
>>>>
>>>> *"NetCDF attributes should never be used to describe value-based or 
>>>> processing-specific features of any data set, because data 
>>>> processors (tools and people) can't help but corrupt derivative 
>>>> data sets with that information. "*
>>>
>>> I like this line of reasoning for understanding the issues, John.  
>>> But the above is not a sufficiently nuanced statement to capture 
>>> what I have advocated.  Thy this:
>>>
>>>     "_Broad_standardization of NetCDF attributes that contain
>>>     value-based or processing-derived information about a data set,
>>>     should be avoided -- used only if there is no reasonable
>>>     alternative.  Use of such attributes breaks the 'backwards
>>>     compatibility' goals that evolving software standards should
>>>     follow.  It 'breaks' existing systems and places a burden upon
>>>     future systems that may modify or extend the contents of that
>>>     dataset."
>>>
>>> Standardization of such attributes for special communities that run 
>>> custom systems, presents no serious problems.  CF allows itself to 
>>> be arbitrarily extended for such special purposes. Each community 
>>> will look after its own software quality.  But when you undertake a 
>>> broad standardization of such attributes, you are placing a burden 
>>> on software systems over which you have no knowledge and no control.
>>>
>>>>
>>>> Don't get me wrong; I think this presents a consistent philosophy 
>>>> in response to today's practical realities. As of today, with 5 out 
>>>> of 6 tools failing the 'metadata consistency' test, it likely 
>>>> minimizes the ratio of bad metadata in the wild.
>>> thank you for this.
>>>>
>>>> What it does *not* do is establish a mature, flexible, 
>>>> interoperable, metadata-aware community of practice going forward. 
>>>> It also does not fix the problem caused by these tools, it just 
>>>> eliminates the most-likely-to-fail attributes (from any standard, 
>>>> forevermore). And it directly undercuts the greatest value of a 
>>>> self-describing data format, the easy-to-access description.
>>>>
>>> As Ted pointed out, the issue is much broader and deeper than this 
>>> one particular use case. Dynamically changing datasets (including 
>>> the 'virtual datasets' created through aggregation) create a class 
>>> of problems that are in fundamental conflict with static metadata 
>>> representations. Do we agree that that virtual datasets are not 
>>> going to go away?   The class of datasets that CF is most centrally 
>>> committed to are often extremely large and evolve over time (both 
>>> the 'C' and the 'F' in CF).
>>>
>>>> The fact that much software pre-dates the standards is a red herring.
>>>
>>> Really?  We watch standards self-destruct time after time because 
>>> they fail to address pragmatic considerations of this very type.   
>>> Backwards compatibility deserves to be a paramount consideration.  
>>> Do we really disagree on this?
>>>
>>>> Software that takes any self-describing file, modifies the 
>>>> contents, yet passes through the original description _without 
>>>> validation_ to the output file, will almost always produce a broken 
>>>> output file. This isn't a feature, nor inevitable; it's a bug. Our 
>>>> underlying challenge is to fix that problem.
>>>>
>>>> And in the meantime, we still have to maintain ALL the metadata 
>>>> affected by that problem, not just the CF coordinate bounds. So we 
>>>> should fix all that software soon. (As I can think of 3 trivial 
>>>> fixes off the top of my head, I'm not sure why previous 
>>>> beat-arounds didn't induce change. It's time. And it's a lot easier 
>>>> than modifying all the software to include statements in the files 
>>>> about how unreliable the metadata is.)
>>>>
>>> I agree with you on this ... at least in principle:  Those 
>>> responsible for the software systems should make a good faith effort 
>>> to upgrade them in response to evolving standards.  But they may 
>>> simply lack the resources.  Or they may be unable to fit the fixes 
>>> into hard-pressed schedules for some time to come.  In the meantime 
>>> we have corrupted metadata.  We are living in lean times for our 
>>> community.  Considerations like this have to be a two-way street:  
>>> those responsible for evolving software standards need to minimize 
>>> so-called 'enhancements' that break existing software.  They need to 
>>> look for alternatives first.
>>>
>>>     - Steve
>>>
>>>> John
>>>>
>>>> On Mar 17, 2014, at 09:23, Steve Hankin <steven.c.hankin at noaa.gov 
>>>> <mailto:steven.c.hankin at noaa.gov>> wrote:
>>>>
>>>>> Greetings Ted!
>>>>>
>>>>> As rants go, the message you sent on Friday was pretty restrained. 
>>>>>   I particularly like your suggestion that the UAF Catalog Cleaner 
>>>>> could detect and report corrupted ACDD geo-positioning 
>>>>> attributes.  We will see what we can do with that idea. (aside:  
>>>>> Is there a way to ask ncISO when it has found this form of 
>>>>> corruption in a dataset?)
>>>>>
>>>>> It would be nice if we could all wrestle this topic to a workable 
>>>>> compromise.  I agree that the problem should be "fixed at the 
>>>>> source".   But characterizing the source of the problem as "sloppy 
>>>>> data management" seems off the mark.   This data management 
>>>>> problem didn't exist until we created the potential for easy 
>>>>> corruption by_defining easily-corrupted, redundant information in 
>>>>> the CF datasets_.
>>>>>
>>>>> The duplication of information between the global attributes and 
>>>>> the CF coordinates is an 'attractive nuisance'.   No matter how 
>>>>> much we exhort data managers to clean up their act, the problem is 
>>>>> going to continue showing up over and over and over. Some of the 
>>>>> data management tools that stumble into this problem are not even 
>>>>> CF-aware;  generic netCDF utilities like nco create this form of 
>>>>> corruption.   Offhand I can think of 6 independent pieces of 
>>>>> software that perform aggregations on CF datasets. Five of these 
>>>>> predate the use of the ACDD geo-bounds attributes within CF 
>>>>> files;  all of these exhibit this corruption.  Ed's example is yet 
>>>>> another.
>>>>>
>>>>> Our underlying challenge is to expose the CF coordinate bounds for 
>>>>> easy use wherever that information is needed for purposes of data 
>>>>> discovery.   The ncISO tool contributed by your group has 
>>>>> addressed this very successfully for formal ISO metadata.  (A big 
>>>>> thanks to you, Dave N. et. al..) There is similar excellent 
>>>>> potential to address the more informal cases through software.  
>>>>> You mentioned the limitations of "ncdump  -h" as an illustration.  
>>>>> How about code contributed to Unidata to create "ncdump -bounds"? 
>>>>>    This would be a smaller effort with a more robust outcome than 
>>>>> to ask all current and future developers of CF aggregation 
>>>>> techniques to make accommodation for the redundant, easily 
>>>>> corrupted attributes in their datasets.
>>>>>
>>>>>     - Steve
>>>>>
>>>>> ===========================================
>>>>>
>>>>> On 3/14/2014 1:14 PM, Ted Habermann wrote:
>>>>>> All,
>>>>>>
>>>>>> I agree with Ed and John,*this is a software tool problem that 
>>>>>> should be fixed at the source.*The description of the history 
>>>>>> attribute has always implied that it should be updated when a 
>>>>>> file is processed (even though, IMHO, it is almost entirely 
>>>>>> unsuited for doing that). The same is true for many others 
>>>>>> (listed by John G. earlier in this thread).*The current practice 
>>>>>> is sloppy data management*that, from the sound of this thread, is 
>>>>>> pervasive in the community. Of course, ncISO provides a very easy 
>>>>>> way to identify occurrences of this problem throughout a THREDDS 
>>>>>> catalog. The "Catalog Cleaner" is another venue for quantifying 
>>>>>> the damage. CF is a community standard. Maybe it is time for the 
>>>>>> community to recommend providing correct metadata with the files 
>>>>>> and to avoid developers and datasets that don't.
>>>>>>
>>>>>> A related problem is that the bounds calculated from the data are 
>>>>>> only available if you read the data. Many users may not be 
>>>>>> equipped to easily read the data during a data discovery process. 
>>>>>> They may not want to go beyond ncdump -x -h (or something like 
>>>>>> that) before they fire up the whole netCDF machine...
>>>>>>
>>>>>> BTW, this problem is trivial relative to that associated with 
>>>>>> virtual datasets created through aggregation. In those cases, 
>>>>>> there is no clear mechanism for providing meaningful metadata, 
>>>>>> although the rich inventory we created several years ago comes 
>>>>>> close... That situation is much more prone to mistakes as all 
>>>>>> semblance of the historic record is wiped out.
>>>>>>
>>>>>> Its Friday, and spring... As Dave said last week, a good time for 
>>>>>> a rant!
>>>>>> Ted
>>>>>>
>>>>>>
>>>>>> On Mar 14, 2014, at 1:44 PM, Steve Hankin 
>>>>>> <steven.c.hankin at noaa.gov <mailto:steven.c.hankin at noaa.gov>> wrote:
>>>>>>
>>>>>>> Hi All,
>>>>>>>
>>>>>>> I'm joining into this discussion from the wings. The topic here 
>>>>>>> -- the common tendency for the ACDD geo-spacio-temporal bounds 
>>>>>>> attributes to get corrupted -- has been beaten around a number 
>>>>>>> of times among different groups.  At this point it isn't clear 
>>>>>>> that there is a "clean" resolution to the problem;  there are 
>>>>>>> already so many files out there that contain these attributes 
>>>>>>> that there may be no easy way to unwind the problem.  Might the 
>>>>>>> best path forward be to see about adding some words of caution 
>>>>>>> into the the documents that suggest the use of these attributes? 
>>>>>>> Something along these lines:
>>>>>>>
>>>>>>>     /*Caution:*//The encoding of geo-spatial bounds values as
>>>>>>>     global attributes is a practice that should be used with
>>>>>>>     caution or avoided.//
>>>>>>>     //
>>>>>>>     //The encoding of geo-spatial bounds values as global
>>>>>>>     attributes introduces a high likelihood of corruption,
>>>>>>>     because the attibute values duplicate information already
>>>>>>>     contained in the self-describing coordinates of the
>>>>>>>     dataset.   A number of data management operations that are
>>>>>>>     common with netCDF files will invalidate the values stored
>>>>>>>     as global attributes.  Such operations include extending the
>>>>>>>     coordinate range of a netCDF file along its record axis;
>>>>>>>     aggregating a collection of netCDF files into a larger
>>>>>>>     datasets (for example aggregating model outputs along their
>>>>>>>     time axes); or appending files using file-based utilities
>>>>>>>     (e.g. nco).//
>>>>>>>     //
>>>>>>>     //It is recommended that 1) the use of these global
>>>>>>>     attributes be restricted to files whose contents are known
>>>>>>>     to be completely stable -- i.e. files very unlikely to be
>>>>>>>     aggregated into larger collections;  and 2) as a matter of
>>>>>>>     best practice, software reading CF files should ignore these
>>>>>>>     global attributes; instead it should compute the geo-spatial
>>>>>>>     bounds by scanning the coordinate ranges found within the CF
>>>>>>>     dataset, itself./
>>>>>>>
>>>>>>> Comments?
>>>>>>>
>>>>>>>     - Steve
>>>>>>> <http://www.facebook.com/feeds/page.php?id=178512945559611&format=rss20>
> <http://www.facebook.com/feeds/page.php?id=178512945559611&format=rss20>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.lists.esipfed.org/pipermail/esip-documentation/attachments/20140318/12a96905/attachment-0001.html>


More information about the Esip-documentation mailing list