[Esip-documentation] Let's get rid of spatial and temporal bounds in ACDD

David Neufeld - NOAA Affiliate david.neufeld at noaa.gov
Thu Mar 20 11:17:13 EDT 2014


Hi Ed,

An additional software recommendation that we shouldn't loose track of, is
the need for server technologies to implement a caching/refresh strategy
(Hyrax/THREDDS) for geospatial and temporal bounds.  This change would
allow for greatly improved performance when crawling servers for metadata
and allow clients to make more intelligent decisions about when to crawl.

Dave


On Thu, Mar 20, 2014 at 9:06 AM, Armstrong, Edward M (398M) <
Edward.M.Armstrong at jpl.nasa.gov> wrote:

>  Hi Ken and all.
>
>  Good points.  I think we are near a consensus for recommending actions
> on software developers to update and modify appropriate global attributes.
>
>  Good point on the netcdf3 compression.....need to uncompress to read
> anything in the file.  Even netCDF4 you will still need to uncompress the
> coordinate variables which can be large in size (real*4 or similar)...takes a
> very small amount of time (compared to the entire file externally
> compressed like netcdf3), but over many many files this time is not
> insignificant either.
>
>  On Mar 20, 2014, at 4:17 AM, Kenneth S. Casey - NOAA Federal <
> kenneth.casey at noaa.gov> wrote:
>
>  Hi All -
>
>  Perhaps this group could lay out a simple "proposal" of sorts... that
> could be discussed and refined in this thread, and agreed to at ESIP Rocky
> Mountain High this summer if not sooner.  Perhaps that proposal would look
> something like:
>
>  "Dear Software Providers:  Please do the right thing with global
> attributes, and properly update spatial and temporal bounding attributes
> when you modify a netCDF file and either re-write or create a new one.
>  While you are at it, add some info to the history attribute too like you
> are supposed to.  In the meantime, dear community, be wary of global
> attributes that relate to coordinate variables.. trust the coordinate
> variables and if you notice a discrepancy with their corresponding global
> attributes SCREAM VERY LOUDLY at the provider of the software which
> generated that netCDF file."
>
>  Specific actions could then be requested of the big players to make the
> appropriate updates to their code.
>
>  I think we need global attributes in general, even ones relating to
> coordinate variables.  Everything said here about coordinate attributes
> actually applies more generally...  many, many, of the global attributes can
> and should be updated depending on the provenance of the file and who did
> what to it.  The only difference is that the attributes relating to
> coordinate variables can actually be tested against the data.
>
>  I'd add one other point... while computationally doing a max/min on the
> coordinate variables is not too terrible, much of the time (esp. with
> netCDF-3) you have to decompress the entire file first, and that is
> computationally terrible for large numbers of large files that are
> externally compressed (like we have with GHRSST, for example... loving that
> GHRSST Data Specification v2 now uses netCDF-4 with internal compression!).
>
>  Ken
>
>
>
>   On Mar 19, 2014, at 11:12 PM, "Signell, Richard" <rsignell at usgs.gov>
> wrote:
>
> Gang,
> I understand the importance having the bounds information in metadata
> -- in fact we start our workflows by querying catalog services which
> uses bounding box information contained in the ISO metadata.  But this
> ISO metadata was calculated by ncISO by reading the CF coordinate
> variables via OPeNDAP, and the metadata points to the OPeNDAP service
> endpoint, so I know that the bounds data is correct.
>
> It would seem that NASA, OCEANSITES, and others could use this
> approach as well, which would yield the same functionality as reading
> metadata from the actual dataset, but without the drawbacks.
>
> Having read all the arguments so far,  I'm going to continue
> recommending that people not write these bounds attributes into their
> datasets, because I remain convinced they do our community more harm
> than good.  But I'll explain to them the arguments for and against.
>
> -Rich
>
> On Wed, Mar 19, 2014 at 5:43 PM, Armstrong, Edward M (398M)
> <Edward.M.Armstrong at jpl.nasa.gov> wrote:
>
> Hello,
>
> Just to continue this thread and the way a popular tool works......I
> checked the
> output of LAS and it does not update any attributes...just inherits what it
> natively subsetted.  Its includes this global attribute:
>
> :FERRET_comment = "File written via LAS. Attributes are inherited from
> originating dataset";
>
>
>
> On Mar 18, 2014, at 10:42 AM, Ted Habermann <thabermann at hdfgroup.org>
> wrote:
>
> All,
>
> Just wanted to point out that CF is really use metadata used by tools that
> are actually reading the data. ACDD are discovery conventions originally
> motivated by the lack of discovery information included in CF. The opinion
> of the CF mailing list is, therefore, not relevant in this discussion.
>
> I agree with Ken and Ed... There are two sets of metadata because they
> serve
> two communities. We need to do our best to make sure they are both correct.
>
> Seems like the netCDF download service in THREDDS is a good place to
> start... Does anyone know how it behaves with global attributes? What about
> LAS?
>
> Ted
>
> On Mar 18, 2014, at 10:52 AM, Armstrong, Edward M (398M)
> <Edward.M.Armstrong at jpl.nasa.gov> wrote:
>
> Hi Steve,
>
> I do agree with Ken and Ted on this issue....also from experience working
> with
> granules in a large data center.
>
> I think the crux of the counter argument is that from the perspective of
> the
> data producer we want them to add more metadata, not less to the native
> granule.  And many people do use these global bounds in some context,
> mostly
> just looking at the granules through a browser like ncdump and Panoply.
> Space and time bounds is just a natural thing most people look at in a
> granule when they first acquire it.
>
> The key, then is to recognize if the granule has been altered by a tool, to
> treat the bounds with caution.  Nan had a great idea that tools should
> point
> back to the original unaltered granule.  I think this is first requirement
> for any tool that does modification to the granule via subsetting or
> similar.
>
> And the next step is to encourage the updating of global bounds by any tool
> or aggregation operations. I do agree that is a challenge.
>
>
> On Mar 18, 2014, at 8:51 AM, Steve Hankin <steven.c.hankin at noaa.gov>
> wrote:
>
> Hi Ken,
>
> I think you are actually agreeing with me, rather than Ted.  The text that
> I
> proposed did not say geo-bounds attributes should be forbidden.  It said
> that they should be used with caution -- only in situations where they are
> unlikely to lead to corruption.
>
> It is recommended that 1) the use of these global attributes be restricted
> to files whose contents are known to be completely stable -- i.e. files
> very
> unlikely to be aggregated into larger collections;  and 2) as a matter of
> best practice, software reading CF files should ignore these global
> attributes; instead it should compute the geo-spatial bounds by scanning
> the
> coordinate ranges found within the CF dataset, itself.
>
> These words can certainly be adjusted and improved.  (The discussion that I
> am still hoping will happen!)  The intent is to apply common sense,
> practical thinking so that our interoperability frameworks works in
> practice.
>
> You have argued that at NODC you *need* these geo-bounds attribute in order
> to avoid impossibly large processing burdens of examining the data, itself.
> What would your strategy be if  the attributes proved not to be
> trustworthy?
> Perhaps within the community of satellite swath folks you can agree that
> you
> will all maintain these attributes faithfully.  Great.  Do so.  That is in
> the spirit of the discussion that we should be having.  But should you
> speak
> for the modeling community, too?  (I presume the NODC granules must be
> swath
> data, because if they were grids, then the processing required to determine
> the bounds from the data is negligible.)
>
> Lets look at the context of this discussion.  What we are seeing here is a
> collision between the priorities of two communities involved in developing
> standards. Our emails here are largely confined to one of the communities.
> Envision taking this discussion topic to the CF email list.  I predict what
> you would see is that the potential of  redundant, static global attributes
> to create corruption to the datasets would fly up as a giant red flag.  It
> is not only aggregation operations that will lead to corruption, it is also
> subsetting operations -- the most routine of all netCDF operations, and
> performed by generic utilities that are unaware of CF or ACDD.  These are
> glaring, unsolved problems in the use of these geo-bounds.  There is such a
> strong case to be more nuanced when standardizing attributes that we can
> plainly see are going to lead to many corrupted datasets.  What is the
> counter-argument for ignoring these self-evident problems?
>
>    - Steve
>
> ===========================================
>
>
> On 3/18/2014 3:48 AM, Kenneth S. Casey - NOAA Federal wrote:
>
> Hi All -
>
> I've been silent but following this thread carefully.  Time to jump in now.
>
> I concur with Ted's statements below.  I would characterize his responses
> as
> a good example of combating the "Tyranny of the Or". It doesn't have to be
> one solution or the other.  For us at NODC, processing literally tens and
> maybe hundreds of millions of netCDF granules, having easily and quickly
> read global attributes is not a convenience. It is a practical necessity.
> We also like the idea of encouraging softwarians to write better software.
> Corrupting attributes through negligence and inaction is not acceptable.
> Fix it. If they can't, we can encourage our users to stop using that
> software.
>
> I love the idea of building the congruence checker into the ACDD rubric and
> catalog cleaner, and I think an "ncdump -bounds" option, where the result
> is
> calculated from the actual bounds, is terrific too.  These kinds of
> additions to exiting tools would help encourage better practices and would
> give us some simple tools to improve our management of netCDF data.
>
> Ken
>
>
>
> On Mar 17, 2014, at 4:35 PM, Ted Habermann <thabermann at hdfgroup.org>
> wrote:
>
> All,
>
> This discussion is driving me bananas!
>
> I would argue (vehemently) against any recommendation that says anything
> like "don't add bounds elements to your global attributes". To encourage
> people to provide less metadata is just not acceptable. Much more palatable
> to
> 1) offer a general warning to users - all data and metadata have varying
> quality - test before you leap and know how the data you have was
> processed.
> 2) to start a concerted effort to encourage software developers to think
> about the metadata in the files they create and
> 3) add a congruence check into the ACDD rubric / catalog cleaner, it would
> be very interesting to know how many datasets in the "clean" catalog suffer
> from inconsistencies between the data and the global attributes.
>
> As I said earlier, there are much more significant problems in ACDD/CF
> metadata land than this one. Hopefully they will generate the same amount
> of
> interest as we move forward.
>
> Ted
>
>
> On Mar 17, 2014, at 1:22 PM, Steve Hankin <steven.c.hankin at noaa.gov>
> wrote:
>
> Hi John, Rich,
>
> The 'bananas' analogy fits right into the discussion below, so I'm not
> commenting on I specifically.    I like it as a good metaphor for the
> question we are debating.
>
> On 3/17/2014 10:51 AM, John Graybeal wrote:
>
> Steve,
>
> Let me see if you take issue with my simplified version of what you said:
>
> "NetCDF attributes should never be used to describe value-based or
> processing-specific features of any data set, because data processors
> (tools
> and people) can't help but corrupt derivative data sets with that
> information. "
>
>
> I like this line of reasoning for understanding the issues, John.  But the
> above is not a sufficiently nuanced statement to capture what I have
> advocated.  Thy this:
>
> "Broad standardization of NetCDF attributes that contain value-based or
> processing-derived information about a data set, should be avoided -- used
> only if there is no reasonable alternative.  Use of such attributes breaks
> the 'backwards compatibility' goals that evolving software standards should
> follow.  It 'breaks' existing systems and places a burden upon future
> systems that may modify or extend the contents of that dataset."
>
> Standardization of such attributes for special communities that run custom
> systems, presents no serious problems.  CF allows itself to be arbitrarily
> extended for such special purposes.  Each community will look after its own
> software quality.  But when you undertake a broad standardization of such
> attributes, you are placing a burden on software systems over which you
> have
> no knowledge and no control.
>
>
> Don't get me wrong; I think this presents a consistent philosophy in
> response to today's practical realities. As of today, with 5 out of 6 tools
> failing the 'metadata consistency' test, it likely minimizes the ratio of
> bad metadata in the wild.
>
> thank you for this.
>
>
> What it does *not* do is establish a mature, flexible, interoperable,
> metadata-aware community of practice going forward. It also does not fix
> the
> problem caused by these tools, it just eliminates the most-likely-to-fail
> attributes (from any standard, forevermore). And it directly undercuts the
> greatest value of a self-describing data format, the easy-to-access
> description.
>
> As Ted pointed out, the issue is much broader and deeper than this one
> particular use case.  Dynamically changing datasets (including the 'virtual
> datasets' created through aggregation) create a class of problems that are
> in fundamental conflict with static metadata representations.   Do we agree
> that that virtual datasets are not going to go away?   The class of
> datasets
> that CF is most centrally committed to are often extremely large and evolve
> over time (both the 'C' and the 'F' in CF).
>
> The fact that much software pre-dates the standards is a red herring.
>
>
> Really?  We watch standards self-destruct time after time because they fail
> to address pragmatic considerations of this very type.   Backwards
> compatibility deserves to be a paramount consideration.  Do we really
> disagree on this?
>
> Software that takes any self-describing file, modifies the contents, yet
> passes through the original description _without validation_ to the output
> file, will almost always produce a broken output file. This isn't a
> feature,
> nor inevitable; it's a bug. Our underlying challenge is to fix that
> problem.
>
>
> And in the meantime, we still have to maintain ALL the metadata affected by
> that problem, not just the CF coordinate bounds. So we should fix all that
> software soon. (As I can think of 3 trivial fixes off the top of my head,
> I'm not sure why previous beat-arounds didn't induce change. It's time. And
> it's a lot easier than modifying all the software to include statements in
> the files about how unreliable the metadata is.)
>
> I agree with you on this ... at least in principle:  Those responsible for
> the software systems should make a good faith effort to upgrade them in
> response to evolving standards.  But they may simply lack the resources.
>  Or
> they may be unable to fit the fixes into hard-pressed schedules for some
> time to come.  In the meantime we have corrupted metadata.  We are living
> in
> lean times for our community.  Considerations like this have to be a
> two-way
> street:  those responsible for evolving software standards need to minimize
> so-called 'enhancements' that break existing software.  They need to look
> for alternatives first.
>
>    - Steve
>
> John
>
> On Mar 17, 2014, at 09:23, Steve Hankin <steven.c.hankin at noaa.gov> wrote:
>
> Greetings Ted!
>
> As rants go, the message you sent on Friday was pretty restrained.   I
> particularly like your suggestion that the UAF Catalog Cleaner could detect
> and report corrupted ACDD geo-positioning attributes.  We will see what we
> can do with that idea.  (aside:  Is there a way to ask ncISO when it has
> found this form of corruption in a dataset?)
>
> It would be nice if we could all wrestle this topic to a workable
> compromise.  I agree that the problem should be "fixed at the source".
>   But
> characterizing the source of the problem as "sloppy data management" seems
> off the mark.   This data management problem didn't exist until we created
> the potential for easy corruption by defining easily-corrupted, redundant
> information in the CF datasets.
>
> The duplication of information between the global attributes and the CF
> coordinates is an 'attractive nuisance'.   No matter how much we exhort
> data
> managers to clean up their act, the problem is going to continue showing up
> over and over and over.   Some of the data management tools that stumble
> into this problem are not even CF-aware;  generic netCDF utilities like nco
> create this form of corruption.   Offhand I can think of 6 independent
> pieces of software that perform aggregations on CF datasets.  Five of these
> predate the use of the ACDD geo-bounds attributes within CF files;  all of
> these exhibit this corruption.  Ed's example is yet another.
>
> Our underlying challenge is to expose the CF coordinate bounds for easy use
> wherever that information is needed for purposes of data discovery.   The
> ncISO tool contributed by your group has addressed this very successfully
> for formal ISO metadata.  (A big thanks to you, Dave N. et. al..)  There is
> similar excellent potential to address the more informal cases through
> software.  You mentioned the limitations of "ncdump  -h" as an
> illustration.
> How about code contributed to Unidata to create "ncdump -bounds"?    This
> would be a smaller effort with a more robust outcome than to ask all
> current
> and future developers of CF aggregation techniques to make accommodation
> for
> the redundant, easily corrupted attributes in their datasets.
>
>    - Steve
>
> ===========================================
>
> On 3/14/2014 1:14 PM, Ted Habermann wrote:
>
> All,
>
> I agree with Ed and John, this is a software tool problem that should be
> fixed at the source. The description of the history attribute has always
> implied that it should be updated when a file is processed (even though,
> IMHO, it is almost entirely unsuited for doing that). The same is true for
> many others (listed by John G. earlier in this thread). The current
> practice
> is sloppy data management that, from the sound of this thread, is pervasive
> in the community. Of course, ncISO provides a very easy way to identify
> occurrences of this problem throughout a THREDDS catalog. The "Catalog
> Cleaner" is another venue for quantifying the damage. CF is a community
> standard. Maybe it is time for the community to recommend providing correct
> metadata with the files and to avoid developers and datasets that don't.
>
> A related problem is that the bounds calculated from the data are only
> available if you read the data. Many users may not be equipped to easily
> read the data during a data discovery process. They may not want to go
> beyond ncdump -x -h (or something like that) before they fire up the whole
> netCDF machine...
>
> BTW, this problem is trivial relative to that associated with virtual
> datasets created through aggregation. In those cases, there is no clear
> mechanism for providing meaningful metadata, although the rich inventory we
> created several years ago comes close... That situation is much more prone
> to mistakes as all semblance of the historic record is wiped out.
>
> Its Friday, and spring... As Dave said last week, a good time for a rant!
> Ted
>
>
> On Mar 14, 2014, at 1:44 PM, Steve Hankin <steven.c.hankin at noaa.gov>
> wrote:
>
> Hi All,
>
> I'm joining into this discussion from the wings.  The topic here -- the
> common tendency for the ACDD geo-spacio-temporal bounds attributes to get
> corrupted -- has been beaten around a number of times among different
> groups.  At this point it isn't clear that there is a "clean" resolution to
> the problem;  there are already so many files out there that contain these
> attributes that there may be no easy way to unwind the problem.  Might the
> best path forward be to see about adding some words of caution into the the
> documents that suggest the use of these attributes?  Something along these
> lines:
>
> Caution:   The encoding of geo-spatial bounds values as global attributes
> is
> a practice that should be used with caution or avoided.
>
> The encoding of geo-spatial bounds values as global attributes introduces a
> high likelihood of corruption, because the attibute values duplicate
> information already contained in the self-describing coordinates of the
> dataset.   A number of data management operations that are common with
> netCDF files will invalidate the values stored as global attributes.  Such
> operations include extending the coordinate range of a netCDF file along
> its
> record axis;  aggregating a collection of netCDF files into a larger
> datasets (for example aggregating model outputs along their time axes); or
> appending files using file-based utilities (e.g. nco).
>
> It is recommended that 1) the use of these global attributes be restricted
> to files whose contents are known to be completely stable -- i.e. files
> very
> unlikely to be aggregated into larger collections;  and 2) as a matter of
> best practice, software reading CF files should ignore these global
> attributes; instead it should compute the geo-spatial bounds by scanning
> the
> coordinate ranges found within the CF dataset, itself.
>
> Comments?
>
>    - Steve
>
>
>
> _______________________________________________
> Esip-documentation mailing list
> Esip-documentation at lists.esipfed.org
> http://www.lists.esipfed.org/mailman/listinfo/esip-documentation
>
>
> -ed
>
> Ed Armstrong
> JPL Physical Oceanography DAAC
> 818 519-7607
>
>
>
>
> <SignatureSm.png>
>
>
> -ed
>
> Ed Armstrong
> JPL Physical Oceanography DAAC
> 818 519-7607
>
>
>
>
> _______________________________________________
> Esip-documentation mailing list
> Esip-documentation at lists.esipfed.org
> http://www.lists.esipfed.org/mailman/listinfo/esip-documentation
>
>
>
>
> --
> Dr. Richard P. Signell   (508) 457-2229
> USGS, 384 Woods Hole Rd.
> Woods Hole, MA 02543-1598
> _______________________________________________
> Esip-documentation mailing list
> Esip-documentation at lists.esipfed.org
> http://www.lists.esipfed.org/mailman/listinfo/esip-documentation
>
>
>   [NOTE: The opinions expressed in this email are those of the author
> alone and do not necessarily reflect official NOAA, Department of Commerce,
> or US government policy.]
>
>  Kenneth S. Casey, Ph.D.
> Technical Director
> NOAA National Oceanographic Data Center
> 1315 East-West Highway
> Silver Spring MD 20910
> 301-713-3272 x133
> http://www.nodc.noaa.gov
>
>  <https://www.facebook.com/noaa.nodc>
>
> <facebook.png><RSS.png>
>  <http://www.facebook.com/feeds/page.php?id=178512945559611&format=rss20>
>  <http://www.facebook.com/feeds/page.php?id=178512945559611&format=rss20>
>
>
>  -ed
>
>  Ed Armstrong
> JPL Physical Oceanography DAAC
> 818 519-7607
>
>
>
>
> _______________________________________________
> Esip-documentation mailing list
> Esip-documentation at lists.esipfed.org
> http://www.lists.esipfed.org/mailman/listinfo/esip-documentation
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.lists.esipfed.org/pipermail/esip-documentation/attachments/20140320/34e74f6c/attachment-0001.html>


More information about the Esip-documentation mailing list