[Esip-documentation] Let's get rid of spatial and temporal bounds in ACDD

Mon Mar 17 15:13:17 EDT 2014

You made my point exactly also:  there is nothing to prevent you from
removing a few bananas after you've weighed them and put the label on
(although the store would be happy -- you would just screw yourself).

You want to replace a functioning system with a broken system that may
work more efficiently in the long run (but no guarantees!)    I don't
think I'd want to take that proposal to the grocery stores...

-R

On Mon, Mar 17, 2014 at 3:00 PM, John Graybeal
<jbgraybeal at mindspring.com> wrote:
> Great analogy, it makes my points exactly: (a) after taking the two bananas,
> you're putting/leaving a sticker on the bananas that says they weigh more
> than they do; (b) weighing has a cost. For those dealing with large volumes
> of data, weighing has a really large cost; and (c) for those who can read
> the sticker but don't have a scale, weighing is not possible.
>
> (In Europe the bananas are weighed and priced at the fruit section, by the
> buyer. They wouldn't be pleased if you weighed them and printed the sticker,
> then removed the 2 bananas.)
>
> John
>
> On Mar 17, 2014, at 11:49, "Signell, Richard" <rsignell at usgs.gov> wrote:
>
> Gang,
>
> Since this is a cluster discussion, how about a bananas analogy.  ;-)
>
> When you buy a bunch of bananas, the checker weighs them and you pay the
> correct amount.
> If you take two bananas off the bunch, the checker weights them and you pay
> the correct amount.
>
> ACDD wants to put a  weight label on the bunch of bananas, hopefully by
> weighing them (but no guarantees!), and you *should* be good if you buy the
> whole amount (but no guarantees).    And if someone takes a few bananas, the
> labeled weight will be wrong.   So the system is broken!   Make sure all
> people correctly weigh the bananas they take and put on the right labels!!!!
>
> But why not just weigh the bananas at checkout?  It works. It's not broken.
>
> I'm not arguing that software not be upgraded to track metadata changes.
> It's just in this case the information we are providing is already contained
> in the file.  I can see that it's useful in situations where you know the
> bounds are right because your software wrote them.  But I believe that in
> general, in the distributed geoscience infrastructure we are building where
> we encourage people to subsetting all the time with a plethora of tools, the
> metadata bounds are wrong so often that it runs the risk of making people
> say "see, this metadata stuff *is* a waste of time, just like I thought it
> was."
>
> -Rich
>
>
>
>
> On Mon, Mar 17, 2014 at 2:09 PM, Ted Habermann <thabermann at hdfgroup.org>
> wrote:
>>
>> Steve et al.,
>>
>> The NCML that is generated by ncISO includes global attributes from the
>> file outside of any group and the CF bounds calculated from the data in the
>> CFMetadata group. Comparing these is straightforward... Would be interesting
>> (and probably helpful) to embed this comparison in the transform that makes
>> the rubric. I think we discussed this idea ahile ago at NGDC, but it was not
>> implemented...
>>
>> See
>> http://www.ngdc.noaa.gov/thredds/ncml/relief/ETOPO1/thredds/ETOPO1_Ice_g_gmt4.nc?catalog=http%3A%2F%2Fwww.ngdc.noaa.gov%2Fthredds%2FbathyCatalog.html&dataset=etopo1Ice
>> for an example...
>>
>> Ted
>>
>> BTW, it looks like my clean catalog link has gone stale as well as the one
>> on https://www.nosc.noaa.gov/EDMC/swg.php... Is the clean catalog still
>> available...?
>>
>> On Mar 17, 2014, at 11:51 AM, John Graybeal <jbgraybeal at mindspring.com>
>> wrote:
>>
>> Steve,
>>
>> Let me see if you take issue with my simplified version of what you said:
>>
>> "NetCDF attributes should never be used to describe value-based or
>> processing-specific features of any data set, because data processors (tools
>> and people) can't help but corrupt derivative data sets with that
>> information. "
>>
>> Don't get me wrong; I think this presents a consistent philosophy in
>> response to today's practical realities. As of today, with 5 out of 6 tools
>> failing the 'metadata consistency' test, it likely minimizes the ratio of
>> bad metadata in the wild.
>>
>> What it does *not* do is establish a mature, flexible, interoperable,
>> metadata-aware community of practice going forward. It also does not fix the
>> problem caused by these tools, it just eliminates the most-likely-to-fail
>> attributes (from any standard, forevermore). And it directly undercuts the
>> greatest value of a self-describing data format, the easy-to-access
>> description.
>>
>> The fact that much software pre-dates the standards is a red herring.
>> Software that takes any self-describing file, modifies the contents, yet
>> passes through the original description _without validation_ to the output
>> file, will almost always produce a broken output file. This isn't a feature,
>> nor inevitable; it's a bug. Our underlying challenge is to fix that problem.
>>
>> And in the meantime, we still have to maintain ALL the metadata affected
>> by that problem, not just the CF coordinate bounds. So we should fix all
>> that software soon. (As I can think of 3 trivial fixes off the top of my
>> head, I'm not sure why previous beat-arounds didn't induce change. It's
>> time. And it's a lot easier than modifying all the software to include
>> statements in the files about how unreliable the metadata is.)
>>
>> John
>>
>> On Mar 17, 2014, at 09:23, Steve Hankin <steven.c.hankin at noaa.gov> wrote:
>>
>> Greetings Ted!
>>
>> As rants go, the message you sent on Friday was pretty restrained.   I
>> particularly like your suggestion that the UAF Catalog Cleaner could detect
>> and report corrupted ACDD geo-positioning attributes.  We will see what we
>> can do with that idea.  (aside:  Is there a way to ask ncISO when it has
>> found this form of corruption in a dataset?)
>>
>> It would be nice if we could all wrestle this topic to a workable
>> compromise.  I agree that the problem should be "fixed at the source".   But
>> characterizing the source of the problem as "sloppy data management" seems
>> off the mark.   This data management problem didn't exist until we created
>> the potential for easy corruption by defining easily-corrupted, redundant
>> information in the CF datasets.
>>
>> The duplication of information between the global attributes and the CF
>> coordinates is an 'attractive nuisance'.   No matter how much we exhort data
>> managers to clean up their act, the problem is going to continue showing up
>> over and over and over.   Some of the data management tools that stumble
>> into this problem are not even CF-aware;  generic netCDF utilities like nco
>> create this form of corruption.   Offhand I can think of 6 independent
>> pieces of software that perform aggregations on CF datasets.  Five of these
>> predate the use of the ACDD geo-bounds attributes within CF files;  all of
>> these exhibit this corruption.  Ed's example is yet another.
>>
>> Our underlying challenge is to expose the CF coordinate bounds for easy
>> use wherever that information is needed for purposes of data discovery.
>> The ncISO tool contributed by your group has addressed this very
>> successfully for formal ISO metadata.  (A big thanks to you, Dave N. et.
>> al..)  There is similar excellent potential to address the more informal
>> cases through software.  You mentioned the limitations of "ncdump  -h" as an
>> illustration.  How about code contributed to Unidata to create "ncdump
>> -bounds"?    This would be a smaller effort with a more robust outcome than
>> to ask all current and future developers of CF aggregation techniques to
>> make accommodation for the redundant, easily corrupted attributes in their
>> datasets.
>>
>>     - Steve
>>
>> ===========================================
>>
>> On 3/14/2014 1:14 PM, Ted Habermann wrote:
>>
>> All,
>>
>> I agree with Ed and John, this is a software tool problem that should be
>> fixed at the source. The description of the history attribute has always
>> implied that it should be updated when a file is processed (even though,
>> IMHO, it is almost entirely unsuited for doing that). The same is true for
>> many others (listed by John G. earlier in this thread). The current practice
>> is sloppy data management that, from the sound of this thread, is pervasive
>> in the community. Of course, ncISO provides a very easy way to identify
>> occurrences of this problem throughout a THREDDS catalog. The "Catalog
>> Cleaner" is another venue for quantifying the damage. CF is a community
>> standard. Maybe it is time for the community to recommend providing correct
>> metadata with the files and to avoid developers and datasets that don't.
>>
>> A related problem is that the bounds calculated from the data are only
>> available if you read the data. Many users may not be equipped to easily
>> read the data during a data discovery process. They may not want to go
>> beyond ncdump -x -h (or something like that) before they fire up the whole
>> netCDF machine...
>>
>> BTW, this problem is trivial relative to that associated with virtual
>> datasets created through aggregation. In those cases, there is no clear
>> mechanism for providing meaningful metadata, although the rich inventory we
>> created several years ago comes close... That situation is much more prone
>> to mistakes as all semblance of the historic record is wiped out.
>>
>> Its Friday, and spring... As Dave said last week, a good time for a rant!
>> Ted
>>
>>
>> On Mar 14, 2014, at 1:44 PM, Steve Hankin <steven.c.hankin at noaa.gov>
>> wrote:
>>
>> Hi All,
>>
>> I'm joining into this discussion from the wings.  The topic here -- the
>> common tendency for the ACDD geo-spacio-temporal bounds attributes to get
>> corrupted -- has been beaten around a number of times among different
>> groups.  At this point it isn't clear that there is a "clean" resolution to
>> the problem;  there are already so many files out there that contain these
>> attributes that there may be no easy way to unwind the problem.  Might the
>> best path forward be to see about adding some words of caution into the the
>> documents that suggest the use of these attributes?  Something along these
>> lines:
>>
>> Caution:   The encoding of geo-spatial bounds values as global attributes
>> is a practice that should be used with caution or avoided.
>>
>> The encoding of geo-spatial bounds values as global attributes introduces
>> a high likelihood of corruption, because the attibute values duplicate
>> information already contained in the self-describing coordinates of the
>> dataset.   A number of data management operations that are common with
>> netCDF files will invalidate the values stored as global attributes.  Such
>> operations include extending the coordinate range of a netCDF file along its
>> record axis;  aggregating a collection of netCDF files into a larger
>> datasets (for example aggregating model outputs along their time axes); or
>> appending files using file-based utilities (e.g. nco).
>>
>> It is recommended that 1) the use of these global attributes be restricted
>> to files whose contents are known to be completely stable -- i.e. files very
>> unlikely to be aggregated into larger collections;  and 2) as a matter of
>> best practice, software reading CF files should ignore these global
>> attributes; instead it should compute the geo-spatial bounds by scanning the
>> coordinate ranges found within the CF dataset, itself.
>>
>> Comments?
>>
>>     - Steve
>>
>> ________________________________
>>
>> From: Armstrong, Edward M (398M) <Edward.M.Armstrong at jpl.nasa.gov>
>> Date: Wed, Mar 12, 2014 at 6:03 PM
>> Subject: Re: [Esip-documentation] Let's get rid of spatial and
>> temporal bounds in ACDD
>> To: Nan Galbraith <ngalbraith at whoi.edu>
>> Cc: Cluster Documentation <esip-documentation at lists.esipfed.org>
>>
>>
>> I think that is an excellent idea.
>>
>> The output of the PO.DAAC HiTIDE subsetter that I mentioned in a
>> previous email does exactly that and includes some useful information
>> about the granule and the wrapped subsetting request via OPeNDAP.
>> Below is a snapshot of some of the global attributes from  a subsetted
>> AVHRR SST granule (look at the naiad_ attributes):
>>
>>   :southernmost_latitude = -89.72987f; // float
>>   :northernmost_latitude = 89.80405f; // float
>>   :westernmost_longitude = -179.9997f; // float
>>   :easternmost_longitude = 179.99994f; // float
>>   :file_quality_index = 1S; // short
>>   :comment = "none";
>>   :naiad_download_date = "2014-03-10 21:25:47";
>>   :naiad_granule_url =
>>
>> "http://podaac-opendap.jpl.nasa.gov/opendap/allData/ghrsst/data/L2P/AVHRR19_G/NAVO/2013/358/20131224-AVHRR19_G-NAVO-L2P-SST_s0827_e1009-v01.nc.bz2";
>>   :naiad_constraint_expression =
>>
>> "lat[8000:1:9200][104:1:408],lon[8000:1:9200][104:1:408],time[0:1:0],sst_dtime[0:1:0][8000:1:9200][104:1:408],rejection_flag[0:1:0][8000:1:9200][104:1:408],SSES_bias_error[0:1:0][8000:1:9200][104:1:408],aod_dtime_from_sst[0:1:0][8000:1:9200][104:1:408],DT_analysis[0:1:0][8000:1:9200][104:1:408],brightness_temperature_11um[0:1:0][8000:1:9200][104:1:408],aerosol_optical_depth[0:1:0][8000:1:9200][104:1:408],sources_of_aod[0:1:0][8000:1:9200][104:1:408],confidence_flag[0:1:0][8000:1:9200][104:1:408],brightness_temperature_4um[0:1:0][8000:1:9200][104:1:408],SSES_standard_deviation_error[0:1:0][8000:1:9200][104:1:408],sea_surface_temperature[0:1:0][8000:1:9200][104:1:408],brightness_temperature_12um[0:1:0][8000:1:9200][104:1:408],proximity_confidence[0:1:0][8000:1:9200][104:1:408],satellite_zenith_angle[0:1:0][8000:1:9200][104:1:408]";
>> }
>>
>> I did misspeak earlier when I indicated that the spatial bounds were
>> also updated, in this case southernmost_latitude etc. They are the
>> original (global) bounds.  I have requested to the developer that
>> these bounds be updated for every subset request. Hopefully it will
>> get in the next version.
>>
>>
>>
>> On Mar 8, 2014, at 3:13 AM, Nan Galbraith <ngalbraith at whoi.edu> wrote:
>>
>> > Hows about adding an attribute that contains the URL of the data
>> > file to which the bounds apply? If your aggregator/sub-setter
>> > has misled you by failing to update the bounds attribute, he's also
>> > provided you with the link to the data you actually wanted.
>> >
>> > For programs that collect lots of data and don't molest it, this would
>> > let them continue to use the bounds atts; for programs that slice
>> > and dice, they'd be motivated to either update the fields dynamically
>> > or remove them.
>> >
>> > Cheers - Nan
>> >
>> > Is it really Friday? I'm at sea (yes, people still do that, sometimes)
>> > and I've lost all sense of time and place (no geospatial and temporal
>> > bounds information).
>> >
>> >
>> >
>> >
>> > On 3/7/14 6:47 PM, David Neufeld - NOAA Affiliate wrote:
>> >>
>> >> Ok, since it's Friday and we are in rant mode, I'm going to have a
>> >> little fun with this one...
>> >>
>> >> Recommended Attribute Disclaimer:
>> >>
>> >> Suggested text: "When using geospatial and temporal bounds information
>> >> in your global attributes, please know that it introduces a likely source of
>> >> error and that you are far better off reading these values from the data
>> >> stored in the file.  If you do choose to use the attributes please also
>> >> include a global checksum attribute that humans can look at to decide
>> >> whether the file has changed since you originally recorded these values."
>> >>
>> >> On Fri, Mar 7, 2014 at 2:02 PM, Nan Galbraith <ngalbraith at whoi.edu
>> >> <mailto:ngalbraith at whoi.edu>> wrote:
>> >>
>> >>
>> >>
>> >>    Maybe we should add some text about which attributes should be
>> >>    considered fragile and under what conditions they need to be
>> >>    recalculated or removed, but I'm not in favor of removing the
>> >>    terms from ACDD.
>> >>
>> >>
>> >
>> >
>> > --
>> > *******************************************************
>> > * Nan Galbraith                        (508) 289-2444 *
>> > * Upper Ocean Processes Group            Mail Stop 29 *
>> > * Woods Hole Oceanographic Institution                *
>> > * Woods Hole, MA 02543                                *
>> > *******************************************************
>> >
>> >
>> > _______________________________________________
>> > Esip-documentation mailing list
>> > Esip-documentation at lists.esipfed.org
>> > http://www.lists.esipfed.org/mailman/listinfo/esip-documentation
>>
>> -ed
>>
>> Ed Armstrong
>> JPL Physical Oceanography DAAC
>> 818 519-7607
>>
>>
>>
>> _______________________________________________
>> Esip-documentation mailing list
>> Esip-documentation at lists.esipfed.org
>> http://www.lists.esipfed.org/mailman/listinfo/esip-documentation
>>
>>
>> --
>> Dr. Richard P. Signell   (508) 457-2229
>> USGS, 384 Woods Hole Rd.
>> Woods Hole, MA 02543-1598
>>
>> _______________________________________________
>> Esip-documentation mailing list
>> Esip-documentation at lists.esipfed.org
>> http://www.lists.esipfed.org/mailman/listinfo/esip-documentation
>>
>>
>> <Mail Attachment.png>
>>
>>
>> _______________________________________________
>> Esip-documentation mailing list
>> Esip-documentation at lists.esipfed.org
>> http://www.lists.esipfed.org/mailman/listinfo/esip-documentation
>>
>>
>> John Graybeal
>> jbgraybeal at mindspring.com
>>
>>
>>
>> _______________________________________________
>> Esip-documentation mailing list
>> Esip-documentation at lists.esipfed.org
>> http://www.lists.esipfed.org/mailman/listinfo/esip-documentation
>>
>>
>> <SignatureSm.png>
>>
>>
>> _______________________________________________
>> Esip-documentation mailing list
>> Esip-documentation at lists.esipfed.org
>> http://www.lists.esipfed.org/mailman/listinfo/esip-documentation
>>
>
>
>
> --
> Dr. Richard P. Signell   (508) 457-2229
> USGS, 384 Woods Hole Rd.
> Woods Hole, MA 02543-1598
> _______________________________________________
> Esip-documentation mailing list
> Esip-documentation at lists.esipfed.org
> http://www.lists.esipfed.org/mailman/listinfo/esip-documentation
>
>
> John Graybeal
> jbgraybeal at mindspring.com
>
>
>

-- 
Dr. Richard P. Signell   (508) 457-2229
USGS, 384 Woods Hole Rd.
Woods Hole, MA 02543-1598