[Esip-preserve] [ESIP-all] Please review Draft ESIP Data Citation Guidelines

Fri Aug 19 13:07:45 EDT 2011

On 18 Aug 2011, at 9:54 AM, Russ Rew wrote:

> I have a (perhaps) minor contribution I'd like to work on, one of these
> days when I get time.  It's about the need for having canonical versions
> of cited datasets, so that an SHA1 digest (for example) of the canonical
> version of a dataset identifies it uniquely, globally, and for all time.
> 
> If such dataset signatures were practical, it would be relatively easy
> to determine whether two copies of a dataset represented the same data,
> even if there were gratuitous differences that were unimportant in terms
> of use of the data in analysis, visualization, or scientific papers.
> 
> Some obvious obstacles to the practicality of this idea are:
> 
>  - tool support to read a dataset in a particular format and output a
>    canonical copy, usable for creating digest signatures
> 
>  - dealing with small differences in the last few bits of
>    floating-point representations in a way that has scientific
>    integrity
> 
>  - clarifying exactly what is meant by "gratuitous" differences in
>    derived data products that should theoretically be the same but
>    differ due to order of processing steps or non-deterministic parallel
>    computations
> 
> I've written an nccopy tool that makes it possible to experiment with an
> option to canonicalize data that would specify canonical orders for
> objects in the dataset, canonical chunking and compression options, and
> canonical ways to handle other differences that are transparent to
> readers of the data.  I hope to finish this in the next year and see if
> the idea is really of any practical use.  You may have already
> considered this and decided it's not feasible, and I may end up
> concluding the same thing, but it's on my ToDo list ...

Ah, Russ, you hit on an issue the cluster went round and round on--the idea of canonical versions and determining scientifically equivalent data sets proves to be quite challenging. Our guidelines do not really address this area and it does indeed demand further research  The Duerr et al. paper referenced in the Guidelines discusses it a bit, and Bruce Barkstrom and Curt Tilmes have some ideas on the topic as well.

Cheers,

-m.