[Esip-preserve] [ESIP-all] Please review Draft ESIP Data Citation Guidelines

Thu Aug 18 11:54:46 EDT 2011

> Thanks Russ. I think I read this a few years back. I reread it all now =
> and I think our proposed guidelines address  all the issues Lawrence =
> raises as best as possible. His last post on Oct. 20, 2006 was the most =
> challenging. The cluster had a lot of discussion on what is a =
> "publisher". Ultimately, we kinda  side-stepped the issue. The primary =
> purpose of citation is to refer unambiguously to the exact data used. =
> The "publisher", if indeed one exists, may be valuable in assessing the =
> quality of the data, but that is a secondary consideration to the =
> primary need for precise reference. Data quality and data peer-review =
> are critical considerations in data stewardship, but they are secondary =
> issue to data citation. After all, we can cite all forms of literature, =
> but it is up to reviewers and readers to assess the quality of those =
> citations. It's the same with data.
> 
> Regardless, if you think the guidelines don't address Lawrence's =
> concerns please tell me specifically where they are failing. It has =
> indeed been a slow process. I remember grappling with this issue more =
> than a decade ago. It is encouraging, however, that the issue now has =
> greater prominence.

I'm glad to hear that your proposed guidelines address the most
important issues raised by Bryan Lawrence.  Before I opined that not
much progress had been made in the last few years, I should have taken a
more careful look at your proposal.

I have a (perhaps) minor contribution I'd like to work on, one of these
days when I get time.  It's about the need for having canonical versions
of cited datasets, so that an SHA1 digest (for example) of the canonical
version of a dataset identifies it uniquely, globally, and for all time.

If such dataset signatures were practical, it would be relatively easy
to determine whether two copies of a dataset represented the same data,
even if there were gratuitous differences that were unimportant in terms
of use of the data in analysis, visualization, or scientific papers.

Some obvious obstacles to the practicality of this idea are:

  - tool support to read a dataset in a particular format and output a
    canonical copy, usable for creating digest signatures

  - dealing with small differences in the last few bits of
    floating-point representations in a way that has scientific
    integrity

  - clarifying exactly what is meant by "gratuitous" differences in
    derived data products that should theoretically be the same but
    differ due to order of processing steps or non-deterministic parallel
    computations

I've written an nccopy tool that makes it possible to experiment with an
option to canonicalize data that would specify canonical orders for
objects in the dataset, canonical chunking and compression options, and
canonical ways to handle other differences that are transparent to
readers of the data.  I hope to finish this in the next year and see if
the idea is really of any practical use.  You may have already
considered this and decided it's not feasible, and I may end up
concluding the same thing, but it's on my ToDo list ...

Anyway, thanks for the response, and now I have more reason to actually
read your proposed citation guidelines!

--Russ

> Cheers,
> 
> -m.=20
> On 17 Aug 2011, at 10:16 AM, Russ Rew wrote:
> 
> > Mark,
> >=20
> > When I wrote:
> >=20
> >> Bryan Lawrence, who leads the CEDA (the STFC Centre for Environmental
> >> Data Archival, which includes the British Atmospheric Data Centre, =
> the
> >> NERC Earth Observation Data Centre and the UK Solar System Data =
> Centre)
> >> has a long-standing interest in data citation, and wrote an =
> interesting
> >> blog entry about the most important issues:
> >>=20
> >>  http://home.badc.rl.ac.uk/lawrence/blog/2005/12/22/data_citation
> >>=20
> >> I've gone back and reread this, and his summary of the issues still =
> seem
> >> relevant, showing we haven't made much progress in the five years =
> since
> >> this was written.
> >=20
> > I didn't realize that was part one of a three part blog on data =
> citation
> > at the BADC, all of which are linked from here:
> >=20
> >  =
> http://home.badc.rl.ac.uk/lawrence/blog/2006/10/20/citation_hosting_and_pu=
> blication
> >=20
> > You may have already seen this in your research, but if not, I think
> > Bryan's ideas about the issues may be useful.
> >=20
> > --Russ