[Esip-citationguidelines] DOIs for data at multiple repositories

Wed Jan 30 12:56:58 EST 2019

Bruce, good points about the different purposes for citation leading to
different needs for granularity. In library-land, this is the classic
work/edition/item challenge. Do you want to cite/find the work, e.g. Don
Quixote, or the edition, e.g. the 2003 translation of Don Quixote published
by Ecco, or the item, e.g. the copy of the 2003 transition of Don Quixote
that I have sitting on my bookshelf at home?

The answer is, it depends. Sometimes you just want the work, sometimes you
want a specific edition, and sometimes you want an exact item. I personally
like these two examples of why these levels matter sometimes.
Cloud Atlas - https://doi.org/10.17613/M63S4H
Wicked bible - https://doi.org/10.1177/004057368003700311 (paywalled, but
you can see the first page and get the idea)

Ideally the metadata and retrieval systems can allow disambiguation at all
levels. But this is really hard. Library catalogers have struggled with
this for 100+ years. There was a big update to the standard cataloging
rules around 2010 to try to do this better, and nobody was really
satisfied.

For data, in my view, starting simple is the way to go. Having people cite
the data that they used, from wherever they got them, is a useful first
step.

Matt

On Wed, Jan 30, 2019 at 6:16 AM Wilson, Bruce E. via
Esip-citationguidelines <esip-citationguidelines at lists.esipfed.org> wrote:

> Location versus identity
> Citation for credit vs citation for reproducibility
> Series vs Instance
>
> These are, IMO, different aspects of an overlapping set of problems, and
> one where current systems do not provide a good answer.  The answer I wish
> worked was for there to be a single DOI, referencing the concept of the
> dataset, and which allowed for a clear trail should the data be
> superseded.  However, I think there are practical issues that prevent that
> unless the two data centers have a very good working relationship and the
> data in the two locations is bit-for-bit identical.
>
> So, my opinion is that the least bad answer it that each repository has
> their own DOI and landing page.
>
> The pattern that DataONE used (starting in V2 of the API) is very relevant
> here, which is that there can be a Series identifier (which essentially
> represents the concept of the dataset) and an Instance identifier (which
> represents a very specific instantiation).  And an Instance can have
> multiple locations (supporting multiple copies for preservation).  The hash
> is a required attribute of an Instance, so there is a mathematical
> guarantee that different copies of an instance are identical.  This pattern
> could support what Jessica is talking about, including cases where the two
> copies are bitwise identical (same Instance) and are not bitwise identical
> (different instances of the same Series), though DataONE does assume that
> there is exactly one current Instance of a given Series (a head node
> model), so that resolving the Series Identifier goes to the current
> Instance.  However, that pattern doesn’t exist in any universal system —
> only within the DataONE environment.
>
> This gets to an issue that I struggle with how to express and I don’t know
> where it has been expressed, which is the tension in the purposes of
> citation: citation for credit and citation for reproducibility.  In the
> case of journal articles, I would argue that the citation is pointing to
> the concept.  Whether someone reads the PDF version or HTML version doesn’t
> matter.  And there’s a long tradition that minor changes and updates to the
> article (adding an author ORCID, for example) don’t change the DOI.  And
> articles don’t get updated in the same way that a time series of data gets
> updated or a computer software package gets updated.  It’s only when we
> move to those types of research objects, where scientifically significant
> changes to the object happen over its lifetime that we run into this issue
> of potentially wanting to point to both the concept of the research object
> (as a matter of credit) and to a very specific copy of that object at a
> point in time (for scientific reproducibility).  In Jessica’s example, the
> credit side of the tension argues for a single DOI.  The reproducibility
> side of the tension argues for different DOIs.  Because we don’t presently
> have a mechanism to identify instance.
>
> Yaxing Wei demonstrated a concept several years ago which is an
> interesting hack, effectively leveraging the fact that DOIs are referenced
> as HTTP URLs and the potential for GET parameters, along the lines of
> https://dx.doi.org/10.12345/identifier1234&instance=someguid That
> provides a mechanism for indicating both the concept and specific
> instance.  It does not (directly) include the location, though the same
> protocols which are used for content distribution networks can be used for
> that location aspect.  However, this embeds some aspects of current
> technology into persistent identifiers, which is problematic.  As a simple
> example, https://doi.org/10.3334/ORNLDAAC/1324 and
> http://doi.org/10.3334/ORNLDAAC/1324 are the same object.  And there are
> some reasons to use and refer to the https version (tamper resistance for
> one).  It requires an understanding of present technology to recognize that
> fact, versus the older (and apparently deprecated) DOI reference of the
> form doi:10.3334/ORNLDAAC/1324 [1]
>
> [1] no need to repeat the sermon that embedding the organization name in
> the DOI was a bad idea.  I can’t change what we did in the past.
>
>
> ================================================
> Bruce E. Wilson (wilsonbe at ornl.gov)
> Manager, ORNL Distributed Active Archive Center for Biogeochemical Dynamics
> Group Leader, Remote Sensing and Environmental Informatics
> Oak Ridge National Laboratory
>
>
>
> On Jan 29, 2019, at 11:13 PM, Parsons, Mark via Esip-citationguidelines <
> esip-citationguidelines at lists.esipfed.org> wrote:
>
> Oh dear. This is the old location vs. identity problem.
>
> Fundamentally, DOIs and handles are locators. They only work as
> identifiers when there is appropriate human due diligence.
>
> For your situation, a true identifier, like a content-based identifier
> could be useful if the two objects are identical at the bit level. But
> there is still question on how that identifier should resolve.  Even with
> “identical” objects the question remains of whether you have accessed the
> authoritative object with relevant provenance or just an older, dated copy.
>
> I think this means the two repos have no choice but to collaborate on how
> they want citation and related processes to work because the downstream
> user (or citation) needs to know not only if they have the right object but
> how it may have changed before and after the time of citation.
>
> It could be useful but probably difficult to specify guidelines on how to
> do this collaboration (i.e. cross links like others suggest). I don’t see
> an obvious technical solution.
>
> cheers,
>
> -m.
>
> On 29 Jan 2019, at 11:19, Hausman, Jessica (398G) via
> Esip-citationguidelines <esip-citationguidelines at lists.esipfed.org> wrote:
>
> Hi
> So this is probably more of a PID discussion, but wanted to get people’s
> opinions on this, I’m probably welcoming a tsunami by doing this. We keep
> running into this problem of having a dataset that lives in 2 different
> repositories. While the data should be identical, the provenance isn’t. So
> should each repository register their own DOI or try to share one, which
> can be logistically difficult. This would also make for some useful
> guidelines if they don’t already exist somewhere else.
>
>
> --
> Jessica Hausman
> Jet Propulsion Lab
> 4800 Oak Grove Dr.
> MS 158-242
> Pasadena, CA 91109
> http://orcid.org/0000-0002-1861-1526
> Tel: +1 818-354-4588
> _______________________________________________
> Esip-citationguidelines mailing list
> Esip-citationguidelines at lists.esipfed.org
> https://lists.esipfed.org/mailman/listinfo/esip-citationguidelines
>
>
> _______________________________________________
> Esip-citationguidelines mailing list
> Esip-citationguidelines at lists.esipfed.org
> https://lists.esipfed.org/mailman/listinfo/esip-citationguidelines
>
>
> _______________________________________________
> Esip-citationguidelines mailing list
> Esip-citationguidelines at lists.esipfed.org
> https://lists.esipfed.org/mailman/listinfo/esip-citationguidelines
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.esipfed.org/pipermail/esip-citationguidelines/attachments/20190130/d4a77d0b/attachment.html>