[Esip-citationguidelines] DOIs for data at multiple repositories

Wilson, Bruce E. wilsonbe at ornl.gov
Wed Jan 30 08:15:18 EST 2019


Location versus identity
Citation for credit vs citation for reproducibility
Series vs Instance

These are, IMO, different aspects of an overlapping set of problems, and one where current systems do not provide a good answer.  The answer I wish worked was for there to be a single DOI, referencing the concept of the dataset, and which allowed for a clear trail should the data be superseded.  However, I think there are practical issues that prevent that unless the two data centers have a very good working relationship and the data in the two locations is bit-for-bit identical.

So, my opinion is that the least bad answer it that each repository has their own DOI and landing page.

The pattern that DataONE used (starting in V2 of the API) is very relevant here, which is that there can be a Series identifier (which essentially represents the concept of the dataset) and an Instance identifier (which represents a very specific instantiation).  And an Instance can have multiple locations (supporting multiple copies for preservation).  The hash is a required attribute of an Instance, so there is a mathematical guarantee that different copies of an instance are identical.  This pattern could support what Jessica is talking about, including cases where the two copies are bitwise identical (same Instance) and are not bitwise identical (different instances of the same Series), though DataONE does assume that there is exactly one current Instance of a given Series (a head node model), so that resolving the Series Identifier goes to the current Instance.  However, that pattern doesn’t exist in any universal system — only within the DataONE environment.

This gets to an issue that I struggle with how to express and I don’t know where it has been expressed, which is the tension in the purposes of citation: citation for credit and citation for reproducibility.  In the case of journal articles, I would argue that the citation is pointing to the concept.  Whether someone reads the PDF version or HTML version doesn’t matter.  And there’s a long tradition that minor changes and updates to the article (adding an author ORCID, for example) don’t change the DOI.  And articles don’t get updated in the same way that a time series of data gets updated or a computer software package gets updated.  It’s only when we move to those types of research objects, where scientifically significant changes to the object happen over its lifetime that we run into this issue of potentially wanting to point to both the concept of the research object (as a matter of credit) and to a very specific copy of that object at a point in time (for scientific reproducibility).  In Jessica’s example, the credit side of the tension argues for a single DOI.  The reproducibility side of the tension argues for different DOIs.  Because we don’t presently have a mechanism to identify instance.

Yaxing Wei demonstrated a concept several years ago which is an interesting hack, effectively leveraging the fact that DOIs are referenced as HTTP URLs and the potential for GET parameters, along the lines of https://dx.doi.org/10.12345/identifier1234&instance=someguid That provides a mechanism for indicating both the concept and specific instance.  It does not (directly) include the location, though the same protocols which are used for content distribution networks can be used for that location aspect.  However, this embeds some aspects of current technology into persistent identifiers, which is problematic.  As a simple example, https://doi.org/10.3334/ORNLDAAC/1324 and http://doi.org/10.3334/ORNLDAAC/1324 are the same object.  And there are some reasons to use and refer to the https version (tamper resistance for one).  It requires an understanding of present technology to recognize that fact, versus the older (and apparently deprecated) DOI reference of the form doi:10.3334/ORNLDAAC/1324 [1]

[1] no need to repeat the sermon that embedding the organization name in the DOI was a bad idea.  I can’t change what we did in the past.


================================================
Bruce E. Wilson (wilsonbe at ornl.gov<mailto:wilsonbe at ornl.gov>)
Manager, ORNL Distributed Active Archive Center for Biogeochemical Dynamics
Group Leader, Remote Sensing and Environmental Informatics
Oak Ridge National Laboratory



On Jan 29, 2019, at 11:13 PM, Parsons, Mark via Esip-citationguidelines <esip-citationguidelines at lists.esipfed.org<mailto:esip-citationguidelines at lists.esipfed.org>> wrote:

Oh dear. This is the old location vs. identity problem.

Fundamentally, DOIs and handles are locators. They only work as identifiers when there is appropriate human due diligence.

For your situation, a true identifier, like a content-based identifier could be useful if the two objects are identical at the bit level. But there is still question on how that identifier should resolve.  Even with “identical” objects the question remains of whether you have accessed the authoritative object with relevant provenance or just an older, dated copy.

I think this means the two repos have no choice but to collaborate on how they want citation and related processes to work because the downstream user (or citation) needs to know not only if they have the right object but how it may have changed before and after the time of citation.

It could be useful but probably difficult to specify guidelines on how to do this collaboration (i.e. cross links like others suggest). I don’t see an obvious technical solution.

cheers,

-m.

On 29 Jan 2019, at 11:19, Hausman, Jessica (398G) via Esip-citationguidelines <esip-citationguidelines at lists.esipfed.org<mailto:esip-citationguidelines at lists.esipfed.org>> wrote:

Hi
So this is probably more of a PID discussion, but wanted to get people’s opinions on this, I’m probably welcoming a tsunami by doing this. We keep running into this problem of having a dataset that lives in 2 different repositories. While the data should be identical, the provenance isn’t. So should each repository register their own DOI or try to share one, which can be logistically difficult. This would also make for some useful guidelines if they don’t already exist somewhere else.


--
Jessica Hausman
Jet Propulsion Lab
4800 Oak Grove Dr.
MS 158-242
Pasadena, CA 91109
http://orcid.org/0000-0002-1861-1526
Tel: +1 818-354-4588
_______________________________________________
Esip-citationguidelines mailing list
Esip-citationguidelines at lists.esipfed.org<mailto:Esip-citationguidelines at lists.esipfed.org>
https://lists.esipfed.org/mailman/listinfo/esip-citationguidelines

_______________________________________________
Esip-citationguidelines mailing list
Esip-citationguidelines at lists.esipfed.org<mailto:Esip-citationguidelines at lists.esipfed.org>
https://lists.esipfed.org/mailman/listinfo/esip-citationguidelines

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.esipfed.org/pipermail/esip-citationguidelines/attachments/20190130/a17b90cd/attachment-0001.html>


More information about the Esip-citationguidelines mailing list