[Esip-preserve] a new relation type for subset citations

Mon Jul 20 17:59:53 EDT 2015

On Mon, 20 Jul 2015, Jeff de La Beaujardiere - NOAA via Esip-preserve wrote:

> Hello Andreas-
>
> My primary concern is asking the data provider, rather than the user, to
> record additional information about data requests. Even assigning DOIs
> carefully is a time-consuming process. Managing and retaining query
> information, perhaps for the life of the data, is prohibitive and, I
> believe, not particularly useful for the reasons stated in my previous
> message of July 16. I have no objection to any guidance you may wish to
> give data users regarding how to cite or describe the manipulations
> performed on the overall dataset accessible via DOI.

Jeff,

Sorry to get in on this late, but I think that I can explain part of the 
situation --

Although yes, there is the need for the deeper methods document, part of 
the issue was in having a reliable way to track a specific version of the 
data that was used, as the researcher might not be aware that the data 
could change.

The idea of the 'receipt' on download (but before analysis) allows for 
benefits which we don't have if we wait for the method section of the 
paper:

1. The researcher has an identifier (or list of identifiers) that they can
    use to determine if the data changes ... which they'd hopefully do
    before they submit their article for peer review.

2. The archive has a record that the data is being used by someone, so
    that they can either ensure that version gets stored for the long
    term, or notify the researcher that they've had to replace it (and why
    it was replaced).

So ... this might make it so that a researcher can be more specific in 
their methods section / document about what data was obtained, and from 
where, and give us something that's short enough to put into a citation 
string, it is *not* intended to be a replacement for well-described 
methods.

The closest analog is that it's intended to be the equivalent of giving 
the edition & paper number in a book.*

Ideally, we make it so it's relatively easy for other researchers 
(especially the peer reviewers) to obtain the same subset of data that was 
used in the analysis, so they then apply the same methods to verify the 
analysis.  Automation of analysis is another issue that will likely 
involve the software citation & provenance communities.

-Joe

* Only in this case, with checksums so it's verifiable, and allows us to
   give oddly specific subsets like 'the 4th word from every 3rd sentence
   of the left-hand pages from pp. 120-130 ... unless that word contained
   the letter q.'  Although you could also just say 'pp. 120-130', and then
   describe in the methods what additional filtering you applied.

> On Fri, Jul 17, 2015 at 7:32 AM, Andreas Rauber <rauber at ifs.tuwien.ac.at>
> wrote:
>
>> Hi Jeff,
>>
>> Thanks a lot for your comment (and thanks a lot, Ruth, for already
>> answering most of the issues raised!) I'd love to discuss some of the
>> issues raised in more detail, either at RDA P6 in Paris or some other
>> dedicated meeting, but just as a quick feedback:
>>
>> I fully agree with the need to go beyond only identifying the specific
>> subset of data, moving much deeper into capturing and describing the
>> process. (I currently have a team of 3 PhD students plus others working on
>> that) However, the challenges in this are not to be underestimated once we
>> move beyond the conceptual perspective towards the practicalities of doing
>> and deploying it - and guaranteeing correct re-execution!
>> We are working on capturing workflows, the organizational, SW and HW
>> context, workflow/process instance data for verification and validation,
>> etc.
>> Service-neutral languages and abstractions are extremely helpful in this -
>> but solve only parts of the problem when it comes to actual repeatability
>> as studies have shown.
>> We have discussed these issues a lot within the WG, but decided, for the
>> recommendations, to refrain from including even generalized views, as they
>> can cause problems that cannot easily be resolved in practice.
>>
>> As Rut already pointed put, the primary goal was to find a solution to one
>> part of the challenge, and this has to be a solution that can be
>> practically deployed without prohibitive effort or changes to existing
>> operations.
>> Note, further, that even approaches going further, i.e. capturing process
>> descriptions etc. will require the proposed data citation solution: if data
>> is dynamic and one wants to be able to go back to earlier states of a DB,
>> timestamping/versioning changes is required.
>> If a process selects a subset of data, that selection needs to be
>> re-executed, i.e. stored as part of the process description in the process
>> instance, the workflow model, etc. So it is mostly the question of who
>> stores the according "query", in which form and where, and how far we dare
>> to go from the mere select/project form of subset identification via
>> generalized views towards increasingly more complex transformations towards
>> capturing entire processes.
>>
>> I fully agree with the final goal, and we have discussed how far we dare
>> to move, deciding to stay  at the select/project query level to ensure the
>> solution will work on a generically. (As I said, the practicalities arising
>> when going into more detail are mind-boggling.) The current recommendations
>> address a necessary sub-component of any more far-reaching solution that
>> already solve many issues arising and can be deployed with comparatively
>> little effort immediately, and can be expanded to include more extensive
>> processing at any point in time.
>> They do not replace the methods section of a paper. We'd be happy to
>> discuss further WG activities moving further in that direction.
>>
>> I'd be very happy to discuss  these things in more detail, either during a
>> telephone conference or a dedicated meeting.
>>
>> Best regards, Andreas
>>
>>
>>> Am 17.07.2015 um 07:08 schrieb Ruth Duerr <ruth.duerr3 at gmail.com>:
>>>
>>> Hi Jeff,
>>>
>>> I can tell you that very long discussions of the "what does the subset
>> cover" issue were held in very many places.  The general agreement was that
>> the purpose of a subset specifier was not to replace the methods portion of
>> a paper; that the very best a repository could do was to allow explicit
>> identification of what subset the user actually obtained - what they did
>> with that data after that, was their issue to describe in their paper - so
>> not this was definitely not meant to address workflow.
>>>
>>> I am sure that he would be happy to talk to you about this (actually he
>> has been trying pretty hard to get folks to comment and he has quite a few
>> pilot implementations at this point).  I took the liberty of adding him to
>> this conversation.
>>>
>>> Ruth
>>>
>>>
>>>> On Jul 16, 2015, at 4:48 PM, Fontaine, Kathy via Esip-preserve <
>> esip-preserve at lists.esipfed.org> wrote:
>>>>
>>>> Hi all - one more thing....
>>>>
>>>> Please note that the RDA outputs are designed to solve one particular
>> problem that was identified and scoped by the Working Group.  These should
>> not ever be viewed as _the_ universal answer to all related problems.
>>>>
>>>> With that in mind, if the issue you are describing, Jeff, is not
>> addressed in the initial conditions, that's why you see what you see.  The
>> comment, then, might be addressed in future or follow-on work.
>>>>
>>>> I don't know, but just wanted to put that caveat out there.
>>>>
>>>> Thanks
>>>>
>>>> K
>>>>
>>>>
>>>>
>>>> ________________________________
>>>> Dr. Kathleen Fontaine
>>>> Managing Director, Research Data Alliance/US (RDA/US)
>>>>
>>>> Amos Eaton Building, Room 211
>>>> Rensselaer Polytechnic Institute (RPI)
>>>> 110 8th Street
>>>> Troy, NY 12180-3590
>>>>
>>>> Cell:  410-991-6728
>>>> Office:  518-276-2829
>>>>
>>>> Email:  fontak at rpi.edu
>>>> Skype:  ksfontaine
>>>>
>>>> ________________________________________
>>>> From: Fox, Peter
>>>> Sent: Thursday, July 16, 2015 7:30 PM
>>>> To: Jeff de La Beaujardiere - NOAA
>>>> Cc: Greg Jan?e; ESIP Preserve List; Fontaine, Kathy
>>>> Subject: Re: [Esip-preserve] a new relation type for subset citations
>>>>
>>>> Jeff - go here https://rd-alliance.org/groups/data-citation-wg.html
>> (need to register/ login) and can add comments (cc: Kathy F - who is here
>> at ESIP for more info on commenting).
>>>> ---Peter.
>>>>
>>>>> On 16 Jul 2015, at 19:25 , Jeff de La Beaujardiere - NOAA via
>> Esip-preserve <esip-preserve at lists.esipfed.org> wrote:
>>>>>
>>>>> I strongly disagree with the RDA recommendation to include subset
>> specifiers in citations and to require the provider to record them
>> permanently. Besides the huge burden on the providers, subsetting is only
>> one part of the workflow. Scientific papers need to describe the work they
>> did including subsetting, mathematical operations, assumptions, et cetera,
>> so merely capturing the subset information is nearly worthless. If the
>> workflow is to be captured in a machine-readable fashion, then a
>> service-neutral language such as (but not necessarily) the OGC Web Coverage
>> Processing Service grammar should be referenced in the paper using a URL
>> maintained by the author or publisher of the paper.
>>>>>
>>>>> I would like to register this objection with RDA but am not sure where
>> to do so. I have CCed Mark Parsons as a start.
>>>>>
>>>>> Regards,
>>>>> Jeff DLB
>>>>>
>>>>>
>>>>> Jeff de La Beaujardiere, PhD
>>>>> NOAA Data Management Architect
>>>>> 1335 East-West Hwy, Silver Spring MD 20910 USA
>>>>> +1 301 713 7175 (NESDIS/ACIO-S - SSMC1/5236)
>>>>> ORCID: http://orcid.org/0000-0002-1001-9210
>>>>>
>>>>> On Thu, Jul 16, 2015 at 9:54 AM, Greg Jan?e <
>> esip-preserve at lists.esipfed.org> wrote:
>>>>> The RDA data citation working group has recommended that subsets of
>> datasets (more broadly, queries against datasets) be persistently
>> identified upon request; cf.
>> https://www.rd-alliance.org/group/data-citation-wg/wiki/wgdc-recommendations.html
>> .
>>>>> `
>>>>> For this to work, queries have to be stored *somewhere*.  One approach
>> (this appears to be the RDA group's working assumption) is for the provider
>> to take on the burden of permanently storing queries, and from there it can
>> issue PIDs for those queries by whatever means it has available.  Another
>> approach is for the provider to support a query API of some kind, e.g.,
>> query URLs (think http://dataset?query=this+that+and+the+other).  This
>> may result in lengthy URLs, but an external identifier system can be used
>> to assign short, opaque PIDs that redirect to those query URLs.
>>>>>
>>>>> Regardless of the approach taken, the net result is multiple, related
>> PIDs: one PID for the dataset as a whole, and then multiple PIDs, one per
>> stored query.  It would be beneficial to record the relationship between
>> these identifiers, particularly in the case when an external identifier
>> system is being used.  The DataCite metadata schema (
>> http://schema.datacite.org) lists a number of possibilities, but none
>> quite fit:
>>>>>
>>>>> - IsPartOf/HasPart: "A IsPartOf B" implies that B can be broken down
>> into some disjoint pieces, and A is one of those pieces.  But a cited
>> subset is not a disjoint part of a whole.
>>>>>
>>>>> - IsCitedBy/Cites: "A IsCitedBy B" implies that B mentions A in some
>> way, and is possibly intellectually derived from A, but not necessarily.
>> This is intentionally a pretty vague relationship (as vague as publication
>> citations, right?), whereas a cited subset has a very specific relationship
>> to the whole.
>>>>>
>>>>> - IsReferencedBy/References: same thing.
>>>>>
>>>>> - IsMemberOf/HasMember (a newly proposed relation): "A IsMemberOf B"
>> implies that B has some rules or standards for inclusion, and A satisfies
>> those rules.  Doesn't seem applicable in this case.
>>>>>
>>>>> So I'm wondering if we need a new relation, which I'll provisionally
>> call "IsCitedSubsetOf".  "A IsCitedSubsetOf B" would mean that a reference
>> to A is really a reference to B, but only a subset of B was actually used.
>> Note that I didn't call it IsSubsetOf, for that would bring up the same
>> issues that IsPartOf has.  It seems important to record that the purpose of
>> these query PIDs is for citation and nothing else.
>>>>>
>>>>> Thoughts?
>>>>> -Greg
>>>>>
>>>>> _______________________________________________
>>>>> Esip-preserve mailing list
>>>>> Esip-preserve at lists.esipfed.org
>>>>> http://lists.deltaforce.net/mailman/listinfo/esip-preserve
>>>>>
>>>>> NOTE: This message was trained as non-spam. If this is wrong, please
>> correct the training as soon as possible.
>>>>> Spam
>>>>> Not spam
>>>>> Forget previous vote
>>>>> _______________________________________________
>>>>> Esip-preserve mailing list
>>>>> Esip-preserve at lists.esipfed.org
>>>>> http://lists.deltaforce.net/mailman/listinfo/esip-preserve
>>>>
>>>> _______________________________________________
>>>> Esip-preserve mailing list
>>>> Esip-preserve at lists.esipfed.org
>>>> http://lists.deltaforce.net/mailman/listinfo/esip-preserve
>>>
>>
>