[Esip-preserve] a new relation type for subset citations

Wed Jul 22 09:25:18 EDT 2015

Hi Jeff,

> My primary concern is asking the data provider, rather than the user, to
> record additional information about data requests.

Here I tend to disagree with you: It is not only about "recording" 
information on data requests, but keeping them accessible and usable. Data 
citation, and resolving references to data being used, is a key 
infrastructure service:

We cannot rely on useres to keep their queries accesible over long periods 
of time - particularly, if those users may not even be aware of the 
concept of a query, creating their subset via some workbench interface. 
Furthermore, if the data representation/underlying technology ever 
changes, how are useres supposed to be able to migrate those stored 
queries when they are usually shielded fromt the underlying complexity of 
a data representation on the technical level. If we want this to be robust 
and sustainable, it needs to be provided by the research 
infrastructure/data centers.

> Even assigning DOIs
> carefully is a time-consuming process.

This is true in principle, but note that in these setting, such a minting 
and the provisioning of all required metadata would be fully automatic.

> Managing and retaining query
> information, perhaps for the life of the data, is prohibitive and,

I'd be veryinterested in learning more about the reasons why you think 
this is the case!

In virtually all pilots that we discussed, once we went to the details, it 
turned out that the queries themselves were not problematic to store ans 
retain at all. We had one setting were an enormous amount of queries were 
continuously fired against a system (similar to, say, the Google search 
engine) where this might have turned into an issue. But by maintaining 
only "relevant" subset queries, i.e. queries designed by tools or 
researchers where one would need to go back to the respective subset (and 
which were not programatically created, in which case one would simply 
keep the query generation process rather thant trillions of queries), the 
actual storage and management of the queries is usually relatively 
trivial:
It's a small table/database storing the query PID, query string (including 
the timestamp), one or two hash keys, maybe a pointer to the superset, 
soem user information if desired, and that's about it - it's really small 
in size - compared to the data, usually.

> I
> believe, not particularly useful for the reasons stated in my previous
> message of July 16.

I hope I was able to clarify that, while going "bigger" in terms of PIDing 
an entire process, is definitely desirable, I would not see (1) how that 
can be guaranteed to work given our current technology, and (2) how that 
could be made to work without also supporting exactly the kind of data 
citation recommended by the WG, as any process would then also need to 
store the queries it executes, so it is the same thing (as is the need for 
data versioning if you need to support goin back to any such previosu 
version)
So, the proposed bigger solution will require the recommended "smaller" 
solution. If you have a solution that would solve this issue 
without it, I'd be very interested to learn about it.

> I have no objection to any guidance you may wish to
> give data users regarding how to cite or describe the manipulations
> performed on the overall dataset accessible via DOI.

great :) But I'd really want to make sure it can work, in principle, in a 
range of settings, and satisfying the needs that users have, so I am very 
happy about any issues being raised to see whether they can be addressed!

thanks a lot!

Andreas

> Regards,
> Jeff DLB
> 
> 
> Jeff de La Beaujardiere, PhD
> NOAA Data Management Architect 
> 1335 East-West Hwy, Silver Spring MD 20910 USA
> +1 301 713 7175 (NESDIS/ACIO-S - SSMC1/5236)
> ORCID: http://orcid.org/0000-0002-1001-9210
> 
> On Fri, Jul 17, 2015 at 7:32 AM, Andreas Rauber <rauber at ifs.tuwien.ac.at>
> wrote:
>       Hi Jeff,
>
>       Thanks a lot for your comment (and thanks a lot, Ruth, for
>       already answering most of the issues raised!) I'd love to
>       discuss some of the issues raised in more detail, either at
>       RDA P6 in Paris or some other dedicated meeting, but just as
>       a quick feedback:
>
>       I fully agree with the need to go beyond only identifying the
>       specific subset of data, moving much deeper into capturing
>       and describing the process. (I currently have a team of 3 PhD
>       students plus others working on that) However, the challenges
>       in this are not to be underestimated once we move beyond the
>       conceptual perspective towards the practicalities of doing
>       and deploying it - and guaranteeing correct re-execution!
>       We are working on capturing workflows, the organizational, SW
>       and HW context, workflow/process instance data for
>       verification and validation, etc.
>       Service-neutral languages and abstractions are extremely
>       helpful in this - but solve only parts of the problem when it
>       comes to actual repeatability as studies have shown.
>       We have discussed these issues a lot within the WG, but
>       decided, for the recommendations, to refrain from including
>       even generalized views, as they can cause problems that
>       cannot easily be resolved in practice.
>
>       As Rut already pointed put, the primary goal was to find a
>       solution to one part of the challenge, and this has to be a
>       solution that can be practically deployed without prohibitive
>       effort or changes to existing operations.
>       Note, further, that even approaches going further, i.e.
>       capturing process descriptions etc. will require the proposed
>       data citation solution: if data is dynamic and one wants to
>       be able to go back to earlier states of a DB,
>       timestamping/versioning changes is required.
>       If a process selects a subset of data, that selection needs
>       to be re-executed, i.e. stored as part of the process
>       description in the process instance, the workflow model, etc.
>       So it is mostly the question of who stores the according
>       "query", in which form and where, and how far we dare to go
>       from the mere select/project form of subset identification
>       via generalized views towards increasingly more complex
>       transformations towards capturing entire processes.
>
>       I fully agree with the final goal, and we have discussed how
>       far we dare to move, deciding to stay  at the select/project
>       query level to ensure the solution will work on a
>       generically. (As I said, the practicalities arising when
>       going into more detail are mind-boggling.) The current
>       recommendations address a necessary sub-component of any more
>       far-reaching solution that already solve many issues arising
>       and can be deployed with comparatively little effort
>       immediately, and can be expanded to include more extensive
>       processing at any point in time.
>       They do not replace the methods section of a paper. We'd be
>       happy to discuss further WG activities moving further in that
>       direction.
>
>       I'd be very happy to discuss  these things in more detail,
>       either during a telephone conference or a dedicated meeting.
>
>       Best regards, Andreas
> 
>
>       > Am 17.07.2015 um 07:08 schrieb Ruth Duerr
>       <ruth.duerr3 at gmail.com>:
>       >
>       > Hi Jeff,
>       >
>       > I can tell you that very long discussions of the "what does
>       the subset cover" issue were held in very many places.  The
>       general agreement was that the purpose of a subset specifier
>       was not to replace the methods portion of a paper; that the
>       very best a repository could do was to allow explicit
>       identification of what subset the user actually obtained -
>       what they did with that data after that, was their issue to
>       describe in their paper - so not this was definitely not
>       meant to address workflow.
>       >
>       > I am sure that he would be happy to talk to you about this
>       (actually he has been trying pretty hard to get folks to
>       comment and he has quite a few pilot implementations at this
>       point).  I took the liberty of adding him to this
>       conversation.
>       >
>       > Ruth
>       >
>       >
>       >> On Jul 16, 2015, at 4:48 PM, Fontaine, Kathy via
>       Esip-preserve <esip-preserve at lists.esipfed.org> wrote:
>       >>
>       >> Hi all - one more thing....
>       >>
>       >> Please note that the RDA outputs are designed to solve one
>       particular problem that was identified and scoped by the
>       Working Group.  These should not ever be viewed as _the_
>       universal answer to all related problems.
>       >>
>       >> With that in mind, if the issue you are describing, Jeff,
>       is not addressed in the initial conditions, that's why you
>       see what you see.  The comment, then, might be addressed in
>       future or follow-on work.
>       >>
>       >> I don't know, but just wanted to put that caveat out
>       there.
>       >>
>       >> Thanks
>       >>
>       >> K
>       >>
>       >>
>       >>
>       >> ________________________________
>       >> Dr. Kathleen Fontaine
>       >> Managing Director, Research Data Alliance/US (RDA/US)
>       >>
>       >> Amos Eaton Building, Room 211
>       >> Rensselaer Polytechnic Institute (RPI)
>       >> 110 8th Street
>       >> Troy, NY 12180-3590
>       >>
>       >> Cell:  410-991-6728
>       >> Office:  518-276-2829
>       >>
>       >> Email:  fontak at rpi.edu
>       >> Skype:  ksfontaine
>       >>
>       >> ________________________________________
>       >> From: Fox, Peter
>       >> Sent: Thursday, July 16, 2015 7:30 PM
>       >> To: Jeff de La Beaujardiere - NOAA
>       >> Cc: Greg Janée; ESIP Preserve List; Fontaine, Kathy
>       >> Subject: Re: [Esip-preserve] a new relation type for
>       subset citations
>       >>
>       >> Jeff - go here
>       https://rd-alliance.org/groups/data-citation-wg.html (need to
>       register/ login) and can add comments (cc: Kathy F - who is
>       here at ESIP for more info on commenting).
>       >> ---Peter.
>       >>
>       >>> On 16 Jul 2015, at 19:25 , Jeff de La Beaujardiere - NOAA
>       via Esip-preserve <esip-preserve at lists.esipfed.org> wrote:
>       >>>
>       >>> I strongly disagree with the RDA recommendation to
>       include subset specifiers in citations and to require the
>       provider to record them permanently. Besides the huge burden
>       on the providers, subsetting is only one part of the
>       workflow. Scientific papers need to describe the work they
>       did including subsetting, mathematical operations,
>       assumptions, et cetera, so merely capturing the subset
>       information is nearly worthless. If the workflow is to be
>       captured in a machine-readable fashion, then a
>       service-neutral language such as (but not necessarily) the
>       OGC Web Coverage Processing Service grammar should be
>       referenced in the paper using a URL maintained by the author
>       or publisher of the paper.
>       >>>
>       >>> I would like to register this objection with RDA but am
>       not sure where to do so. I have CCed Mark Parsons as a start.
>       >>>
>       >>> Regards,
>       >>> Jeff DLB
>       >>>
>       >>>
>       >>> Jeff de La Beaujardiere, PhD
>       >>> NOAA Data Management Architect
>       >>> 1335 East-West Hwy, Silver Spring MD 20910 USA
>       >>> +1 301 713 7175 (NESDIS/ACIO-S - SSMC1/5236)
>       >>> ORCID: http://orcid.org/0000-0002-1001-9210
>       >>>
>       >>> On Thu, Jul 16, 2015 at 9:54 AM, Greg Janée
>       <esip-preserve at lists.esipfed.org> wrote:
>       >>> The RDA data citation working group has recommended that
>       subsets of datasets (more broadly, queries against datasets)
>       be persistently identified upon request; cf.https://www.rd-alliance.org/group/data-citation-wg/wiki/wgdc-recommendations.htm
>       l.
>       >>> `
>       >>> For this to work, queries have to be stored *somewhere*. 
>       One approach (this appears to be the RDA group's working
>       assumption) is for the provider to take on the burden of
>       permanently storing queries, and from there it can issue PIDs
>       for those queries by whatever means it has available. 
>       Another approach is for the provider to support a query API
>       of some kind, e.g., query URLs (think
>       http://dataset?query=this+that+and+the+other).  This may
>       result in lengthy URLs, but an external identifier system can
>       be used to assign short, opaque PIDs that redirect to those
>       query URLs.
>       >>>
>       >>> Regardless of the approach taken, the net result is
>       multiple, related PIDs: one PID for the dataset as a whole,
>       and then multiple PIDs, one per stored query.  It would be
>       beneficial to record the relationship between these
>       identifiers, particularly in the case when an external
>       identifier system is being used.  The DataCite metadata
>       schema (http://schema.datacite.org) lists a number of
>       possibilities, but none quite fit:
>       >>>
>       >>> - IsPartOf/HasPart: "A IsPartOf B" implies that B can be
>       broken down into some disjoint pieces, and A is one of those
>       pieces.  But a cited subset is not a disjoint part of a
>       whole.
>       >>>
>       >>> - IsCitedBy/Cites: "A IsCitedBy B" implies that B
>       mentions A in some way, and is possibly intellectually
>       derived from A, but not necessarily.  This is intentionally a
>       pretty vague relationship (as vague as publication citations,
>       right?), whereas a cited subset has a very specific
>       relationship to the whole.
>       >>>
>       >>> - IsReferencedBy/References: same thing.
>       >>>
>       >>> - IsMemberOf/HasMember (a newly proposed relation): "A
>       IsMemberOf B" implies that B has some rules or standards for
>       inclusion, and A satisfies those rules.  Doesn't seem
>       applicable in this case.
>       >>>
>       >>> So I'm wondering if we need a new relation, which I'll
>       provisionally call "IsCitedSubsetOf".  "A IsCitedSubsetOf B"
>       would mean that a reference to A is really a reference to B,
>       but only a subset of B was actually used.  Note that I didn't
>       call it IsSubsetOf, for that would bring up the same issues
>       that IsPartOf has.  It seems important to record that the
>       purpose of these query PIDs is for citation and nothing else.
>       >>>
>       >>> Thoughts?
>       >>> -Greg
>       >>>
>       >>> _______________________________________________
>       >>> Esip-preserve mailing list
>       >>> Esip-preserve at lists.esipfed.org
>       >>>
>       http://lists.deltaforce.net/mailman/listinfo/esip-preserve
>       >>>
>       >>> NOTE: This message was trained as non-spam. If this is
>       wrong, please correct the training as soon as possible.
>       >>> Spam
>       >>> Not spam
>       >>> Forget previous vote
>       >>> _______________________________________________
>       >>> Esip-preserve mailing list
>       >>> Esip-preserve at lists.esipfed.org
>       >>>
>       http://lists.deltaforce.net/mailman/listinfo/esip-preserve
>       >>
>       >> _______________________________________________
>       >> Esip-preserve mailing list
>       >> Esip-preserve at lists.esipfed.org
>       >> http://lists.deltaforce.net/mailman/listinfo/esip-preserve
>       >
> 
> 
> 
>