[Esip-preserve] a new relation type for subset citations

Joe Hourcle oneiros at grace.nascom.nasa.gov
Thu Jul 16 13:54:56 EDT 2015



On Thu, 16 Jul 2015, Greg Jan?e via Esip-preserve wrote:

> The RDA data citation working group has recommended that subsets of datasets (more broadly, queries against datasets) be persistently identified upon request; cf. https://www.rd-alliance.org/group/data-citation-wg/wiki/wgdc-recommendations.html.
> `
> For this to work, queries have to be stored *somewhere*.  One approach (this appears to be the RDA group's working assumption) is for the provider to take on the burden of permanently storing queries, and from there it can issue PIDs for those queries by whatever means it has available.  Another approach is for the provider to support a query API of some kind, e.g., query URLs (think http://dataset?query=this+that+and+the+other).  This may result in lengthy URLs, but an external identifier system can be used to assign short, opaque PIDs that redirect to those query URLs.
>
> Regardless of the approach taken, the net result is multiple, related PIDs: one PID for the dataset as a whole, and then multiple PIDs, one per stored query.  It would be beneficial to record the relationship between these identifiers, particularly in the case when an external identifier system is being used.  The DataCite metadata schema (http://schema.datacite.org) lists a number of possibilities, but none quite fit:
>
> - IsPartOf/HasPart: "A IsPartOf B" implies that B can be broken down 
> into some disjoint pieces, and A is one of those pieces.  But a cited 
> subset is not a disjoint part of a whole.
>
> - IsCitedBy/Cites: "A IsCitedBy B" implies that B mentions A in some 
> way, and is possibly intellectually derived from A, but not necessarily. 
> This is intentionally a pretty vague relationship (as vague as 
> publication citations, right?), whereas a cited subset has a very 
> specific relationship to the whole.
>
> - IsReferencedBy/References: same thing.
>
> - IsMemberOf/HasMember (a newly proposed relation): "A IsMemberOf B" 
> implies that B has some rules or standards for inclusion, and A 
> satisfies those rules.  Doesn't seem applicable in this case.
>
> So I'm wondering if we need a new relation, which I'll provisionally 
> call "IsCitedSubsetOf".  "A IsCitedSubsetOf B" would mean that a 
> reference to A is really a reference to B, but only a subset of B was 
> actually used.  Note that I didn't call it IsSubsetOf, for that would 
> bring up the same issues that IsPartOf has.  It seems important to 
> record that the purpose of these query PIDs is for citation and nothing 
> else.
>
> Thoughts?

Don't forget that there could be a many-to-many relationship between the 
query results & the datasets for some data systems, so it's not always a 
simple relationship.

For example: we serve a number of different types of processing from HMI 
(doppler velocity, magnetic field, intensity, etc.), but a single query 
could get a time slice across multiple of them.  Using the VSO IDL client:

 	IDL> a = vso_search( inst='hmi', near='2014-01-01' );
 	Records Returned : JSOC : 1/1
 	Records Returned : JSOC : 1/1
 	Records Returned : JSOC : 1/1
 	Records Returned : JSOC : 1/1
 	Records Returned : JSOC : 1/1
 	IDL> print, rotate(a.info,1)
 	45sec. Continuum intensity
 	45sec. Magnetogram
 	720sec. Magnetogram
 	12min. IQUV
 	45sec. Dopplergram
 	IDL> print, rotate(a.fileid,1)
 	hmi__ic_45s:14726401:14726401
 	hmi__m_45s:14726401:14726401
 	hmi__m_720s:920400:920400
 	hmi__s_720s:920400:920400
 	hmi__v_45s:14726401:14726401

The VSO web client doesn't support the 'time near' syntax, but here's an 
example of a 10 minute window of HMI data:

 	http://sdac.virtualsolar.org/cgi-bin/cartui.pl?sc_id=VSO-SDAC-150716-059

(yes, yes, the header jumping around thing is annoying, I know)

-Joe


More information about the Esip-preserve mailing list