[Esip-preserve] [Infusion] Suggestion for tech infusion activity vis a vis MEaSUREs

Wed Apr 14 10:04:32 EDT 2010

On 03/23/2010 02:35 PM, Wilson, Brian D (335G) wrote:
> We will need to formulate this consensus recommendation quickly.
> 
> I suggest two features:
> 
> 1) Publish the MEASUREs datasets as a dataset paper in an appropriate
> journal so the *dataset* has a refrence-able DOI.

We've begun to discuss/distinguish the concepts of "Data Type" (what
EOS call's ESDT) from "Dataset", which is a specific version (EOS
parlance 'Collection') of that Data Type in the ESIP Preservation
cluster identifiers group.

I put some strawman terms and definitions here: (up for discussion!)
http://wiki.esipfed.org/index.php/Interagency_Data_Stewardship/Identifiers#Definitions

I think each of those concepts needs a referenceable identifier from
which we can construct data citations.

For example, consider ESDT FOO.  It is archived in DAAC MyOrg
(CrossRef DOI Org 10.12345), which has archived data from ESDT FOO for
collection 1 (a "Closed Data Set") and is currently archiving
collection 2 (an "Open Data Set" still being processed from current
data).

We need a citation for the general data type:

Smith, John. "Some Earth Science Data", FOO, DOI: 10.12345/FOO.

and a citation for each data set (each version of the data time).
Rather than registering a new DOI for each new version (collection),
I'm inclined to advise reusing the data type DOI:

Smith, John. "Some Earth Science Data", FOO, DOI: 10.12345/FOO,
Collection 1.

This "datatype DOI" could also be the 'published paper describing the
dataset' DOI, but I guess I'd be inclined to have separate DOIs, one
for the paper, and one for the datatype.  Then a paper could reference
either or both as appropriate to the nature of the use.

Alternatively, we could register distinct DOIs for each new version:

Smith, John. "Some Earth Science Data", FOO, DOI: 10.12345/FOO.1,
Collection 1.

For the "Open Data Set" case, I think we must precisely qualify the
citation to reference the specific granule membership of the dataset.
There are a few ways to do this, but I think the cleanest is a
date/time stamp:

Smith, John. "Some Earth Science Data", FOO, DOI: 10.12345/FOO,
Collection 2, 2010-04-01T14:00:00.

> 2) Serve the dataset granules from permanent (as possible) URL's
> from the origin sites and the receiving DAAC's.  The grabbed real
> estate, the root of the URL, should reference MEASUREs and the
> institution, and not contain the name of a computer (or something
> else that is dumb).
>
> 3) As far as truly permanent URI's, I don't know what to say.  I
> don't think either the handle system, XRI's, or any other system has
> gotten traction (a large market share).  This is mostly the fault of
> the W3C, which thinks the entire problem has been solved by existing
> URLs and URNs.  Hogwash.

I like including both identifiers, datatype and dataset.  I'm leaning
toward using DOIs for the datatype and PURLs for the precise data
specification and locator:

Smith, John. "Some Earth Science Data", FOO, DOI: 10.12345/FOO,
Collection 2, http://purl.org/NET/MyOrg/data/FOO/2/2010-04-01T14:00:00.

(Though, as Ruth points out, ARKs are nice too and have their own
benefits.)

Curt