[Esip-preserve] [FOO] Dataset Identifiers in federated mirrors

Curt Tilmes Curt.Tilmes at nasa.gov
Wed Oct 20 10:20:05 EDT 2010


On 10/18/10 09:55, alicebarkstrom at frontier.com wrote:
 > One might want to be careful about assuming that there's just one
 > "authentic" archive.  For example, Orbital Sciences could provide a
 > "data buy" to several different customers - each of whom gets the
 > same data.  NOAA gets data from one of these customers and NASA gets
 > data from another.
 >
 > Does this approach preclude federated mirroring?

Ok, Let's try to build this into the FOO scenario.

Archive "US" has all the data I've discussed to date, but their
bandwidth has been overwhelmed by the demand for the FOO data products
They've arranged to set up a mirror with the "THEM" archive.

THEM has surveyed users and determined that the collection 1 data is
obsolete, so they aren't going to bother with it.  They decide to grab
all of the FOOL2.002 data.  They also set up an active mirror,
subscribing to new data for that dataset.  As new granules enter the
US archive, THEM will pull them to their archive as well.

They start their "mirror" process up on 2001-02-01.

For those of you playing along at home, here are the precise granules
they pull as of that date:

FOOL2.002:

FOOL2.v2.01.bba34792-f256-4c54-81dd-9977e432c204
FOOL2.v2.02.2fd12da6-a3e2-4e50-8140-3ac645882419
FOOL2.v2.03.29bda893-765d-476d-851b-8b9acd7f140e
FOOL2.v2.04.57509ddb-3d40-4d60-8204-da4b99867fc7
FOOL2.v2.05.0e8604fa-fb4e-4cfb-b412-5364ca12cf14
FOOL2.v2.06.0eb26b4e-b718-41c5-bbf8-c83d3d79c233
FOOL2.v2.07.43079ea6-43b5-4622-b492-bcdb824a818e
FOOL2.v2.08.590fd64c-ec12-44a5-9b14-0042d19ed3dc
FOOL2.v2.09.226173b9-4ef7-49e8-8b9e-701b892a8f57
FOOL2.v2.10.533b2a95-d57f-4f75-9b7d-914d3d220310 (from corrupt data)
FOOL2.v2.11.af235d11-777c-4bf1-a5e6-15273a5e5d80
FOOL2.v2.12.bdc9dc33-38bd-403c-991e-48dcd4762ca7


First, let's address the "Dataset Identifier" issue.  This, as you'll
recall, is the term I've chosen for what ECS calls { ESDT+Collection }.
(I don't think we have consensus on that definition yet...)

For a closed, or static dataset, that would refer to a precise set of
granules, but for an open, or dynamic dataset (like the FOO case), it
is a general identifier, but not a precise set of granules.

We have assigned a DOI to that dataset, and it currently points to the
information page at the "US" archive site for that datasets:
doi:10.9999/US/FOOL2.v2.

Now an end user can obtain the physical files for that dataset from
either US or THEM.

I would argue that for a strict mirror (identical physical bits) we
should maintain our DOIs as they are.  The US site can add a link to
their page "mirror available over there!", but we still have 1
dataset, with 2 mirrors, not 2 datasets.

On 2001-02-01 David downloads dataset FOOL2.002 from THEM and writes
his paper.  He includes in his citation the same old DOI as Alice, Bob
and Charlie: doi:10.9999/US/FOOL2.v2.

Someone reading their respective data citations can clearly see that
they are talking about the same data.

As I've described this, the data is created at one place, then
transferred to another.  We could also do the same thing if the data
started at some source and went to multiple "equal" federated
archives.  Assign one dataset identifier, and refer to it from each
archive, and you cite the data identically wherever you got it.  This
is for the same physical bits.


HOWEVER, THEM has a bunch of users who desperately want the FOO data,
but find the current format (US Data Format -- UDF) of the data very
awkward.  They ask for THEM to convert it to the TDF (THEM Data Format
-- TDF) that they are much more familiar with.

THEM complies, establishing a new dataset, "FOOL2TDF.002", which is a
simple reformatting of each granule from "FOOL2.002" into TDF.  They
tag each granule with a new UUID so they will have unique identifiers.
Even though the data have identical scientific data content to the
original data set, they have been transformed.

As of 2001-02-01, they produce these granules:

FOOL2TDF.002:
FOOL2TDF.v2.01.3e53e4b7-bd24-478c-9c7e-5fdc4cfd1003
FOOL2TDF.v2.02.94ab3aca-06f6-4350-b953-ea274aeaa880
FOOL2TDF.v2.03.5e6dbf44-c9de-4c2b-8357-45e718220051
FOOL2TDF.v2.04.bc1e65ed-213b-452f-bb27-f1bdd691e043
FOOL2TDF.v2.05.5bb65101-77c0-4af0-9d57-f1eff3498943
FOOL2TDF.v2.06.91a18132-ac08-4fee-a1fa-c6e79a60d7ce
FOOL2TDF.v2.07.1856d966-4b8c-4611-bf08-b05696c43cf5
FOOL2TDF.v2.08.21170d7f-82b3-4136-bdbf-0f3a45120b57
FOOL2TDF.v2.09.d3ad9556-72a6-44cb-a95d-4237fade2d31
FOOL2TDF.v2.10.ea1ba4e3-8c7c-44a4-8201-acd046db4469 (from corrupt data)
FOOL2TDF.v2.11.1e64fc65-4f3b-4f7d-b27d-7e812f366878
FOOL2TDF.v2.12.bcf108eb-3888-4131-b04b-1c33bbe4737f

Since they want users of that data to cite it directly, they assign a
new DOI for the reformatted dataset, "doi:10.9998/THEM/FOOL2TDF.v2".
Even though the science content of these two datasets is identical, we
still assign distinct DOIs and cite them differently.  (Contentious
point here?)

If you resolve that DOI by going through dx.doi.org, you'll end up on
the THEM dataset information page about "FOOL2TDF.002", which
describes the fact that this data is scientifically equivalent to
FOOL2.002, which is also available within the THEM archive.


Ok, here's something even weirder.  Getting the FOOL2.002 dataset
mirror set up was such a pain, rather than mirroring the FOOL3.002
dataset, THEM decides "Hey, we've got all the input data here already,
let's just make 'FOOL3.002' ourselves rather than mirroring it from
US".

Recall that as of 2001-02-01, "US.FOOL3.002" (I think I'll start
referring to datasets with the archive name prepended) had in it a
single granule:
     FOOL3.v2.01.07aa9ae3-9c3e-4508-b027-890dae11b768
which had been produced from these granules:
     FOOL2.v2.01.bba34792-f256-4c54-81dd-9977e432c204
     FOOL2.v2.02.2fd12da6-a3e2-4e50-8140-3ac645882419
     FOOL2.v2.03.29bda893-765d-476d-851b-8b9acd7f140e
     FOOL2.v2.04.57509ddb-3d40-4d60-8204-da4b99867fc7
     FOOL2.v2.05.0e8604fa-fb4e-4cfb-b412-5364ca12cf14
     FOOL2.v2.06.0eb26b4e-b718-41c5-bbf8-c83d3d79c233
     FOOL2.v2.07.43079ea6-43b5-4622-b492-bcdb824a818e
     FOOL2.v2.08.590fd64c-ec12-44a5-9b14-0042d19ed3dc
     FOOL2.v2.09.226173b9-4ef7-49e8-8b9e-701b892a8f57
     FOOL2.v2.10.533b2a95-d57f-4f75-9b7d-914d3d220310
     FOOL2.v2.11.af235d11-777c-4bf1-a5e6-15273a5e5d80
     FOOL2.v2.12.bdc9dc33-38bd-403c-991e-48dcd4762ca7

So, THEM re-runs that production, using the identical input
files, with the identical algorithm, producing this granule:
FOOL3.v2.01.801b85a4-6658-4ff2-98e2-adaf99f6fb85

Now, (assuming THEM did everything right, and the algorithm was
portable, not sensitive to the environment, they used the right
version of the algorithm, the exact same input files and
parameters, etc.),
the granule in the "US.FOOL3.002" archive:
     FOOL3.v2.01.07aa9ae3-9c3e-4508-b027-890dae11b768
should be totally scientifically equivalent to the one that THEM
just made:
     FOOL3.v2.01.801b85a4-6658-4ff2-98e2-adaf99f6fb85

How should THEM refer to that dataset?  What DOI/citation should
people who get that dataset use?

I think in the "identical case", (US.FOOL2.002 vs. THEM.FOOL2.002)
where the exact physical bits are mirrored and we maintain the exact
same granule identifiers, we should call them "mirrors of the same
dataset", and use 1 DOI.

In the "scientifically equivalent case", (US.FOOL2.002
vs. THEM.FOOL2TDF.002 where the data are scientifically equivalent,
but in a different format, with different granule identifiers), or in
the reproduced dataset (US.FOOL3.003 vs. THEM.FOOL3.003, same format,
but different granule identifiers) we should assign a distinct DOI and
cite the data differently.

So, in our strawman, THEM makes a new dataset:

THEM.FOOL3.002, that, as of 2001-02-01 has a single granule:
     FOOL3.v2.01.801b85a4-6658-4ff2-98e2-adaf99f6fb85

They assign a new DOI:10.9998/THEM/FOOL3.v2 that resolves to the THEM
dataset information page for THEM.FOOL3.002.  That information page
(in human readable form, and in machine readable RDF (semantic web))
includes the assertion that this data is scientifically equivalent to
the data in doi:10.9999/US/FOOL3.v2.


Now if Alice cites doi:10.9999/US/FOOL3.v2, and David cites
doi:10.9998/THEM/FOOL3.v2, even those these data are scientifically
equivalent, they have distinct identifiers and someone looking at the
two citations couldn't determine automatically that they were
referencing scientifically equivalent data.


Note: In my example, I've included the "FOOL3.v2" part as the local
differentiator part of the DOI, but that isn't really a requirement of
DOIs, and perhaps not even desirable.  Even the "US" / "THEM" part
could be misleading since even if you downloaded data from the THEM
mirror of FOOL2.002, you would still cite it with the common
doi:10.9999/US/FOOL2.v2.  We want to allow any dataset to move to any
archive without ever changing that dataset identifier.  Some argue for
that reason you shouldn't try to make them meaningful as I have done
here.  It might be better for example, that the DOI be something like
"doi:10.9999/725" or "doi:10.9999/au5ah7".  Kunze [1] describes a
scheme for the CDL that carefully avoids putting any semantics into
the identifier.  Others have argued that including semantics can be
useful at times.


I could see a case for including an identifier for the original
"scientific origin" of the data in the citation, but I think that will
lead to more and more complicated citations.  More complicated
citations is a burden on the users, where I'd rather keep the burden
on the archive, since I think that will lead to more likely compliance
with our recommendations.

Suppose we wanted to do it anyway.

Let's say Alice downloads US.FOOL3.002 and cites it like this:

     "... doi:10.9999/US/FOOL3.v2 ..."

and David downloads THEM.FOOL3.002 and cites it like this:

     "... doi:10.9998/THEM/FOOL3.v2, SE:doi:10.9999/US/FOOL3.v2 ..."

to indicate that the dataset he downloaded was scientifically
equivalent to some other dataset.  Just seems awkward to me, but you
could look at the two citations and see that the datasets were
asserted (not proven) to be scientifically equivalent.


Ok, that was "Dataset Identifiers".  Next note will address
"DatasetInstance Identifiers".

Curt


More information about the Esip-preserve mailing list