[Esip-preserve] [FOO] Dataset Identifiers in federated mirrors

Wed Oct 20 11:20:37 EDT 2010

Bravo!  Great work based on a concrete example.

Your note on prepending the archive is one that I believe
Ruth and I disagreed with, since I used a field for that
purpose in a specific naming convention using OID's.
It's in a discussion that's on the ESIP web site for
my work with a considerably more complex production
scenario [I don't remember the exact URL - although
I've mentioned it in previous e-mail traffic.  Something
involving oceandis.]
I think the approach you're taking suggests that tracking
scientifically identical data collections might be well
served by using an approach similar to the way we transfer
property.  Each piece of property (think an identifiable
data collection) is identified in a survey that is referenced
in a "deed" (think "identifier").  When we want to reference
a scientifically identical property, we create a new map
from a new survey.  The mapping is verified by a survey.
This makes the registration authority responsible for
maintaining the maps that transfer one record of the
property to another - with an auditable (empirically
verifiable) record.  This is a much more robust way
of ensuring transfers are valid than merely relying on
the assurance of the authority.  With only that assurance,
the whole scheme is quite susceptible to "man in the
middle" attacks.  [Imagine that a rich, right (or left) wing extremist
bought out your favorite registrar and hired "true believer"
programmers to modify the data and send users off to
strange sites.]

I am including these kinds of thoughts in the paper I'm
finishing.  I think it might be wise to separate data citations
intended to allow data users to replicate results from citations
intended to give credit for contributions to developing data
collections.  I've got one case from attempts to determine
whether there are trends in the solar constant that has three
different instrument models and uses data from more than
half dozen instruments.  It appears that the differences between
the three reconstructions of the solar constant involve selections
that use individual days - and there are a lot of those in a thirty-year
record.

The other case of providing credit to individuals based on their
contributions to developing data collections is, perhaps, more
difficult.  One contributing factor is IPR.  Another is the strong
probability of the use of proprietary software in instantiating the
algorithms into code (which may become an issue in the case
of ad hoc production using Open Source or proprietary tools).
As an example, I've been wondering who we would cite in the
case of the Hurricane Ike Damage Assessment image collection,
in which the images themselves are now in the public domain
(nobody owns them), while the software that creates the images
from the camera is probably the property of the company that
NOAA contracted to take the images.

The next issue to tackle in your exploration is probably the
one about subdividing the collections in your example into
Data Sets (assume a second instrument launch whose data
collection overlaps the first - and then add an instrument failure
that adds a gap in the record, followed by transferring the
instrument and data production to another agency) - with the
Data Sets having Versions whose uncertainties are different.

At any rate, keep up the exploration.

Bruce B.
On Wed, Oct 20, 2010 at 10:20 AM, Curt Tilmes <Curt.Tilmes at nasa.gov> wrote:

> On 10/18/10 09:55, alicebarkstrom at frontier.com wrote:
> > One might want to be careful about assuming that there's just one
> > "authentic" archive.  For example, Orbital Sciences could provide a
> > "data buy" to several different customers - each of whom gets the
> > same data.  NOAA gets data from one of these customers and NASA gets
> > data from another.
> >
> > Does this approach preclude federated mirroring?
>
> Ok, Let's try to build this into the FOO scenario.
>
> Archive "US" has all the data I've discussed to date, but their
> bandwidth has been overwhelmed by the demand for the FOO data products
> They've arranged to set up a mirror with the "THEM" archive.
>
> THEM has surveyed users and determined that the collection 1 data is
> obsolete, so they aren't going to bother with it.  They decide to grab
> all of the FOOL2.002 data.  They also set up an active mirror,
> subscribing to new data for that dataset.  As new granules enter the
> US archive, THEM will pull them to their archive as well.
>
> They start their "mirror" process up on 2001-02-01.
>
> For those of you playing along at home, here are the precise granules
> they pull as of that date:
>
> FOOL2.002:
>
> FOOL2.v2.01.bba34792-f256-4c54-81dd-9977e432c204
> FOOL2.v2.02.2fd12da6-a3e2-4e50-8140-3ac645882419
> FOOL2.v2.03.29bda893-765d-476d-851b-8b9acd7f140e
> FOOL2.v2.04.57509ddb-3d40-4d60-8204-da4b99867fc7
> FOOL2.v2.05.0e8604fa-fb4e-4cfb-b412-5364ca12cf14
> FOOL2.v2.06.0eb26b4e-b718-41c5-bbf8-c83d3d79c233
> FOOL2.v2.07.43079ea6-43b5-4622-b492-bcdb824a818e
> FOOL2.v2.08.590fd64c-ec12-44a5-9b14-0042d19ed3dc
> FOOL2.v2.09.226173b9-4ef7-49e8-8b9e-701b892a8f57
> FOOL2.v2.10.533b2a95-d57f-4f75-9b7d-914d3d220310 (from corrupt data)
> FOOL2.v2.11.af235d11-777c-4bf1-a5e6-15273a5e5d80
> FOOL2.v2.12.bdc9dc33-38bd-403c-991e-48dcd4762ca7
>
>
> First, let's address the "Dataset Identifier" issue.  This, as you'll
> recall, is the term I've chosen for what ECS calls { ESDT+Collection }.
> (I don't think we have consensus on that definition yet...)
>
> For a closed, or static dataset, that would refer to a precise set of
> granules, but for an open, or dynamic dataset (like the FOO case), it
> is a general identifier, but not a precise set of granules.
>
> We have assigned a DOI to that dataset, and it currently points to the
> information page at the "US" archive site for that datasets:
> doi:10.9999/US/FOOL2.v2.
>
> Now an end user can obtain the physical files for that dataset from
> either US or THEM.
>
> I would argue that for a strict mirror (identical physical bits) we
> should maintain our DOIs as they are.  The US site can add a link to
> their page "mirror available over there!", but we still have 1
> dataset, with 2 mirrors, not 2 datasets.
>
> On 2001-02-01 David downloads dataset FOOL2.002 from THEM and writes
> his paper.  He includes in his citation the same old DOI as Alice, Bob
> and Charlie: doi:10.9999/US/FOOL2.v2.
>
> Someone reading their respective data citations can clearly see that
> they are talking about the same data.
>
> As I've described this, the data is created at one place, then
> transferred to another.  We could also do the same thing if the data
> started at some source and went to multiple "equal" federated
> archives.  Assign one dataset identifier, and refer to it from each
> archive, and you cite the data identically wherever you got it.  This
> is for the same physical bits.
>
>
> HOWEVER, THEM has a bunch of users who desperately want the FOO data,
> but find the current format (US Data Format -- UDF) of the data very
> awkward.  They ask for THEM to convert it to the TDF (THEM Data Format
> -- TDF) that they are much more familiar with.
>
> THEM complies, establishing a new dataset, "FOOL2TDF.002", which is a
> simple reformatting of each granule from "FOOL2.002" into TDF.  They
> tag each granule with a new UUID so they will have unique identifiers.
> Even though the data have identical scientific data content to the
> original data set, they have been transformed.
>
> As of 2001-02-01, they produce these granules:
>
> FOOL2TDF.002:
> FOOL2TDF.v2.01.3e53e4b7-bd24-478c-9c7e-5fdc4cfd1003
> FOOL2TDF.v2.02.94ab3aca-06f6-4350-b953-ea274aeaa880
> FOOL2TDF.v2.03.5e6dbf44-c9de-4c2b-8357-45e718220051
> FOOL2TDF.v2.04.bc1e65ed-213b-452f-bb27-f1bdd691e043
> FOOL2TDF.v2.05.5bb65101-77c0-4af0-9d57-f1eff3498943
> FOOL2TDF.v2.06.91a18132-ac08-4fee-a1fa-c6e79a60d7ce
> FOOL2TDF.v2.07.1856d966-4b8c-4611-bf08-b05696c43cf5
> FOOL2TDF.v2.08.21170d7f-82b3-4136-bdbf-0f3a45120b57
> FOOL2TDF.v2.09.d3ad9556-72a6-44cb-a95d-4237fade2d31
> FOOL2TDF.v2.10.ea1ba4e3-8c7c-44a4-8201-acd046db4469 (from corrupt data)
> FOOL2TDF.v2.11.1e64fc65-4f3b-4f7d-b27d-7e812f366878
> FOOL2TDF.v2.12.bcf108eb-3888-4131-b04b-1c33bbe4737f
>
> Since they want users of that data to cite it directly, they assign a
> new DOI for the reformatted dataset, "doi:10.9998/THEM/FOOL2TDF.v2".
> Even though the science content of these two datasets is identical, we
> still assign distinct DOIs and cite them differently.  (Contentious
> point here?)
>
> If you resolve that DOI by going through dx.doi.org, you'll end up on
> the THEM dataset information page about "FOOL2TDF.002", which
> describes the fact that this data is scientifically equivalent to
> FOOL2.002, which is also available within the THEM archive.
>
>
> Ok, here's something even weirder.  Getting the FOOL2.002 dataset
> mirror set up was such a pain, rather than mirroring the FOOL3.002
> dataset, THEM decides "Hey, we've got all the input data here already,
> let's just make 'FOOL3.002' ourselves rather than mirroring it from
> US".
>
> Recall that as of 2001-02-01, "US.FOOL3.002" (I think I'll start
> referring to datasets with the archive name prepended) had in it a
> single granule:
>    FOOL3.v2.01.07aa9ae3-9c3e-4508-b027-890dae11b768
> which had been produced from these granules:
>    FOOL2.v2.01.bba34792-f256-4c54-81dd-9977e432c204
>    FOOL2.v2.02.2fd12da6-a3e2-4e50-8140-3ac645882419
>    FOOL2.v2.03.29bda893-765d-476d-851b-8b9acd7f140e
>    FOOL2.v2.04.57509ddb-3d40-4d60-8204-da4b99867fc7
>    FOOL2.v2.05.0e8604fa-fb4e-4cfb-b412-5364ca12cf14
>    FOOL2.v2.06.0eb26b4e-b718-41c5-bbf8-c83d3d79c233
>    FOOL2.v2.07.43079ea6-43b5-4622-b492-bcdb824a818e
>    FOOL2.v2.08.590fd64c-ec12-44a5-9b14-0042d19ed3dc
>    FOOL2.v2.09.226173b9-4ef7-49e8-8b9e-701b892a8f57
>    FOOL2.v2.10.533b2a95-d57f-4f75-9b7d-914d3d220310
>    FOOL2.v2.11.af235d11-777c-4bf1-a5e6-15273a5e5d80
>    FOOL2.v2.12.bdc9dc33-38bd-403c-991e-48dcd4762ca7
>
> So, THEM re-runs that production, using the identical input
> files, with the identical algorithm, producing this granule:
> FOOL3.v2.01.801b85a4-6658-4ff2-98e2-adaf99f6fb85
>
> Now, (assuming THEM did everything right, and the algorithm was
> portable, not sensitive to the environment, they used the right
> version of the algorithm, the exact same input files and
> parameters, etc.),
> the granule in the "US.FOOL3.002" archive:
>    FOOL3.v2.01.07aa9ae3-9c3e-4508-b027-890dae11b768
> should be totally scientifically equivalent to the one that THEM
> just made:
>    FOOL3.v2.01.801b85a4-6658-4ff2-98e2-adaf99f6fb85
>
> How should THEM refer to that dataset?  What DOI/citation should
> people who get that dataset use?
>
> I think in the "identical case", (US.FOOL2.002 vs. THEM.FOOL2.002)
> where the exact physical bits are mirrored and we maintain the exact
> same granule identifiers, we should call them "mirrors of the same
> dataset", and use 1 DOI.
>
> In the "scientifically equivalent case", (US.FOOL2.002
> vs. THEM.FOOL2TDF.002 where the data are scientifically equivalent,
> but in a different format, with different granule identifiers), or in
> the reproduced dataset (US.FOOL3.003 vs. THEM.FOOL3.003, same format,
> but different granule identifiers) we should assign a distinct DOI and
> cite the data differently.
>
> So, in our strawman, THEM makes a new dataset:
>
> THEM.FOOL3.002, that, as of 2001-02-01 has a single granule:
>    FOOL3.v2.01.801b85a4-6658-4ff2-98e2-adaf99f6fb85
>
> They assign a new DOI:10.9998/THEM/FOOL3.v2 that resolves to the THEM
> dataset information page for THEM.FOOL3.002.  That information page
> (in human readable form, and in machine readable RDF (semantic web))
> includes the assertion that this data is scientifically equivalent to
> the data in doi:10.9999/US/FOOL3.v2.
>
>
> Now if Alice cites doi:10.9999/US/FOOL3.v2, and David cites
> doi:10.9998/THEM/FOOL3.v2, even those these data are scientifically
> equivalent, they have distinct identifiers and someone looking at the
> two citations couldn't determine automatically that they were
> referencing scientifically equivalent data.
>
>
> Note: In my example, I've included the "FOOL3.v2" part as the local
> differentiator part of the DOI, but that isn't really a requirement of
> DOIs, and perhaps not even desirable.  Even the "US" / "THEM" part
> could be misleading since even if you downloaded data from the THEM
> mirror of FOOL2.002, you would still cite it with the common
> doi:10.9999/US/FOOL2.v2.  We want to allow any dataset to move to any
> archive without ever changing that dataset identifier.  Some argue for
> that reason you shouldn't try to make them meaningful as I have done
> here.  It might be better for example, that the DOI be something like
> "doi:10.9999/725" or "doi:10.9999/au5ah7".  Kunze [1] describes a
> scheme for the CDL that carefully avoids putting any semantics into
> the identifier.  Others have argued that including semantics can be
> useful at times.
>
>
> I could see a case for including an identifier for the original
> "scientific origin" of the data in the citation, but I think that will
> lead to more and more complicated citations.  More complicated
> citations is a burden on the users, where I'd rather keep the burden
> on the archive, since I think that will lead to more likely compliance
> with our recommendations.
>
> Suppose we wanted to do it anyway.
>
> Let's say Alice downloads US.FOOL3.002 and cites it like this:
>
>    "... doi:10.9999/US/FOOL3.v2 ..."
>
> and David downloads THEM.FOOL3.002 and cites it like this:
>
>    "... doi:10.9998/THEM/FOOL3.v2, SE:doi:10.9999/US/FOOL3.v2 ..."
>
> to indicate that the dataset he downloaded was scientifically
> equivalent to some other dataset.  Just seems awkward to me, but you
> could look at the two citations and see that the datasets were
> asserted (not proven) to be scientifically equivalent.
>
>
> Ok, that was "Dataset Identifiers".  Next note will address
> "DatasetInstance Identifiers".
>
> Curt
> _______________________________________________
> Esip-preserve mailing list
> Esip-preserve at lists.esipfed.org
> http://www.lists.esipfed.org/mailman/listinfo/esip-preserve
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.lists.esipfed.org/pipermail/esip-preserve/attachments/20101020/f3de2f45/attachment-0001.html>