[Esip-preserve] ESIP Citation Guidelines

Curt Tilmes Curt.Tilmes at nasa.gov
Mon Oct 18 09:22:54 EDT 2010


On 10/15/10 17:02, Ruth Duerr wrote:
> On the other hand, I do believe that Curt's definition of an ESDT +
> Collection (i.e., dataset along with version) as the object that is
> assigned a unique DOI is sound, especially when accompanied by a
> date of access; but that later bit drifts off of the identifier
> discussion to citation guidelines - two different discussions
> (though I agree identifiers are foundational to both).

Recall my other message pointing out that while those three pieces of
information (ESDT+Collection+DateTimeStamp) are sufficient to
determine precise granule membership, they don't represent a unique
identifier for that granule aggregation (i.e. datasets), since all the
timestamps between two granule changes (add/delete) map to the same
granule set.

I think we are looking for an identifier (and citation) that provides
two properties:

1. The identifier can be used to determine precise granule membership.

2. Two identifiers can be compared to determine if they are
    referencing the same granule set.

(Again setting aside the issues of "equivalent" granules -- let's just
address the specific granules in the authoritative archive.)

Let's start by assuming we have a unique identifier to each granule.
The identifiers paper recommends using UUID for that, (and my FOO case
study follows that recommendation).  There are some other approaches,
but I like the UUID recommendation.  What I describe below works
equally well with any other scheme, as long as it produces unique
identifiers.

I will propose here a specific scheme for computing an identifier for
data set membership in a way that provides both properties above.
Similar to the UUID case, there are many other ways to do so.  I think
the important point is not that every data center compute them the
same way, with the same algorithm, but that they each provide
*something* that preserves those properties for their users.  I think
the identifier (though I will describe a possible way to compute one)
can be considered totally opaque from the user's perspective,
retaining meaning only to the archive.  The user simply cut/pastes the
identifier (or hopefully the whole citation) from the archive's web
site.  They don't have to know or care how it was produced.

Here is a possible scheme, from the FOO example:
https://spreadsheets.google.com/ccc?key=0AjtPCL0EZx_3dHVEWUxudC1FZVRGbHdPOW44UWRXTnc

Taking the "FOOL2.002" dataset, doi:10.9999/US/FOOL2.v2
We have these granules, with the date they were added to the database,
and the date the were removed (this looks best with a non-proportional
font):

LocalgranuleID                                          IngestDateTime 
Deleted
------------------------------------------------        -------------- 
-------------
FOOL2.v2.01.bba34792-f256-4c54-81dd-9977e432c204        2001-01-02
FOOL2.v2.02.2fd12da6-a3e2-4e50-8140-3ac645882419        2001-01-02
FOOL2.v2.03.29bda893-765d-476d-851b-8b9acd7f140e        2001-01-02
FOOL2.v2.04.57509ddb-3d40-4d60-8204-da4b99867fc7        2001-01-02
FOOL2.v2.05.0e8604fa-fb4e-4cfb-b412-5364ca12cf14        2001-01-02
FOOL2.v2.06.0eb26b4e-b718-41c5-bbf8-c83d3d79c233        2001-01-02
FOOL2.v2.07.43079ea6-43b5-4622-b492-bcdb824a818e        2001-01-02
FOOL2.v2.08.590fd64c-ec12-44a5-9b14-0042d19ed3dc        2001-01-02
FOOL2.v2.09.226173b9-4ef7-49e8-8b9e-701b892a8f57        2001-01-02
FOOL2.v2.10.533b2a95-d57f-4f75-9b7d-914d3d220310        2001-01-02 
2001-03-01
FOOL2.v2.11.af235d11-777c-4bf1-a5e6-15273a5e5d80        2001-01-02
FOOL2.v2.12.bdc9dc33-38bd-403c-991e-48dcd4762ca7        2001-01-03
FOOL2.v2.13.f8f9564d-cc2a-4760-b1bc-13f1ef5cbdcb        2001-02-03
FOOL2.v2.14.4814ed46-0e41-4e3f-8f73-33d0cd2ef0bc        2001-03-03
FOOL2.v2.10.6e58a410-60e7-4956-aeaf-37f76a16b171        2001-03-03

Assume we have an ACID[1] database -- All those granules with an
"IngestDateTime" of 2001-01-02 are not there prior to instant
"2001-01-02", but as of the instant "2001-01-02" they are all there.

Define a digital signature scheme something like this: The data set as
of an instant is identified by starting with the MD5 hash of the
dataset prior to that instant (or nothing), and appending an
asciibetically sorted list of localgranuleids added at that instant,
followed by a list of localgranuleids removed prepended by "-" all
separated by new lines and taking the MD5 (or any other) hash of the
resulting text.

So, for time "2001-01-02", we add these granules:
FOOL2.v2.01.bba34792-f256-4c54-81dd-9977e432c204
FOOL2.v2.02.2fd12da6-a3e2-4e50-8140-3ac645882419
FOOL2.v2.03.29bda893-765d-476d-851b-8b9acd7f140e
FOOL2.v2.04.57509ddb-3d40-4d60-8204-da4b99867fc7
FOOL2.v2.05.0e8604fa-fb4e-4cfb-b412-5364ca12cf14
FOOL2.v2.06.0eb26b4e-b718-41c5-bbf8-c83d3d79c233
FOOL2.v2.07.43079ea6-43b5-4622-b492-bcdb824a818e
FOOL2.v2.08.590fd64c-ec12-44a5-9b14-0042d19ed3dc
FOOL2.v2.09.226173b9-4ef7-49e8-8b9e-701b892a8f57
FOOL2.v2.10.533b2a95-d57f-4f75-9b7d-914d3d220310
FOOL2.v2.11.af235d11-777c-4bf1-a5e6-15273a5e5d80

and get hash: 3718fb5714e5e4da709dfc230286236c

At time "2001-01-03", we add one new granule,
3718fb5714e5e4da709dfc230286236c
FOOL2.v2.12.bdc9dc33-38bd-403c-991e-48dcd4762ca7

and get hash: 124926a96f2fb6b8176608a28baa714b

At time "2001-02-03", we add one new granule,
124926a96f2fb6b8176608a28baa714b
FOOL2.v2.13.f8f9564d-cc2a-4760-b1bc-13f1ef5cbdcb

and get hash: 93eaef81f4db7ab28e3980add13c9e77

At time "2001-03-01", we remove one granule,
93eaef81f4db7ab28e3980add13c9e77
-FOOL2.v2.10.533b2a95-d57f-4f75-9b7d-914d3d220310

and get hash: f21d39e77e7ccf12493d5a432b2660c4

At time "2001-03-03", we add two new granules,
f21d39e77e7ccf12493d5a432b2660c4
FOOL2.v2.14.4814ed46-0e41-4e3f-8f73-33d0cd2ef0bc
FOOL2.v2.10.6e58a410-60e7-4956-aeaf-37f76a16b171

and get hash: 212ee7065d20defb1b92ecd92361c92c

These hashes can be stored in a "dataset table" something like this:

Timestamp   DatasetIdentifier
2001-01-02  3718fb5714e5e4da709dfc230286236c
2001-01-03  124926a96f2fb6b8176608a28baa714b
2001-02-03  93eaef81f4db7ab28e3980add13c9e77
2001-03-01  f21d39e77e7ccf12493d5a432b2660c4
2001-03-03  212ee7065d20defb1b92ecd92361c92c

Since we only altered the dataset at 5 times, we have only 5 distinct
sets of granules we are trying to identify, so we have 5 distinct
DatasetIdentifiers.

Note: I add the newlines and "-" more for people than for mathematics.
This algorithm would perform acceptably -- preserving the properties
above -- if we just used the raw granuleids.  We just need some chunk
of bits that are unique for each operation, adding/removing granules
from the set.  The canonical order is required.  I also prefer to keep
a "running total" of the digital signature, hashing in the previous
value rather than listing every granule.  Listing every granule is
perhaps more intuitive, but the calculation time is O(n), which I
prefer to avoid.

The scheme as I describe it is very easy to process during any dataset
manipulation (adding/removing granules) and just keep up to date.

Going back to my previous example:

>Consider two researchers who download all the granules from
>FOOL2.002, one on the date 2001-01-04, and one on the date
>2001-01-05.  They will get an identical list of granules.  (Since the
>next granule didn't get added until 2001-02-03).
>
> If one cites the dataset with { doi:10.9999/US/FOOL2.v2,
>"2001-01-04" } and the other cites the dataset with {
>doi:10.9999/US/FOOL2.v2, "2001-01-05" }, they have different
>information included in their citations even though in reality they
>used an identical set of granules.

Now, when they download the dataset, they also cut/paste the dataset
identifier from the archive's web page as well.

Alice's cite includes: doi:10.9999/US/FOOL2.v2, 2001-01-04,
dataset:124926a96f2fb6b8176608a28baa714b

and Bob's cite includes doi:10.9999/US/FOOL2.v2, 2001-01-05,
dataset:124926a96f2fb6b8176608a28baa714b

Now, we can see they date they downloaded the dataset, but can also
clearly determine that they used the exact same set of granules.

Continuing to another question: "If the dataset identifiers are
different, how different are they?"

Suppose Charlie downloaded this dataset on "2001-03-02".  His cite
includes: doi:10.9999/US/FOOL2.v2, 2001-03-02,
dataset:f21d39e77e7ccf12493d5a432b2660c4.

We can look at his citation compared to Alice/Bob and see that they
are clearly different datasets.

Using the two tables above, it is easy to work backwards from the
dataset identifier to the date of the change (well, easy also since
they've actually included the dates in the citation, but I would still
rely on the dataset identifier over the date since it has less chance
of being wrong).

You can determine the granule membership of a dataset by querying the
granule database for granules ingested prior to the date of the
dataset and either not deleted, or deleted after the date of the
dataset timestamp.

Charlie's dataset identifier maps to date "2001-03-01" and Alice's
dataset identifier maps to date "2001-01-02".  The difference between
the datasets is equivalent to a query on the granule table for the
changes between those dates. You can easily see granule additions,
granule removals and granule updates.

Curt

[1] http://en.wikipedia.org/wiki/ACID


More information about the Esip-preserve mailing list