[Esip-preserve] [FOO] Foo Project moves to Google Spreadsheets

Tue Oct 12 10:53:12 EDT 2010

On 10/11/2010 04:57 PM, alicebarkstrom at frontier.com wrote:
> Do we have any analysis that quantifies the probability that the
> identifiers will be disconnected from the file contents?
> [...]
> Since the usual DOI/URL/PURL/.../UUID's do not have direct ties to
> the file (or larger collections), I don't think you could rely on
> these to detect changes.

On 10/12/2010 09:39 AM, Kenneth S. Casey wrote:
> On Oct 12, 2010, at 9:21 AM, alicebarkstrom at frontier.com
> <mailto:alicebarkstrom at frontier.com> wrote:
>> The suggestion that Ken has of embedding UUIDs in the file at least
>> makes them unique and independent of cryptographic digests. That
>> means as long as a copy of the original file is readable, it can be
>> uniquely identified.

UUID is not resolvable, so I think embedding them makes a lot of
sense.

DOIs on the other hand are resolvable.  The curator of the archive
should make a commitment to maintain the tie to the data collection.

Not only that, I think we have a strong case that the curator should
develop and maintain the ability to resolve the DOI to a specific set
of granules given a particular date/time stamp.  (see example below)

>> What happens after the original file becomes unreadable (physical
>> deterioration, software obsolescence, hardware obsolescence being
>> prime suspects) is not so clear.

I think this is a whole separate category of "preservation."  I like
to wrap this in a box labeled "curator commitment."  If you have the
resources to prevent those problems, you can.  If your data value
warrants the resources to support that commitment, you can maintain
it.  If your data value doesn't warrant those resources, you lose it.
Hopefully over time (and efforts of groups like this one), technology
advances to make the resource requirements cheaper and cheaper.  As
long as that wave stays ahead of us, we can keep everything forever.

For example, a couple years ago, the ozone group went back to our
earliest data (circa 1970s) and pulled all the data (off of the old
tapes) into the modern archive, updated metadata to current standards
and reprocessed everything with the latest version of the algorithms.
I have every expectation that will happen again and again, but it
definitely depends on ongoing funding for that group.

> As for defining collections, that is probably not so easy to do in
> general but can certainly be done on a project basis... in GHRSST we
> know exactly what a collection is and what a granule is. Between
> checksums and UUIDs and DOIs, plus the date someone makes a
> citation, I feel pretty good we that we could track back
> unambiguously exactly which granules were used. And if not, we make
> adjustments in the next version of GHRSST and move on.

For example, in the FOO case, Alice's paper can include both the DOI
doi:10.9999/US/FOOL3.v2 and the date "2001-01-05" and it would
unambigiously indicate that she used granule
FOOL3.v2.01.07aa9ae3-9c3e-4508-b027-890dae11b768 (which entered that
dataset on 2001-01-04 and was deleted on 2001-03-01) and not granule
FOOL3.v2.01.2a365058-fb52-4559-ab4b-085cb5ac0b73 (which entered the
dataset on 2001-03-01).

Looking at the FOOL2.002 dataset, a citation that included
doi:10.9999/US/FOOL2.v2, and the date 2001-01-05
resolves to these specific granules:

FOOL2.v2.01.bba34792-f256-4c54-81dd-9977e432c204
FOOL2.v2.02.2fd12da6-a3e2-4e50-8140-3ac645882419
FOOL2.v2.03.29bda893-765d-476d-851b-8b9acd7f140e
FOOL2.v2.04.57509ddb-3d40-4d60-8204-da4b99867fc7
FOOL2.v2.05.0e8604fa-fb4e-4cfb-b412-5364ca12cf14
FOOL2.v2.06.0eb26b4e-b718-41c5-bbf8-c83d3d79c233
FOOL2.v2.07.43079ea6-43b5-4622-b492-bcdb824a818e
FOOL2.v2.08.590fd64c-ec12-44a5-9b14-0042d19ed3dc
FOOL2.v2.09.226173b9-4ef7-49e8-8b9e-701b892a8f57
FOOL2.v2.10.533b2a95-d57f-4f75-9b7d-914d3d220310  (bad granule later 
deleted)
FOOL2.v2.11.af235d11-777c-4bf1-a5e6-15273a5e5d80
FOOL2.v2.12.bdc9dc33-38bd-403c-991e-48dcd4762ca7

and a citation that included doi:10.9999/US/FOOL2.v2 and
the date 2001-03-04
resolves to these specific granules:

FOOL2.v2.01.bba34792-f256-4c54-81dd-9977e432c204
FOOL2.v2.02.2fd12da6-a3e2-4e50-8140-3ac645882419
FOOL2.v2.03.29bda893-765d-476d-851b-8b9acd7f140e
FOOL2.v2.04.57509ddb-3d40-4d60-8204-da4b99867fc7
FOOL2.v2.05.0e8604fa-fb4e-4cfb-b412-5364ca12cf14
FOOL2.v2.06.0eb26b4e-b718-41c5-bbf8-c83d3d79c233
FOOL2.v2.07.43079ea6-43b5-4622-b492-bcdb824a818e
FOOL2.v2.08.590fd64c-ec12-44a5-9b14-0042d19ed3dc
FOOL2.v2.09.226173b9-4ef7-49e8-8b9e-701b892a8f57
FOOL2.v2.10.6e58a410-60e7-4956-aeaf-37f76a16b171  (replaced old granule)
FOOL2.v2.11.af235d11-777c-4bf1-a5e6-15273a5e5d80
FOOL2.v2.12.bdc9dc33-38bd-403c-991e-48dcd4762ca7
FOOL2.v2.13.f8f9564d-cc2a-4760-b1bc-13f1ef5cbdcb  (added new granule)
FOOL2.v2.14.4814ed46-0e41-4e3f-8f73-33d0cd2ef0bc  (added new granule)

(I'm simplifying to just date, but it should really be date+time to
enough resolution to remain unambiguous.  Also, always use ISO-8601
date/time formats.)

>> I've no objection to trying DOI's - but what we mean by a
>> collection needs clarification.

This is a big concern of mine.  The way our OMI group is using
collection is slightly different than the ECS had used it.  We have a
concept called "ArchiveSet" which I think is more precise.  This is
related to "Data Version" which is a more complicated issue than at
appears first blush.  I might try to explore that a bit more in the
example.  Consider what happens when the LUT version changes, but the
L2 algorithm version doesn't change.  We get two different versions of
the L2 data and need to distinguish them.

Curt