[Esip-preserve] [FOO] DatasetInstance Identifiers in federated mirrors

Curt Tilmes Curt.Tilmes at nasa.gov
Wed Oct 20 10:21:55 EDT 2010


Now let's talk about what I'll now call "DatasetInstance
Identifiers". I'll define that as something we can resolve to a
specific set of granules, even for an open or dynamic dataset.

If THEM maintains similar "DatasetInstance Identifiers" with a table
like my other note, and computes that identifier with the algorithm
previously described, we get something like this:

Dataset   Timestamp   DatasetInstanceIdentifier
FOOL2.002 2001-02-01  2dd3d82d3926fe552274b452cc5662c4

Note that since I'm using the "running total" MD5 scheme, we end up
with different DatasetInstance Identifiers between the US archive and
the THEM archive.

US added granules 1..11 at once, got one identifier, then added
granule 12 and got a second identifier for granules 1..12.

THEM added granules 1..12 at once, and got a different identifier for
granules 1..12.

Since our desired property for "DatasetInstance Identifiers" are that
two users citing identical sets of granules end up with the same
identifier, this loses out the gate.


One possible algorithm for generating identical identifiers would be
to simply take the whole list of granule identifiers (sorted
canonically) and hash that.

This has the advantage of being 'correct' in that each archive could
compute it separately and would always get the same answer, but it
seems like a lot of work to keep up to date every time a new granule
hits the archive.  For our toy case, it is easy, but looking at MODIS,
10 years * 365 days/year * 288 granules/day * 50 chars/identifier?, it
gets harder and harder.  For the NPP/JPSS, they are talking about 30
second granules, so the total number of granules is obscenely high.
(Though we do aggregate them up a lot -- we could provide
DatasetInstance Identifiers only for the aggregated datasets, not for
the raw data.)

Another alternative in the "mirror" case would be for the mirror
archive to also transfer the "running total" DatasetInstance
Identifiers from the primary archive.  That would be possible, but
would require a very high degree of synchronization that may prove
difficult in practice..

Perhaps we could try to do something with granule by granule running
totals, and simply roll back to the point of divergence when you get
an out of order granule.

Algorithm something like this:
For first granule, calculate md5sum of unique granule identifier.
For each additional granule
     calculate md5sum of previous md5sum
                         + unique granule identifier of next granule.
For each removed granule,
     roll back to point of divergence in md5sum sequence and
     recalculate forward.

What does that really mean?  FOO Example time...

As US ingests the granules into FOOL2.002, it calculates a
DatasetInstance like so:

LocalgranuleID                                   Ingested   Deleted
------------------------------------------------ ---------- -------
FOOL2.v2.01.bba34792-f256-4c54-81dd-9977e432c204 2001-01-02
FOOL2.v2.02.2fd12da6-a3e2-4e50-8140-3ac645882419 2001-01-02
FOOL2.v2.03.29bda893-765d-476d-851b-8b9acd7f140e 2001-01-02
FOOL2.v2.04.57509ddb-3d40-4d60-8204-da4b99867fc7 2001-01-02
FOOL2.v2.05.0e8604fa-fb4e-4cfb-b412-5364ca12cf14 2001-01-02
FOOL2.v2.06.0eb26b4e-b718-41c5-bbf8-c83d3d79c233 2001-01-02
FOOL2.v2.07.43079ea6-43b5-4622-b492-bcdb824a818e 2001-01-02
FOOL2.v2.08.590fd64c-ec12-44a5-9b14-0042d19ed3dc 2001-01-02
FOOL2.v2.09.226173b9-4ef7-49e8-8b9e-701b892a8f57 2001-01-02
FOOL2.v2.10.533b2a95-d57f-4f75-9b7d-914d3d220310 2001-01-02 2001-03-01
FOOL2.v2.11.af235d11-777c-4bf1-a5e6-15273a5e5d80 2001-01-02
FOOL2.v2.12.bdc9dc33-38bd-403c-991e-48dcd4762ca7 2001-01-03
FOOL2.v2.13.f8f9564d-cc2a-4760-b1bc-13f1ef5cbdcb 2001-02-03
FOOL2.v2.14.4814ed46-0e41-4e3f-8f73-33d0cd2ef0bc 2001-03-03
FOOL2.v2.10.6e58a410-60e7-4956-aeaf-37f76a16b171 2001-03-03

On 2001-01-02, you add the first 11 granules above.  Start with
FOOL2.v2.01.bba34792-f256-4c54-81dd-9977e432c204,
take md5sum of that: f869b254eb75be5a2736cdb28b30eba0,
add to that the next granule, so we take an md5sum of this:
f869b254eb75be5a2736cdb28b30eba0
FOOL2.v2.02.2fd12da6-a3e2-4e50-8140-3ac645882419
and get: de2c970d4c035550b7880403ef52be6d
continuing to add later granules, we get these:
+ FOOL2.v2.03.29bda893-765d-476d-851b-8b9acd7f140e
=> 905e08c6999bc0c9d4a4f662c2566d93
+ FOOL2.v2.04.57509ddb-3d40-4d60-8204-da4b99867fc7
=> 552e64b7de31866d335ae49e5fa388c5
+ FOOL2.v2.05.0e8604fa-fb4e-4cfb-b412-5364ca12cf14
=> 177194dac82f85646a913334edfd2ea8
+ FOOL2.v2.06.0eb26b4e-b718-41c5-bbf8-c83d3d79c233
=> 2e816b406fae56cc578f9f49612b7005
+ FOOL2.v2.07.43079ea6-43b5-4622-b492-bcdb824a818e
=> 5f4bafcdd8187e4b6f32a908e3297afc
+ FOOL2.v2.08.590fd64c-ec12-44a5-9b14-0042d19ed3dc
=> 9c681dfe89be66ca2c14a2803cc911ff
+ FOOL2.v2.09.226173b9-4ef7-49e8-8b9e-701b892a8f57
=> 242eba08c8fd2ac386b3797d43a26331
+ FOOL2.v2.10.533b2a95-d57f-4f75-9b7d-914d3d220310
=> d2d541e2776128a74eef16eca84e4be4
+ FOOL2.v2.11.af235d11-777c-4bf1-a5e6-15273a5e5d80
=> 7fb1e8ba9b0c9888858b66f6a1732d2c

That identifier we store, associated with time 2001-01-02 (just
discard the others).

At time 2001-01-03, we add granule
so we get the "latest DatasetInstance Identifier" and add the new
FOOL2.v2.12.bdc9dc33-38bd-403c-991e-48dcd4762ca7
granule identifier to it, hashing
7fb1e8ba9b0c9888858b66f6a1732d2c
FOOL2.v2.12.bdc9dc33-38bd-403c-991e-48dcd4762ca7
to get 763122197bfb3ffbf0da14adbfb1b13b, which we store.

At time 2001-02-03, we add
FOOL2.v2.13.f8f9564d-cc2a-4760-b1bc-13f1ef5cbdcb
to get a8e677fcc63016b49be72f202f4ba760

At time 2001-03-01, we remove granule
FOOL2.v2.10.533b2a95-d57f-4f75-9b7d-914d3d220310 (the bad one).
So we rewind to the last DatasetInstance Identifier prior to the
ingest of that granule.  (Just happens for this case that we rewind to
the beginnging of the mission, but for real world cases, the
likelihood of a given granule being removed goes down over time.)

So, restart from the begining, and recalculate all the identifiers:
FOOL2.v2.01.bba34792-f256-4c54-81dd-9977e432c204
=> f869b254eb75be5a2736cdb28b30eba0
+ FOOL2.v2.02.2fd12da6-a3e2-4e50-8140-3ac645882419
=> de2c970d4c035550b7880403ef52be6d
+ FOOL2.v2.03.29bda893-765d-476d-851b-8b9acd7f140e
=> 905e08c6999bc0c9d4a4f662c2566d93
+ FOOL2.v2.04.57509ddb-3d40-4d60-8204-da4b99867fc7
=> 552e64b7de31866d335ae49e5fa388c5
+ FOOL2.v2.05.0e8604fa-fb4e-4cfb-b412-5364ca12cf14
=> 177194dac82f85646a913334edfd2ea8
+ FOOL2.v2.06.0eb26b4e-b718-41c5-bbf8-c83d3d79c233
=> 2e816b406fae56cc578f9f49612b7005
+ FOOL2.v2.07.43079ea6-43b5-4622-b492-bcdb824a818e
=> 5f4bafcdd8187e4b6f32a908e3297afc
+ FOOL2.v2.08.590fd64c-ec12-44a5-9b14-0042d19ed3dc
=> 9c681dfe89be66ca2c14a2803cc911ff
+ FOOL2.v2.09.226173b9-4ef7-49e8-8b9e-701b892a8f57
=> 242eba08c8fd2ac386b3797d43a26331
+ FOOL2.v2.11.af235d11-777c-4bf1-a5e6-15273a5e5d80  (note we skip 10)
=> 3563a5830ba63ff0633024894df46168
+ FOOL2.v2.12.bdc9dc33-38bd-403c-991e-48dcd4762ca7
=> 7d214181a4db9ef9f5677c86400164c8
+ FOOL2.v2.13.f8f9564d-cc2a-4760-b1bc-13f1ef5cbdcb
=> c552aca58d871920702c6948c7c0bbe1
and store that associated with time 2001-03-01.

At time 2001-03-03, we add two more granules:
FOOL2.v2.10.6e58a410-60e7-4956-aeaf-37f76a16b171
FOOL2.v2.14.4814ed46-0e41-4e3f-8f73-33d0cd2ef0bc
We sort that list, take the first one, search the table for the
granule just before that one:
FOOL2.v2.09.226173b9-4ef7-49e8-8b9e-701b892a8f57
look up its ingest date: 2001-01-02,
find the dataset identifier prior to that, and again we have to rewind
to the beginning of the mission.  (I guess you could store those
intermediate results to reuse them, but I think it really wouldn't be
worth it in practice.)
[... same as above up to FOOL2.v2.09 ...]
242eba08c8fd2ac386b3797d43a26331
+ FOOL2.v2.10.6e58a410-60e7-4956-aeaf-37f76a16b171
=> 4e41c3b6e990884d24c8c1f7fb50c600
+ FOOL2.v2.11.af235d11-777c-4bf1-a5e6-15273a5e5d80
=> eb5058e46c419a06d68e37bbe372461e
+ FOOL2.v2.12.bdc9dc33-38bd-403c-991e-48dcd4762ca7
=> a72c246574cd40669f5531996433b1ab
+ FOOL2.v2.13.f8f9564d-cc2a-4760-b1bc-13f1ef5cbdcb
=> a9e88329c184061b5ef0e71cd6f19fcd
+ FOOL2.v2.14.4814ed46-0e41-4e3f-8f73-33d0cd2ef0bc
=> 4cc346c87b70c9457b8cdfb1a3bb6ae4
which we store for time 2001-03-03

So (if I did my math right...) this is what we have for dataset
US.FOOL2.002:

Timestamp  DatasetInstance Identifier
---------  --------------------------
2001-01-02 7fb1e8ba9b0c9888858b66f6a1732d2c
2001-01-03 763122197bfb3ffbf0da14adbfb1b13b
2001-02-03 a8e677fcc63016b49be72f202f4ba760
2001-03-01 c552aca58d871920702c6948c7c0bbe1
3001-03-03 4cc346c87b70c9457b8cdfb1a3bb6ae4


Now let's take a look at the mirror THEM.FOOL2.002:

At time 2001-02-01, they grab the dataset from US.FOOL2.002, and
get these granules:

FOOL2.v2.01.bba34792-f256-4c54-81dd-9977e432c204
FOOL2.v2.02.2fd12da6-a3e2-4e50-8140-3ac645882419
FOOL2.v2.03.29bda893-765d-476d-851b-8b9acd7f140e
FOOL2.v2.04.57509ddb-3d40-4d60-8204-da4b99867fc7
FOOL2.v2.05.0e8604fa-fb4e-4cfb-b412-5364ca12cf14
FOOL2.v2.06.0eb26b4e-b718-41c5-bbf8-c83d3d79c233
FOOL2.v2.07.43079ea6-43b5-4622-b492-bcdb824a818e
FOOL2.v2.08.590fd64c-ec12-44a5-9b14-0042d19ed3dc
FOOL2.v2.09.226173b9-4ef7-49e8-8b9e-701b892a8f57
FOOL2.v2.10.533b2a95-d57f-4f75-9b7d-914d3d220310 (from corrupt data)
FOOL2.v2.11.af235d11-777c-4bf1-a5e6-15273a5e5d80
FOOL2.v2.12.bdc9dc33-38bd-403c-991e-48dcd4762ca7

They calculate their DatasetInstance Identifier like this:

FOOL2.v2.01.bba34792-f256-4c54-81dd-9977e432c204
=> f869b254eb75be5a2736cdb28b30eba0
+ FOOL2.v2.02.2fd12da6-a3e2-4e50-8140-3ac645882419
=> de2c970d4c035550b7880403ef52be6d
+ FOOL2.v2.03.29bda893-765d-476d-851b-8b9acd7f140e
=> 905e08c6999bc0c9d4a4f662c2566d93
+ FOOL2.v2.04.57509ddb-3d40-4d60-8204-da4b99867fc7
=> 552e64b7de31866d335ae49e5fa388c5
+ FOOL2.v2.05.0e8604fa-fb4e-4cfb-b412-5364ca12cf14
=> 177194dac82f85646a913334edfd2ea8
+ FOOL2.v2.06.0eb26b4e-b718-41c5-bbf8-c83d3d79c233
=> 2e816b406fae56cc578f9f49612b7005
+ FOOL2.v2.07.43079ea6-43b5-4622-b492-bcdb824a818e
=> 5f4bafcdd8187e4b6f32a908e3297afc
+ FOOL2.v2.08.590fd64c-ec12-44a5-9b14-0042d19ed3dc
=> 9c681dfe89be66ca2c14a2803cc911ff
+ FOOL2.v2.09.226173b9-4ef7-49e8-8b9e-701b892a8f57
=> 242eba08c8fd2ac386b3797d43a26331
+ FOOL2.v2.10.533b2a95-d57f-4f75-9b7d-914d3d220310
=> d2d541e2776128a74eef16eca84e4be4
+ FOOL2.v2.11.af235d11-777c-4bf1-a5e6-15273a5e5d80
=> 7fb1e8ba9b0c9888858b66f6a1732d2c
+ FOOL2.v2.12.bdc9dc33-38bd-403c-991e-48dcd4762ca7
=> 763122197bfb3ffbf0da14adbfb1b13b

which they store in their table, with date 2001-02-01:

Timestamp  DatasetInstance Identifier
---------  --------------------------
2001-02-01 763122197bfb3ffbf0da14adbfb1b13b


Now suppose Alice downloads "US.FOOL2.002" on 2001-01-05 and Bob
downloades "THEM.FOOL2.002" on 2001-02-03.

Alice's data citation includes these: { doi:10.9999/US/FOOL2.v2,
2001-01-05, DI:763122197bfb3ffbf0da14adbfb1b13b }.  Bob's citation
includes these: { doi:10.9999/US/FOOL2.v2, 2001-02-03,
DI:763122197bfb3ffbf0da14adbfb1b13b }. (DI = DatasetInstance?)

Again we can look at the citations and see that they are citing
identical sets of granules.

What do you think?

Compared to the previous approach, this algorithm maintains
DatasetInstance Identifiers regardless of the order you add/remove
your granules, so handles the federated mirror case well.  It remains
very easy (running total rather than reset to beginning in every case)
addition of individual granules, but still provides a way (though
slow) to handle out-of-order or removed granules.  The time it takes
to handle those cases is proportional to the time in the past at which
the divergence occurred).

Curt

[1] Kunze, John. "Towards Electronic Persistence Using ARK Identifiers"
https://confluence.ucop.edu/download/attachments/16744455/arkcdl.pdf


More information about the Esip-preserve mailing list