[Esip-preserve] [FOO] DatasetInstance Identifiers in federated mirrors

Bruce Barkstrom brbarkstrom at gmail.com
Wed Oct 20 11:30:01 EDT 2010


Ah - life in the simulated world gets complicated - and more
realistic.

Another case I don't think has been included yet is one where
there was an operator error that placed wrong files in the place
where one might expect a particular member of a sequence of
files.  Note that I expect most users will be looking at the date
of observation, rather than the "date of publication" - a rather
important point for data access.  For access by date of observation,
it would seem important to ensure that the operator error places
the files in the expected collating sequence.  The same would be
true if the failure involved a hardware error (e.g. a router failure
that corrupted the bits in the file).   This kind of question raises
the issue of whether or not an archive should ever be allowed
to delete anything.  Deep Space doesn't.  I'm inclined to think
deletions should be allowed - but be difficult to implement and
requiring some very careful procedural constraints.

Keep it up!

Bruce B.

On Wed, Oct 20, 2010 at 10:21 AM, Curt Tilmes <Curt.Tilmes at nasa.gov> wrote:

> Now let's talk about what I'll now call "DatasetInstance
> Identifiers". I'll define that as something we can resolve to a
> specific set of granules, even for an open or dynamic dataset.
>
> If THEM maintains similar "DatasetInstance Identifiers" with a table
> like my other note, and computes that identifier with the algorithm
> previously described, we get something like this:
>
> Dataset   Timestamp   DatasetInstanceIdentifier
> FOOL2.002 2001-02-01  2dd3d82d3926fe552274b452cc5662c4
>
> Note that since I'm using the "running total" MD5 scheme, we end up
> with different DatasetInstance Identifiers between the US archive and
> the THEM archive.
>
> US added granules 1..11 at once, got one identifier, then added
> granule 12 and got a second identifier for granules 1..12.
>
> THEM added granules 1..12 at once, and got a different identifier for
> granules 1..12.
>
> Since our desired property for "DatasetInstance Identifiers" are that
> two users citing identical sets of granules end up with the same
> identifier, this loses out the gate.
>
>
> One possible algorithm for generating identical identifiers would be
> to simply take the whole list of granule identifiers (sorted
> canonically) and hash that.
>
> This has the advantage of being 'correct' in that each archive could
> compute it separately and would always get the same answer, but it
> seems like a lot of work to keep up to date every time a new granule
> hits the archive.  For our toy case, it is easy, but looking at MODIS,
> 10 years * 365 days/year * 288 granules/day * 50 chars/identifier?, it
> gets harder and harder.  For the NPP/JPSS, they are talking about 30
> second granules, so the total number of granules is obscenely high.
> (Though we do aggregate them up a lot -- we could provide
> DatasetInstance Identifiers only for the aggregated datasets, not for
> the raw data.)
>
> Another alternative in the "mirror" case would be for the mirror
> archive to also transfer the "running total" DatasetInstance
> Identifiers from the primary archive.  That would be possible, but
> would require a very high degree of synchronization that may prove
> difficult in practice..
>
> Perhaps we could try to do something with granule by granule running
> totals, and simply roll back to the point of divergence when you get
> an out of order granule.
>
> Algorithm something like this:
> For first granule, calculate md5sum of unique granule identifier.
> For each additional granule
>    calculate md5sum of previous md5sum
>                        + unique granule identifier of next granule.
> For each removed granule,
>    roll back to point of divergence in md5sum sequence and
>    recalculate forward.
>
> What does that really mean?  FOO Example time...
>
> As US ingests the granules into FOOL2.002, it calculates a
> DatasetInstance like so:
>
> LocalgranuleID                                   Ingested   Deleted
> ------------------------------------------------ ---------- -------
> FOOL2.v2.01.bba34792-f256-4c54-81dd-9977e432c204 2001-01-02
> FOOL2.v2.02.2fd12da6-a3e2-4e50-8140-3ac645882419 2001-01-02
> FOOL2.v2.03.29bda893-765d-476d-851b-8b9acd7f140e 2001-01-02
> FOOL2.v2.04.57509ddb-3d40-4d60-8204-da4b99867fc7 2001-01-02
> FOOL2.v2.05.0e8604fa-fb4e-4cfb-b412-5364ca12cf14 2001-01-02
> FOOL2.v2.06.0eb26b4e-b718-41c5-bbf8-c83d3d79c233 2001-01-02
> FOOL2.v2.07.43079ea6-43b5-4622-b492-bcdb824a818e 2001-01-02
> FOOL2.v2.08.590fd64c-ec12-44a5-9b14-0042d19ed3dc 2001-01-02
> FOOL2.v2.09.226173b9-4ef7-49e8-8b9e-701b892a8f57 2001-01-02
> FOOL2.v2.10.533b2a95-d57f-4f75-9b7d-914d3d220310 2001-01-02 2001-03-01
> FOOL2.v2.11.af235d11-777c-4bf1-a5e6-15273a5e5d80 2001-01-02
> FOOL2.v2.12.bdc9dc33-38bd-403c-991e-48dcd4762ca7 2001-01-03
> FOOL2.v2.13.f8f9564d-cc2a-4760-b1bc-13f1ef5cbdcb 2001-02-03
> FOOL2.v2.14.4814ed46-0e41-4e3f-8f73-33d0cd2ef0bc 2001-03-03
> FOOL2.v2.10.6e58a410-60e7-4956-aeaf-37f76a16b171 2001-03-03
>
> On 2001-01-02, you add the first 11 granules above.  Start with
> FOOL2.v2.01.bba34792-f256-4c54-81dd-9977e432c204,
> take md5sum of that: f869b254eb75be5a2736cdb28b30eba0,
> add to that the next granule, so we take an md5sum of this:
> f869b254eb75be5a2736cdb28b30eba0
> FOOL2.v2.02.2fd12da6-a3e2-4e50-8140-3ac645882419
> and get: de2c970d4c035550b7880403ef52be6d
> continuing to add later granules, we get these:
> + FOOL2.v2.03.29bda893-765d-476d-851b-8b9acd7f140e
> => 905e08c6999bc0c9d4a4f662c2566d93
> + FOOL2.v2.04.57509ddb-3d40-4d60-8204-da4b99867fc7
> => 552e64b7de31866d335ae49e5fa388c5
> + FOOL2.v2.05.0e8604fa-fb4e-4cfb-b412-5364ca12cf14
> => 177194dac82f85646a913334edfd2ea8
> + FOOL2.v2.06.0eb26b4e-b718-41c5-bbf8-c83d3d79c233
> => 2e816b406fae56cc578f9f49612b7005
> + FOOL2.v2.07.43079ea6-43b5-4622-b492-bcdb824a818e
> => 5f4bafcdd8187e4b6f32a908e3297afc
> + FOOL2.v2.08.590fd64c-ec12-44a5-9b14-0042d19ed3dc
> => 9c681dfe89be66ca2c14a2803cc911ff
> + FOOL2.v2.09.226173b9-4ef7-49e8-8b9e-701b892a8f57
> => 242eba08c8fd2ac386b3797d43a26331
> + FOOL2.v2.10.533b2a95-d57f-4f75-9b7d-914d3d220310
> => d2d541e2776128a74eef16eca84e4be4
> + FOOL2.v2.11.af235d11-777c-4bf1-a5e6-15273a5e5d80
> => 7fb1e8ba9b0c9888858b66f6a1732d2c
>
> That identifier we store, associated with time 2001-01-02 (just
> discard the others).
>
> At time 2001-01-03, we add granule
> so we get the "latest DatasetInstance Identifier" and add the new
> FOOL2.v2.12.bdc9dc33-38bd-403c-991e-48dcd4762ca7
> granule identifier to it, hashing
> 7fb1e8ba9b0c9888858b66f6a1732d2c
> FOOL2.v2.12.bdc9dc33-38bd-403c-991e-48dcd4762ca7
> to get 763122197bfb3ffbf0da14adbfb1b13b, which we store.
>
> At time 2001-02-03, we add
> FOOL2.v2.13.f8f9564d-cc2a-4760-b1bc-13f1ef5cbdcb
> to get a8e677fcc63016b49be72f202f4ba760
>
> At time 2001-03-01, we remove granule
> FOOL2.v2.10.533b2a95-d57f-4f75-9b7d-914d3d220310 (the bad one).
> So we rewind to the last DatasetInstance Identifier prior to the
> ingest of that granule.  (Just happens for this case that we rewind to
> the beginnging of the mission, but for real world cases, the
> likelihood of a given granule being removed goes down over time.)
>
> So, restart from the begining, and recalculate all the identifiers:
> FOOL2.v2.01.bba34792-f256-4c54-81dd-9977e432c204
> => f869b254eb75be5a2736cdb28b30eba0
> + FOOL2.v2.02.2fd12da6-a3e2-4e50-8140-3ac645882419
> => de2c970d4c035550b7880403ef52be6d
> + FOOL2.v2.03.29bda893-765d-476d-851b-8b9acd7f140e
> => 905e08c6999bc0c9d4a4f662c2566d93
> + FOOL2.v2.04.57509ddb-3d40-4d60-8204-da4b99867fc7
> => 552e64b7de31866d335ae49e5fa388c5
> + FOOL2.v2.05.0e8604fa-fb4e-4cfb-b412-5364ca12cf14
> => 177194dac82f85646a913334edfd2ea8
> + FOOL2.v2.06.0eb26b4e-b718-41c5-bbf8-c83d3d79c233
> => 2e816b406fae56cc578f9f49612b7005
> + FOOL2.v2.07.43079ea6-43b5-4622-b492-bcdb824a818e
> => 5f4bafcdd8187e4b6f32a908e3297afc
> + FOOL2.v2.08.590fd64c-ec12-44a5-9b14-0042d19ed3dc
> => 9c681dfe89be66ca2c14a2803cc911ff
> + FOOL2.v2.09.226173b9-4ef7-49e8-8b9e-701b892a8f57
> => 242eba08c8fd2ac386b3797d43a26331
> + FOOL2.v2.11.af235d11-777c-4bf1-a5e6-15273a5e5d80  (note we skip 10)
> => 3563a5830ba63ff0633024894df46168
> + FOOL2.v2.12.bdc9dc33-38bd-403c-991e-48dcd4762ca7
> => 7d214181a4db9ef9f5677c86400164c8
> + FOOL2.v2.13.f8f9564d-cc2a-4760-b1bc-13f1ef5cbdcb
> => c552aca58d871920702c6948c7c0bbe1
> and store that associated with time 2001-03-01.
>
> At time 2001-03-03, we add two more granules:
> FOOL2.v2.10.6e58a410-60e7-4956-aeaf-37f76a16b171
> FOOL2.v2.14.4814ed46-0e41-4e3f-8f73-33d0cd2ef0bc
> We sort that list, take the first one, search the table for the
> granule just before that one:
> FOOL2.v2.09.226173b9-4ef7-49e8-8b9e-701b892a8f57
> look up its ingest date: 2001-01-02,
> find the dataset identifier prior to that, and again we have to rewind
> to the beginning of the mission.  (I guess you could store those
> intermediate results to reuse them, but I think it really wouldn't be
> worth it in practice.)
> [... same as above up to FOOL2.v2.09 ...]
> 242eba08c8fd2ac386b3797d43a26331
> + FOOL2.v2.10.6e58a410-60e7-4956-aeaf-37f76a16b171
> => 4e41c3b6e990884d24c8c1f7fb50c600
> + FOOL2.v2.11.af235d11-777c-4bf1-a5e6-15273a5e5d80
> => eb5058e46c419a06d68e37bbe372461e
> + FOOL2.v2.12.bdc9dc33-38bd-403c-991e-48dcd4762ca7
> => a72c246574cd40669f5531996433b1ab
> + FOOL2.v2.13.f8f9564d-cc2a-4760-b1bc-13f1ef5cbdcb
> => a9e88329c184061b5ef0e71cd6f19fcd
> + FOOL2.v2.14.4814ed46-0e41-4e3f-8f73-33d0cd2ef0bc
> => 4cc346c87b70c9457b8cdfb1a3bb6ae4
> which we store for time 2001-03-03
>
> So (if I did my math right...) this is what we have for dataset
> US.FOOL2.002:
>
> Timestamp  DatasetInstance Identifier
> ---------  --------------------------
> 2001-01-02 7fb1e8ba9b0c9888858b66f6a1732d2c
> 2001-01-03 763122197bfb3ffbf0da14adbfb1b13b
> 2001-02-03 a8e677fcc63016b49be72f202f4ba760
> 2001-03-01 c552aca58d871920702c6948c7c0bbe1
> 3001-03-03 4cc346c87b70c9457b8cdfb1a3bb6ae4
>
>
> Now let's take a look at the mirror THEM.FOOL2.002:
>
> At time 2001-02-01, they grab the dataset from US.FOOL2.002, and
> get these granules:
>
> FOOL2.v2.01.bba34792-f256-4c54-81dd-9977e432c204
> FOOL2.v2.02.2fd12da6-a3e2-4e50-8140-3ac645882419
> FOOL2.v2.03.29bda893-765d-476d-851b-8b9acd7f140e
> FOOL2.v2.04.57509ddb-3d40-4d60-8204-da4b99867fc7
> FOOL2.v2.05.0e8604fa-fb4e-4cfb-b412-5364ca12cf14
> FOOL2.v2.06.0eb26b4e-b718-41c5-bbf8-c83d3d79c233
> FOOL2.v2.07.43079ea6-43b5-4622-b492-bcdb824a818e
> FOOL2.v2.08.590fd64c-ec12-44a5-9b14-0042d19ed3dc
> FOOL2.v2.09.226173b9-4ef7-49e8-8b9e-701b892a8f57
> FOOL2.v2.10.533b2a95-d57f-4f75-9b7d-914d3d220310 (from corrupt data)
> FOOL2.v2.11.af235d11-777c-4bf1-a5e6-15273a5e5d80
> FOOL2.v2.12.bdc9dc33-38bd-403c-991e-48dcd4762ca7
>
> They calculate their DatasetInstance Identifier like this:
>
> FOOL2.v2.01.bba34792-f256-4c54-81dd-9977e432c204
> => f869b254eb75be5a2736cdb28b30eba0
> + FOOL2.v2.02.2fd12da6-a3e2-4e50-8140-3ac645882419
> => de2c970d4c035550b7880403ef52be6d
> + FOOL2.v2.03.29bda893-765d-476d-851b-8b9acd7f140e
> => 905e08c6999bc0c9d4a4f662c2566d93
> + FOOL2.v2.04.57509ddb-3d40-4d60-8204-da4b99867fc7
> => 552e64b7de31866d335ae49e5fa388c5
> + FOOL2.v2.05.0e8604fa-fb4e-4cfb-b412-5364ca12cf14
> => 177194dac82f85646a913334edfd2ea8
> + FOOL2.v2.06.0eb26b4e-b718-41c5-bbf8-c83d3d79c233
> => 2e816b406fae56cc578f9f49612b7005
> + FOOL2.v2.07.43079ea6-43b5-4622-b492-bcdb824a818e
> => 5f4bafcdd8187e4b6f32a908e3297afc
> + FOOL2.v2.08.590fd64c-ec12-44a5-9b14-0042d19ed3dc
> => 9c681dfe89be66ca2c14a2803cc911ff
> + FOOL2.v2.09.226173b9-4ef7-49e8-8b9e-701b892a8f57
> => 242eba08c8fd2ac386b3797d43a26331
> + FOOL2.v2.10.533b2a95-d57f-4f75-9b7d-914d3d220310
> => d2d541e2776128a74eef16eca84e4be4
> + FOOL2.v2.11.af235d11-777c-4bf1-a5e6-15273a5e5d80
> => 7fb1e8ba9b0c9888858b66f6a1732d2c
> + FOOL2.v2.12.bdc9dc33-38bd-403c-991e-48dcd4762ca7
> => 763122197bfb3ffbf0da14adbfb1b13b
>
> which they store in their table, with date 2001-02-01:
>
> Timestamp  DatasetInstance Identifier
> ---------  --------------------------
> 2001-02-01 763122197bfb3ffbf0da14adbfb1b13b
>
>
> Now suppose Alice downloads "US.FOOL2.002" on 2001-01-05 and Bob
> downloades "THEM.FOOL2.002" on 2001-02-03.
>
> Alice's data citation includes these: { doi:10.9999/US/FOOL2.v2,
> 2001-01-05, DI:763122197bfb3ffbf0da14adbfb1b13b }.  Bob's citation
> includes these: { doi:10.9999/US/FOOL2.v2, 2001-02-03,
> DI:763122197bfb3ffbf0da14adbfb1b13b }. (DI = DatasetInstance?)
>
> Again we can look at the citations and see that they are citing
> identical sets of granules.
>
> What do you think?
>
> Compared to the previous approach, this algorithm maintains
> DatasetInstance Identifiers regardless of the order you add/remove
> your granules, so handles the federated mirror case well.  It remains
> very easy (running total rather than reset to beginning in every case)
> addition of individual granules, but still provides a way (though
> slow) to handle out-of-order or removed granules.  The time it takes
> to handle those cases is proportional to the time in the past at which
> the divergence occurred).
>
> Curt
>
> [1] Kunze, John. "Towards Electronic Persistence Using ARK Identifiers"
> https://confluence.ucop.edu/download/attachments/16744455/arkcdl.pdf
> _______________________________________________
> Esip-preserve mailing list
> Esip-preserve at lists.esipfed.org
> http://www.lists.esipfed.org/mailman/listinfo/esip-preserve
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.lists.esipfed.org/pipermail/esip-preserve/attachments/20101020/ab203e64/attachment-0001.html>


More information about the Esip-preserve mailing list