[Esip-preserve] On Earth Science Data File Uniqueness
Lynnes, Christopher S. (GSFC-6102)
christopher.s.lynnes at nasa.gov
Wed Feb 9 13:27:17 EST 2011
On Feb 9, 2011, at 1:14 PM, Curt Tilmes wrote:
> On 02/09/11 13:08, Lynnes, Christopher S. (GSFC-6102) wrote:
>
>> Your use case may be plausible, but does not seem to be very common
>> in the wild. I suspect any recommendation to use UUIDs that goes out
>> for review by us more thick-skulled practitioners is going to need a
>> more compelling use case, esp. when put up against the clear
>> practical benefit of checksums. In other words, you will need to
>> answer the question: why is a UUID a better unique identifier of
>> contents than the contents' SHA-1 or MD5 checksum? And what
>> practical benefit does it buy me?
>
> It is not an identifier for the content at all. It is an identifier
> for the object.
>
> If two granules have the same object identifier, you are talking about
> the same object (two copies of the same object). If you are talking
> about the same object. If you want to verify the fixity of the
> content, the UUID won't give you that. You still need to use SHA-1 or
> MD5 or whatever to verify integrity/fixity.
This speaks directly to the practical usefulness. If I have two objects that are "the same" by UUID, but have different contents, then they aren't really the same from a practical standpoint. For any of the operations I am going to apply to them (e.g., science processing, analysis), I need to treat them as different, and the UUID comparison has not helped in the least. On the other hand, if the contents are identical, but the UUIDs are different, I can still proceed with most of the operations I would want to do under the assumption that they are the same. Only in a few rare cases (e.g., formal attribution) is this risky. And even those cases presume that I also have a thorough provenance to go along with it.)
>
>
> Some data models choose to use a hash of the content as an identifier
> for the object. We could choose to do that as well. I think that is
> a valid and useful approach. It does however, impose certain
> constraints. It assumes that if the content of two objects is
> identical then the objects are identical. It precludes the
> possibility to make two distinct objects with the same content. If
> that is an acceptable constraint, then we could propose to use one of
> the digital signature schemes as our recommendations for data granule
> identifiers. Since one of our goals is reproducibility -- striving to
> make data granules the same way with equivalent content -- we may be
> as cross purposes with ourselves.
>
> Curt
> _______________________________________________
> Esip-preserve mailing list
> Esip-preserve at lists.esipfed.org
> http://www.lists.esipfed.org/mailman/listinfo/esip-preserve
--
Dr. Christopher Lynnes NASA/GSFC, Code 610.2 phone: 301-614-5185
More information about the Esip-preserve
mailing list