[Esip-preserve] On Earth Science Data File Uniqueness

Lynnes, Christopher S. (GSFC-6102) christopher.s.lynnes at nasa.gov
Wed Feb 9 13:27:17 EST 2011


On Feb 9, 2011, at 1:14 PM, Curt Tilmes wrote:

> On 02/09/11 13:08, Lynnes, Christopher S. (GSFC-6102) wrote:
> 
>> Your use case may be plausible, but does not seem to be very common
>> in the wild.  I suspect any recommendation to use UUIDs that goes out
>> for review by us more thick-skulled practitioners is going to need a
>> more compelling use case, esp. when put up against the clear
>> practical benefit of checksums.  In other words, you will need to
>> answer the question: why is a UUID a better unique identifier of
>> contents than the contents' SHA-1 or MD5 checksum?  And what
>> practical benefit does it buy me?
> 
> It is not an identifier for the content at all.  It is an identifier
> for the object.
> 
> If two granules have the same object identifier, you are talking about
> the same object (two copies of the same object).  If you are talking
> about the same object.  If you want to verify the fixity of the
> content, the UUID won't give you that.  You still need to use SHA-1 or
> MD5 or whatever to verify integrity/fixity.

This speaks directly to the practical usefulness.  If I have two objects that are "the same" by UUID, but have different contents, then they aren't really the same from a practical standpoint.  For any of the operations I am going to apply to them (e.g., science processing, analysis), I need to treat them as different, and the UUID comparison has not helped in the least.  On the other hand, if the contents are identical, but the UUIDs are different, I can still proceed with most of the operations I would want to do under the assumption that they are the same.  Only in a few rare cases (e.g., formal attribution) is this risky.  And even those cases presume that I also have a thorough provenance to go along with it.)

> 
> 
> Some data models choose to use a hash of the content as an identifier
> for the object.  We could choose to do that as well.  I think that is
> a valid and useful approach.  It does however, impose certain
> constraints.  It assumes that if the content of two objects is
> identical then the objects are identical.  It precludes the
> possibility to make two distinct objects with the same content.  If
> that is an acceptable constraint, then we could propose to use one of
> the digital signature schemes as our recommendations for data granule
> identifiers.  Since one of our goals is reproducibility -- striving to
> make data granules the same way with equivalent content -- we may be
> as cross purposes with ourselves.
> 
> Curt
> _______________________________________________
> Esip-preserve mailing list
> Esip-preserve at lists.esipfed.org
> http://www.lists.esipfed.org/mailman/listinfo/esip-preserve

--
Dr. Christopher Lynnes     NASA/GSFC, Code 610.2    phone: 301-614-5185




More information about the Esip-preserve mailing list