[Esip-preserve] On Earth Science Data File Uniqueness

Lynnes, Christopher S. (GSFC-6102) christopher.s.lynnes at nasa.gov
Wed Feb 9 13:08:00 EST 2011


Curt,
   Your use case may be plausible, but does not seem to be very common in the wild.  I suspect any recommendation to use UUIDs that goes out for review by us more thick-skulled practitioners is going to need a more compelling use case, esp. when put up against the clear practical benefit of checksums.  In other words, you will need to answer the question:  why is a UUID a better unique identifier of contents than the contents' SHA-1 or MD5 checksum?  And what practical benefit does it buy me?

On Feb 9, 2011, at 12:39 PM, Curt Tilmes wrote:

> On 02/09/11 12:19, Lynnes, Christopher S. (GSFC-6102) wrote:
> 
>> I should have phrased that: has someone asserted that data items A
>> and B are bitwise identical, i.e., by assigning a UUID.  BTW, I
>> thought our preferred method for assigning UUIDs was to derive them
>> from the SHA-1, anyway?
> 
> SHA-1 can be part of making UUIDs, but it isn't really related to the
> content of the object being identified.
> 
> Content based identifiers are very useful for some things, but that
> isn't what we are talking about here (perhaps we should be?)
> 
>>> UUID is just a way to make an identifier that is globally unique
>>> forever [1] and easily recognizable as a UUID.
> 
>> OK, if it doesn't answer the question, are they identical, what
>> question does the UUID answer???  Just having a unique identifier in
>> and of itself is not intrinsically useful.
> 
> That's Bruce's argument.
> 
> 
>>> They can be assigned to the object arbitrarily without regard to
>>> content.
>> 
>> in that case, we have a misuse of UUIDs by the UUID creator.
> 
> perhaps...
> 
> 
> An object's identity is related but distinct from its content.
> 
> I can make a chunk of data, pull a UUID out of the air
> ("bf8847e8-1291-4c4b-be99-a0080183a62c") and label my data with that
> identifier.
> 
> Suppose the content is "5\n".  To maintain the integrity/fixity of
> that object, I take an MD5 digital signature
> ("1dcca23355272056f04fe8bf20edfce0") and put that in my metadata.
> 
> If someone get's a copy of object bf8847e8-1291-4c4b-be99-a0080183a62c
> they can re-run the MD5 algorithm on the content and verify they've
> got the same content I thought I sent them.
> 
> 
> You can make another object, and by off chance (maybe you are
> recreating my research, and have a perfect copy of my environment) you
> make an object with the same content.  You create a UUID for your
> object (beabc557-43ce-4204-822b-59b7ed7692e5).  Since the content of
> your object matches the content of my object, the MD5 of your object
> just happens to be the same as mine.
> 
> Now what is the provenance of object
> bf8847e8-1291-4c4b-be99-a0080183a62c?  It was created by "Curt" on
> host "curthost" at time 12:30pm, etc. etc.
> 
> What is the provenance of object beabc557-43ce-4204-822b-59b7ed7692e5?
> It was created by "Chris" on host "chrishost" at time 12:35pm.
> 
> They are different objects that happen to have the same content.
> 
> This is a data model we are constructing, it is our choice to treat
> them as different objects, but I think it is reasonable to think of it
> that way.  If you say they are the same object, the provenance issue,
> for example, can get confused.
> 
> BTW, without distinct identifiers for these two objects I'm discussing
> in this simple example, I couldn't even write those paragraphs.
> That's why I think they are useful.  For my simple example, I could
> easily call them 'x' and 'y', but UUID is really nice in that it can
> give you globally unique forever identifiers.  Someone, somewhere will
> use 'x' or 'y' to identify something else, but no one ever will use
> those two UUIDs again.[1]
> 
> Curt
> 
> [1] without copying them from this message or otherwise maliciously
>    creating them by some manner other than following the UUID
>    creation requirements.
> _______________________________________________
> Esip-preserve mailing list
> Esip-preserve at lists.esipfed.org
> http://www.lists.esipfed.org/mailman/listinfo/esip-preserve

--
Dr. Christopher Lynnes     NASA/GSFC, Code 610.2    phone: 301-614-5185




More information about the Esip-preserve mailing list