[Esip-preserve] On Earth Science Data File Uniqueness
Lynnes, Christopher S. (GSFC-6102)
christopher.s.lynnes at nasa.gov
Wed Feb 9 13:08:00 EST 2011
Curt,
Your use case may be plausible, but does not seem to be very common in the wild. I suspect any recommendation to use UUIDs that goes out for review by us more thick-skulled practitioners is going to need a more compelling use case, esp. when put up against the clear practical benefit of checksums. In other words, you will need to answer the question: why is a UUID a better unique identifier of contents than the contents' SHA-1 or MD5 checksum? And what practical benefit does it buy me?
On Feb 9, 2011, at 12:39 PM, Curt Tilmes wrote:
> On 02/09/11 12:19, Lynnes, Christopher S. (GSFC-6102) wrote:
>
>> I should have phrased that: has someone asserted that data items A
>> and B are bitwise identical, i.e., by assigning a UUID. BTW, I
>> thought our preferred method for assigning UUIDs was to derive them
>> from the SHA-1, anyway?
>
> SHA-1 can be part of making UUIDs, but it isn't really related to the
> content of the object being identified.
>
> Content based identifiers are very useful for some things, but that
> isn't what we are talking about here (perhaps we should be?)
>
>>> UUID is just a way to make an identifier that is globally unique
>>> forever [1] and easily recognizable as a UUID.
>
>> OK, if it doesn't answer the question, are they identical, what
>> question does the UUID answer??? Just having a unique identifier in
>> and of itself is not intrinsically useful.
>
> That's Bruce's argument.
>
>
>>> They can be assigned to the object arbitrarily without regard to
>>> content.
>>
>> in that case, we have a misuse of UUIDs by the UUID creator.
>
> perhaps...
>
>
> An object's identity is related but distinct from its content.
>
> I can make a chunk of data, pull a UUID out of the air
> ("bf8847e8-1291-4c4b-be99-a0080183a62c") and label my data with that
> identifier.
>
> Suppose the content is "5\n". To maintain the integrity/fixity of
> that object, I take an MD5 digital signature
> ("1dcca23355272056f04fe8bf20edfce0") and put that in my metadata.
>
> If someone get's a copy of object bf8847e8-1291-4c4b-be99-a0080183a62c
> they can re-run the MD5 algorithm on the content and verify they've
> got the same content I thought I sent them.
>
>
> You can make another object, and by off chance (maybe you are
> recreating my research, and have a perfect copy of my environment) you
> make an object with the same content. You create a UUID for your
> object (beabc557-43ce-4204-822b-59b7ed7692e5). Since the content of
> your object matches the content of my object, the MD5 of your object
> just happens to be the same as mine.
>
> Now what is the provenance of object
> bf8847e8-1291-4c4b-be99-a0080183a62c? It was created by "Curt" on
> host "curthost" at time 12:30pm, etc. etc.
>
> What is the provenance of object beabc557-43ce-4204-822b-59b7ed7692e5?
> It was created by "Chris" on host "chrishost" at time 12:35pm.
>
> They are different objects that happen to have the same content.
>
> This is a data model we are constructing, it is our choice to treat
> them as different objects, but I think it is reasonable to think of it
> that way. If you say they are the same object, the provenance issue,
> for example, can get confused.
>
> BTW, without distinct identifiers for these two objects I'm discussing
> in this simple example, I couldn't even write those paragraphs.
> That's why I think they are useful. For my simple example, I could
> easily call them 'x' and 'y', but UUID is really nice in that it can
> give you globally unique forever identifiers. Someone, somewhere will
> use 'x' or 'y' to identify something else, but no one ever will use
> those two UUIDs again.[1]
>
> Curt
>
> [1] without copying them from this message or otherwise maliciously
> creating them by some manner other than following the UUID
> creation requirements.
> _______________________________________________
> Esip-preserve mailing list
> Esip-preserve at lists.esipfed.org
> http://www.lists.esipfed.org/mailman/listinfo/esip-preserve
--
Dr. Christopher Lynnes NASA/GSFC, Code 610.2 phone: 301-614-5185
More information about the Esip-preserve
mailing list