[Esip-preserve] On Earth Science Data File Uniqueness
Curt Tilmes
Curt.Tilmes at nasa.gov
Wed Feb 9 12:39:35 EST 2011
On 02/09/11 12:19, Lynnes, Christopher S. (GSFC-6102) wrote:
> I should have phrased that: has someone asserted that data items A
> and B are bitwise identical, i.e., by assigning a UUID. BTW, I
> thought our preferred method for assigning UUIDs was to derive them
> from the SHA-1, anyway?
SHA-1 can be part of making UUIDs, but it isn't really related to the
content of the object being identified.
Content based identifiers are very useful for some things, but that
isn't what we are talking about here (perhaps we should be?)
>> UUID is just a way to make an identifier that is globally unique
>> forever [1] and easily recognizable as a UUID.
> OK, if it doesn't answer the question, are they identical, what
> question does the UUID answer??? Just having a unique identifier in
> and of itself is not intrinsically useful.
That's Bruce's argument.
>> They can be assigned to the object arbitrarily without regard to
>> content.
>
> in that case, we have a misuse of UUIDs by the UUID creator.
perhaps...
An object's identity is related but distinct from its content.
I can make a chunk of data, pull a UUID out of the air
("bf8847e8-1291-4c4b-be99-a0080183a62c") and label my data with that
identifier.
Suppose the content is "5\n". To maintain the integrity/fixity of
that object, I take an MD5 digital signature
("1dcca23355272056f04fe8bf20edfce0") and put that in my metadata.
If someone get's a copy of object bf8847e8-1291-4c4b-be99-a0080183a62c
they can re-run the MD5 algorithm on the content and verify they've
got the same content I thought I sent them.
You can make another object, and by off chance (maybe you are
recreating my research, and have a perfect copy of my environment) you
make an object with the same content. You create a UUID for your
object (beabc557-43ce-4204-822b-59b7ed7692e5). Since the content of
your object matches the content of my object, the MD5 of your object
just happens to be the same as mine.
Now what is the provenance of object
bf8847e8-1291-4c4b-be99-a0080183a62c? It was created by "Curt" on
host "curthost" at time 12:30pm, etc. etc.
What is the provenance of object beabc557-43ce-4204-822b-59b7ed7692e5?
It was created by "Chris" on host "chrishost" at time 12:35pm.
They are different objects that happen to have the same content.
This is a data model we are constructing, it is our choice to treat
them as different objects, but I think it is reasonable to think of it
that way. If you say they are the same object, the provenance issue,
for example, can get confused.
BTW, without distinct identifiers for these two objects I'm discussing
in this simple example, I couldn't even write those paragraphs.
That's why I think they are useful. For my simple example, I could
easily call them 'x' and 'y', but UUID is really nice in that it can
give you globally unique forever identifiers. Someone, somewhere will
use 'x' or 'y' to identify something else, but no one ever will use
those two UUIDs again.[1]
Curt
[1] without copying them from this message or otherwise maliciously
creating them by some manner other than following the UUID
creation requirements.
More information about the Esip-preserve
mailing list