[Esip-preserve] On Earth Science Data File Uniqueness

Mon Feb 14 15:34:44 EST 2011

On 02/09/11 14:58, Ruth Duerr wrote:

Caveat: I'm not a UUID expert (though I'm considering reading up more
on them...)

> I've heard good compelling arguments for two competing best
> practices for using UUID's (Chris' use of a message digest form of
> UUID and Curt's pure Unique Identifier form of UUID).

There are variants of the UUID algorithm that incorporate digital
signature algorithms, including MD5 (UUID version 3) and SHA-1 (UUID
version 5).  Those algorithms are used to make UUIDs, but are still
unrelated to the actual content of the object being identified.

There are other approaches to distinguishing objects strictly by their
content that use a digital signature of the content to make a unique
identifier for the object.  Such schemes can only be used if there is
never a need to distinguish objects with the same content.

Digital signatures of the content are a good way to verify the
integrity/fixity of the content.  Such a use is orthogonal to whether
or not that digital signature is the identifier of the object.

For me, the question boils down to one issue.  Will distinct objects
under the scope of this data model ever have the same content or not?

To illustrate:

Suppose I have two HDF files, "a" and "b".

Suppose we have a process "P" that gets a subset of an HDF data file
and produces a small image file from it.

I apply P to "a" and produce image a.png.  Call this "Job 1".

I apply P to "b" and produce image b.png.  Call this "Job 2".

Just by chance, process P happens to pick an area of the data that
happens to be all black.  The content of a.png is equal to the content
of b.png.

Now we're going to try to make identifiers for the two image files and
store them in our database.

With UUIDs, I make two identifiers:
  5b849030-d964-44ef-a2a1-e3e20cd18637
  d4cd7a92-9418-440d-9c39-a359d2b55944

I record the fact that 5b849030-d964-44ef-a2a1-e3e20cd18637 was
generated by "Job 1", using file "a" as an input.  (well, it would
have a UUID too, but you get the picture).

I record the fact that d4cd7a92-9418-440d-9c39-a359d2b55944 was
generated by "Job 2", using file "b" as an input.

I can clearly distinguish the two files by their unique identifiers.

With an identifier derived from the content, I get an MD5 (e.g.) for
the first object, 3da607d21285eb08f7d40ef8dd028d35, and store the fact
that 3da607d21285eb08f7d40ef8dd028d35 was generated by "Job 1", using
file "a" as an input.

When I make the second file, I get the same object, with the same
identifier.  I can't put it in my database.  I can't associate that
object with "Job 2".  (Well you can, but the data model gets really
messy -- Your DAGs aren't right any more.)  Then try to query "Who
created the object?"  -- You get two answers!  Tomorrow you may get
three answers!

If we all agree that no process ever run under the scope of our data
model will ever create distinct objects with the same content
(Including my old friend d41d8cd98f00b204e9800998ecf8427e -- The MD5
of an empty file), then we can accept that a digital signature of the
content is sufficient to distinguish the objects.  If we allow for the
possibility of duplicating the content in distinct objects, then we
need some identifier other than the digital signature of the content.

Curt