[Esip-preserve] On Earth Science Data File Uniqueness

Mon Feb 14 17:00:26 EST 2011

On Feb 14, 2011, at 4:55 PM, Ruth Duerr wrote:

> Chris,
> 
> Only if you want the recommendation to stray beyond the boundaries of identifiers.  While I think that could be very appropriate for NASA TIWG, best ESIP data management practices, etc. it clearly was beyond the scope of that poor not-so-little paper.
> 
> - Ruth
> 

Well, it's up to you all of course, but at least you should allude to the question that UUID does not answer and note that digital signatures would be required. 

(I personally would consider digital signature to be a kind of identifier, or at least a part of the identifer.  One could even propose a hybrid identifer, consisting of UUID and digital signature in concert...)

> 
> On Feb 14, 2011, at 2:46 PM, Lynnes, Christopher S. (GSFC-6102) wrote:
> 
>> Curt,
>> This is a good example for why hash signatures are not enough to identify data, though I would simplify it for any recommendation. Essentially, the fact that is easy to create two PNG or even flat binary files (without internal provenance metadata) that are both all fill values with the same array sizes is enough to demonstrate that signatures alone do not do the trick.
>> However, based on the early argument of the inadequacy of UUIDs alone to answer where File A = File B (bitwise) suggests that any recommendation should be that both UUID and digital signature must be used together, yes?
>> 
>> On Feb 14, 2011, at 3:34 PM, Curt Tilmes wrote:
>> 
>>> On 02/09/11 14:58, Ruth Duerr wrote:
>>> 
>>> Caveat: I'm not a UUID expert (though I'm considering reading up more
>>> on them...)
>>> 
>>>> I've heard good compelling arguments for two competing best
>>>> practices for using UUID's (Chris' use of a message digest form of
>>>> UUID and Curt's pure Unique Identifier form of UUID).
>>> 
>>> There are variants of the UUID algorithm that incorporate digital
>>> signature algorithms, including MD5 (UUID version 3) and SHA-1 (UUID
>>> version 5).  Those algorithms are used to make UUIDs, but are still
>>> unrelated to the actual content of the object being identified.
>>> 
>>> There are other approaches to distinguishing objects strictly by their
>>> content that use a digital signature of the content to make a unique
>>> identifier for the object.  Such schemes can only be used if there is
>>> never a need to distinguish objects with the same content.
>>> 
>>> Digital signatures of the content are a good way to verify the
>>> integrity/fixity of the content.  Such a use is orthogonal to whether
>>> or not that digital signature is the identifier of the object.
>>> 
>>> 
>>> For me, the question boils down to one issue.  Will distinct objects
>>> under the scope of this data model ever have the same content or not?
>>> 
>>> 
>>> To illustrate:
>>> 
>>> Suppose I have two HDF files, "a" and "b".
>>> 
>>> Suppose we have a process "P" that gets a subset of an HDF data file
>>> and produces a small image file from it.
>>> 
>>> I apply P to "a" and produce image a.png.  Call this "Job 1".
>>> 
>>> I apply P to "b" and produce image b.png.  Call this "Job 2".
>>> 
>>> Just by chance, process P happens to pick an area of the data that
>>> happens to be all black.  The content of a.png is equal to the content
>>> of b.png.
>>> 
>>> Now we're going to try to make identifiers for the two image files and
>>> store them in our database.
>>> 
>>> With UUIDs, I make two identifiers:
>>> 5b849030-d964-44ef-a2a1-e3e20cd18637
>>> d4cd7a92-9418-440d-9c39-a359d2b55944
>>> 
>>> I record the fact that 5b849030-d964-44ef-a2a1-e3e20cd18637 was
>>> generated by "Job 1", using file "a" as an input.  (well, it would
>>> have a UUID too, but you get the picture).
>>> 
>>> I record the fact that d4cd7a92-9418-440d-9c39-a359d2b55944 was
>>> generated by "Job 2", using file "b" as an input.
>>> 
>>> I can clearly distinguish the two files by their unique identifiers.
>>> 
>>> With an identifier derived from the content, I get an MD5 (e.g.) for
>>> the first object, 3da607d21285eb08f7d40ef8dd028d35, and store the fact
>>> that 3da607d21285eb08f7d40ef8dd028d35 was generated by "Job 1", using
>>> file "a" as an input.
>>> 
>>> When I make the second file, I get the same object, with the same
>>> identifier.  I can't put it in my database.  I can't associate that
>>> object with "Job 2".  (Well you can, but the data model gets really
>>> messy -- Your DAGs aren't right any more.)  Then try to query "Who
>>> created the object?"  -- You get two answers!  Tomorrow you may get
>>> three answers!
>>> 
>>> 
>>> If we all agree that no process ever run under the scope of our data
>>> model will ever create distinct objects with the same content
>>> (Including my old friend d41d8cd98f00b204e9800998ecf8427e -- The MD5
>>> of an empty file), then we can accept that a digital signature of the
>>> content is sufficient to distinguish the objects.  If we allow for the
>>> possibility of duplicating the content in distinct objects, then we
>>> need some identifier other than the digital signature of the content.
>>> 
>>> Curt
>>> _______________________________________________
>>> Esip-preserve mailing list
>>> Esip-preserve at lists.esipfed.org
>>> http://www.lists.esipfed.org/mailman/listinfo/esip-preserve
>> 
>> --
>> Dr. Christopher Lynnes     NASA/GSFC, Code 610.2    phone: 301-614-5185
>> 
>> 
>> _______________________________________________
>> Esip-preserve mailing list
>> Esip-preserve at lists.esipfed.org
>> http://www.lists.esipfed.org/mailman/listinfo/esip-preserve
> 

--
Dr. Christopher Lynnes     NASA/GSFC, Code 610.2    phone: 301-614-5185