[Esip-preserve] ESIP Citation Guidelines - A Demonstration of Cryptographic Digest Divergence on Scientifically Identical Files Using Real Data

alicebarkstrom at frontier.com alicebarkstrom at frontier.com
Tue Oct 12 21:14:49 EDT 2010


The attached file provides a writeup of creating cryptographic digests on
a very small subset of data from NOAA's Global Historical Climate Network
monthly average precipitation data and then perturbing the arrangement of
the data elements in the data and metadata files.  I believe the study 
clearly demonstrates that example (2) below can be proven for data that
are not the same arrangement, provided that the original data representation
isn't tampered with.  I believe we can accept that the demonstration would
extend to cases in which floats or reals were used in one of the two files,
instead of just rearranging the order of the ASCII encoding of the numerical
value of the data elements.

I believe there is a fairly stringent computation that can answer example
conjecture (3) and have at least prototype software that illustrates the
algorithm.  The statement below in (4) is a bit imprecise - although in
principle one could extend the algorithm for file-by-file intercomparison
to conclude whether the datasets are the same - although I'm not sure this
is a pleasant solution because of the computational resources involved.
The assumption that if the identifiers are the same, the files must be
the same does not seem like a very reliable approach in a world where thieves change
car licence plates or steal credit card identities.

I'll note that this example appears (to me at least) to effectively negate
the UNF approach suggested in Altman and King's paper on scholarly citation
of quantitative numeric data.  While Altman's UNF paper is reasonably good
on the numeric properties of the method, it makes what I regard as an
unacceptable assumption - namely that there will be a practical way to 
negotiate a "canonical" arrangement of representations and data element
ordering that applies to all Earth science data.  The UNF relies on 
cryptographic digests of the files, although the computed digests are 
modified to create the UNFs.  On the other hand, Ken's note on embedding 
the UUID's in the files appears to stand - and might be made more robust 
by including a cryptographic digest of the file after the
UUID is embedded with the distributed file.

Any comments on this work will be gratefully received.

Bruce B.
----- Original Message -----
From: "Christopher S. Lynnes (GSFC-6102)" <christopher.s.lynnes at nasa.gov>
To: esip-preserve at lists.esipfed.org
Sent: Tuesday, October 12, 2010 8:00:11 AM
Subject: Re: [Esip-preserve] ESIP Citation Guidelines

I'm only half Scandinavian, so my view is not so pessimistic as Bruce's. However, as a would-be practitioner watching the debate, it looks like it has gone too far down in the weeds to have practical value to Joe Data Manager.  My suggestion is to divide and conquer by coming back to specific questions that can (and cannot yet) be answered.

For example:
(1) Can I prove that File A and B are the same file?  A: a cryptographic hash can do this (most of the time)
(2) Can I prove that File A and B contain the same data?  A: yes, if they are the same file. But see next question...
(3) Can I prove that File A and B do NOT contain the same data?  A: much more difficult, due to reformatting, reordering, etc.
(4) Are Dataset A and B the same?  A:  yes, if they have the same dataset identifier (e.g., a DOI)
(5) Did researcher A and B use the same data from datasets A and B?  A:  much more difficult to determine

What I have seen of the debate revolves mostly around questions 3 and 5.  Even though questions 1, 2, and 4 may seem too simple, degenerate or incomplete to be interesting from an academic standpoint, they do have some practical value in today's world of data management.  Perhaps you can couch any recommendations in terms of the questions can be answered easily, v. those that are difficult to answer?
--
Dr. Christopher Lynnes    NASA/GSFC, Code 610.2, Greenbelt, MD 20771
Phone: 301-614-5185

_______________________________________________
Esip-preserve mailing list
Esip-preserve at lists.esipfed.org
http://www.lists.esipfed.org/mailman/listinfo/esip-preserve
-------------- next part --------------
A non-text attachment was scrubbed...
Name: GHCN_Example.pdf
Type: application/pdf
Size: 76463 bytes
Desc: not available
URL: <http://www.lists.esipfed.org/pipermail/esip-preserve/attachments/20101013/30687164/attachment-0001.pdf>


More information about the Esip-preserve mailing list