[Esip-preserve] ESIP Citation Guidelines - A Demonstration of Cryptographic Digest Divergence on Scientifically Identical Files Using Real Data

Curt Tilmes Curt.Tilmes at nasa.gov
Wed Oct 13 08:33:37 EDT 2010


On 10/12/10 21:14, alicebarkstrom at frontier.com wrote:
> The attached file provides a writeup of creating cryptographic
> digests on a very small subset of data from NOAA's Global Historical
> Climate Network monthly average precipitation data and then
> perturbing the arrangement of the data elements in the data and
> metadata files.

I can summarize the paper with the definition of a cryptographic hash:
If you take the hash of two different sets of bits, you get a
different hash.  (well, a very high probability of getting a different
hash anyway)

With all due respect, this is not surprising to anyone in the field...


But suppose I went through each of your transformations and proposed a
canonicalization that rendered the same set of resulting bits from
each of the scientifically equivalent representations of the data.

For example:

Consider the starting file as a collection of logical pieces of
information:

X = { x0, x1, x2, x3, ... }

You are proposing various transformations with two properties:
1. They don't affect the scientific equivalence class of the data.
2. They do affect the physical bits.

So a transformed file looks like this:

Y = { y0, y1, y2, y3, ... }

where yi = t(xi) where those properties hold for t(x)
and more generally Y = t(X).

Then you take a hash of X and Y and not surprisingly get different
hashes:

H(X) = { H(x0), H(x1), H(x2), ... }
H(Y) = { H(y0), H(y1), H(y2), ... }

Naturally, since we've assumed the property that t(x) changes the
bits, H(X) != H(Y).


Now suppose we add a canonicalization function with this property:
1. Data in the same equivalence class result in the same bits.

Then C(x0) = C(y0) and H(C(x0)) = H(C(y0)) and H(C(X)) = H(C(Y))


Take your first case, if we say certain portions of X are text fields
where we define that their scientific meaning is case insensitive.
The canonicalization function for those portions where that is true
could simply be to always capitalize them.

So, if x17 = "MOUNT SHASTA" and y17 = "Mount Shasta",
C(x17) = "MOUNT SHASTA" and C(y17) = "MOUNT SHASTA".  Since
C(x17) = C(y17), H(C(x17)) = H(C(y17)).


> I'll note that this example appears (to me at least) to effectively
> negate the UNF approach suggested in Altman and King's paper on
> scholarly citation of quantitative numeric data.  While Altman's UNF
> paper is reasonably good on the numeric properties of the method, it
> makes what I regard as an unacceptable assumption - namely that
> there will be a practical way to negotiate a "canonical" arrangement
> of representations and data element ordering that applies to all
> Earth science data.  The UNF relies on cryptographic digests of the
> files, although the computed digests are modified to create the
> UNFs.

You can argue that coming up with a C(x) canonicalization isn't
practical for our data (I won't even disagree :-) I sure don't want to
do it myself), but your paper doesn't present that argument, or even
address the point.  Your conclusion simply assumes it is true.

As Altman demonstrates for his field, it is certainly conceivable.

I'm also not certain that we have to develop something that "applies
to all Earth science data" to be useful.  Perhaps we can come up with
something reasonable for a subset, for example, annotated files in one
of the self-describing formats (HDF/NetCDF/etc.) where the annotations
can contribute to the canonicalization process (i.e. you tag text
fields with a property that says "case-insensitive canonicalization of
this field will maintain scientific equivalence").

Again, I'm not saying that it isn't true, I'm simply pointing out that
the example and arguments presented in your paper don't demonstrate or
prove that it isn't true.

> On the other hand, Ken's note on embedding the UUID's in the files
> appears to stand - and might be made more robust by including a
> cryptographic digest of the file after the UUID is embedded with the
> distributed file.

We do that for OMI files. It is difficult to embed the digest of the
whole file into the actual file since it would alter the file's
content and therefore the digest.

We calculate the RFC 1321 MD5 of the whole data file at the point of
creation, and store it in the ECS Inventory Metadata
ECSDATAGRANULE.LOCALVERSIONID field in the associated .met file.
Every time we transfer the file around or use it, we re-verify the
bit-for-bit contents against that MD5.

Curt


More information about the Esip-preserve mailing list