[Esip-preserve] ESIP Citation Guidelines - A Demonstration of Cryptographic Digest Divergence on Scientifically Identical Files Using Real Data

Wed Oct 13 12:03:11 EDT 2010

In formal language, you've approximated the argument - and I appreciate that.

The issue on canonicalization is a practical one and intimitately tied to
the politics of the communities involved.  

Consider radiosonde archives.
I'm aware that NCAR and GSFC have them and believe there's one in Europe.
Each has a slight variant of the raw records because they've used different
editing criteria for inclusion.  Those variants have persisted for many years
and show no signs of going away.

I know CERES produced two separate variants in data formats, one in HDF for
external use, a second in "native" format for internal use.  They are identical
for practical purposes.  I haven't checked, but both format variants may be
available through ASDC.

I remember trying to get an agreement on regional data formats for EOSDIS.
The parties trying to reach an agreement on a common format for a grid stopped
trying to reach an agreement after an agonizing two year effort.

CERES has a grid for regions that starts at the North Pole and wends southward;
ISCCP (and the Reynold's SST product) start at the South Pole and wends northward.
[I don't know the convention on Ken's SST.]

Text preparation products have a number of long-term disagreements about formats:
MS-Word (choose your version), Open Office, TeX, LaTeX (for which there are publisher-specific
variant templates).  Ditto on journal preferences for bibliographic reference formats
(kind of an odd comment on the hopes for increasing the use of citations).

Variants for image files (JPG, GIF, TIFF, Postscript, SVG) - which are all slightly
different and not necessarily easy to map from one case to another.

My conclusion is that the prospect of canonicalization is nil on the kind of scale
we need - particularly if the desire is to facilitate interdisciplinary data use.
Software to translate one community's conventions into another is, of course, feasible.
I strongly suspect that community-by-community translations that are done on an as-needed
basis are cheaper than trying to achieve some sort of global agreement.

There are number of rather subtle issues involved in the translation problem.

For one thing, it is important to decide what gets included in the transforms.
In HDF, array dimensions and element orderings are included in the file itself.
If the format is done in FORTRAN, they are likely contained in the read file,
being included in the FORMAT statements.  Are they critical?  If I don't have
them, but have decided that geolocation elements are in the scientific data,
then you can simply place the radiance data at the longitude and latitude and
reconstruct the image on that basis.

I am also not sure what to make of case sensitivity.  If I were dealing with 
books, the character case of the text matters a great deal.  Many times I don't think
it does in scientific data.  You can make this issue more complex by having to deal
with "white space": is "MOUNT SHASTA" equivalent to "MOUNTSHASTA" or "Mount_Shasta"
or "MountShasta"?

As a kind of aside on this issue, your suggestion on how to handle the translation
is to map to one of the formats by what looks like a fairly constrained version of
the mapping (like the binary representation for 1.0 being translated into the ASCII
encoded character string "1.0" if the original were in that form).  In the algorithm
I've been working on, I can demonstrate one-to-one mappings between the highest
precision form of the quantity (where the ASCII character string "1.0" gets mapped
to a double precision value of 1.0 and compared with a similar conversion from a
single precision value).  In this case, the mapping is an index into an array
whose elements record the value and the representation for the first form of the
data and a second index that provides a "pointer" to the second form.  Similar
ideas, but different implementations.

I think it is fairly straightforward to embed the UUID in the file (perhaps using
the RFC 4122 Version 3 algorithm - meaning building the UUID based on the file name)
and then compute the cryptographic digest of the file with the embedded UUID.
If the UUID and the digest are registered at the same time, that could provide
a fairly secure way of identifying the file and ensuring that it hasn't been
altered.

Finally, it's true that anyone familiar with the properties of cryptographic
digests would reach the obvious conclusion I did.  However, this portion of the paper
is intended to be taken in the same spirit as the very useful demonstration
you've provided of the complexities of production and versioning.  The examples
in the text can provide test cases to demonstrate mapping algorithms, just as
your examples are quite useful for testing the sensitivity of identifier schemas
to perturbations.

Bruce B.

----- Original Message -----
From: "Curt Tilmes" <Curt.Tilmes at nasa.gov>
To: esip-preserve at lists.esipfed.org
Sent: Wednesday, October 13, 2010 8:33:37 AM
Subject: Re: [Esip-preserve] ESIP Citation Guidelines - A Demonstration of Cryptographic Digest Divergence on Scientifically Identical Files Using Real Data

On 10/12/10 21:14, alicebarkstrom at frontier.com wrote:
> The attached file provides a writeup of creating cryptographic
> digests on a very small subset of data from NOAA's Global Historical
> Climate Network monthly average precipitation data and then
> perturbing the arrangement of the data elements in the data and
> metadata files.

I can summarize the paper with the definition of a cryptographic hash:
If you take the hash of two different sets of bits, you get a
different hash.  (well, a very high probability of getting a different
hash anyway)

With all due respect, this is not surprising to anyone in the field...

But suppose I went through each of your transformations and proposed a
canonicalization that rendered the same set of resulting bits from
each of the scientifically equivalent representations of the data.

For example:

Consider the starting file as a collection of logical pieces of
information:

X = { x0, x1, x2, x3, ... }

You are proposing various transformations with two properties:
1. They don't affect the scientific equivalence class of the data.
2. They do affect the physical bits.

So a transformed file looks like this:

Y = { y0, y1, y2, y3, ... }

where yi = t(xi) where those properties hold for t(x)
and more generally Y = t(X).

Then you take a hash of X and Y and not surprisingly get different
hashes:

H(X) = { H(x0), H(x1), H(x2), ... }
H(Y) = { H(y0), H(y1), H(y2), ... }

Naturally, since we've assumed the property that t(x) changes the
bits, H(X) != H(Y).

Now suppose we add a canonicalization function with this property:
1. Data in the same equivalence class result in the same bits.

Then C(x0) = C(y0) and H(C(x0)) = H(C(y0)) and H(C(X)) = H(C(Y))

Take your first case, if we say certain portions of X are text fields
where we define that their scientific meaning is case insensitive.
The canonicalization function for those portions where that is true
could simply be to always capitalize them.

So, if x17 = "MOUNT SHASTA" and y17 = "Mount Shasta",
C(x17) = "MOUNT SHASTA" and C(y17) = "MOUNT SHASTA".  Since
C(x17) = C(y17), H(C(x17)) = H(C(y17)).

> I'll note that this example appears (to me at least) to effectively
> negate the UNF approach suggested in Altman and King's paper on
> scholarly citation of quantitative numeric data.  While Altman's UNF
> paper is reasonably good on the numeric properties of the method, it
> makes what I regard as an unacceptable assumption - namely that
> there will be a practical way to negotiate a "canonical" arrangement
> of representations and data element ordering that applies to all
> Earth science data.  The UNF relies on cryptographic digests of the
> files, although the computed digests are modified to create the
> UNFs.

You can argue that coming up with a C(x) canonicalization isn't
practical for our data (I won't even disagree :-) I sure don't want to
do it myself), but your paper doesn't present that argument, or even
address the point.  Your conclusion simply assumes it is true.

As Altman demonstrates for his field, it is certainly conceivable.

I'm also not certain that we have to develop something that "applies
to all Earth science data" to be useful.  Perhaps we can come up with
something reasonable for a subset, for example, annotated files in one
of the self-describing formats (HDF/NetCDF/etc.) where the annotations
can contribute to the canonicalization process (i.e. you tag text
fields with a property that says "case-insensitive canonicalization of
this field will maintain scientific equivalence").

Again, I'm not saying that it isn't true, I'm simply pointing out that
the example and arguments presented in your paper don't demonstrate or
prove that it isn't true.

> On the other hand, Ken's note on embedding the UUID's in the files
> appears to stand - and might be made more robust by including a
> cryptographic digest of the file after the UUID is embedded with the
> distributed file.

We do that for OMI files. It is difficult to embed the digest of the
whole file into the actual file since it would alter the file's
content and therefore the digest.

We calculate the RFC 1321 MD5 of the whole data file at the point of
creation, and store it in the ECS Inventory Metadata
ECSDATAGRANULE.LOCALVERSIONID field in the associated .met file.
Every time we transfer the file around or use it, we re-verify the
bit-for-bit contents against that MD5.

Curt
_______________________________________________
Esip-preserve mailing list
Esip-preserve at lists.esipfed.org
http://www.lists.esipfed.org/mailman/listinfo/esip-preserve