[Esip-preserve] ESIP Citation Guidelines

alicebarkstrom at frontier.com alicebarkstrom at frontier.com
Mon Oct 11 19:50:08 EDT 2010


See below.  I've tried to insert some indication of my comments,
using a quasi-XML tag starting with <brb> and ending with </brb>.

Bruce B.
----- Original Message -----
From: "Curt Tilmes" <Curt.Tilmes at nasa.gov>
To: esip-preserve at lists.esipfed.org
Sent: Monday, October 11, 2010 5:22:54 PM
Subject: Re: [Esip-preserve] ESIP Citation Guidelines

On 10/11/2010 05:03 PM, alicebarkstrom at frontier.com wrote:
> At least from my perspective (probably gloomily of Scandanavian
> genetic predisposition), until we've got a threat analysis that
> moves in the direction of quantifying the probability of identifiers
> "coming loose" from the data itself, as well as the probability of
> detecting changes, and some approach to auditing for corruption, our
> job on this is far from done.

Yes.  We agree the job is far from done.

> I'll also note that I don't think we've done an adequate job of
> taking into account the difficulties of dealing with format and data
> order rearrangements.  I am quite certain that it is unfeasible to
> provide a draconian standardization of data formats and data file
> interpretations.  As a result, cryptographic digests only protect
> against tampering with the bits in a file - but they don't deal with
> the question of being able to uniquely identify two files with
> scientifically identical data that have different cryptographic
> digests (or bit-by-bit intercomparisons).

Yes, these are related to the OAIS "Fixity" requirement.
We need work there too.

<brb>If by "Fixity" you mean "can we verify that a file sent from an
archive" has been unaltered from the last check on the cryptographic
digest", the answer is "yes" with a very high degree of probability
(even recognizing that MD5 and SHA-1 have known difficulties).
However, that question is NOT the same as "does this file (or collection)
contain the same scientific values".  The answer to that question requires
something very different from cryptographic digests - or any other technique
that requires bit-by-bit agreement between two data collections.

If you want to require "bit-for-bit" comparisons, you require that the
file being compared to the "original" is identical in both the data representation and
in the ordering of numeric values.  To put it another way, you cannot
alter the data representation (from byte integers to ints to long ints;
from floats to doubles or even ASCII encodings of float values) nor can
you allow any reordering of the numerical values.  Violation of either of
these conditions will cause the cryptographic digests to be different.

In addition, encodings for missing or "odd" values must be identical
between the two representations.  Further, if some of the values (like
months in an annual array) are tacit, then comparisons might require
that the tacit values be made explicit - changing the storage requirements
and changing the cryptographic digests in a way that would prevent scientifically
identical values from having the same cryptographic digest.</brb>

Do you think it is possible to adapt the UFN approach previously
mentioned to our earth science data?  It addresses (some, but not all
of) the things you discuss here.

<BRB>Absolutely NOT!  The UFN approach starts by assuming that data
can be arranged in a "canonical" sequence of values and held to
a single specified precision and represenation.  There isn't
anyone to play "pope" to provide a "canon" of data formats.  That
means that there isn't anyone who can identify the "canonical"
representation of a numeric data collection.

Here are a couple of examples:

1.  The NOAA GHCN adjusted precipitation data separates the geolocation
data from the actual precip data - which is arranged in single year
arrays (Jan as first month, Dec as the last month) using an ASCII encoding
of five characters to represent an integer.  I don't think most of us
would accept the notion that the data in memory (or in a file) that converted
the ASCII to, say, double precision floats would suddenly render the data
in memory "inauthentic".

2.  The MODIS MOD02 data product contains the lat and long values for each
location in the 1 km data (if I remember what we had to deal with on CERES).
If someone takes the spectral channels for 1 km res data and extracts the lats and long,
and uses that for an analysis, I don't think most of us would assume that
they've created "inauthentic" data.  So - is the "authentic" data the spectral
radiances without the geolocation - or does the geolocation data have to accompany
the spectral radiances?  If the answer is the spectral radiances, does the identifier
have to refer just to that data?  If the answer is both, what identifier should someone
quote who wants to use just the spectral data?

As additional indicators of the difficulty, you can take the 
different formats available for images, with the format differences persisting 
over multiple decades (bmp, jpg, tiff, ps, and eps).  Likewise, do you really expect NASA, NOAA, and DOD to 
agree on exactly the format and representation they'll use in common?  Or, for 
that matter that NASA and ESA will agree on identical data formats and sequential 
order in files of the "same" data?</brb>

Additionally, I think reproducibility through complete provenance
capture helps address this (though I acknowledge it doesn't solve it).

<brb>Don't agree at all!</brb>

  > This line of reasoning strongly suggests that the notion of a
  > "unique authentic version" of a file is impossible.

We have typically relied on a trusted curator to manage and affirm
this.  We can't prove it, (especially in the case of a malicious
curator), but we log cryptographic digests as we produce data, and
distribute them with the data files.  We (EOSDIS) dictate formats for
standard data products for the "authoritative" version and that is
what gets archived and distributed.  As you point out, this only
checks the physical bits.

<brb>Why should we rely on unverifiable curation?  What techniques do we
have to audit the claim of identity?  I do think auditing is possible
and can be highly reliable.  However, it means that the curating
authority has to present a chain of evidence that will somehow
allow independent verification of the claim of unaltered scientific
identity.  I think we have a responsibility to avoid claims that cannot
be substantiated - and I don't think the current state of our claims
of "unique identifiers" can be verified independently.  If we go back
to "Applied Cryptography", we don't have a verifiable trust model.
Anyone in the security business can (and should) shoot at what we're
claiming.  I do not believe our current position fits with long-term
preservation.</brb>

In some cases, like the MODIS "process on demand" L1B, we can't do
that.  We assert that we have the ability to reproduce an equivalent
file (although with the current implementation, it actually performs
what I call "reprocessing" rather than "reproducing" -- The difference
being that reprocessing can use better versions of ancillary data
files, or later versions of the algorithms rather than trying to apply
a faithful attempt to make the same file.)

<brb>In short - it's not reproducable at all!  Bad guarantee of fixity!
Indeed, it sounds like a case of near-fraud, misrepresenting a reproduction
as a (possibly bad) replica!</brb>

Curt
_______________________________________________
Esip-preserve mailing list
Esip-preserve at lists.esipfed.org
http://www.lists.esipfed.org/mailman/listinfo/esip-preserve


More information about the Esip-preserve mailing list