[Esip-preserve] Stewardship Best Practices - Identifiers

Bruce Barkstrom brbarkstrom at gmail.com
Thu Oct 7 15:01:51 EDT 2010


First, I'm attaching six very short text files containing an example of the
kinds of data rearrangement that will prevent cryptographci digests from
working.  The data are drawn from two files that are available from an ftp
site at NCDC - the location from which I got them is in the file identified
as Procedural_Documentation.txt.  The two files with "INV" in the file
name contain geolocation data for rain guage stations (lat, long, altitude).
The files with "PRCP" in the title are simple rearrangements of ASCII text
files that contain data - which appears to have been written with a
FORTRAN I5 format for each of the monthly averaged precipitation values.
You'll also find my calculation of MD5 checksums for each file.  If I
haven't
made errors and my program has no bugs, you should be able to verify
that the files you receive with this note are unchanged from what I've done.

Second, as a note on scientific equality, although the data values
in the files have been created using only text "cut-and-paste" (and checked
visually), the simple act of rearranging the order of the values makes the
files have different checksums - even though I think it is reasonable to
believe that the data are both authentic (in the sense that they are what
NCDC put into them) and identical in the sense that when pairs
of corresponding elements
in two files are examined together, they have the same scientific value.
It is easy to contrive more complex examples if we convert the ASCII integer
values to, say, double precision floats.  At that point, we'd have to turn
to a
program that would directly compare the numeric values of the data values.
Note, for example, that an ASCII '1' character actually has a binary value
of
49.  Thus, a bit-by-bit comparison of the ASCII text in these files with the
float value will not produce equality.

Third, there is some complexity involved in what we might mean by
"scientifically
identical".  Probably the simplest case involves equality only of the data
values.
As noted in the Procedural_Documentation file (near the bottom), these are
only
the data values contained in the files with PRCP - and the comparison would
not
include the station identifiers.  The next simplest case requires grouping
the data
values with the geolocation values (or, in more general terms, the
spatio-temporal
sampling pattern).  The strongest case I can think of now would require the
data
values, the sampling pattern, and the statistical distribution of error to
match for
every point.  At this point, I am not prepared to discuss computational
feasibility
or storage cost.  As a practical matter, the simplest case is tedious enough
to
automate.

A suggestion or two on how to proceed:

a.  I have some strong suspicions that the appropriate way of dealing with
the
traceability (or authenticity) is to regard the problem as equivalent of
being able
to request a deed abstract that shows all of the comparisons of scientific
equivalency that the abstract preparer is aware of, preferably including at
least
one comparison with the original file prepared by the original producer.  It
will take
me a bit of time (meaning maybe a day) to write up an outline of what this
would entail.  The WG needs to digest the material above (and - hopefully -
the
files attached) and have a discussion.

b.  The paper, as it now stands, is an interesting collection of information
about
various approaches to registering data.  I have mixed feelings about it -
particularly
whether it would lead the community to place an undue confidence in the
registration
process as a long-term way of preserving data.  As with suggestion a, I
think we need
to poll the members of the WG about their feelings on it - after considering
the
discussion we've been having.

c.  As an interim approach to identifiers, I think it would be
straightforward to
use the current naming conventions and hook them up to some of the
identifier
schemas identified in the paper.  DOI's that resolve to file or collection
names
are probably as good as anything else if we're looking for simple
identification.
However, I do not think this kind of approach should be sold as a method of
concocting permanent, unique identifiers.  At best, scientifically identical
data
collections form an equivalence class - with no special status for one
unique
member of the class.

I will see what I can do to put together an outline of what I'd suggest for
authenticity
and auditing.  I do have documentation and code for a computational approach
to
establishing scientific identity of two data sets.  I'll also note that the
HDF mapping
approach that I heard about (for the first time) at the HDF workshop looks
to me like
a pretty solidly based piece of work - although it fits within the
constraints of working
only with HDF files (at least if my understanding hasn't led me astray).
The mapping
of one file into an equivalent file is also in the work that I'm doing,
although I've started with
the layered information model in Annex E of the OAIS RM.  The content of my
mappings
is also different from the XML approach - mine mappings consist of pairs of
indexes
that refer to array elements in what Annex E identifies as the Data Element
layer.
The mapping concept seems like the most fruitful approach from my
perspective.

Hope this is helpful.  We've got a lot of listening to each other to do.

Bruce B.
On Thu, Oct 7, 2010 at 10:41 AM, Curt Tilmes <Curt.Tilmes at nasa.gov> wrote:

> On 10/07/2010 10:23 AM, alicebarkstrom at frontier.com wrote:
>
>> The alternative is to be able to verify that files hold scientifically
>> identical data by computing whether the alternatives have the same
>> values.
>>
>
> The Altman paper Ruth cited on page 19 of discusses this a bit:
> http://www.springerlink.com/content/j13u6pwh837q2711/
> "A Fingerprint Method for Scientific Data Verification".
>
> Basically producing a hash of a canonical representation of
> the data.  Regardless of the format, the prescribed canonical
> representation is the same, so the hashes are comparable.
>
> For numerous reasons (you point out several), that isn't sufficient
> for our needs, but with some more work, it could be adapted to help us
> perform a similar function.
>
>
> I've been working on a comparable method, taking hashes of a canonical
> representation of the provenance of a file and using that as a
> fingerprint to compare two files.
>
>
> I think we need to work on both approaches.  Ways to identify,
> distinguish and compare content and ways to identify, distinguish and
> compare provenance.
>
>
> Curt
> _______________________________________________
> Esip-preserve mailing list
> Esip-preserve at lists.esipfed.org
> http://www.lists.esipfed.org/mailman/listinfo/esip-preserve
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.lists.esipfed.org/pipermail/esip-preserve/attachments/20101007/5fc08cea/attachment.html>


More information about the Esip-preserve mailing list