[Esip-preserve] A Concern

Wed Sep 15 19:47:33 EDT 2010

I'll try to develop some concrete, doable tasks the
WG could do.  

I have two very concrete examples of
files that illustrate the problem(s) - one taken from
a recent paper I have had accepted in ESI that had
a map of byte values in a 2D array.  The byte values
could be converted to ASCII strings (of variable length)
that represented 18 ecosystem types.  Thus, one could
encode a small set of integers (18, to be exact) as
ASCII text - with no loss in scientific identity of
the data.  Furthermore, the order of the array could
be transposed without losing scientific identity.

The second one is drawn from work I've done with the
NCDC GHCN monthly averaged precipitation data.  The
original files have data from about 2100 rain guage
stations and are encoded into ASCII integers using
twelve months in the year - identifying the stations
with an ASCII label and keeping the numerical values
in order of the months.  It's actually easy to rearrange
the scientific data so that all of the values for a given
station are put into a single array - where now year and
month are encoded by position in the array.  No data is
lost in the transform, although the "tacit" ordering of
the data is part of the context information that's needed
to interpret the data.  Note that the order of the stations
can be permuted without losing scientific information.
The individual values could be encoded as floating point
values instead of scaled integers.  Furthermore, one could
even separate the station ID's from the station records
(as one might in a normalized database) and still have the
scientific equivalence even though the total data source
resides in two files (or tables), not just one.

There are a lot of variants on this theme.  My sense is
that the best we can do is to recognize that a particular
file (or "data set") belongs to a family of representationally
equivalent instances - with no one instance being the "unique"
touchstone.  I think the question of identifying whether a
particular instance belongs to a particular family is important
for auditing claims of the scientific identity of two files and
also that this claim will need to be supported by explicit
(mathematical) mappings between the two instances that are
one-to-one and onto - meaning that both a forward and an inverse
mapping are required to demonstrate equality.

I'll have more to say on this issue at the HDF and HDF-EOS
workshop in a couple of weeks.  I believe I've been able to
convert Annex E of the OAIS RM into a more formal form that
I can use (in Ada) to create a programmable form for this
claim.

Also, on the larger issue of potentially non-unique data collections
spread across a federation of archival sites should probably be
tackled by starting with a threat analysis based on the LOCKSSS
paper on the subject.  That would be a useful item of work for
the WG - and would move us in the direction of leading work on
reliability analysis for long-term preservation.

Note also - my e-mail service at Verizon.net has been bought 
out by Frontier.com.  I have opened an account at
brbarkstrom at gmail.com
which I hope will be more reliable and preservable into the future.
By the end of the year, the Verizon account will be gone - and I
don't intend to continue with Frontier.

Finally, this coming weekend is another of our football "excursions"
up to Illinois.  Thus, I'll be out of touch from late tomorrow 
afternoon until Tuesday.

Bruce B.
----- Original Message -----
From: "Ruth Duerr" <rduerr at nsidc.org>
To: alicebarkstrom at verizon.net
Cc: "Ruth Duerr" <rduerr at nsidc.org>, esip-preserve at rtpnet.org
Sent: Wednesday, September 15, 2010 12:33:24 PM
Subject: Re: A Concern

Hi Bruce, 

On Sep 14, 2010, at 9:45 AM, alicebarkstrom at verizon.net wrote: 

In the course of working on a presentation on the role of 
data formats in information preservation, I found that it 
is possible to create several different groupings of numerical 
data that have identical scientific content but that 
are different in ways that would prevent a cryptographic 
digest from identifying them as identical.  As a result, 
I'm reasonably certain that for Earth science data, files 
that have the same content are not unique in their form. 
A simple example arises from data that could be stored 
in a database but that has one instance that is normalized 
and another instance (of the same data) that is not normalized. 
  I think this is the issue of scientific identity or equivalence that Curt has previously brought up.  I do note some interesting work being done in the social sciences on Unique Numerical Fingerprints ( http://thedata.org/citation/standard ) that has relevance here.  Unfortunately it looks like considerable effort needs to go into developing a "fingerprint" for each kind of data. 

I'll also note that in the likely event of transformational 
migration, it seems probable that there will be multiple 
locations that can be identified as holders of authentic 
data, although the files in the collections may be quite 
different in their layout.  LOCKSSS is, of course, a prime 
example of this dispersal of authenticity. 
  Actually LOCKSS is interesting in this regard since while they do hold voting to ensure that all copies are the same, they do explicitly have a single source (the "authentic, authorized" version) where you can get copies from.  This is similar to NSIDC's concepts of primary archive and backup archives...  Our responsibilities are different if we are primary vs backup... 

I think this suggests some discussion is needed regarding 
what we mean by uniqueness and authenticity, as well as 
some work regarding reliability of survival of information. 
  If you can define concrete, doable tasks that the group could tackle, that would be great! 

Ruth