[Esip-preserve] Stewardship Best Practices - Identifiers

Wed Oct 6 16:05:18 EDT 2010

Owing to e-mail confusions this morning, I'm going
to repeat a message concerning "unique identifiers":

I now believe I've got a proof that under a fairly
stringent definition of scientific data equality,
collections of data are not unique in the sense that
one can identify a single bit-by-bit replication of
collection content.  In other words, there is a 
computable definition of scientific data equality
such that two files (or collections) have identical
scientific content, but cannot have a cryptographic
digest metric of bit-by-bit identity.  There are at
least three sources of difficulty:

1.  Differences in the formatting of individual data
values, e.g. a byte value may be identical with an 
int value and may be identical with a long int as well
as an ASCII chracter string that spells out the integer
value.

2.  Differences in the serialization of arrays or lists
of values, as in C array order versus FORTRAN order, where
the values are the same but the order in which the values
appear differ.  A rearrangement of the rows in a relational
database table would have the same effect.  

Because a cryptographic digest depends on the order of bits, such
rearrangements will be guaranteed to create different 
digests, although the number of data elements is the same.
Note that this case requires an explicit mapping that can
identify which element in the second collection corresponds
to an element in the first file, so that corresponding
elements can be tested for equality.  

This concern over rearrangements also applies to cases in which some parts
of a data collection are placed in different files.  Such
a case occurs when geolocation data (lat, long, and altitude)
are placed in one file, while observations are placed in
another.  That happens with MOD 02 (which contains geolocation
data for the 1 km MODIS channels), while the calibrated
spectral radiances are placed in one or more other files.
It also happens with the monthly average precipitation
data from NOAA's Global Historical Climate Network data.
I have created a single file that keeps the geolocation
and the precip records for a station in the equivalent of
a single relational database record, although the original
data were in separate files.  I also converted the ASCII
five-character integer values for precipitation into 
double precision floats and put all of the years of precip
into one array - without affecting the numerical values
to any perceptible degree.

3.  There may be tacit data in such objects as array bin
values that act as implicit numerical values.  If these
tacit data are converted to explicit values, two data
collections will differ in their bit-by-bit values, although
they contain identical scientific data.

The identifiers paper does contain wording that suggests
that use case 1 is based on bit-by-bit replication of values
to determine uniqueness.  I believe this is not supportable.
Scientific identity creates equivalence classes, not unique
objects.

There are related problems with the notion of authenticity
in the paper, although I won't discuss these in the same
level of detail now.

I think this situation requires a fairly substantive
discussion by our working group.

Bruce B.

----- Original Message -----
From: "Curt Tilmes" <Curt.Tilmes at nasa.gov>
To: "ESIP Preservation cluster" <esip-preserve at rtpnet.org>
Sent: Wednesday, October 6, 2010 7:25:58 AM
Subject: [Esip-preserve] Stewardship Best Practices

Looking over this a bit:

http://wiki.esipfed.org/index.php/Interagency_Data_Stewardship/Principles

Here are some, perhaps controversial, proposals for addition:

Data Creators:

will release all software used in the processing of data used in
scientific research. Even if the rights to use that software are
restricted it should at least be available for inspection.

will strive for portability in software used in the processing of data
used in scientific to enable independent verification and
reproducibility of results.

Data Intermediaries:

will assign persisent resolvable identifiers for data. (link to
identifiers paper)

will maintain metadata for cited datasets to preserve the integrity of
persisent identifiers, even if the data themselves are deleted due to
obsolesence.

Data Users:

   "will follow any restrictions on redistribution of data that were
    indicated by the data intermediaries."

add software to that one:

will follow any restrictions on usage of software or redistribution of
data that were indicated by the data intermediaries.

Curt
_______________________________________________
Esip-preserve mailing list
Esip-preserve at lists.esipfed.org
http://www.lists.esipfed.org/mailman/listinfo/esip-preserve