[Esip-preserve] Stewardship Best Practices - Identifiers

alicebarkstrom at frontier.com alicebarkstrom at frontier.com
Thu Oct 7 10:23:13 EDT 2010


The first sentence is correct.  The point of the proof is that
there is no way at all to find a unique identifier based on
bit-for-bit identity.  If we were working with texts, like plays
of Shakespeare, bit-for-bit identity would be appropriate because
the text can't be rearranged without losing meaning.  However, because we
can rearrange numbers in data collections without changing their
values, there's no way to establish a unique identifier.

The alternative is to be able to verify that files hold scientifically
identical data by computing whether the alternatives have the same
values.  This means that establishing "authenticity" is roughly equivalent
to obtaining the abstract for a piece of property that lists all of the
transfers back to the original.  If there's doubt about the property
description, we do a survey to make sure the property fits the map
in the recorder of deeds office.  Or, to put it another way, we need
an algorithm that will allow us to verify that two files contain the
same scientific data.  If we can establish identity of values between
one file and another, then we can either trace that identity back by
a series of tests for equality until we arrive at a test on equality
with the original file the producer created - or, if that no longer
exists, then creating a very strong presumption of equality by having
agreements with a large number of files we believe contain the same
values.  Note that in the latter case, the chain of tests for identity
is also not unique because we could test different collections of files
in two traces - and the equality tests might not be in the same order.

I think the current paper is deficient in that it doesn't discuss how
to verify claims of authenticity, should the authority come into question.
This is important for being able to audit the chains of transformations
that an archive may need in order to keep data useable.

Bruce B.
----- Original Message -----
From: "Mark A. Parsons" <parsonsm at nsidc.org>
To: "Bruce Barkstrom" <brbarkstrom at gmail.com>
Cc: "ESIP Preservation cluster" <esip-preserve at rtpnet.org>
Sent: Thursday, October 7, 2010 7:36:41 AM
Subject: Re: [Esip-preserve] Stewardship Best Practices - Identifiers

I confess to not following this closely, but I believe Bruce is saying that currently it is impossible to have a truly unique identifier. Therefore, we should hold off on making any concrete recomendations on the use of curent  imperfect schemes. There is logic in this rationale, and with my personally conflicted feelings about the use of identifiers, I find empathy. But I am adamently oppossed to prohibiting baby steps of almost any kind. We have to test the waters. Citation has never been perfect but it has always been more useful than no citation. We cannot be so presumtious to think that we will figure out a problem that has existed since the dawn of science. All we can hope to do is steer the community in the right directiion. I do not see how adopting certain identifiers for certain purposes leads us to a precipice or costs us more than applying no identifier. We may make some false starts that cost us in the long term, but I can't see how it will cost us more than doin
 g nothing. Nor do I see how it increases the risk of data loss. At least we're recording things in a formal way for later consideration in the "perfect" system.

-m.

On 6 Oct 2010, at 5:59 PM, Bruce Barkstrom wrote:

> From my perspective, the issue is whether the "baby steps"
> eventually lead us over a precipice that costs us much more
> later and exposes us to extraordinary dangers of losing
> information.  I think that's demonstrably the case.
>  
> Furthermore, I think the mapping approach that you're already
> exploring with the HDF Group is a much more viable approach,
> despite the limitations of that work to HDF files.  The suggestion
> I'm making is not as extreme as the one you are pushing us
> into with the notion of "unique identifiers".
>  
> Bruce B.
> On Wed, Oct 6, 2010 at 4:15 PM, Ruth Duerr <rduerr at nsidc.org> wrote:
> Hi Bruce,
> 
> I think you are still confusing two definitions of identity.  The first is whether this file is the same file (format, content, etc.) as that file.  The second is whether this data is the same as that data.  As I've noted innumerable times now, the two cases are distinct and both are important.  They are embodied in the distinctions between Use Case 1 and Use Case 3, and in fact that is the distinction between Use Case 1 and Use Case 3.
> 
> The factor that we don't have an identifier solution for Use Case 3 just makes the factor that we can do Use Case 1 more important.  Yes, it isn't a perfect solution - but it is better than no solution at all.
> 
> I'd rather take baby steps and accomplish something meaningful and useful than take no steps at all.
> 
> Ruth
> 
> On Oct 6, 2010, at 2:05 PM, alicebarkstrom at frontier.com wrote:
> 
> > Owing to e-mail confusions this morning, I'm going
> > to repeat a message concerning "unique identifiers":
> >
> > I now believe I've got a proof that under a fairly
> > stringent definition of scientific data equality,
> > collections of data are not unique in the sense that
> > one can identify a single bit-by-bit replication of
> > collection content.  In other words, there is a
> > computable definition of scientific data equality
> > such that two files (or collections) have identical
> > scientific content, but cannot have a cryptographic
> > digest metric of bit-by-bit identity.  There are at
> > least three sources of difficulty:
> >
> > 1.  Differences in the formatting of individual data
> > values, e.g. a byte value may be identical with an
> > int value and may be identical with a long int as well
> > as an ASCII chracter string that spells out the integer
> > value.
> >
> > 2.  Differences in the serialization of arrays or lists
> > of values, as in C array order versus FORTRAN order, where
> > the values are the same but the order in which the values
> > appear differ.  A rearrangement of the rows in a relational
> > database table would have the same effect.
> >
> > Because a cryptographic digest depends on the order of bits, such
> > rearrangements will be guaranteed to create different
> > digests, although the number of data elements is the same.
> > Note that this case requires an explicit mapping that can
> > identify which element in the second collection corresponds
> > to an element in the first file, so that corresponding
> > elements can be tested for equality.
> >
> > This concern over rearrangements also applies to cases in which some parts
> > of a data collection are placed in different files.  Such
> > a case occurs when geolocation data (lat, long, and altitude)
> > are placed in one file, while observations are placed in
> > another.  That happens with MOD 02 (which contains geolocation
> > data for the 1 km MODIS channels), while the calibrated
> > spectral radiances are placed in one or more other files.
> > It also happens with the monthly average precipitation
> > data from NOAA's Global Historical Climate Network data.
> > I have created a single file that keeps the geolocation
> > and the precip records for a station in the equivalent of
> > a single relational database record, although the original
> > data were in separate files.  I also converted the ASCII
> > five-character integer values for precipitation into
> > double precision floats and put all of the years of precip
> > into one array - without affecting the numerical values
> > to any perceptible degree.
> >
> > 3.  There may be tacit data in such objects as array bin
> > values that act as implicit numerical values.  If these
> > tacit data are converted to explicit values, two data
> > collections will differ in their bit-by-bit values, although
> > they contain identical scientific data.
> >
> > The identifiers paper does contain wording that suggests
> > that use case 1 is based on bit-by-bit replication of values
> > to determine uniqueness.  I believe this is not supportable.
> > Scientific identity creates equivalence classes, not unique
> > objects.
> >
> > There are related problems with the notion of authenticity
> > in the paper, although I won't discuss these in the same
> > level of detail now.
> >
> > I think this situation requires a fairly substantive
> > discussion by our working group.
> >
> > Bruce B.
> >
> > ----- Original Message -----
> > From: "Curt Tilmes" <Curt.Tilmes at nasa.gov>
> > To: "ESIP Preservation cluster" <esip-preserve at rtpnet.org>
> > Sent: Wednesday, October 6, 2010 7:25:58 AM
> > Subject: [Esip-preserve] Stewardship Best Practices
> >
> > Looking over this a bit:
> >
> > http://wiki.esipfed.org/index.php/Interagency_Data_Stewardship/Principles
> >
> > Here are some, perhaps controversial, proposals for addition:
> >
> > Data Creators:
> >
> > will release all software used in the processing of data used in
> > scientific research. Even if the rights to use that software are
> > restricted it should at least be available for inspection.
> >
> > will strive for portability in software used in the processing of data
> > used in scientific to enable independent verification and
> > reproducibility of results.
> >
> > Data Intermediaries:
> >
> > will assign persisent resolvable identifiers for data. (link to
> > identifiers paper)
> >
> > will maintain metadata for cited datasets to preserve the integrity of
> > persisent identifiers, even if the data themselves are deleted due to
> > obsolesence.
> >
> > Data Users:
> >
> >   "will follow any restrictions on redistribution of data that were
> >    indicated by the data intermediaries."
> >
> > add software to that one:
> >
> > will follow any restrictions on usage of software or redistribution of
> > data that were indicated by the data intermediaries.
> >
> > Curt
> > _______________________________________________
> > Esip-preserve mailing list
> > Esip-preserve at lists.esipfed.org
> > http://www.lists.esipfed.org/mailman/listinfo/esip-preserve
> > _______________________________________________
> > Esip-preserve mailing list
> > Esip-preserve at lists.esipfed.org
> > http://www.lists.esipfed.org/mailman/listinfo/esip-preserve
> 
> _______________________________________________
> Esip-preserve mailing list
> Esip-preserve at lists.esipfed.org
> http://www.lists.esipfed.org/mailman/listinfo/esip-preserve
> 
> _______________________________________________
> Esip-preserve mailing list
> Esip-preserve at lists.esipfed.org
> http://www.lists.esipfed.org/mailman/listinfo/esip-preserve

_______________________________________________
Esip-preserve mailing list
Esip-preserve at lists.esipfed.org
http://www.lists.esipfed.org/mailman/listinfo/esip-preserve


More information about the Esip-preserve mailing list