[Esip-preserve] Stewardship Best Practices - Identifiers

Wed Oct 6 16:15:28 EDT 2010

Hi Bruce,

I think you are still confusing two definitions of identity.  The first is whether this file is the same file (format, content, etc.) as that file.  The second is whether this data is the same as that data.  As I've noted innumerable times now, the two cases are distinct and both are important.  They are embodied in the distinctions between Use Case 1 and Use Case 3, and in fact that is the distinction between Use Case 1 and Use Case 3.  

The factor that we don't have an identifier solution for Use Case 3 just makes the factor that we can do Use Case 1 more important.  Yes, it isn't a perfect solution - but it is better than no solution at all.  

I'd rather take baby steps and accomplish something meaningful and useful than take no steps at all.

Ruth

On Oct 6, 2010, at 2:05 PM, alicebarkstrom at frontier.com wrote:

> Owing to e-mail confusions this morning, I'm going
> to repeat a message concerning "unique identifiers":
> 
> I now believe I've got a proof that under a fairly
> stringent definition of scientific data equality,
> collections of data are not unique in the sense that
> one can identify a single bit-by-bit replication of
> collection content.  In other words, there is a 
> computable definition of scientific data equality
> such that two files (or collections) have identical
> scientific content, but cannot have a cryptographic
> digest metric of bit-by-bit identity.  There are at
> least three sources of difficulty:
> 
> 1.  Differences in the formatting of individual data
> values, e.g. a byte value may be identical with an 
> int value and may be identical with a long int as well
> as an ASCII chracter string that spells out the integer
> value.
> 
> 2.  Differences in the serialization of arrays or lists
> of values, as in C array order versus FORTRAN order, where
> the values are the same but the order in which the values
> appear differ.  A rearrangement of the rows in a relational
> database table would have the same effect.  
> 
> Because a cryptographic digest depends on the order of bits, such
> rearrangements will be guaranteed to create different 
> digests, although the number of data elements is the same.
> Note that this case requires an explicit mapping that can
> identify which element in the second collection corresponds
> to an element in the first file, so that corresponding
> elements can be tested for equality.  
> 
> This concern over rearrangements also applies to cases in which some parts
> of a data collection are placed in different files.  Such
> a case occurs when geolocation data (lat, long, and altitude)
> are placed in one file, while observations are placed in
> another.  That happens with MOD 02 (which contains geolocation
> data for the 1 km MODIS channels), while the calibrated
> spectral radiances are placed in one or more other files.
> It also happens with the monthly average precipitation
> data from NOAA's Global Historical Climate Network data.
> I have created a single file that keeps the geolocation
> and the precip records for a station in the equivalent of
> a single relational database record, although the original
> data were in separate files.  I also converted the ASCII
> five-character integer values for precipitation into 
> double precision floats and put all of the years of precip
> into one array - without affecting the numerical values
> to any perceptible degree.
> 
> 3.  There may be tacit data in such objects as array bin
> values that act as implicit numerical values.  If these
> tacit data are converted to explicit values, two data
> collections will differ in their bit-by-bit values, although
> they contain identical scientific data.
> 
> The identifiers paper does contain wording that suggests
> that use case 1 is based on bit-by-bit replication of values
> to determine uniqueness.  I believe this is not supportable.
> Scientific identity creates equivalence classes, not unique
> objects.
> 
> There are related problems with the notion of authenticity
> in the paper, although I won't discuss these in the same
> level of detail now.
> 
> I think this situation requires a fairly substantive
> discussion by our working group.
> 
> Bruce B.
> 
> ----- Original Message -----
> From: "Curt Tilmes" <Curt.Tilmes at nasa.gov>
> To: "ESIP Preservation cluster" <esip-preserve at rtpnet.org>
> Sent: Wednesday, October 6, 2010 7:25:58 AM
> Subject: [Esip-preserve] Stewardship Best Practices
> 
> Looking over this a bit:
> 
> http://wiki.esipfed.org/index.php/Interagency_Data_Stewardship/Principles
> 
> Here are some, perhaps controversial, proposals for addition:
> 
> Data Creators:
> 
> will release all software used in the processing of data used in
> scientific research. Even if the rights to use that software are
> restricted it should at least be available for inspection.
> 
> will strive for portability in software used in the processing of data
> used in scientific to enable independent verification and
> reproducibility of results.
> 
> Data Intermediaries:
> 
> will assign persisent resolvable identifiers for data. (link to
> identifiers paper)
> 
> will maintain metadata for cited datasets to preserve the integrity of
> persisent identifiers, even if the data themselves are deleted due to
> obsolesence.
> 
> Data Users:
> 
>   "will follow any restrictions on redistribution of data that were
>    indicated by the data intermediaries."
> 
> add software to that one:
> 
> will follow any restrictions on usage of software or redistribution of
> data that were indicated by the data intermediaries.
> 
> Curt
> _______________________________________________
> Esip-preserve mailing list
> Esip-preserve at lists.esipfed.org
> http://www.lists.esipfed.org/mailman/listinfo/esip-preserve
> _______________________________________________
> Esip-preserve mailing list
> Esip-preserve at lists.esipfed.org
> http://www.lists.esipfed.org/mailman/listinfo/esip-preserve