[Esip-preserve] Stewardship Best Practices - Identifiers - UNF's

Thu Oct 7 17:41:23 EDT 2010

I managed to find a copy of the paper on the UNF algorithm.  At its
core, it relies on reducing the resolution of a cryptographic digest
of the file after "canonicalizing" the order of the numeric elements.
I don't think this is sufficient to come up with a unique identifier.

There are two versions of the proof regarding the non-uniqueness of
scientifically identical files.  The basic statement that needs proof
is that

1.  Files with scientifically identical contents
can have different identifiers.

I think the examples from earlier this afternoon provide examples
that should be understandable.  At the heart of these examples is
proof by counter-example - and we should only need one counter-example
to complete the proof.  However, to do the formal proof, I expect
it will be necessary to move to symbolic logic.  Personally, that
probably won't make the content more understandable - but it may
provide some intellectual satisfaction to individuals with more formal
math skills (I'm just an applied math person - I've seen guys who 
practice number theory and I doubt I'll ever be competent at that
kind of theorem proving).

Also note that the examples from earlier this afternoon have scientific
content that could be expressed within a single file or spread over two.
That means that we're not just dealing with identity of scientific
content in one file - but will have to deal with what is probably a
combinatorial expansion in the number of ways individuals group data.

An example of this is the earlier example that took the original GHCN
data (which had one year of data on each line) and rearranged it so that
four years of data were on a single line of text.  I did something very
similar, except I took the whole history of observations for a station and assembled them into a single array (with about 2,400 elements), meaning that the whole data collection had about 2,100 arrays of 2,400 elements.  I could sort each station's array and thereby use a Kolmogorov-Smirnov statistical test to decide whether the precip statistics were normal (they weren't).
That would have been much harder to do if I couldn't have rearranged
the data.

I could also sort the lines of text by different criteria.  For example,
I might put the station with the smallest median rainfall first and the
largest median rainfall last.  The content wouldn't change, but a 
cryptographic digest of the array would be different.

2.  Files with identical identifiers may have different scientific
content.

The easiest example of this proposition is probably to think of the
possible interpretations of a single byte's value.  Suppost the
bit pattern in the byte has the value '00000001'.  If this sequence
of bits is interpreted as an array of bit flags, it has a different
meaning than if it is interpreted as an ASCII character (SOH or Start
Of Header).  If it is interpreted as a byte-length integer, it would
have a numeric value of 1.  Each of these interpretations would have
the same cryptographic digest.  To remove the ambiguity, the repository
would have to specify how a program that reads the file has to interpret
the input sequence of bits.  That's where the CASPAR project's Representation
Network Registry begins to make sense.

In either of these propositions, you could replace equality of cryptographic
digests by a bit-by-bit comparison of the bit-array representation of the
files.  While the single byte value example is easy to see, I am pretty
sure the ambiguity extends to larger files or collections of files.

To summarize, the best I can come up with right now is that we can find
sample files that each have computational demonstrations of the notion that
their contents are scientifically identical.  I believe mathematicians
who do formal proofs would assign all of the files that satisfy this
property to an "equivalence class".  That could be assigned a unique identifier that would stick.  In practical terms, it means that one could demonstrate that two files belong to the same class by computing whether they have scientifically identical contents.  As a result, one can assemble a chain
of comparisons that will tie together members of the class.  If you think
about this a bit, it lends itself to the notion that you could test for
identity as being a very high probability that a file belongs to the class
because it's been through enough comparisons to make it nearly impossible
to believe it isn't a member.  This kind of an approach suggests the model
of the county recorder of deeds who keeps surveying records as well as
ownership records.  If there's a dispute, doing a new survey should establish boundaries, just as a comparison can establish whether two files are scientifically identical.  The cost of this kind of proof is outside
of the scope of this e-mail - although it seems to me that this is very
similar to what LOCKSS does with their voting.

Hope this adds some clarity to my point of view - although I'll certainly
admit to it being a pretty abstract one.

Bruce B.
----- Original Message -----
From: "Mark A. Parsons" <parsonsm at nsidc.org>
To: "Bruce Barkstrom" <brbarkstrom at gmail.com>
Cc: esip-preserve at lists.esipfed.org
Sent: Thursday, October 7, 2010 3:56:03 PM
Subject: Re: [Esip-preserve] Stewardship Best Practices - Identifiers

Bruce,

I don't believe anyone denies your fundamental conclusions. The question is what are their implications, especially in regards to the conclusions of the paper.

I would suggest that our final conclusion is that there is currently no fully robust and truly unique identification scheme, but there are some current practices that can imperfectly aid in ensuring scientific reproducibility. We suggest what may be seen as interim best practices that address three of our four use cases, notably data citation. Data Managers should adopt these practices while adhering to other best practices of data stewardship and being mindful of further research addressing mechanisms to trace and ensure data authenticity.

To make this conclusion, we need to beef up the discussion of use case 4 on scientific uniqueness, by including some of what you have been describing (perhaps with a reference to Barkstrom (in prep.)). We may also want to add an assessment the Universal Numeric Fingerprint to our list.

I think if we make those changes, the rest of paper stands (after some reorganization and editing Ruth and I have discussed).

Cheers,

-m. 

p.s. I was tickled that you picked my home town (Ithaca) and just outside my other childhood home (Mt. Shasta) for your data examples.

On 7 Oct 2010, at 12:35 PM, Bruce Barkstrom wrote:

> c.  As an interim approach to identifiers, I think it would be straightforward to
> use the current naming conventions and hook them up to some of the identifier
> schemas identified in the paper.  DOI's that resolve to file or collection names
> are probably as good as anything else if we're looking for simple identification.
> However, I do not think this kind of approach should be sold as a method of
> concocting permanent, unique identifiers.  At best, scientifically identical data
> collections form an equivalence class - with no special status for one unique
> member of the class.

_______________________________________________
Esip-preserve mailing list
Esip-preserve at lists.esipfed.org
http://www.lists.esipfed.org/mailman/listinfo/esip-preserve