[Esip-preserve] On Earth Science Data File Uniqueness

Mon Feb 7 08:13:27 EST 2011

On 01/28/11 16:02, Bruce Barkstrom wrote:
> While I'm not quite done with the paper that describes work I've
> been doing on file uniqueness, I think we should open a discussion
> about whether files that contain Earth science data, particularly
> numic data, can have unique identifiers associated with their
> content.
> [ Lots of discussion about content identification ]

On 01/31/11 16:26, Ruth Duerr wrote:
> I tend to agree with your conclusions though not how you are stating
> them.  For example, a data file is an "object" and there is a point
> to giving it a "unique id" even though that says nothing about
> whether we have mechanisms to tell whether it is in the same
> "equivalence class" as another "object" a file in this case that
> contains the same content differently formatted.

We've been attacking the identifiers problem for a while from a few
different directions.  I think this is a major issue for the group,
and will continue to be so.  We've identified a few very specific,
very important use cases for identifiers, and proposed some schemes
and best practices for them.  I think the identifiers paper going out
the door now is an important first step, and will hopefully help us
move from current practice (the woefully inadequate citations and
current data identification illustrated in Ruth/Mark's EOS article and
the talks at recent meetings) to step 1.  Now we are looking at more
nuanced use cases.  For identifiers paper #2, we need to attack that
use case #4 ("scientific equivalence"), and possibly look beyond.

I think Ruth/Bruce's discussions get to the heart of identification
and equivalence.  Objects in the same equivalence class can share an
identifier.  Objects are in many different equivalence classes.  We
need enough different identifiers to distinguish those equivalence
classes that are useful (i.e. fit into our Earth Science use cases).

Some discussion:

I capture some data and put it in a file.  I'll name it "a". Following
our recommendation to use unique global identifiers, I'll make a uuid
and assign it to "a" -- 83fce562-8838-4eb5-b0d1-41aa4c2f3bbd.

Provenance is important, so I'll make a record in my provenance store
(to be defined..), with some info about that object.  It has
assertions (facts) about that object saying who captured the data,
where it came from, etc.

Now I put that file in a directory "/mydata".  One identifier for that
object is the filename "a" that is a "local identifer".  Another
identifier is it's path on my host: "/mydata/a".  Another identifier
might be a reference to the file on my web site:
http://mysite/mydata/a".  Another identifier is it's UUID.

It is important data so I make a backup over in "/backup/a".  Now I
have two files on my disk, "/mydata/a" and "/backup/a".  Those are two
different files, and have two different identifiers.  In one sense
there now exists two different objects.  In another sense, we are
still discussing just one object, with one UUID.

Now Al wants to take a look at my data, so he logs into my server or
whatever and downloads the file.  He puts it on his disk in
"/home/al/a".  It is still the same object with the same UUID, but it
now has another identifer.  UUID makes it really easy for us to
maintain the "equivalence class" of that unique object no matter how
many copies of it get passed around.

So provenance is important to Al too, so he enters some records into
his provenance store.  It says things like "UUID
83fce562-8838-4eb5-b0d1-41aa4c2f3bbd was transfered from host "mysite"
on a certain date/time by a certain agent, etc. etc.  Provenance from
his perspective is different from provenance for that object from my
perspective.

Ok, let's do some processing.  I apply algorithm X, version 1.0 to
file "a", creating file "b", which I'll assign UUID
177eda2f-ec0a-4762-a525-4f9d25701127.

Al thinks that was kind of interesting, so he gets a copy of the same
version of the same algorithm and applies it to his copy of "a",
creating file "c", which he assigns UUID
7de7b0ad-a5e1-45c8-bd82-020795d79ebc.

Now one of a few things could have happened:

1. Our compilers/compiler options/library
   versions/environment/etc. were close enough that he reproduced the
   process identically and got the same content, bit-for-bit.

2. Our compilers/compiler options/library
   versions/environment/etc. were close enough that he reproduced the
   process mostly, but the answers were slightly off, but still within
   the right error bars, or within the same "probability distribution"
   or whatever.  The content is scientifically equivalent.  The two
   files could be used in the same scientific analysis and would yield
   the same conclusions.

3. Something in the compilers/compiler options/library
   versions/environment/etc. was off enough that the content is not
   scientifically equivalent.  "b" and "c" are not scientifically
   equivalent.

Case #1 is the easy one.  I can compare a digital signature of the two
files and easily determine if that is the case.  Unfortunately, that
is also the rarest case and occurs so seldom in the real world it
probably isn't even worth considering.

Case 2 vs. 3 is harder.  It seems to take long, expensive manual
analysis to determine which is the case.  What we usually do is assume
case 2, until something weird pops up, then we consider the
possibility of case 3.

Let's assume that the process is reproducible.  That means that I can
convey sufficient details about the provenance of "b" that someone
else can perform the same process and get an answer that is equivalent
to mine.  (If I can't convey sufficient details about the provenance
so that someone can make an equivalent file, then the process is not
reproducible.)

What we are looking for is an identifier that can refer to an
equivalence class that includes both "b" and "c" (and also future
instances of that object).  (This is the use case from the identifiers
paper with no scheme.)

We've discussed two approaches to coming up with that identifier:

1. Content Equivalence

   This is the Altman UNF approach, and some of Bruce's work.  Is
   there an algorithm that can consider the content of a file and come
   up with a unique identifier that will be the same for objects in
   the same "content equivalence class"?

2. Provenance Equivalence

   This is my approach.  It goes something like this:

   If a process is reproducible then I can convey sufficient creation
   provenance details to someone else to make an equivalent file.

   If someone follows those provenance details to re-create the
   object, the resulting object will be equivalent to the one I made.
   (If that isn't true, then my process isn't reproducible.)

   If I can enumerate/list sufficient creation provenance details to
   make an equivalent file, then I can describe an algorithm (similar
   to the content equivalence algorithm above -- take a digital
   signature of a canonical representation of the information) to
   produce an identifier that will be the same for files that match
   those provenance details.

I think both of these approaches are interesting and useful.

There is also the question of format changes.  For some purposes, the
format is important (If my programs wants to read an HDF file, it
won't do me any good to feed it an ASCII file with the same content.)
For others, the content is important, while the format is irrelevant.

That "format irrelevant content equivalence" is what Altman tried to
solve for his community with UNF.

I've described a few different types of equivalence classes:

1. Same UUID, different copies at my site.

2. Same UUID, different copies at different sites.

3. Same content, different UUID.

4. Equivalent, but slightly different content, different UUID.

Curt