[Esip-preserve] On Earth Science Data File Uniqueness

Mon Feb 7 10:45:12 EST 2011

Here's another wrinkle: if a UUID is formed in the field but no one
registers it someplace what good is the identifier?  It's like a birth
certificate.  If there isn't some social recognition of whose family
produced the child, how do you get a family history.  [I was tempted
to create a story about a fellow who walks into a Western bar and
ends up being shot by the sheriff because he didn't have any identification
and got nasty about having to produce any.  Consult Knuth's Art of
Computer Programming for a version of the story produced by random
simulation.]

Also, I think we need to be a lot more sensitive to the kinds of "objects"
we're trying to identify.  I'll send a copy of a note I sent to Ruth last
week.

Nice discussion of the scenarios that go with this discussion.

Bruce B.

On Mon, Feb 7, 2011 at 8:13 AM, Curt Tilmes <Curt.Tilmes at nasa.gov> wrote:

> On 01/28/11 16:02, Bruce Barkstrom wrote:
> > While I'm not quite done with the paper that describes work I've
> > been doing on file uniqueness, I think we should open a discussion
> > about whether files that contain Earth science data, particularly
> > numic data, can have unique identifiers associated with their
> > content.
> > [ Lots of discussion about content identification ]
>
> On 01/31/11 16:26, Ruth Duerr wrote:
> > I tend to agree with your conclusions though not how you are stating
> > them.  For example, a data file is an "object" and there is a point
> > to giving it a "unique id" even though that says nothing about
> > whether we have mechanisms to tell whether it is in the same
> > "equivalence class" as another "object" a file in this case that
> > contains the same content differently formatted.
>
>
>
> We've been attacking the identifiers problem for a while from a few
> different directions.  I think this is a major issue for the group,
> and will continue to be so.  We've identified a few very specific,
> very important use cases for identifiers, and proposed some schemes
> and best practices for them.  I think the identifiers paper going out
> the door now is an important first step, and will hopefully help us
> move from current practice (the woefully inadequate citations and
> current data identification illustrated in Ruth/Mark's EOS article and
> the talks at recent meetings) to step 1.  Now we are looking at more
> nuanced use cases.  For identifiers paper #2, we need to attack that
> use case #4 ("scientific equivalence"), and possibly look beyond.
>
>
> I think Ruth/Bruce's discussions get to the heart of identification
> and equivalence.  Objects in the same equivalence class can share an
> identifier.  Objects are in many different equivalence classes.  We
> need enough different identifiers to distinguish those equivalence
> classes that are useful (i.e. fit into our Earth Science use cases).
>
> Some discussion:
>
> I capture some data and put it in a file.  I'll name it "a". Following
> our recommendation to use unique global identifiers, I'll make a uuid
> and assign it to "a" -- 83fce562-8838-4eb5-b0d1-41aa4c2f3bbd.
>
> Provenance is important, so I'll make a record in my provenance store
> (to be defined..), with some info about that object.  It has
> assertions (facts) about that object saying who captured the data,
> where it came from, etc.
>
> Now I put that file in a directory "/mydata".  One identifier for that
> object is the filename "a" that is a "local identifer".  Another
> identifier is it's path on my host: "/mydata/a".  Another identifier
> might be a reference to the file on my web site:
> http://mysite/mydata/a".  Another identifier is it's UUID.
>
> It is important data so I make a backup over in "/backup/a".  Now I
> have two files on my disk, "/mydata/a" and "/backup/a".  Those are two
> different files, and have two different identifiers.  In one sense
> there now exists two different objects.  In another sense, we are
> still discussing just one object, with one UUID.
>
> Now Al wants to take a look at my data, so he logs into my server or
> whatever and downloads the file.  He puts it on his disk in
> "/home/al/a".  It is still the same object with the same UUID, but it
> now has another identifer.  UUID makes it really easy for us to
> maintain the "equivalence class" of that unique object no matter how
> many copies of it get passed around.
>
> So provenance is important to Al too, so he enters some records into
> his provenance store.  It says things like "UUID
> 83fce562-8838-4eb5-b0d1-41aa4c2f3bbd was transfered from host "mysite"
> on a certain date/time by a certain agent, etc. etc.  Provenance from
> his perspective is different from provenance for that object from my
> perspective.
>
>
> Ok, let's do some processing.  I apply algorithm X, version 1.0 to
> file "a", creating file "b", which I'll assign UUID
> 177eda2f-ec0a-4762-a525-4f9d25701127.
>
> Al thinks that was kind of interesting, so he gets a copy of the same
> version of the same algorithm and applies it to his copy of "a",
> creating file "c", which he assigns UUID
> 7de7b0ad-a5e1-45c8-bd82-020795d79ebc.
>
>
> Now one of a few things could have happened:
>
> 1. Our compilers/compiler options/library
>   versions/environment/etc. were close enough that he reproduced the
>   process identically and got the same content, bit-for-bit.
>
> 2. Our compilers/compiler options/library
>   versions/environment/etc. were close enough that he reproduced the
>   process mostly, but the answers were slightly off, but still within
>   the right error bars, or within the same "probability distribution"
>   or whatever.  The content is scientifically equivalent.  The two
>   files could be used in the same scientific analysis and would yield
>   the same conclusions.
>
> 3. Something in the compilers/compiler options/library
>   versions/environment/etc. was off enough that the content is not
>   scientifically equivalent.  "b" and "c" are not scientifically
>   equivalent.
>
>
> Case #1 is the easy one.  I can compare a digital signature of the two
> files and easily determine if that is the case.  Unfortunately, that
> is also the rarest case and occurs so seldom in the real world it
> probably isn't even worth considering.
>
> Case 2 vs. 3 is harder.  It seems to take long, expensive manual
> analysis to determine which is the case.  What we usually do is assume
> case 2, until something weird pops up, then we consider the
> possibility of case 3.
>
>
> Let's assume that the process is reproducible.  That means that I can
> convey sufficient details about the provenance of "b" that someone
> else can perform the same process and get an answer that is equivalent
> to mine.  (If I can't convey sufficient details about the provenance
> so that someone can make an equivalent file, then the process is not
> reproducible.)
>
> What we are looking for is an identifier that can refer to an
> equivalence class that includes both "b" and "c" (and also future
> instances of that object).  (This is the use case from the identifiers
> paper with no scheme.)
>
> We've discussed two approaches to coming up with that identifier:
>
> 1. Content Equivalence
>
>   This is the Altman UNF approach, and some of Bruce's work.  Is
>   there an algorithm that can consider the content of a file and come
>   up with a unique identifier that will be the same for objects in
>   the same "content equivalence class"?
>
> 2. Provenance Equivalence
>
>   This is my approach.  It goes something like this:
>
>   If a process is reproducible then I can convey sufficient creation
>   provenance details to someone else to make an equivalent file.
>
>   If someone follows those provenance details to re-create the
>   object, the resulting object will be equivalent to the one I made.
>   (If that isn't true, then my process isn't reproducible.)
>
>   If I can enumerate/list sufficient creation provenance details to
>   make an equivalent file, then I can describe an algorithm (similar
>   to the content equivalence algorithm above -- take a digital
>   signature of a canonical representation of the information) to
>   produce an identifier that will be the same for files that match
>   those provenance details.
>
> I think both of these approaches are interesting and useful.
>
> There is also the question of format changes.  For some purposes, the
> format is important (If my programs wants to read an HDF file, it
> won't do me any good to feed it an ASCII file with the same content.)
> For others, the content is important, while the format is irrelevant.
>
> That "format irrelevant content equivalence" is what Altman tried to
> solve for his community with UNF.
>
>
> I've described a few different types of equivalence classes:
>
> 1. Same UUID, different copies at my site.
>
> 2. Same UUID, different copies at different sites.
>
> 3. Same content, different UUID.
>
> 4. Equivalent, but slightly different content, different UUID.
>
> Curt
> _______________________________________________
> Esip-preserve mailing list
> Esip-preserve at lists.esipfed.org
> http://www.lists.esipfed.org/mailman/listinfo/esip-preserve
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.lists.esipfed.org/pipermail/esip-preserve/attachments/20110207/ef63de7c/attachment-0001.html>