[Esip-preserve] On Earth Science Data File Uniqueness

Curt Tilmes Curt.Tilmes at nasa.gov
Wed Feb 9 13:01:07 EST 2011


This will repeat some of what I've written previously, but I'll strive
to lay out a summary of my position/concept/use case/scenario as
clearly as I can.

Data Granules are among the data objects we need to manage.

Each data granule has certain characteristics:

1. Identity

2. Content

3. Provenance

   I break Provenance up into two sets that I call "essential" and
   "non-essential" which I define like this:

   a. essential - those aspects of provenance that if matched for a
      reproducible process will result in the creation of a
      scientifically equivalent data granule (things like how it was
      made -- algorithm, version of the input data, etc.)

   b. non-essential - Other cool stuff about the history of the
      object, but don't contribute directly to the scientific
      equivalence of the granules. (things like who made it, when they
      made it.)

We are talking about at least three identifiers, representing
equivalence classes of data granules that share common
characteristics.

For identity, we need something globally unique forever we can attach
to the object at creation that will follow it forever and allow us to
distinguish it from every other granule.  UUID has some nice
characteristics that make it a strong contender for this role.

For content, we've talked about two concepts, exact content
equivalence and loose content equivalence.  We have really good ways
to identify exact content equivalence, MD5, SHA-1, etc.  We have a
start at the loose content equivalence identification, things like
Altman's UNF and Bruce's work.

For provenance equivalence, I've proposed a couple schemes for
capturing the "essential" provenance into a canonical serialization
and hashing that to make an identifier that will match for any
granules created in the same way.


What we really care about is scientific equivalence, but that's really
hard to determine without manpower/analysis/etc.  The real purpose of
both content equivalence and provenance equivalence is to serve as a
proxy for scientific equivalence.  Something we can mechanically
calculate and give us confidence (not proof -- except in the case of
exact content equivalence -- an exceedingly rare case) that we're
talking about the same granule.

Curt


More information about the Esip-preserve mailing list