[Esip-preserve] [FOO] Scientific Equivalence

alicebarkstrom at frontier.com alicebarkstrom at frontier.com
Thu Oct 21 10:50:46 EDT 2010


I think we're substantially improving our concept
of what's going on.  Here's an additional wrinkle:

Over a long-term preservation scenario (say 200 years),
we will have serious technology changes.  Back when I 
started, we used 60-bit Control Data machines - which
don't exist now in any practical sense.  Our original
code was FORTRAN 77 - an unlikely language now.  I
believe we used FORTRAN 90 for CERES.  Now the team
has  redone the code in C (or - God help us - C++),
with maybe a touch of Ada.  We're now grappling with
the move to multicore CPU's.  The large scale storage
still uses mag tape - but every three years the vendors
come out with a new drive that crams three times as much
storage in the same size tape cartridge - plus there's
linear drive technology.  As a rough approximation, 
assume that there's a serious technology upgrade to
an archive's contents every ten years.  Over two hundred
years, that means that the archive will be forced to
create twenty new versions of the data in order to avoid
loss due to hardware and software obsolescence.  Each of
those transformations is a new process that would need
to be tracked in the chain of provenance.  Given the
reality of organizational mortality and funding priority
changes, it seems to me that the most robust system
will have multiple data centers - each with their own
policy regarding updates - and with independent provenance
tracks.  Trying to keep the same format basically introduces
a "single point of failure" in the preservation scenario.

I'll grant that most of our experience as data producers
has occurred under the scenario in which we control the
data production.  However, for long-term preservation,
we really need to think about a situtation in which the
produciton is distributed to people and organizations we
don't control, but who still need to do transformational
migration in order to preserve the information.

Incidentally, I think the tail end of your discussion is
quite useful in dealing with the core of this preservation
issue.

Bruce B.
----- Original Message -----
From: "Curt Tilmes" <Curt.Tilmes at nasa.gov>
To: "ESIP Preservation cluster" <esip-preserve at rtpnet.org>
Sent: Wednesday, October 20, 2010 3:59:13 PM
Subject: Re: [Esip-preserve] [FOO] Scientific Equivalence

On 10/20/10 15:36, Bruce Barkstrom wrote:
> It is much more direct to identify corresponding data elements in
> each collection (noting that one data collection being compared may
> come in a number of files, whereas the other collection might be
> grouped in a single file), identify which elements constitute the
> ones to use for scientific identity, and compare the values
> directly.

Note that in part of my example, the archive produces the granule,
then produces the SEI, then deletes the granule.  Someone else later
attempts to reproduce the workflow captured within the provenance
information, obtaining a distinct granule, with a distinct granule
identifier and distinct provenance, but with the same SEI.

The granules themselves never existed at the same time, and their
values couldn't have been compared.


> There may be a philisophical issue lurking here: do we require the
> same provenance to get two measurements to have scientifically
> identical values?  I suspect not, but that's a question that
> requires a bit more deliberate thinking.  Clearly, human beings with
> exactly the same provenance must be identical (indeed, that
> identifies a unique individual).

I think I need a better term for what I am doing.

"Scientific Equivalence" I think should be reserved for a comparison
of content (like you are doing) or the fingerprinting of content (like
that other guy).

Cavanaugh and Graham [1] break down equivalence into three terms:

Exact Equivalence - exactly the same content (what I prefer to call
identical)

Strong Provenance Equivalence - things where all provenance
information is the same.  With my definition of provenance, as you
point out above, it is impossible to have two entities with identical
total provenance that aren't exactly equal.

Weak Provenance Equivalence - Some specified subset of provenance
matches.  This is what I am doing.

> However, we can affirm that two measurements with very different
> provenance chains are scientifically identical if we have a
> trustworthy method of comparison.  Perhaps we can call that
> measurement validation.

I agree.  If two independent satellites measured some geophysical
property with two different instruments operating in two different
manners using two different algorithms, and each were independently
validated against a third (known better) dataset, we could still say
that those datasets were scientifically equivalent.  As you point out,
that is a totally different concept than what I am trying to
accomplish.

Curt

[1] "Apples and Apple-shaped Oranges: Equivalence of Data Returned on
Subsequent Queries with Provenance Information",
http://people.cs.uchicago.edu/~yongzh/papers/apples-oranges.ps
_______________________________________________
Esip-preserve mailing list
Esip-preserve at lists.esipfed.org
http://www.lists.esipfed.org/mailman/listinfo/esip-preserve


More information about the Esip-preserve mailing list