[Esip-preserve] Possible Workaround for data identity non-uniqueness?

Wed Oct 13 12:18:00 EDT 2010

A third use case that is important for preservation:

3.  The original file is lost or unreadable, although I've got
     copies in a variant format and with rearrangements.  I've
     also got mappings that we believe demonstrate scientific
     equivalence from one or more of these copies.  In this case,
     we might not be able to recreate an exact copy of the original
     file because the process of demonstrating SEC did not preserve
     some elements of the original file.  For example, the original
     was created in FORTRAN, while the only copy we've got is in
     HDF and contains array dimensions that weren't in the original
     FORTRAN.

     This case can be extended to multiple mappings, where now the
     intermediate copies are also lost or unreadable.  Maybe the third copy
     is in XML - and the tags were not part of either the first or
     second copies.  Note that if the chain of possible copies gets
     long enough, there is likely to be the possibility of multiple
     paths to establish the equivalence.  This may mean that equivalence
     becomes a stochastic variable where we ask "how many comparisons
     do I have to make to establish that the probability of my not
     having an authentic copy is below T?"

Bruce B.
----- Original Message -----
From: "Curt Tilmes" <Curt.Tilmes at nasa.gov>
To: esip-preserve at lists.esipfed.org
Sent: Wednesday, October 13, 2010 9:11:09 AM
Subject: Re: [Esip-preserve] Possible Workaround for data identity	non-uniqueness?

On 10/13/10 08:56, Lynnes, Christopher S. (GSFC-6102) wrote:
> Is there perhaps a workaround where the reformatting agent simply
> asserts that they are equivalent?  That is, to add a metadata
> attribute that says, "this file is scientifically equivalent to this
> other file (e.g., identified by uuid)"?

Then we have to start tagging them with "Justification" and "Trust"
facts as well...

I see (at least) two use cases we are concerned with for scientific
equivalence:

1. The reformatting case.  I have data from some authoritative source,
    and I want to do a transformation that maintains what we are
    calling the "scientific equivalence class" (SEC).

    As you propose, we could use the "authoritative souce" UUID as a
    SEC identifier, and keep that with the transformed data.

    My justification could be that I validated my transformation
    process and assert that it does maintain that property.

2. The reproduction case. I have a granule and I want to repeat the
    processing in such a way that the resulting file is in the same SEC
    as the original.

    My justification could be that I have replicated the processing
    steps sufficiently to maintain that property.

    For example, consider "process on demand" where the original file
    was deleted, but the producer maintains sufficient provenance
    information to re-make a new file (with a distinct UUID) that
    should be in the same SEC.

    Or a web service transformation.  I can store a
    WCS/WFS/WMS/etc. REST URL with all the parameters used to produce a
    file.  If I call it with those parameters and you call it with
    identical parameters, we should get files in the same SEC.

Curt
_______________________________________________
Esip-preserve mailing list
Esip-preserve at lists.esipfed.org
http://www.lists.esipfed.org/mailman/listinfo/esip-preserve