[Esip-preserve] [FOO] Scientific Equivalence

Curt Tilmes Curt.Tilmes at nasa.gov
Wed Oct 20 14:23:58 EDT 2010


Ok, now let's take a look into scientific equivalence.

Suppose we break down the components of provenance like this:

Granule g is the result of inputs x1, x2, ... where xi is 'some'
component of provenance.

What does that include?  Let's say everything that any of us could
think of: Clearly we start with the algorithm itself (and the specific
version thereof), and clearly the input files, but also include the
compiler I used to compile the algorithm, libraries, the machine I ran
it on, the person who ran it, what time I ran it, the phase of the
moon, etc.

Let's define scientific equivalence as the extent to which two
granules can affect the result of a scientific analysis.  If you have
some scientific analysis a that uses g1, and you perform the same
scientific analysis with granule g2 and get the same
results/conclusion, then granules g1 and g2 are scientifically
equivalent for analysis a.  If g1 and g2 are scientifically equivalent
for all scientific analyses then we can say they are scientifically
equivalent (SE).

Now break down the set of provenance xi into two subsets, one subset
(y) that is likely (yeah, rigorous I know) to affect scientific
equivalence and one subset (z) that is unlikely to affect scientific
equivalence.  I call the former set (y) "essential" provenance. (Some
call this lineage or pedigree).  These are the things that contribute
materially to the scientific content of the resulting granule.

Now things like the raw L0 granule fed into the process are clearly
essential.  If you don't include that, you are very unlikely to make a
SE granule.  Other things like the name of the machine you ran the
process on are hopefully not essential.  Some things like the version
of the HDF library may have an effect, or may not..  I'm going to hand
wave that away for now and just assume that you can perfectly sort
your provenance into those two categories.

Let's define a "Scientific Equivalence Identifier" (SEI).  (Or perhaps
a better term would be "Scientific Equivalence Indicator", since it
represents an indication that two granules are equivalent, not a proof
that they are.)

Back to FOO.

Take each granule of "FOOL0.001". I can't reproduce those granules.
Let's define the SEI for anything I can't reproduce as the MD5 digital
signature of the unique identifier of that granule.

If I reformat one of those granules in a manner that preserves the
scientific content (don't worry about how I do that, just assume I
can), then I can assert (yes, I agree prove/audit is different --
we'll deal with that later) that the new granule is SE to the
original, so it can copy the SEI from the original granule. (similar
to Chris' "keep a reference to the original UUID", but I want to make
it look like the hashes I'm going to use later for higher level
products.)

For now, let's do the same with the "FOOCAL.001" (that's a little more
complicated since presumably they came from somewhere, so they have
their own provenance, but for now, let's just call that stuff "out of
scope" and hand wave it away).

For our simple example, let's say that the set of provenance
components of L1B that is likely to affect the SE property of the
resulting granule includes the following:
     Algorithm + Version
     Calibration file
     Level 0 data

I'm not saying these are the only things that could possibly affect
the SE property for anything, just that in this particular example, we
accept that those are the only things likely to.  That means that:

1) If you re-run the process and use components SE to those things, you
are likely to produce a granule that is SE to the original.
2) If you re-run the process and use components that are not SE to
those things, you are not likely to produce a granule that is SE to
the original.

Now let's define a canonical serialization of those items.  Let's just
list them textually, sorting asciibetically if we have more than one
thing in a category.

For the example, let's take the first granule in FOOL1B.002:
     FOOL1B.v2.01.d615b4f6-5e35-49f0-834a-ee199db7597c
That was produced by applying algorithm APP_L1B version 1.0 to
     FOOL0.01.fa9cf2e0-b60b-43b7-baee-d18cc185b407,
using calibration
     FOOCAL.2.d2a14052-f426-4d2e-a506-ec052fdb69d4

If I hash the unique identifiers for the L0 and CAL, I get these SEIs:
(for folks playing along at home, I'm actually appending newlines to
the identifiers -- doesn't matter if you do or don't, just that we all
use the same canonical serialization)

FOOL0.01.fa9cf2e0-b60b-43b7-baee-d18cc185b407
SEI: 47ed3c20426ec661125e0c41caf143ea

and

FOOCAL.2.d2a14052-f426-4d2e-a506-ec052fdb69d4
SEI: 06f5554083cdfe71c087d1b5bd95cb33

Then, let's define a canonical serialization of the list of essential
provenance of FOOL1B.v2.01.d615b4f6-5e35-49f0-834a-ee199db7597c
APP_L1B v1.0
06f5554083cdfe71c087d1b5bd95cb33
47ed3c20426ec661125e0c41caf143ea

Then we hash that list and produce an SEI for that granule
of 9571dd5de3ef0bba5e85be0b80f85c91

Any granule that includes those provenance items will have the same
canonical serialization of the list, and therefore the same resulting
hash and the same SEI.


Let me again emphasize.  I'm not trying to prove scientific
equivalence -- you could look at the actual contents for that (or try
some of the fingerprinting techniques we've already discussed) or look
at independent validation, etc.  This is simply a shorthand
representation of the elements of provenance essential for basic
reproducability.

Saying that two granules have the same SEI is asserting that to some
extent (possibly not perfect, or even correct) you've replicated the
circumstances (essential provenance) that led to their creation.


Back to Bob trying to reproduce Alice's research several messages ago.

You'll recall Alice was using granule
     FOOL3.v2.01.07aa9ae3-9c3e-4508-b027-890dae11b768

which is no longer available in the archive
(having been replaced by corrected granule
FOOL3.v2.01.2a365058-fb52-4559-ab4b-085cb5ac0b73)

Let's calculate some "SEI"s:

FOOL0.001:

FOOL0.01.fa9cf2e0-b60b-43b7-baee-d18cc185b407
SEI: 47ed3c20426ec661125e0c41caf143ea

FOOL0.02.8a29012d-be8d-4af1-a158-783cfdcfc7fc
SEI: 644b6ad8e8cf663751e853ec8ad96c09

FOOL0.03.90463be2-a1a4-4d63-ad70-2f1f3c09798e
SEI: 635891855737afae680d07b027a3cb19

FOOL0.04.c121f001-6851-433f-9400-8e3acaa0229a
SEI: 1311c0bff3054252bb28f7c32327a877

FOOL0.05.2558aa9a-ce8c-47df-ac05-0fc982568462
SEI: 292634dea5bd916445175fa67dac9891

FOOL0.06.7223358c-f92d-42d0-bff4-1cd6140d4a89
SEI: 26c77385347ac6879a992862ffd310fb

FOOL0.07.e2e145cd-899e-483f-989a-fbf1017a3df8
SEI: 8c2960e1c526ffc606a605fe45c01cc8

FOOL0.08.d0e94ea8-08b3-4bdf-af3f-7af72f2f9221
SEI: 70c578e0ea392efb1d75353cb7ec6b82

FOOL0.09.ddc94379-0086-4817-9595-5fbe378c5a29
SEI: 89b2e866ebd3c2fe13d6a83b6bc2388b

FOOL0.10.0b337185-82af-4662-89b0-419bfd3e5db7 (corrupt L0 data)
SEI: 5f99075ac49960a8ed35a0345a2ee4a3

FOOL0.11.27d94c01-51ed-45c4-8c50-e19ca7f20882
SEI: 54dd6684e6e97fcc0971b8d7808b7516

FOOL0.12.0af9c3a5-6ee5-4435-8437-d96ebfa36625
SEI: 388288e589c190c1a65cbc23b2c5a7b3

FOOL0.13.76e9b680-2d06-4a55-ad2c-79b533ce86ca
SEI: 454b0db6bdc60e3cd5dbffd4f350617b

FOOL0.14.f6bf9378-4215-41b5-9d3e-96dc0c2e7eeb
SEI: 878aafea700de9f06a8031c4e7e2157c

FOOL0.10.3adf6dea-06af-478f-8216-2bbcdb0caad2 (new updated file)
SEI: 34b3c13c470ddf83c1fa00d57611740a

----------------------------------------------------------------------

Continuing to FOOL1B.002, canonicalizing the list
similar to the example above:

FOOL1B.v2.01.d615b4f6-5e35-49f0-834a-ee199db7597c
APP_L1B v1.0
06f5554083cdfe71c087d1b5bd95cb33
47ed3c20426ec661125e0c41caf143ea
SEI: 9571dd5de3ef0bba5e85be0b80f85c91

FOOL1B.v2.02.6c1a5a3b-55ab-4b53-8659-982d591cc744
APP_L1B v1.0
06f5554083cdfe71c087d1b5bd95cb33
644b6ad8e8cf663751e853ec8ad96c09
SEI: 78ef6c2dc982397b40f4cf2245f9c659

FOOL1B.v2.03.58575454-3a4e-46af-8cb5-27c6aa4321cc
APP_L1B v1.0
06f5554083cdfe71c087d1b5bd95cb33
635891855737afae680d07b027a3cb19
SEI: dcf2d2305aac7e059780032706d80718

FOOL1B.v2.04.09124f68-3f14-446b-a1aa-d7f57e7f1603
APP_L1B v1.0
06f5554083cdfe71c087d1b5bd95cb33
1311c0bff3054252bb28f7c32327a877
SEI: 2dc241b67be6ca6877ae107c0e133a22

FOOL1B.v2.05.6091c13e-5ea4-44c3-92cc-f20823249421
APP_L1B v1.0
06f5554083cdfe71c087d1b5bd95cb33
292634dea5bd916445175fa67dac9891
SEI: 0b3981a865fbfdd59c6cab2d4346c620

FOOL1B.v2.06.fcf1bb4c-51ea-464e-9935-b7653354ef73
APP_L1B v1.0
06f5554083cdfe71c087d1b5bd95cb33
26c77385347ac6879a992862ffd310fb
SEI: a3233fd6e39b03b6de225bc5f2c6b2e2

FOOL1B.v2.07.424aa11d-fdfb-4a63-9b1b-0a386d23b1fa
APP_L1B v1.0
06f5554083cdfe71c087d1b5bd95cb33
8c2960e1c526ffc606a605fe45c01cc8
SEI: 3e57734417a87a8b627fb76b422a6795

FOOL1B.v2.08.e3660e2e-4248-43ea-b91c-64799f3a1e74
APP_L1B v1.0
06f5554083cdfe71c087d1b5bd95cb33
70c578e0ea392efb1d75353cb7ec6b82
SEI: 550b94e28cd52b99ebb4e6266b405d7b

FOOL1B.v2.09.9ca12548-2a6b-47e4-9f7d-732013984ec1
APP_L1B v1.0
06f5554083cdfe71c087d1b5bd95cb33
89b2e866ebd3c2fe13d6a83b6bc2388b
SEI: 9369eee55d0cf604eea4cfbf59cfb545

FOOL1B.v2.10.2f269e5e-cce7-41e4-8a83-baad1e087c8e  (deleted)
APP_L1B v1.0
06f5554083cdfe71c087d1b5bd95cb33
5f99075ac49960a8ed35a0345a2ee4a3
SEI: 287af1b0f499184bd9500a55e32d49ef

FOOL1B.v2.11.6cfec73e-bf2e-42cb-a427-2d30694f43e8
APP_L1B v1.0
06f5554083cdfe71c087d1b5bd95cb33
54dd6684e6e97fcc0971b8d7808b7516
SEI: 7ea19d0a5b55af4fb2590e343f824f30

FOOL1B.v2.12.5cb6c2d9-386c-4d87-a103-56f0e459a26f
APP_L1B v1.0
06f5554083cdfe71c087d1b5bd95cb33
388288e589c190c1a65cbc23b2c5a7b3
SEI: fa01ee12f5ca87503d64a07fcc9ae491

----------------------------------------------------------------------

FOOLUT:

FOOLUT.1.5e3ef918-0216-4a58-9daf-5495dbf4a364
SEI: 175f8bf4340d730e22e955d03493b2de

----------------------------------------------------------------------

FOOL2.002:

FOOL2.v2.01.bba34792-f256-4c54-81dd-9977e432c204
APP_L2 v1.0
175f8bf4340d730e22e955d03493b2de
9571dd5de3ef0bba5e85be0b80f85c91
SEI: 809acb2b1ed53368a270b4d52c14dd76

FOOL2.v2.02.2fd12da6-a3e2-4e50-8140-3ac645882419
APP_L2 v1.0
175f8bf4340d730e22e955d03493b2de
78ef6c2dc982397b40f4cf2245f9c659
SEI: 3461d265a7787cc8769df16b420f5f34

FOOL2.v2.03.29bda893-765d-476d-851b-8b9acd7f140e
APP_L2 v1.0
175f8bf4340d730e22e955d03493b2de
dcf2d2305aac7e059780032706d80718
SEI: dd77c4fcfacb380c46ce02a02e59a09a

FOOL2.v2.04.57509ddb-3d40-4d60-8204-da4b99867fc7
APP_L2 v1.0
175f8bf4340d730e22e955d03493b2de
2dc241b67be6ca6877ae107c0e133a22
SEI: 348997d808f129fed6b9903d02d25732

FOOL2.v2.05.0e8604fa-fb4e-4cfb-b412-5364ca12cf14
APP_L2 v1.0
175f8bf4340d730e22e955d03493b2de
0b3981a865fbfdd59c6cab2d4346c620
SEI: c8074e135dc5d31698e3b699e1e48461

FOOL2.v2.06.0eb26b4e-b718-41c5-bbf8-c83d3d79c233
APP_L2 v1.0
175f8bf4340d730e22e955d03493b2de
a3233fd6e39b03b6de225bc5f2c6b2e2
SEI: d475a1b902d589c59f077a818ac1f842

FOOL2.v2.07.43079ea6-43b5-4622-b492-bcdb824a818e
APP_L2 v1.0
175f8bf4340d730e22e955d03493b2de
3e57734417a87a8b627fb76b422a6795
SEI: 23cd1ea68d8f2b190834a7aaeac51139

FOOL2.v2.08.590fd64c-ec12-44a5-9b14-0042d19ed3dc
APP_L2 v1.0
175f8bf4340d730e22e955d03493b2de
550b94e28cd52b99ebb4e6266b405d7b
SEI: 1ff32375c73b8cdbe3d024ed47ede196

FOOL2.v2.09.226173b9-4ef7-49e8-8b9e-701b892a8f57
APP_L2 v1.0
175f8bf4340d730e22e955d03493b2de
9369eee55d0cf604eea4cfbf59cfb545
SEI: 614507502fbc743aa0cc8533f98f0a62

FOOL2.v2.10.533b2a95-d57f-4f75-9b7d-914d3d220310 (deleted)
APP_L2 v1.0
175f8bf4340d730e22e955d03493b2de
287af1b0f499184bd9500a55e32d49ef
SEI: c6ff8dd6ffe4801594bda868fa943f56

FOOL2.v2.11.af235d11-777c-4bf1-a5e6-15273a5e5d80
APP_L2 v1.0
175f8bf4340d730e22e955d03493b2de
7ea19d0a5b55af4fb2590e343f824f30
SEI: 1442167e55c216ef83b637a272ed3480

FOOL2.v2.12.bdc9dc33-38bd-403c-991e-48dcd4762ca7
APP_L2 v1.0
175f8bf4340d730e22e955d03493b2de
fa01ee12f5ca87503d64a07fcc9ae491
SEI: 5bb60a358a38723e26930984c5f2a8d7

----------------------------------------------------------------------

FOOL3.v2.01.07aa9ae3-9c3e-4508-b027-890dae11b768 (deleted)
APP_L3 v1.0
809acb2b1ed53368a270b4d52c14dd76
3461d265a7787cc8769df16b420f5f34
dd77c4fcfacb380c46ce02a02e59a09a
348997d808f129fed6b9903d02d25732
c8074e135dc5d31698e3b699e1e48461
d475a1b902d589c59f077a818ac1f842
23cd1ea68d8f2b190834a7aaeac51139
1ff32375c73b8cdbe3d024ed47ede196
614507502fbc743aa0cc8533f98f0a62
c6ff8dd6ffe4801594bda868fa943f56
1442167e55c216ef83b637a272ed3480
5bb60a358a38723e26930984c5f2a8d7
SEI: 41351ab8a3482b33f5412b6cd2c66171

----------------------------------------------------------------------

Ok, back to Bob trying to reproduce Alice's work:

She used granule: FOOL3.v2.01.07aa9ae3-9c3e-4508-b027-890dae11b768,
which we can see above has SEI: 41351ab8a3482b33f5412b6cd2c66171.

Bob needs FOOL1B.v2.10.2f269e5e-cce7-41e4-8a83-baad1e087c8e  (deleted)
which is no longer available, It has SEI: 287af1b0f499184bd9500a55e32d49ef.

Bob gets the L0 file and remakes this file:
FOOL1B.v2.10.c911b994-91fb-4d5c-b9e1-642c0a9c46a3

Note, it has a distinct UUID and has distinct provenance (it was made
by Bob at a different date/time, on a different machine), but the
essential provenance fields (which we've accepted to be this set) are
the same:
APP_L1B v1.0
06f5554083cdfe71c087d1b5bd95cb33
5f99075ac49960a8ed35a0345a2ee4a3

The hash of those, and therefore the SEI for Bob's new file are the
same as the SEI of the original file:
SEI: 287af1b0f499184bd9500a55e32d49ef

He then takes his L1B file and runs APP_L2 on it, producing this:
FOOL2.v2.10.2c09ed89-57cf-40ed-910b-16c1aafcd947
with this essential provenance list:
APP_L2 v1.0
175f8bf4340d730e22e955d03493b2de
287af1b0f499184bd9500a55e32d49ef
and resulting SEI: c6ff8dd6ffe4801594bda868fa943f56

Then he can use that together with the other L2 files he can get from
the archive and re-make the L3:
FOOL3.v2.01.52562fbd-5969-4572-a757-47ff3f92dda4
APP_L3 v1.0
809acb2b1ed53368a270b4d52c14dd76
3461d265a7787cc8769df16b420f5f34
dd77c4fcfacb380c46ce02a02e59a09a
348997d808f129fed6b9903d02d25732
c8074e135dc5d31698e3b699e1e48461
d475a1b902d589c59f077a818ac1f842
23cd1ea68d8f2b190834a7aaeac51139
1ff32375c73b8cdbe3d024ed47ede196
614507502fbc743aa0cc8533f98f0a62
c6ff8dd6ffe4801594bda868fa943f56
1442167e55c216ef83b637a272ed3480
5bb60a358a38723e26930984c5f2a8d7
SEI: 41351ab8a3482b33f5412b6cd2c66171

Again, he calculates the same SEI, indicating that his file is SE to
the original, missing file.

Each of his newly re-created files has a unique UUID, and distinct
provenance, but their "essential" provenance (lineage, pedigree) can
be compared by just looking at their SEI, not as a proof of their
equivalence, but as an indicator of it.

If the archive maintains the SEI metadata field, even in the event
that they delete the data content of the granule, someone can attempt
to reproduce the granule in the same way and get a good check to see
if they did it right.

Similarly, process on demand systems (or ephemeral web services) can
include the SEI with granules they give out, and when they give a
granule to the next person, you can simply compare SEIs for an
indication that you are getting the granule you thought you were.

Curt


More information about the Esip-preserve mailing list