[Esip-preserve] [FOO] Scientific Equivalence
Bruce Barkstrom
brbarkstrom at gmail.com
Wed Oct 20 15:36:58 EDT 2010
>From my perspective, this is a pretty complex procedure and involves a
number
of conjectures. It is much more direct to identify corresponding data
elements
in each collection (noting that one data collection being compared may come
in
a number of files, whereas the other collection might be grouped in a single
file), identify which elements constitute the ones to use for scientific
identity,
and compare the values directly. I'm writing down the details and will get
them
out in a format suitable for publication in the next day or two. The
comparison
is straightforward - and the mapping involves just an array of pairs of
indices
that identify the correspondence betweeen the elements. One hard part is
selecting which data elements to compare (just data, data plus circumstances
of measurement - like geolocation, or data plus circumstances plus the
conditional
error distribution for finding the true value given the measurement). We
can call
this the stringency of the test, if you like. The second hard part is
identifyiing
the data element correspondences. That requires some knowledge of both the
format and the semantic content of both data sets. For example, a
conversion
from Celsius to Fahrenheit is a semantic transformation, but doesn't really
change the scientific equality of two temperature values. Likewise, the
conversion
from Oct. 20, 2010 at 3 pm in the Eastern Time Zone over to Astronomical
Julian Date and Universal Time (UTC) doesn't change the scientific
equivalence
of these two date-time values.
There may be a philisophical issue lurking here: do we require the same
provenance
to get two measurements to have scientifically identical values? I suspect
not,
but that's a question that requires a bit more deliberate thinking.
Clearly, human
beings with exactly the same provenance must be identical (indeed, that
identifies
a unique individual). However, we can affirm that two measurements with
very
different provenance chains are scientifically identical if we have a
trustworthy
method of comparison. Perhaps we can call that measurement validation.
Again, great to see the independent exploration of the issues involved.
Bruce B.
On Wed, Oct 20, 2010 at 2:23 PM, Curt Tilmes <Curt.Tilmes at nasa.gov> wrote:
> Ok, now let's take a look into scientific equivalence.
>
> Suppose we break down the components of provenance like this:
>
> Granule g is the result of inputs x1, x2, ... where xi is 'some'
> component of provenance.
>
> What does that include? Let's say everything that any of us could
> think of: Clearly we start with the algorithm itself (and the specific
> version thereof), and clearly the input files, but also include the
> compiler I used to compile the algorithm, libraries, the machine I ran
> it on, the person who ran it, what time I ran it, the phase of the
> moon, etc.
>
> Let's define scientific equivalence as the extent to which two
> granules can affect the result of a scientific analysis. If you have
> some scientific analysis a that uses g1, and you perform the same
> scientific analysis with granule g2 and get the same
> results/conclusion, then granules g1 and g2 are scientifically
> equivalent for analysis a. If g1 and g2 are scientifically equivalent
> for all scientific analyses then we can say they are scientifically
> equivalent (SE).
>
> Now break down the set of provenance xi into two subsets, one subset
> (y) that is likely (yeah, rigorous I know) to affect scientific
> equivalence and one subset (z) that is unlikely to affect scientific
> equivalence. I call the former set (y) "essential" provenance. (Some
> call this lineage or pedigree). These are the things that contribute
> materially to the scientific content of the resulting granule.
>
> Now things like the raw L0 granule fed into the process are clearly
> essential. If you don't include that, you are very unlikely to make a
> SE granule. Other things like the name of the machine you ran the
> process on are hopefully not essential. Some things like the version
> of the HDF library may have an effect, or may not.. I'm going to hand
> wave that away for now and just assume that you can perfectly sort
> your provenance into those two categories.
>
> Let's define a "Scientific Equivalence Identifier" (SEI). (Or perhaps
> a better term would be "Scientific Equivalence Indicator", since it
> represents an indication that two granules are equivalent, not a proof
> that they are.)
>
> Back to FOO.
>
> Take each granule of "FOOL0.001". I can't reproduce those granules.
> Let's define the SEI for anything I can't reproduce as the MD5 digital
> signature of the unique identifier of that granule.
>
> If I reformat one of those granules in a manner that preserves the
> scientific content (don't worry about how I do that, just assume I
> can), then I can assert (yes, I agree prove/audit is different --
> we'll deal with that later) that the new granule is SE to the
> original, so it can copy the SEI from the original granule. (similar
> to Chris' "keep a reference to the original UUID", but I want to make
> it look like the hashes I'm going to use later for higher level
> products.)
>
> For now, let's do the same with the "FOOCAL.001" (that's a little more
> complicated since presumably they came from somewhere, so they have
> their own provenance, but for now, let's just call that stuff "out of
> scope" and hand wave it away).
>
> For our simple example, let's say that the set of provenance
> components of L1B that is likely to affect the SE property of the
> resulting granule includes the following:
> Algorithm + Version
> Calibration file
> Level 0 data
>
> I'm not saying these are the only things that could possibly affect
> the SE property for anything, just that in this particular example, we
> accept that those are the only things likely to. That means that:
>
> 1) If you re-run the process and use components SE to those things, you
> are likely to produce a granule that is SE to the original.
> 2) If you re-run the process and use components that are not SE to
> those things, you are not likely to produce a granule that is SE to
> the original.
>
> Now let's define a canonical serialization of those items. Let's just
> list them textually, sorting asciibetically if we have more than one
> thing in a category.
>
> For the example, let's take the first granule in FOOL1B.002:
> FOOL1B.v2.01.d615b4f6-5e35-49f0-834a-ee199db7597c
> That was produced by applying algorithm APP_L1B version 1.0 to
> FOOL0.01.fa9cf2e0-b60b-43b7-baee-d18cc185b407,
> using calibration
> FOOCAL.2.d2a14052-f426-4d2e-a506-ec052fdb69d4
>
> If I hash the unique identifiers for the L0 and CAL, I get these SEIs:
> (for folks playing along at home, I'm actually appending newlines to
> the identifiers -- doesn't matter if you do or don't, just that we all
> use the same canonical serialization)
>
> FOOL0.01.fa9cf2e0-b60b-43b7-baee-d18cc185b407
> SEI: 47ed3c20426ec661125e0c41caf143ea
>
> and
>
> FOOCAL.2.d2a14052-f426-4d2e-a506-ec052fdb69d4
> SEI: 06f5554083cdfe71c087d1b5bd95cb33
>
> Then, let's define a canonical serialization of the list of essential
> provenance of FOOL1B.v2.01.d615b4f6-5e35-49f0-834a-ee199db7597c
> APP_L1B v1.0
> 06f5554083cdfe71c087d1b5bd95cb33
> 47ed3c20426ec661125e0c41caf143ea
>
> Then we hash that list and produce an SEI for that granule
> of 9571dd5de3ef0bba5e85be0b80f85c91
>
> Any granule that includes those provenance items will have the same
> canonical serialization of the list, and therefore the same resulting
> hash and the same SEI.
>
>
> Let me again emphasize. I'm not trying to prove scientific
> equivalence -- you could look at the actual contents for that (or try
> some of the fingerprinting techniques we've already discussed) or look
> at independent validation, etc. This is simply a shorthand
> representation of the elements of provenance essential for basic
> reproducability.
>
> Saying that two granules have the same SEI is asserting that to some
> extent (possibly not perfect, or even correct) you've replicated the
> circumstances (essential provenance) that led to their creation.
>
>
> Back to Bob trying to reproduce Alice's research several messages ago.
>
> You'll recall Alice was using granule
> FOOL3.v2.01.07aa9ae3-9c3e-4508-b027-890dae11b768
>
> which is no longer available in the archive
> (having been replaced by corrected granule
> FOOL3.v2.01.2a365058-fb52-4559-ab4b-085cb5ac0b73)
>
> Let's calculate some "SEI"s:
>
> FOOL0.001:
>
> FOOL0.01.fa9cf2e0-b60b-43b7-baee-d18cc185b407
> SEI: 47ed3c20426ec661125e0c41caf143ea
>
> FOOL0.02.8a29012d-be8d-4af1-a158-783cfdcfc7fc
> SEI: 644b6ad8e8cf663751e853ec8ad96c09
>
> FOOL0.03.90463be2-a1a4-4d63-ad70-2f1f3c09798e
> SEI: 635891855737afae680d07b027a3cb19
>
> FOOL0.04.c121f001-6851-433f-9400-8e3acaa0229a
> SEI: 1311c0bff3054252bb28f7c32327a877
>
> FOOL0.05.2558aa9a-ce8c-47df-ac05-0fc982568462
> SEI: 292634dea5bd916445175fa67dac9891
>
> FOOL0.06.7223358c-f92d-42d0-bff4-1cd6140d4a89
> SEI: 26c77385347ac6879a992862ffd310fb
>
> FOOL0.07.e2e145cd-899e-483f-989a-fbf1017a3df8
> SEI: 8c2960e1c526ffc606a605fe45c01cc8
>
> FOOL0.08.d0e94ea8-08b3-4bdf-af3f-7af72f2f9221
> SEI: 70c578e0ea392efb1d75353cb7ec6b82
>
> FOOL0.09.ddc94379-0086-4817-9595-5fbe378c5a29
> SEI: 89b2e866ebd3c2fe13d6a83b6bc2388b
>
> FOOL0.10.0b337185-82af-4662-89b0-419bfd3e5db7 (corrupt L0 data)
> SEI: 5f99075ac49960a8ed35a0345a2ee4a3
>
> FOOL0.11.27d94c01-51ed-45c4-8c50-e19ca7f20882
> SEI: 54dd6684e6e97fcc0971b8d7808b7516
>
> FOOL0.12.0af9c3a5-6ee5-4435-8437-d96ebfa36625
> SEI: 388288e589c190c1a65cbc23b2c5a7b3
>
> FOOL0.13.76e9b680-2d06-4a55-ad2c-79b533ce86ca
> SEI: 454b0db6bdc60e3cd5dbffd4f350617b
>
> FOOL0.14.f6bf9378-4215-41b5-9d3e-96dc0c2e7eeb
> SEI: 878aafea700de9f06a8031c4e7e2157c
>
> FOOL0.10.3adf6dea-06af-478f-8216-2bbcdb0caad2 (new updated file)
> SEI: 34b3c13c470ddf83c1fa00d57611740a
>
> ----------------------------------------------------------------------
>
> Continuing to FOOL1B.002, canonicalizing the list
> similar to the example above:
>
> FOOL1B.v2.01.d615b4f6-5e35-49f0-834a-ee199db7597c
> APP_L1B v1.0
> 06f5554083cdfe71c087d1b5bd95cb33
> 47ed3c20426ec661125e0c41caf143ea
> SEI: 9571dd5de3ef0bba5e85be0b80f85c91
>
> FOOL1B.v2.02.6c1a5a3b-55ab-4b53-8659-982d591cc744
> APP_L1B v1.0
> 06f5554083cdfe71c087d1b5bd95cb33
> 644b6ad8e8cf663751e853ec8ad96c09
> SEI: 78ef6c2dc982397b40f4cf2245f9c659
>
> FOOL1B.v2.03.58575454-3a4e-46af-8cb5-27c6aa4321cc
> APP_L1B v1.0
> 06f5554083cdfe71c087d1b5bd95cb33
> 635891855737afae680d07b027a3cb19
> SEI: dcf2d2305aac7e059780032706d80718
>
> FOOL1B.v2.04.09124f68-3f14-446b-a1aa-d7f57e7f1603
> APP_L1B v1.0
> 06f5554083cdfe71c087d1b5bd95cb33
> 1311c0bff3054252bb28f7c32327a877
> SEI: 2dc241b67be6ca6877ae107c0e133a22
>
> FOOL1B.v2.05.6091c13e-5ea4-44c3-92cc-f20823249421
> APP_L1B v1.0
> 06f5554083cdfe71c087d1b5bd95cb33
> 292634dea5bd916445175fa67dac9891
> SEI: 0b3981a865fbfdd59c6cab2d4346c620
>
> FOOL1B.v2.06.fcf1bb4c-51ea-464e-9935-b7653354ef73
> APP_L1B v1.0
> 06f5554083cdfe71c087d1b5bd95cb33
> 26c77385347ac6879a992862ffd310fb
> SEI: a3233fd6e39b03b6de225bc5f2c6b2e2
>
> FOOL1B.v2.07.424aa11d-fdfb-4a63-9b1b-0a386d23b1fa
> APP_L1B v1.0
> 06f5554083cdfe71c087d1b5bd95cb33
> 8c2960e1c526ffc606a605fe45c01cc8
> SEI: 3e57734417a87a8b627fb76b422a6795
>
> FOOL1B.v2.08.e3660e2e-4248-43ea-b91c-64799f3a1e74
> APP_L1B v1.0
> 06f5554083cdfe71c087d1b5bd95cb33
> 70c578e0ea392efb1d75353cb7ec6b82
> SEI: 550b94e28cd52b99ebb4e6266b405d7b
>
> FOOL1B.v2.09.9ca12548-2a6b-47e4-9f7d-732013984ec1
> APP_L1B v1.0
> 06f5554083cdfe71c087d1b5bd95cb33
> 89b2e866ebd3c2fe13d6a83b6bc2388b
> SEI: 9369eee55d0cf604eea4cfbf59cfb545
>
> FOOL1B.v2.10.2f269e5e-cce7-41e4-8a83-baad1e087c8e (deleted)
> APP_L1B v1.0
> 06f5554083cdfe71c087d1b5bd95cb33
> 5f99075ac49960a8ed35a0345a2ee4a3
> SEI: 287af1b0f499184bd9500a55e32d49ef
>
> FOOL1B.v2.11.6cfec73e-bf2e-42cb-a427-2d30694f43e8
> APP_L1B v1.0
> 06f5554083cdfe71c087d1b5bd95cb33
> 54dd6684e6e97fcc0971b8d7808b7516
> SEI: 7ea19d0a5b55af4fb2590e343f824f30
>
> FOOL1B.v2.12.5cb6c2d9-386c-4d87-a103-56f0e459a26f
> APP_L1B v1.0
> 06f5554083cdfe71c087d1b5bd95cb33
> 388288e589c190c1a65cbc23b2c5a7b3
> SEI: fa01ee12f5ca87503d64a07fcc9ae491
>
> ----------------------------------------------------------------------
>
> FOOLUT:
>
> FOOLUT.1.5e3ef918-0216-4a58-9daf-5495dbf4a364
> SEI: 175f8bf4340d730e22e955d03493b2de
>
> ----------------------------------------------------------------------
>
> FOOL2.002:
>
> FOOL2.v2.01.bba34792-f256-4c54-81dd-9977e432c204
> APP_L2 v1.0
> 175f8bf4340d730e22e955d03493b2de
> 9571dd5de3ef0bba5e85be0b80f85c91
> SEI: 809acb2b1ed53368a270b4d52c14dd76
>
> FOOL2.v2.02.2fd12da6-a3e2-4e50-8140-3ac645882419
> APP_L2 v1.0
> 175f8bf4340d730e22e955d03493b2de
> 78ef6c2dc982397b40f4cf2245f9c659
> SEI: 3461d265a7787cc8769df16b420f5f34
>
> FOOL2.v2.03.29bda893-765d-476d-851b-8b9acd7f140e
> APP_L2 v1.0
> 175f8bf4340d730e22e955d03493b2de
> dcf2d2305aac7e059780032706d80718
> SEI: dd77c4fcfacb380c46ce02a02e59a09a
>
> FOOL2.v2.04.57509ddb-3d40-4d60-8204-da4b99867fc7
> APP_L2 v1.0
> 175f8bf4340d730e22e955d03493b2de
> 2dc241b67be6ca6877ae107c0e133a22
> SEI: 348997d808f129fed6b9903d02d25732
>
> FOOL2.v2.05.0e8604fa-fb4e-4cfb-b412-5364ca12cf14
> APP_L2 v1.0
> 175f8bf4340d730e22e955d03493b2de
> 0b3981a865fbfdd59c6cab2d4346c620
> SEI: c8074e135dc5d31698e3b699e1e48461
>
> FOOL2.v2.06.0eb26b4e-b718-41c5-bbf8-c83d3d79c233
> APP_L2 v1.0
> 175f8bf4340d730e22e955d03493b2de
> a3233fd6e39b03b6de225bc5f2c6b2e2
> SEI: d475a1b902d589c59f077a818ac1f842
>
> FOOL2.v2.07.43079ea6-43b5-4622-b492-bcdb824a818e
> APP_L2 v1.0
> 175f8bf4340d730e22e955d03493b2de
> 3e57734417a87a8b627fb76b422a6795
> SEI: 23cd1ea68d8f2b190834a7aaeac51139
>
> FOOL2.v2.08.590fd64c-ec12-44a5-9b14-0042d19ed3dc
> APP_L2 v1.0
> 175f8bf4340d730e22e955d03493b2de
> 550b94e28cd52b99ebb4e6266b405d7b
> SEI: 1ff32375c73b8cdbe3d024ed47ede196
>
> FOOL2.v2.09.226173b9-4ef7-49e8-8b9e-701b892a8f57
> APP_L2 v1.0
> 175f8bf4340d730e22e955d03493b2de
> 9369eee55d0cf604eea4cfbf59cfb545
> SEI: 614507502fbc743aa0cc8533f98f0a62
>
> FOOL2.v2.10.533b2a95-d57f-4f75-9b7d-914d3d220310 (deleted)
> APP_L2 v1.0
> 175f8bf4340d730e22e955d03493b2de
> 287af1b0f499184bd9500a55e32d49ef
> SEI: c6ff8dd6ffe4801594bda868fa943f56
>
> FOOL2.v2.11.af235d11-777c-4bf1-a5e6-15273a5e5d80
> APP_L2 v1.0
> 175f8bf4340d730e22e955d03493b2de
> 7ea19d0a5b55af4fb2590e343f824f30
> SEI: 1442167e55c216ef83b637a272ed3480
>
> FOOL2.v2.12.bdc9dc33-38bd-403c-991e-48dcd4762ca7
> APP_L2 v1.0
> 175f8bf4340d730e22e955d03493b2de
> fa01ee12f5ca87503d64a07fcc9ae491
> SEI: 5bb60a358a38723e26930984c5f2a8d7
>
> ----------------------------------------------------------------------
>
> FOOL3.v2.01.07aa9ae3-9c3e-4508-b027-890dae11b768 (deleted)
> APP_L3 v1.0
> 809acb2b1ed53368a270b4d52c14dd76
> 3461d265a7787cc8769df16b420f5f34
> dd77c4fcfacb380c46ce02a02e59a09a
> 348997d808f129fed6b9903d02d25732
> c8074e135dc5d31698e3b699e1e48461
> d475a1b902d589c59f077a818ac1f842
> 23cd1ea68d8f2b190834a7aaeac51139
> 1ff32375c73b8cdbe3d024ed47ede196
> 614507502fbc743aa0cc8533f98f0a62
> c6ff8dd6ffe4801594bda868fa943f56
> 1442167e55c216ef83b637a272ed3480
> 5bb60a358a38723e26930984c5f2a8d7
> SEI: 41351ab8a3482b33f5412b6cd2c66171
>
> ----------------------------------------------------------------------
>
> Ok, back to Bob trying to reproduce Alice's work:
>
> She used granule: FOOL3.v2.01.07aa9ae3-9c3e-4508-b027-890dae11b768,
> which we can see above has SEI: 41351ab8a3482b33f5412b6cd2c66171.
>
> Bob needs FOOL1B.v2.10.2f269e5e-cce7-41e4-8a83-baad1e087c8e (deleted)
> which is no longer available, It has SEI: 287af1b0f499184bd9500a55e32d49ef.
>
> Bob gets the L0 file and remakes this file:
> FOOL1B.v2.10.c911b994-91fb-4d5c-b9e1-642c0a9c46a3
>
> Note, it has a distinct UUID and has distinct provenance (it was made
> by Bob at a different date/time, on a different machine), but the
> essential provenance fields (which we've accepted to be this set) are
> the same:
> APP_L1B v1.0
> 06f5554083cdfe71c087d1b5bd95cb33
> 5f99075ac49960a8ed35a0345a2ee4a3
>
> The hash of those, and therefore the SEI for Bob's new file are the
> same as the SEI of the original file:
> SEI: 287af1b0f499184bd9500a55e32d49ef
>
> He then takes his L1B file and runs APP_L2 on it, producing this:
> FOOL2.v2.10.2c09ed89-57cf-40ed-910b-16c1aafcd947
> with this essential provenance list:
> APP_L2 v1.0
> 175f8bf4340d730e22e955d03493b2de
> 287af1b0f499184bd9500a55e32d49ef
> and resulting SEI: c6ff8dd6ffe4801594bda868fa943f56
>
> Then he can use that together with the other L2 files he can get from
> the archive and re-make the L3:
> FOOL3.v2.01.52562fbd-5969-4572-a757-47ff3f92dda4
> APP_L3 v1.0
> 809acb2b1ed53368a270b4d52c14dd76
> 3461d265a7787cc8769df16b420f5f34
> dd77c4fcfacb380c46ce02a02e59a09a
> 348997d808f129fed6b9903d02d25732
> c8074e135dc5d31698e3b699e1e48461
> d475a1b902d589c59f077a818ac1f842
> 23cd1ea68d8f2b190834a7aaeac51139
> 1ff32375c73b8cdbe3d024ed47ede196
> 614507502fbc743aa0cc8533f98f0a62
> c6ff8dd6ffe4801594bda868fa943f56
> 1442167e55c216ef83b637a272ed3480
> 5bb60a358a38723e26930984c5f2a8d7
> SEI: 41351ab8a3482b33f5412b6cd2c66171
>
> Again, he calculates the same SEI, indicating that his file is SE to
> the original, missing file.
>
> Each of his newly re-created files has a unique UUID, and distinct
> provenance, but their "essential" provenance (lineage, pedigree) can
> be compared by just looking at their SEI, not as a proof of their
> equivalence, but as an indicator of it.
>
> If the archive maintains the SEI metadata field, even in the event
> that they delete the data content of the granule, someone can attempt
> to reproduce the granule in the same way and get a good check to see
> if they did it right.
>
> Similarly, process on demand systems (or ephemeral web services) can
> include the SEI with granules they give out, and when they give a
> granule to the next person, you can simply compare SEIs for an
> indication that you are getting the granule you thought you were.
>
> Curt
> _______________________________________________
> Esip-preserve mailing list
> Esip-preserve at lists.esipfed.org
> http://www.lists.esipfed.org/mailman/listinfo/esip-preserve
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.lists.esipfed.org/pipermail/esip-preserve/attachments/20101020/74c45525/attachment-0001.html>
More information about the Esip-preserve
mailing list