[Esip-preserve] ESIP Citation Guidelines
Mark A. Parsons
parsonsm at nsidc.org
Wed Oct 13 10:28:00 EDT 2010
I'm losing the thread on this, but I note all the discussion to date has been about files. What about database records and aggregations or collections.
Cheers,
-m.
On 13 Oct 2010, at 7:58 AM, Curt Tilmes wrote:
> On 10/11/10 19:50, alicebarkstrom at frontier.com wrote:
>> Do you think it is possible to adapt the UFN approach previously
>> mentioned to our earth science data? It addresses (some, but not
>> all of) the things you discuss here.
>>
>> <BRB>Absolutely NOT! The UFN approach starts by assuming that data
>> can be arranged in a "canonical" sequence of values and held to a
>> single specified precision and represenation. There isn't anyone to
>> play "pope" to provide a "canon" of data formats. That means that
>> there isn't anyone who can identify the "canonical" representation
>> of a numeric data collection.
>
> Not for all data, or not for any subset of the data?
>
>> 1. The NOAA GHCN adjusted precipitation data separates the
>> geolocation data from the actual precip data - which is arranged in
>> single year arrays (Jan as first month, Dec as the last month) using
>> an ASCII encoding of five characters to represent an integer. I
>> don't think most of us would accept the notion that the data in
>> memory (or in a file) that converted the ASCII to, say, double
>> precision floats would suddenly render the data in memory
>> "inauthentic".
>
> So the canonicalization for that says "always compare it like this".
>
> Why is that impossible?
>
>
>> 2. The MODIS MOD02 data product contains the lat and long values
>> for each location in the 1 km data (if I remember what we had to
>> deal with on CERES). If someone takes the spectral channels for 1
>> km res data and extracts the lats and long, and uses that for an
>> analysis, I don't think most of us would assume that they've created
>> "inauthentic" data. So - is the "authentic" data the spectral
>> radiances without the geolocation - or does the geolocation data
>> have to accompany the spectral radiances? If the answer is the
>> spectral radiances, does the identifier have to refer just to that
>> data? If the answer is both, what identifier should someone quote
>> who wants to use just the spectral data?
>
> Now you are getting into subsets. I'd prefer to simply postpone a
> subset discussion by simply citing the whole file, even if you use
> only part of it. (Think of citing a fact from a paper. I just cite
> the DOI of the paper as a whole).
>
> If you did have a valid canonicalization for the file as a whole, and
> an identifier scheme that can distinguish subsets of the file, then
> you could identify a valid canonicalization of the subset of the file.
>
> We could also play with the "SEC" identifier (UUID of the
> authoritative file) and tie that into identifiers for subsets.
>
> But, as I said, let's put off identifiers for subsets of files until
> we can at least get identifiers fo the files themselves.
>
>> As additional indicators of the difficulty, you can take the
>> different formats available for images, with the format differences
>> persisting over multiple decades (bmp, jpg, tiff, ps, and eps).
>> Likewise, do you really expect NASA, NOAA, and DOD to agree on
>> exactly the format and representation they'll use in common? Or,
>> for that matter that NASA and ESA will agree on identical data
>> formats and sequential order in files of the "same" data?
>
> Like I said, I don't think we need to have a solution that big to have
> something useful.
>
> Right now, we have a universe of files and no way to assert SEC for
> any of them, even if they are scientifically equivalent.
>
> If we can come up with a scheme to identify scientific equivalence for
> some small corner of the universe, it could be useful.
>
>> Additionally, I think reproducibility through complete provenance
>> capture helps address this (though I acknowledge it doesn't solve
>> it).
>>
>> <brb>Don't agree at all!</brb>
>
> Need more information. What don't you agree with? That capturing
> provenance is useful at all? That provenance information doesn't help
> address reproducibility? Or that reproducibility is simply a lost
> cause and we shouldn't even aim for it?
>
>
>> We have typically relied on a trusted curator to manage and affirm
>> this. We can't prove it, (especially in the case of a malicious
>> curator), but we log cryptographic digests as we produce data, and
>> distribute them with the data files. We (EOSDIS) dictate formats for
>> standard data products for the "authoritative" version and that is
>> what gets archived and distributed. As you point out, this only
>> checks the physical bits.
>>
>> <brb>Why should we rely on unverifiable curation? What techniques
>> do we have to audit the claim of identity? I do think auditing is
>> possible and can be highly reliable. However, it means that the
>> curating authority has to present a chain of evidence that will
>> somehow allow independent verification of the claim of unaltered
>> scientific identity. I think we have a responsibility to avoid
>> claims that cannot be substantiated - and I don't think the current
>> state of our claims of "unique identifiers" can be verified
>> independently. If we go back to "Applied Cryptography", we don't
>> have a verifiable trust model. Anyone in the security business can
>> (and should) shoot at what we're claiming. I do not believe our
>> current position fits with long-term preservation.</brb>
>
> As opposed to what we do now? We constantly make unsubstantiated
> claims. Audits are rare in this world.
>
> I think you and I agree substantiation and audits would be good and
> useful.
>
> If today we have no good identifiers and no substantiation, and our
> goal is to have both good identifiers and substantiation. Are you
> objecting to proposing some identifiers prior to setting up a system
> for substantiation? We can certainly head there next, but I'd be
> happy for self-certifying compliance with a standard as a first step.
> Then we can move to independent audits, formal certification, etc.
>
> I don't see failure to jump to a perfect end point as an argument
> against taking the first baby step.
>
>> In some cases, like the MODIS "process on demand" L1B, we can't do
>> that. We assert that we have the ability to reproduce an equivalent
>> file (although with the current implementation, it actually performs
>> what I call "reprocessing" rather than "reproducing" -- The
>> difference being that reprocessing can use better versions of
>> ancillary data files, or later versions of the algorithms rather
>> than trying to apply a faithful attempt to make the same file.)
>>
>> <brb>In short - it's not reproducable at all! Bad guarantee of
>> fixity! Indeed, it sounds like a case of near-fraud,
>> misrepresenting a reproduction as a (possibly bad) replica!</brb>
>
> I mispoke -- I should haven't used the word equivalent there -- they
> aren't really claiming to be reproducing the original files, no fixity
> here. They simply make the file when you ask for it. The processing
> step will make the best file the know how to make (not necessarily an
> equivalent file to what they made last time.)
>
> Curt
>
> _______________________________________________
> Esip-preserve mailing list
> Esip-preserve at lists.esipfed.org
> http://www.lists.esipfed.org/mailman/listinfo/esip-preserve
More information about the Esip-preserve
mailing list