[Esip-preserve] ESIP Citation Guidelines

Wed Oct 13 10:28:00 EDT 2010

I'm losing the thread on this, but I note all the discussion to date has been about files. What about database records and aggregations or collections.

Cheers,

-m. 
On 13 Oct 2010, at 7:58 AM, Curt Tilmes wrote:

> On 10/11/10 19:50, alicebarkstrom at frontier.com wrote:
>> Do you think it is possible to adapt the UFN approach previously
>> mentioned to our earth science data?  It addresses (some, but not
>> all of) the things you discuss here.
>> 
>> <BRB>Absolutely NOT!  The UFN approach starts by assuming that data
>> can be arranged in a "canonical" sequence of values and held to a
>> single specified precision and represenation.  There isn't anyone to
>> play "pope" to provide a "canon" of data formats.  That means that
>> there isn't anyone who can identify the "canonical" representation
>> of a numeric data collection.
> 
> Not for all data, or not for any subset of the data?
> 
>> 1.  The NOAA GHCN adjusted precipitation data separates the
>> geolocation data from the actual precip data - which is arranged in
>> single year arrays (Jan as first month, Dec as the last month) using
>> an ASCII encoding of five characters to represent an integer.  I
>> don't think most of us would accept the notion that the data in
>> memory (or in a file) that converted the ASCII to, say, double
>> precision floats would suddenly render the data in memory
>> "inauthentic".
> 
> So the canonicalization for that says "always compare it like this".
> 
> Why is that impossible?
> 
> 
>> 2.  The MODIS MOD02 data product contains the lat and long values
>> for each location in the 1 km data (if I remember what we had to
>> deal with on CERES).  If someone takes the spectral channels for 1
>> km res data and extracts the lats and long, and uses that for an
>> analysis, I don't think most of us would assume that they've created
>> "inauthentic" data.  So - is the "authentic" data the spectral
>> radiances without the geolocation - or does the geolocation data
>> have to accompany the spectral radiances?  If the answer is the
>> spectral radiances, does the identifier have to refer just to that
>> data?  If the answer is both, what identifier should someone quote
>> who wants to use just the spectral data?
> 
> Now you are getting into subsets.  I'd prefer to simply postpone a
> subset discussion by simply citing the whole file, even if you use
> only part of it.  (Think of citing a fact from a paper.  I just cite
> the DOI of the paper as a whole).
> 
> If you did have a valid canonicalization for the file as a whole, and
> an identifier scheme that can distinguish subsets of the file, then
> you could identify a valid canonicalization of the subset of the file.
> 
> We could also play with the "SEC" identifier (UUID of the
> authoritative file) and tie that into identifiers for subsets.
> 
> But, as I said, let's put off identifiers for subsets of files until
> we can at least get identifiers fo the files themselves.
> 
>> As additional indicators of the difficulty, you can take the
>> different formats available for images, with the format differences
>> persisting over multiple decades (bmp, jpg, tiff, ps, and eps).
>> Likewise, do you really expect NASA, NOAA, and DOD to agree on
>> exactly the format and representation they'll use in common?  Or,
>> for that matter that NASA and ESA will agree on identical data
>> formats and sequential order in files of the "same" data?
> 
> Like I said, I don't think we need to have a solution that big to have
> something useful.
> 
> Right now, we have a universe of files and no way to assert SEC for
> any of them, even if they are scientifically equivalent.
> 
> If we can come up with a scheme to identify scientific equivalence for
> some small corner of the universe, it could be useful.
> 
>> Additionally, I think reproducibility through complete provenance
>> capture helps address this (though I acknowledge it doesn't solve
>> it).
>> 
>> <brb>Don't agree at all!</brb>
> 
> Need more information.  What don't you agree with?  That capturing
> provenance is useful at all?  That provenance information doesn't help
> address reproducibility?  Or that reproducibility is simply a lost
> cause and we shouldn't even aim for it?
> 
> 
>> We have typically relied on a trusted curator to manage and affirm
>> this.  We can't prove it, (especially in the case of a malicious
>> curator), but we log cryptographic digests as we produce data, and
>> distribute them with the data files.  We (EOSDIS) dictate formats for
>> standard data products for the "authoritative" version and that is
>> what gets archived and distributed.  As you point out, this only
>> checks the physical bits.
>> 
>> <brb>Why should we rely on unverifiable curation?  What techniques
>> do we have to audit the claim of identity?  I do think auditing is
>> possible and can be highly reliable.  However, it means that the
>> curating authority has to present a chain of evidence that will
>> somehow allow independent verification of the claim of unaltered
>> scientific identity.  I think we have a responsibility to avoid
>> claims that cannot be substantiated - and I don't think the current
>> state of our claims of "unique identifiers" can be verified
>> independently.  If we go back to "Applied Cryptography", we don't
>> have a verifiable trust model.  Anyone in the security business can
>> (and should) shoot at what we're claiming.  I do not believe our
>> current position fits with long-term preservation.</brb>
> 
> As opposed to what we do now?  We constantly make unsubstantiated
> claims.  Audits are rare in this world.
> 
> I think you and I agree substantiation and audits would be good and
> useful.
> 
> If today we have no good identifiers and no substantiation, and our
> goal is to have both good identifiers and substantiation.  Are you
> objecting to proposing some identifiers prior to setting up a system
> for substantiation?  We can certainly head there next, but I'd be
> happy for self-certifying compliance with a standard as a first step.
> Then we can move to independent audits, formal certification, etc.
> 
> I don't see failure to jump to a perfect end point as an argument
> against taking the first baby step.
> 
>> In some cases, like the MODIS "process on demand" L1B, we can't do
>> that.  We assert that we have the ability to reproduce an equivalent
>> file (although with the current implementation, it actually performs
>> what I call "reprocessing" rather than "reproducing" -- The
>> difference being that reprocessing can use better versions of
>> ancillary data files, or later versions of the algorithms rather
>> than trying to apply a faithful attempt to make the same file.)
>> 
>> <brb>In short - it's not reproducable at all!  Bad guarantee of
>> fixity!  Indeed, it sounds like a case of near-fraud,
>> misrepresenting a reproduction as a (possibly bad) replica!</brb>
> 
> I mispoke -- I should haven't used the word equivalent there -- they
> aren't really claiming to be reproducing the original files, no fixity
> here.  They simply make the file when you ask for it.  The processing
> step will make the best file the know how to make (not necessarily an
> equivalent file to what they made last time.)
> 
> Curt
> 
> _______________________________________________
> Esip-preserve mailing list
> Esip-preserve at lists.esipfed.org
> http://www.lists.esipfed.org/mailman/listinfo/esip-preserve