[Esip-preserve] [FOO] Foo Project moves to Google Spreadsheets

Tue Oct 12 09:50:37 EDT 2010

Nice Ken. Practical approaches to complex problems. We learn as we go and hopefully get better at addressing the tougher problems.

Cheers,

-m. 
On 12 Oct 2010, at 7:39 AM, Kenneth S. Casey wrote:

> Bruce,
> 
> On Oct 12, 2010, at 9:21 AM, alicebarkstrom at frontier.com wrote:
> 
>> The suggestion that Ken has of embedding UUIDs in the file at least
>> makes them unique and independent of cryptographic digests.  That means
>> as long as a copy of the original file is readable, it can be uniquely
>> identified.  What happens after the original file becomes unreadable
>> (physical deterioration, software obsolescence, hardware obsolescence
>> being prime suspects) is not so clear.
> 
> At the risk of sounding defensive, which I am not, GHRSST has always accounted for these other risks through:
> 
> 1. checksumming - everything gets checksummed on every transfer... from Regional Data Assembly Center (RDAC) to Global Data Assembly Center (GDAC) to Long Term Stewardship and Reanalysis Facility (LTSRF, at NODC).
> 
> 2. Formal archiving - following to the best of our ability OAIS principles at the LTSRF.  Plus most or all RDACs keep local archives of their own products.
> 
> 3. use of community-accepted file format:  netCDF.   As we've seen at ESIP summer meeting in UCSB, Unidata is working actively to ensure netCDF remains viable for the long term.  This does not mean forever of course, but it does mean that the expected life time of the format is pretty long.  And conversion to the "next format" is extremely "do-able" since it is all standardized.
> 
> 4. use of structured metadata - collection level (FGDC and GCMD DIF) and file level (CF), plus now ISO.
> 
> 5. Versioning - for now anyway, we keep all the versions of GHRSST data. Someday we may retire older versions, but for now we don't need to... most providers are not providing multiple versions anyway. 
> 
> I guess the point is that since there are a variety of risks, any program should know to implement a variety of safeguards... from checksums to multiple copies to standard formats to UUIDs to DOIs to....   and as has been said, we can poke holes in any one of these approaches but that should not stop us from doing, and the strength comes in from the multiple "layers" of safeguards and tracking mechanisms.
> 
> As for defining collections, that is probably not so easy to do in general but can certainly be done on a project basis... in GHRSST we know exactly what a collection is and what a granule is.   Between checksums and UUIDs and DOIs, plus the date someone makes a citation, I feel pretty good we  that we could track back unambiguously exactly which granules were used.  And if not, we make adjustments in the next version of GHRSST and move on.
> 
> Ken
> 
> 
>> 
>> I've no objection to trying DOI's - but what we mean by a collection
>> needs clarification.  I agree that Curt's simple example is very useful
>> and adds to our understanding of that issue.  The critical issue here
>> is the number of entities required for precise citation.  As a concrete
>> set of questions, we need to state whether a citation needs to be
>> sensitive to
>> 1.  The archive from which the data have been obtained (I'm curious
>>      as to whether Ken's example on this will have multiple locations
>>      where data will be stored.)
>> 2.  The Data Product or ESDT - meaning very generic collections
>> 3.  The Data Source - particularly if there are multiple sources,
>>      such as instruments; in the case of data collections that have
>>      many kinds of input data, is identifying each source critical?
>> 4.  The Version of the algorithms (as well as input parameters, such
>>      as calibration coefficients)
>> 5.  A carefully selected set of files that were used in such instances
>>      as validation by intercomparison with a field experiment
>> 6.  Data inside files - as might be the case where the use of the data
>>      involved a small geographic or temporal sample within a larger
>>      region
>> 7.  Specific identification of data inside a large number of files
>>      (this comes up on attempts to construct records of the "solar
>>       constant" from which the authors of the reconstruction have
>>       been attempting to determine whether or not there is a trend
>>       in that value)
>> I'll note the paper on scholarly citation of quantitative data
>> (Altman and King) appears to me to squash all of these levels of
>> precision together in an unacceptable way.
>> 
>> I'll have some additional material - particularly on Chris' comments,
>> which I think begin to help provide some rigor to the questions we're
>> going to have to ask ourselves.
>> 
>> Bruce B.
>> ----- Original Message -----
>> From: "Kenneth S. Casey" <Kenneth.Casey at noaa.gov>
>> To: "Curt Tilmes" <Curt.Tilmes at nasa.gov>
>> Cc: "ESIP Preservation cluster" <esip-preserve at rtpnet.org>
>> Sent: Tuesday, October 12, 2010 6:54:37 AM
>> Subject: Re: [Esip-preserve] [FOO] Foo Project moves to Google Spreadsheets
>> 
>> 
>> Curt - I've been quiet on this list owing to lack of time to respond, but I did want to say that I think your FOO analysis is excellent and I've been following it closely!  Thanks so much. I think this approach is (1) extremely informative and (2) very comforting... comforting because the Group for High Resolution SST ( http://ghrsst.org ), a big international effort that produces standardized SST data from multiple satellites around the world that I participate in, just decided to use UUIDs embedded in a netCDF attribute for every single granule. It also plans to use DOIs for the collections being generated by the network of data providers from around the world.  The UUID usage is part of the new version of the "GHRSST Data Specification Version 2 (GDS2) " which was just published on October 1st.  The GDS version 1 produced about 30 collections and 1.5 million netCDF files, and there will be even more GDS2 data in just the next few years... so, we'll see how well this approach works...   
>> 
>> 
>> Ken 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> On Oct 11, 2010, at 2:52 PM, Curt Tilmes wrote: 
>> 
>> 
>> 
>> 
>> This is a read/only link: 
>> 
>> https://docs.google.com/leaf?id=0BztPCL0EZx_3NWI4OTQwN2ItMjU3OC00ZGIwLWFlNjUtNWY1OTE3MGJjNDUw&hl=en 
>> 
>> Mostly cut/pasted from my earlier emails. 
>> 
>> I did add one additional file, a "FOOLUT" lookup table that is an 
>> input to APP_L2.  We can change the version of the APP independently 
>> from the version of the LUT and explore the various provenance graphs. 
>> 
>> I'm attaching a basic data flow diagram.  (I also pasted an SVG 
>> version of this diagram into the spreadsheet, but it doesn't come 
>> through every browser.) 
>> 
>> I'm also trying to always include the "[FOO]" tag so you can filter 
>> your ESIP-Preserve list if you aren't following this scenario and 
>> want to trim down the clutter. 
>> 
>> So far, this simple scenario has: 
>> 
>> 1. Shown how DOI works well to identify and locate "ESDT+Collection". 
>> 
>> 2. Show how DOI doesn't precisely identify sets of granules. 
>> 
>> 3. Show how UUID can be used to unambiguously refer to individual granules. 
>> 
>> Curt 
>> <flow.png> _______________________________________________ 
>> Esip-preserve mailing list 
>> Esip-preserve at lists.esipfed.org 
>> http://www.lists.esipfed.org/mailman/listinfo/esip-preserve 
>> 
>> 
>> 
>> 
>> [NOTE: The opinions expressed in this email are those of the author alone and do not necessarily reflect official NOAA, Department of Commerce, or US government policy.] 
>> 
>> 
>> Kenneth S. Casey, Ph.D. 
>> Technical Director 
>> NOAA National Oceanographic Data Center 
>> 1315 East-West Highway 
>> Silver Spring MD 20910 
>> 301-713-3272 ext 133 
>> http://www.nodc.noaa.gov/ 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> _______________________________________________
>> Esip-preserve mailing list
>> Esip-preserve at lists.esipfed.org
>> http://www.lists.esipfed.org/mailman/listinfo/esip-preserve
> 
> 
> [NOTE: The opinions expressed in this email are those of the author alone and do not necessarily reflect official NOAA, Department of Commerce, or US government policy.]
> 
> Kenneth S. Casey, Ph.D.
> Technical Director
> NOAA National Oceanographic Data Center
> 1315 East-West Highway
> Silver Spring MD 20910
> 301-713-3272 ext 133
> http://www.nodc.noaa.gov/
> 
> 
> 
> 
> 
> _______________________________________________
> Esip-preserve mailing list
> Esip-preserve at lists.esipfed.org
> http://www.lists.esipfed.org/mailman/listinfo/esip-preserve