[Esip-preserve] [FOO] Foo Project moves to Google Spreadsheets

Tue Oct 12 12:35:08 EDT 2010

All of this sounds good - and quite specific.

I think this list could be a good starting point
for making approaches to risks more transparent.

Bruce B.
----- Original Message -----
From: "Kenneth S. Casey" <Kenneth.Casey at noaa.gov>
To: alicebarkstrom at frontier.com
Cc: "ESIP Preservation cluster" <esip-preserve at rtpnet.org>, "Curt Tilmes" <Curt.Tilmes at nasa.gov>
Sent: Tuesday, October 12, 2010 9:39:02 AM
Subject: Re: [Esip-preserve] [FOO] Foo Project moves to Google Spreadsheets

Bruce, 

On Oct 12, 2010, at 9:21 AM, alicebarkstrom at frontier.com wrote: 

The suggestion that Ken has of embedding UUIDs in the file at least 
makes them unique and independent of cryptographic digests.  That means 
as long as a copy of the original file is readable, it can be uniquely 
identified.  What happens after the original file becomes unreadable 
(physical deterioration, software obsolescence, hardware obsolescence 
being prime suspects) is not so clear. 

At the risk of sounding defensive, which I am not, GHRSST has always accounted for these other risks through: 

1. checksumming - everything gets checksummed on every transfer... from Regional Data Assembly Center (RDAC) to Global Data Assembly Center (GDAC) to Long Term Stewardship and Reanalysis Facility (LTSRF, at NODC). 

2. Formal archiving - following to the best of our ability OAIS principles at the LTSRF.  Plus most or all RDACs keep local archives of their own products. 

3. use of community-accepted file format:  netCDF.   As we've seen at ESIP summer meeting in UCSB, Unidata is working actively to ensure netCDF remains viable for the long term.  This does not mean forever of course, but it does mean that the expected life time of the format is pretty long.  And conversion to the "next format" is extremely "do-able" since it is all standardized. 

4. use of structured metadata - collection level (FGDC and GCMD DIF) and file level (CF), plus now ISO. 

5. Versioning - for now anyway, we keep all the versions of GHRSST data. Someday we may retire older versions, but for now we don't need to... most providers are not providing multiple versions anyway.  

I guess the point is that since there are a variety of risks, any program should know to implement a variety of safeguards... from checksums to multiple copies to standard formats to UUIDs to DOIs to....   and as has been said, we can poke holes in any one of these approaches but that should not stop us from doing, and the strength comes in from the multiple "layers" of safeguards and tracking mechanisms. 

As for defining collections, that is probably not so easy to do in general but can certainly be done on a project basis... in GHRSST we know exactly what a collection is and what a granule is.   Between checksums and UUIDs and DOIs, plus the date someone makes a citation, I feel pretty good we  that we could track back unambiguously exactly which granules were used.  And if not, we make adjustments in the next version of GHRSST and move on. 

Ken 

I've no objection to trying DOI's - but what we mean by a collection 
needs clarification.  I agree that Curt's simple example is very useful 
and adds to our understanding of that issue.  The critical issue here 
is the number of entities required for precise citation.  As a concrete 
set of questions, we need to state whether a citation needs to be 
sensitive to 
1.  The archive from which the data have been obtained (I'm curious 
     as to whether Ken's example on this will have multiple locations 
     where data will be stored.) 
2.  The Data Product or ESDT - meaning very generic collections 
3.  The Data Source - particularly if there are multiple sources, 
     such as instruments; in the case of data collections that have 
     many kinds of input data, is identifying each source critical? 
4.  The Version of the algorithms (as well as input parameters, such 
     as calibration coefficients) 
5.  A carefully selected set of files that were used in such instances 
     as validation by intercomparison with a field experiment 
6.  Data inside files - as might be the case where the use of the data 
     involved a small geographic or temporal sample within a larger 
     region 
7.  Specific identification of data inside a large number of files 
     (this comes up on attempts to construct records of the "solar 
      constant" from which the authors of the reconstruction have 
      been attempting to determine whether or not there is a trend 
      in that value) 
I'll note the paper on scholarly citation of quantitative data 
(Altman and King) appears to me to squash all of these levels of 
precision together in an unacceptable way. 

I'll have some additional material - particularly on Chris' comments, 
which I think begin to help provide some rigor to the questions we're 
going to have to ask ourselves. 

Bruce B. 
----- Original Message ----- 
From: "Kenneth S. Casey" < Kenneth.Casey at noaa.gov > 
To: "Curt Tilmes" < Curt.Tilmes at nasa.gov > 
Cc: "ESIP Preservation cluster" < esip-preserve at rtpnet.org > 
Sent: Tuesday, October 12, 2010 6:54:37 AM 
Subject: Re: [Esip-preserve] [FOO] Foo Project moves to Google Spreadsheets 

Curt - I've been quiet on this list owing to lack of time to respond, but I did want to say that I think your FOO analysis is excellent and I've been following it closely!  Thanks so much. I think this approach is (1) extremely informative and (2) very comforting... comforting because the Group for High Resolution SST ( http://ghrsst.org ), a big international effort that produces standardized SST data from multiple satellites around the world that I participate in, just decided to use UUIDs embedded in a netCDF attribute for every single granule. It also plans to use DOIs for the collections being generated by the network of data providers from around the world.  The UUID usage is part of the new version of the "GHRSST Data Specification Version 2 (GDS2) " which was just published on October 1st.  The GDS version 1 produced about 30 collections and 1.5 million netCDF files, and there will be even more GDS2 data in just the next few years... so, we'll see how well this approach works...   

Ken 

On Oct 11, 2010, at 2:52 PM, Curt Tilmes wrote: 

This is a read/only link: 

https://docs.google.com/leaf?id=0BztPCL0EZx_3NWI4OTQwN2ItMjU3OC00ZGIwLWFlNjUtNWY1OTE3MGJjNDUw&hl=en 

Mostly cut/pasted from my earlier emails. 

I did add one additional file, a "FOOLUT" lookup table that is an 
input to APP_L2.  We can change the version of the APP independently 
from the version of the LUT and explore the various provenance graphs. 

I'm attaching a basic data flow diagram.  (I also pasted an SVG 
version of this diagram into the spreadsheet, but it doesn't come 
through every browser.) 

I'm also trying to always include the "[FOO]" tag so you can filter 
your ESIP-Preserve list if you aren't following this scenario and 
want to trim down the clutter. 

So far, this simple scenario has: 

1. Shown how DOI works well to identify and locate "ESDT+Collection". 

2. Show how DOI doesn't precisely identify sets of granules. 

3. Show how UUID can be used to unambiguously refer to individual granules. 

Curt 
<flow.png> _______________________________________________ 
Esip-preserve mailing list 
Esip-preserve at lists.esipfed.org 
http://www.lists.esipfed.org/mailman/listinfo/esip-preserve 

[NOTE: The opinions expressed in this email are those of the author alone and do not necessarily reflect official NOAA, Department of Commerce, or US government policy.] 

Kenneth S. Casey, Ph.D. 
Technical Director 
NOAA National Oceanographic Data Center 
1315 East-West Highway 
Silver Spring MD 20910 
301-713-3272 ext 133 
http://www.nodc.noaa.gov/ 

_______________________________________________ 
Esip-preserve mailing list 
Esip-preserve at lists.esipfed.org 
http://www.lists.esipfed.org/mailman/listinfo/esip-preserve 

[NOTE: The opinions expressed in this email are those of the author alone and do not necessarily reflect official NOAA, Department of Commerce, or US government policy.] 

Kenneth S. Casey, Ph.D. 
Technical Director 
NOAA National Oceanographic Data Center 
1315 East-West Highway 
Silver Spring MD 20910 
301-713-3272 ext 133 
http://www.nodc.noaa.gov/