[Esip-preserve] Identifiers Use Cases

Tue Apr 13 09:25:13 EDT 2010

A useful note.  It would be interesting to know what the production scenario
was and how the "duplicate" data files differed from the "original" ones.
Was it poor configuration management or were there real differences?
There's an interesting issue with some of the archive designs (e.g. DSpace)
that don't allow any file deletion.

I did look up Curt's Powerpoint slides on how ESDT's lie behind the 
nomenclature.
The notation is a pretty elegant approach to representing the 
selection of subsets,
although I think the cases represented don't have all of the layers 
needed.  For
example, I think the notion that an ESDT is distinguished by having an ATBD
is probably not sufficient for preserving the details associated with 
versions - I'm
pretty sure the CERES team didn't redo the ATBD's when we produced a new
version - although the on-line documentation for each version does 
include caveats.
Likewise, Curt's schema doesn't seem to me to represent the differences between
data sources (meaning data that use only MODIS Terra versus only MODIS Aqua
versus data that might be built on a combination of the two).  Ditto 
for possible
differences with respect to machine configuration during production.  Curt's
note on the fact that Earth science data versioning needs to 
represent the influence
of causitive factors other than just the author is correct.

More later.

Thanks for the note.

Bruce B.

At 12:49 AM 4/13/2010, Ruth Duerr wrote:
>Just a quick note Bruce about "classic" EOSDIS products.  In general
>you may be correct; but for MODIS we received "duplicate" data files
>(i.e., newly produced copies of data from ostensively the same
>version) often enough that we had to work out a strategy for dealing
>with them...
>
>;-) Ruth
>On Apr 12, 2010, at 7:09 AM, Alice Barkstrom wrote:
>
>>I think it would be helpful to have some more concrete production
>>and data use scenarios.  The Web site I've placed on line at oceandis
>>has a whole collection of these with a table of contents at
>>
>>http://www.oceandis.com/metadata/Text_Documentation/Example/example_index.html
>>
>>Chapter 10 in this collection of pdf documents has a couple of figures
>>(10.2 and 10.3) that show the buildup over time of a number of
>>versions
>>(DTX's in Curt's nomenclature).  The production scenarios in these pdf
>>chapters is simple compared with MODIS or CERES, but it's got a fair
>>amount of reality based on the CERES production for Level 1 data.
>>
>>For the largest component of Earth science data, the extensions of
>>the 2.6 PB in the
>>"classic" EOSDIS data centers, production is hardly a random update
>>process.  Rather, it proceeds pretty systematically.  It also seems
>>true
>>that once a file is produced, it is not likely to be changed -
>>unless there
>>was a cataloging error.  If a scientific error is identified, the
>>file will not be modified
>>although the producer may reduce such an error when he/she creates a
>>new version of the data set.
>>
>>For data produced operationally, e.g. GOES images, radiosondes, HCN
>>or other
>>surface networks, the producers are under such time pressure that
>>they do not
>>have time to go back and revise the software or coefficients -
>>meaning that they
>>do not produce new versions, although they may have changes in the
>>coefficients
>>or code - sometimes noted and documented.
>>
>>In validation campaigns, of course, the selection of data in subsets
>>that
>>refer to particular times and places might have a number of interim
>>files
>>that have been worked with different processes.  These should probably
>>not be regarded as "published" in the normal sense - they are not
>>"peer-reviewed",
>>but are steps along the way.
>>
>>In short, having some well-documented production and data use
>>scenarios
>>with dates of data collection, dates of production, and dates of
>>data use
>>is critical to getting to the bottom of these issues.  It would also
>>be wise
>>to do the appropriate "production engineering" to ensure we are
>>dealing with
>>"high probability" scenarios first, rather than unrepresentative,
>>extreme
>>cases.  I rather suspect that the scenarios divide into two classes:
>>one being
>>the highly "regular" cases that provide the bulk of the data and a
>>second being
>>the cases that cause a lot more work for the archives.
>>
>>Bruce B.
>>
>>At 06:45 AM 4/12/2010, Curt Tilmes wrote:
>>>Ruth Duerr wrote:
>>> > However, I think it is the citation that needs that, not the
>>> > identifier for the data set.
>>>
>>>Yes.  That is one of the differences in the examples I showed for DOI
>>>vs. PURL.  It is trivial to produce thousands of PURL identifiers, so
>>>it makes sense to put the full qualification in the indentifiers.
>>>For
>>>DOI, not so much, so I added an additional qualifier (in my proposed
>>>case, the date/time) to the citation.  You distinguished identifiers
>>>from citations with better wording on the main page.
>>>
>>> > I also think that, as you suggest in your use cases, the time of
>>> > access is one possible mechanism for doing that
>>>
>>>We also looked at some hashing schemes or even arbitrary identifiers
>>>that mapped to sets of granules, but nothing was as clean and easy to
>>>use (and understand) for users or implementers as date/time.
>>>
>>> > (and that is probably the simplest mechanism from a citation
>>> > standpoint though not necessarily from a user standpoint if for no
>>> > other reason than it might have taken the user a month to download
>>> > all the data they used and the data set may have undergone a whole
>>> > host of updates over that time period).
>>>
>>>Ok, take that case.  How should we propose to handle it?
>>>
>>>In my scheme, the date/time in the citation is a point in time, so
>>>you
>>>could either:
>>>
>>>1. grab the original set of granules that were existing at the time
>>>you start that long month of downloads and cite that date/time.
>>>
>>>or 2. double check the data set and grab any updates and cite the
>>>later date/time.
>>>
>>>How else could we approach it and still maintain the precision of
>>>citation?
>>>
>>>Curt
>>>_______________________________________________
>>>Esip-preserve mailing list
>>>Esip-preserve at lists.esipfed.org
>>>http://www.lists.esipfed.org/mailman/listinfo/esip-preserve
>>
>>
>>_______________________________________________
>>Esip-preserve mailing list
>>Esip-preserve at lists.esipfed.org
>>http://www.lists.esipfed.org/mailman/listinfo/esip-preserve
>