[Esip-preserve] Identifiers Use Cases

Alice Barkstrom alicebarkstrom at verizon.net
Mon Apr 12 09:09:53 EDT 2010


I think it would be helpful to have some more concrete production
and data use scenarios.  The Web site I've placed on line at oceandis
has a whole collection of these with a table of contents at

http://www.oceandis.com/metadata/Text_Documentation/Example/example_index.html

Chapter 10 in this collection of pdf documents has a couple of figures
(10.2 and 10.3) that show the buildup over time of a number of versions
(DTX's in Curt's nomenclature).  The production scenarios in these pdf
chapters is simple compared with MODIS or CERES, but it's got a fair
amount of reality based on the CERES production for Level 1 data.

For the largest component of Earth science data, the extensions of 
the 2.6 PB in the
"classic" EOSDIS data centers, production is hardly a random update
process.  Rather, it proceeds pretty systematically.  It also seems true
that once a file is produced, it is not likely to be changed - unless there
was a cataloging error.  If a scientific error is identified, the 
file will not be modified
although the producer may reduce such an error when he/she creates a 
new version of the data set.

For data produced operationally, e.g. GOES images, radiosondes, HCN or other
surface networks, the producers are under such time pressure that they do not
have time to go back and revise the software or coefficients - 
meaning that they
do not produce new versions, although they may have changes in the coefficients
or code - sometimes noted and documented.

In validation campaigns, of course, the selection of data in subsets that
refer to particular times and places might have a number of interim files
that have been worked with different processes.  These should probably
not be regarded as "published" in the normal sense - they are not 
"peer-reviewed",
but are steps along the way.

In short, having some well-documented production and data use scenarios
with dates of data collection, dates of production, and dates of data use
is critical to getting to the bottom of these issues.  It would also be wise
to do the appropriate "production engineering" to ensure we are dealing with
"high probability" scenarios first, rather than unrepresentative, extreme
cases.  I rather suspect that the scenarios divide into two classes: one being
the highly "regular" cases that provide the bulk of the data and a second being
the cases that cause a lot more work for the archives.

Bruce B.

At 06:45 AM 4/12/2010, Curt Tilmes wrote:
>Ruth Duerr wrote:
> > However, I think it is the citation that needs that, not the
> > identifier for the data set.
>
>Yes.  That is one of the differences in the examples I showed for DOI
>vs. PURL.  It is trivial to produce thousands of PURL identifiers, so
>it makes sense to put the full qualification in the indentifiers.  For
>DOI, not so much, so I added an additional qualifier (in my proposed
>case, the date/time) to the citation.  You distinguished identifiers
>from citations with better wording on the main page.
>
> > I also think that, as you suggest in your use cases, the time of
> > access is one possible mechanism for doing that
>
>We also looked at some hashing schemes or even arbitrary identifiers
>that mapped to sets of granules, but nothing was as clean and easy to
>use (and understand) for users or implementers as date/time.
>
> > (and that is probably the simplest mechanism from a citation
> > standpoint though not necessarily from a user standpoint if for no
> > other reason than it might have taken the user a month to download
> > all the data they used and the data set may have undergone a whole
> > host of updates over that time period).
>
>Ok, take that case.  How should we propose to handle it?
>
>In my scheme, the date/time in the citation is a point in time, so you
>could either:
>
>1. grab the original set of granules that were existing at the time
>you start that long month of downloads and cite that date/time.
>
>or 2. double check the data set and grab any updates and cite the
>later date/time.
>
>How else could we approach it and still maintain the precision of
>citation?
>
>Curt
>_______________________________________________
>Esip-preserve mailing list
>Esip-preserve at lists.esipfed.org
>http://www.lists.esipfed.org/mailman/listinfo/esip-preserve




More information about the Esip-preserve mailing list