[Esip-preserve] An Interesting Wrinkle on Collection Volatility

Mon Apr 19 09:26:48 EDT 2010

In trying to work up a table that would allow us to see what we're
referring to, it has occured to me that we're trying to deal with
two distinct kinds of volitility in open collections.

First, I'll assume that we're only trying to deal with citations to files,
not to subsets of files.  In the last couple of weeks, I've been working
with the precipitation records from NCDC and have just three files
in the collection.  One is an inventory of stations and such things
as lat and long.  The second consists of monthly precip values
for about 20,000 stations.  The third consists of "adjusted" monthly
precip stations at about 2,300 stations.  NCDC apparently updates
the contents of the latter two when they get new measurements.
In that sense, these files are rather like a database that gets updated
every so often - although they're available to the users as discrete
files in ASCII.  These facts mean that the file contents are volatile
and suggest that citations need to contain a time stamp for the
creation date of the file.

Note that one kind of data use would work with a subset of one of these
files.  For example the time series for a single station might be of
interest.  In this case, the citation would probably need to be to the
time-stamped file from NCDC.

At the same time, it would be useful for provenance tracking purposes
to be able to identify the original source of the data, since this file is
assembled with data from several thousand data collections.  This
particular scenario is the kind of editing problem that has bedeviled
the IPCC and "climategate" folks,  so I think it is a highly relevant,
practical problem.

Second, it's important to be able to distinguish between "open" and
"closed" collecctions.  Closed collections are ones that have eternally
fixed sets of files.  Open collections are ones that may have items
added.  [Curt's suggestion on this nomenclature was a good and useful
addition to our vocabularies on this issue.]

It is probably helpful to note that open collections can change because
they add a new sub-collections or because they add new files within an
existing collection.

For example, if the producer decides to add an entirely new collection
with different parameters than he/she had done before, that would
constitute one kind of volatility.  Concretely, a producer working with
multi-spectral data might have decided to add a new data product
containing aerosol concentration, whereas before his/her team had only
done vegetation products.

In other cases, a producer might create a new sub-collection (I'll call
that a Data Set) when there's data from a new instrument that gets
processed in (nearly) the same way as data from previous instruments.
In EOS terms, MODIS and CERES have had this happen several times.

Likewise, a producer can add a new Version by changing algorithms
(whether documented in an ATBD or in other locations) that uses the
same sources as he/she did before.  That would be a sub-sub-sub
collection in the hierarchy I published some time ago.

The appropriate citation for these kinds of collection changes may depend
on how precisely the data user wants to specify the data of interest.

Sometimes, a user may want to refer to all of the data from a particular
producer - as in a generic reference to a kind of data that serves to
place this user's work in a broad context.

Sometimes, a user may want to compare the general properties of
one version against the properties of another.  Such a case might come
up in dealing with whether the new version had sufficient global
accuracy to detect climate change whereas the previous version
did not.

Sometimes, a user may need to be highly specific to allow a reference
to a (small) subset of a version that was used in intercomparing one
set of data over a field experiment that lasted two weeks in some
portion of the globe.

Finally, we have the kind of collection changes that Curt discussed
last week, in which the producer is sending out new instances of
files that are nearly certain to be "replacements" for previous files
in the same collection.  If I recall correctly, D-Space would require
a new identifier for each instance - and no deletion of previous files.
Other archiving schemes would allow replacements.

The upshot - I think we need concrete examples of the structure
of collections.  We also need concrete examples of the kinds of
citations we're thinking people will need.  It would also be useful
to have some empirical census of kinds of citations, rather than
cases we invent out of our hypotheses about what users do.
Some users may not care about which subsets of which version
of data were used.  On the other hand, the user might
be having to deal with legal questioning about whether he or she
had "deleted" data from his/her conclusions, a situation that calls
for an entirely different level of care about the precision of citations.

I've started a table, although after some experimentation I did not find
the material in the NAS report on collections that Ruth had mentioned
last week to be particularly useful.

I'll get the table out as soon as I can, even if only in a preliminary form.

Bruce B.