[Esip-preserve] Identifiers Use Cases

Curt Tilmes Curt.Tilmes at nasa.gov
Wed Apr 14 09:25:09 EDT 2010


Alice Barkstrom wrote:
> It also seems true that once a file is produced, it is not likely to
> be changed - unless there was a cataloging error.  If a scientific
> error is identified, the file will not be modified although the
> producer may reduce such an error when he/she creates a new version
> of the data set.

This is true of the 'good' data sets, the long term, reprocessed,
cleaned up data.  We try to keep a nice consistent version of
everything across the whole dataset.

Operational realities of the 'forward processing' stream have to
accomodate changing data.  Broken deliveries, partial packets later
resent with more data in them, etc.  We can either delay producing
data until we know it is good, introducing latency, or make it as soon
as we can and remake it if we need to.

We are often torn between trying to produce and keep a nice consistent
data set and the realization that we are making crap.  We can either
1) keep making crap (at least it is consistent) or 2) Make better data
in the future by delivering new code.  Then we have the choice to keep
the old data, or remake it.

As you note (and your examples describe) the reprocessed data sets are
generally very systematic and nice.  The operational, forward
processing stream has occasional hiccups and brokenness that we try to
fix on the fly, even if it involves remaking broken data.

In the ACPS (and to some extent MODAPS), we distinguish these two
cases, and handle them differently.  Each stream (roughly what I'm
calling Data Version on the use case page) is processed in a separate
ArchiveSet.

Each 'major' reprocessing (i.e. a distinct version of a data set we
want to keep around) happens in a new ArchiveSet.  The ArchiveSet +
ESDT + TimeStamp (=DataSet) is sufficient to precisely describe the
granule membership at a given instant in time.

If there is 'minor' reprocessing (i.e. we get a call from EDOS that
they sent a broken packet and new fixed one is on the way), we just
reprocess it inside the same ArchiveSet.  I still maintain complete
and precise provenenance information about the broken packet.  If you
refer to the DataSet with a timestamp after I made the bad data, but
before I made the good data, you're referenced DataSet happens to
include the bad granule. (I can even use your referenced identifier to
point out to you that it includes bad data and you probably want to
update to a later instance.)

I don't think we should replace bad data within a referenced dataset
without updating the reference, nor do I think we should keep the bad
data in the 'current' version of the dataset.  Our scheme has a happy
medium where the references are always precise and consistent, but the
bad data do get fixed.

> It would also be wise to do the appropriate "production engineering"
> to ensure we are dealing with "high probability" scenarios first,
> rather than unrepresentative, extreme cases.  I rather suspect that
> the scenarios divide into two classes: one being the highly
> "regular" cases that provide the bulk of the data and a second being
> the cases that cause a lot more work for the archives.

Agreed, but I think a comprehensive scheme will eventually accomodate
both scenarios.

Curt


More information about the Esip-preserve mailing list