[Esip-preserve] Identifiers Use Cases

Alice Barkstrom alicebarkstrom at verizon.net
Wed Apr 14 10:42:10 EDT 2010


This discussion is useful as a prelude to a discussion regarding how
to classify different production paradigms.  At one extreme of high
throughput production, you have a paradigm driven by operational
considerations that require the lowest latency possible - and which
embeds variations in quality and completeness into the stream of
data.  At the other extreme, you have a paradigm driven by error
homogeneity.  This makes for a nice classification that includes not
only the NASA experience that Curt describes below, but also the
NOAA operational experience.  In that latter case, the satellite and
in situ data are processed as well as possible with what's available
in two to six hours.  However, NCDC also has a number of data sets
that have taken the operational data streams and reprocessed them
for quality and homogeneity.  I suspect that this distinction between
low latency and high homogeneity is enough of a feature of the data
environment that we should begin to recognize it in our identifier
typologies.

A more formal approach would be to separate the change frequencies
into frequency bands (a notion that I'm borrowing from work on Manufacturing
Systems Engineering - S. Gershwin).  This works something like the
following:
Basic File Time Interval:       one file per 5 min [MODIS] or one 
file per hour or per day [CERES]
Operational Perturbations:      one change in attitude/orbit per two 
weeks; similar for revised raw data
Calibration Changes:    might be once per 6 months (if vicarious) or 
less frequently
Production Configuration Changes:       once per month to once per 
year (depending on acquisition
         frequency for software or hardware updates)
Major Algorithm Updates/Revisions:      once per year (early - might 
get longer intervals later)

This separation operates (in the manufacturing system environment - although
data production partakes of that) so that if changes are happening faster than
the band under consideration, all you can do is describe the 
statistics.  If it's
happening more slowly, then you assume nothing changes for the particular
band.  Thus, calibration changes across the time interval of a basic file time
interval are assumed to be negligible.  Likewise, the attitude/orbit 
environment
is assumed to be constant across the basic file time interval.  This approach
should give a "natural" way to group collections of files - and should probably
be recognized in the identifier schemas.  You can see this separation appearing
in Curt's next e-mail - with DOIs for large collections and PURLs, 
ARK, (and OIDs)
for the more rapidly changing stuff.

Note that this view also provides a useful perspective for providing data users
with explanations regarding what happened, as well as recommendations on
which data collections might be useful for a particular kind of activity.

Bruce B.


At 09:25 AM 4/14/2010, Curt Tilmes wrote:
>Alice Barkstrom wrote:
> > It also seems true that once a file is produced, it is not likely to
> > be changed - unless there was a cataloging error.  If a scientific
> > error is identified, the file will not be modified although the
> > producer may reduce such an error when he/she creates a new version
> > of the data set.
>
>This is true of the 'good' data sets, the long term, reprocessed,
>cleaned up data.  We try to keep a nice consistent version of
>everything across the whole dataset.
>
>Operational realities of the 'forward processing' stream have to
>accomodate changing data.  Broken deliveries, partial packets later
>resent with more data in them, etc.  We can either delay producing
>data until we know it is good, introducing latency, or make it as soon
>as we can and remake it if we need to.
>
>We are often torn between trying to produce and keep a nice consistent
>data set and the realization that we are making crap.  We can either
>1) keep making crap (at least it is consistent) or 2) Make better data
>in the future by delivering new code.  Then we have the choice to keep
>the old data, or remake it.
>
>As you note (and your examples describe) the reprocessed data sets are
>generally very systematic and nice.  The operational, forward
>processing stream has occasional hiccups and brokenness that we try to
>fix on the fly, even if it involves remaking broken data.
>
>In the ACPS (and to some extent MODAPS), we distinguish these two
>cases, and handle them differently.  Each stream (roughly what I'm
>calling Data Version on the use case page) is processed in a separate
>ArchiveSet.
>
>Each 'major' reprocessing (i.e. a distinct version of a data set we
>want to keep around) happens in a new ArchiveSet.  The ArchiveSet +
>ESDT + TimeStamp (=DataSet) is sufficient to precisely describe the
>granule membership at a given instant in time.
>
>If there is 'minor' reprocessing (i.e. we get a call from EDOS that
>they sent a broken packet and new fixed one is on the way), we just
>reprocess it inside the same ArchiveSet.  I still maintain complete
>and precise provenenance information about the broken packet.  If you
>refer to the DataSet with a timestamp after I made the bad data, but
>before I made the good data, you're referenced DataSet happens to
>include the bad granule. (I can even use your referenced identifier to
>point out to you that it includes bad data and you probably want to
>update to a later instance.)
>
>I don't think we should replace bad data within a referenced dataset
>without updating the reference, nor do I think we should keep the bad
>data in the 'current' version of the dataset.  Our scheme has a happy
>medium where the references are always precise and consistent, but the
>bad data do get fixed.
>
> > It would also be wise to do the appropriate "production engineering"
> > to ensure we are dealing with "high probability" scenarios first,
> > rather than unrepresentative, extreme cases.  I rather suspect that
> > the scenarios divide into two classes: one being the highly
> > "regular" cases that provide the bulk of the data and a second being
> > the cases that cause a lot more work for the archives.
>
>Agreed, but I think a comprehensive scheme will eventually accomodate
>both scenarios.
>
>Curt
>_______________________________________________
>Esip-preserve mailing list
>Esip-preserve at lists.esipfed.org
>http://www.lists.esipfed.org/mailman/listinfo/esip-preserve




More information about the Esip-preserve mailing list