[Esip-preserve] Provenance and Context Content Spreadsheet

Sun Mar 6 13:30:34 EST 2011

I'll probably have some longer comments later, but here's a bit more on the
context of the NASA project environment that may help explain the tabulation.
First, the concern that John Moses has been wrestling with is what to do with
items at the end of a government project.  The instrument developer is usually
a contractor and the documentation items would need to be specifically
identified as deliverables
in the contract.  They may consist of documents or collections of data, which
might be in the form of files or in databases developed by the
contractor.  It may
be that the full collections are provided to the government
institution that oversees
the contract.  The spreadsheet is intended to assist first in dealing
with items in
the contracts that will be disposed of if no action is taken very
soon.  It is not clear
that the relevant institutions or organizations that deal with the
metadata and data
produced by the project have any knowledge of this contractually
deliverable information.
Indeed, the term 'metadata' and the finer distinction between
documents, data, and
contextual information we usually call metadata may not have any meaning to the
instrument project or the instrument contractor.

In my experience (where the dialect I'm accustomed to may not be equivalent
to that of other communities), QA and QC would usually refer to
automated elements
of the production software that produce reports (usually files with
text and numerical
values) that are usually produced routinely as part of each file's
production job.
The number of these can be quite large - as might be expected from
production rates
on NASA EOS missions being 5,000 to 100,000 jobs being run per day
over ten to twenty
years.  It is not clear that the science teams that provide the
software or the EOS
data centers have ever made any plans to archive these reports.  QA and QC might
also include the diagnostic work and production changes associated
with discovering
anomalies in the reports and developing fixes.  While these anomalies
and fixes are
likely to appear in action item lists prepared by the production
teams, these items are
also likely to be discarded or on the list of things not included in
the archive accession
lists for permanent retention.

'Validation' and 'Calibration' also have many variant meanings.  In my
experience, calibration
refers to the activities and procedures that are used to convert the
raw data into geophysical
units.  For example, a conversion from digital counts to calibrated
radiances would use
calibration coefficients - as well as an algorithm embedded in source
code that may also
perform statistical filtering to remove "bad values".  The calibration
data for the instruments I'm
familiar with may come from pre-flight facilities on the ground, or
from in-flight sources that
are part of the instrument.  To complicate the discussion further,
some kinds of calibration
algorithms use fairly complex radiative transfer models that use
information on surface
reflection and atmospheric constituent distributions to model the
radiances arriving at
the instrument in orbit.  In all three of these calibration methods,
there are complex chains
of processing and fairly large amounts of data and documentation that
may be involved.

In contrast, validation is more likely to involve the activities, processes,
and data that work on comparing geophysical parameters from different
measurement sources
and are highly likely to involve the more complex processes that
convert the calibrated
values (of the previous paragraph) to other kinds of geophysical
quantities.  For the experiments
with which I've been involved, calibrated radiances are the lowest
physically useful kind of data.
Most of the validation involved processes that attempted to compare
fluxes from our instruments
with fluxes from other instruments (radiance being a quantity that
describes energy transport
along a light ray, while a flux describes energy being transported
through an area for all
directions), or with determinations of cloud cover or properties.  In
addition, validation often
involves areal and temporal averages - say going from 20 km diameter
footprints to 2.5 degree
regions and from instantaneous observations to monthly averages.
Again, the processes, data,
and documentation for these validation exercises may be voluminous and
have as much storage
volume as the data itself.

Of course, these comments reflect my experience with a couple of large
projects.  Other communities
may have different mental models and different vocabularies for describing them.

As far as the source code question, the governmental and project
context may govern the response
to the question.  In some projects, particularly NOAA, the data
reduction software may be done by
government contractors - and it may be difficult (not to mention
expensive) to get the contractor to
agree to formally release the software for more public use.  In other
cases (NASA EOS being one),
the data production software may have been required to be made
available as part of the agreement
by the producers that create the data and the documentation.  Source
code is difficult enough (some of
the code bases are substantial fractions of a million lines of code -
or even more).  We'd also have to
get the procedures and production histories that may (or may not) have
been part of the contractual
agreements.  What used to be NPOESS may or may not have included these
items in the contract, for example.
As to higher level products vs lower level code, we'd have to see
whether the contractor regarded the
code as proprietary.  I do recall having a great deal of difficulty
getting the source code out of the
ESDIS contractor to deal with the geolocation algorithms that used DEMs.

Anyway, hope these comments are useful in understanding the context.
It isn't easy to provide
all of the context when you're trying to engage in an information
rescue mission, as John Moses
is right now.

Bruce B.

On Wed, Mar 2, 2011 at 6:59 PM, Siri Jodha Khalsa <sjsk at nsidc.org> wrote:
> Rama and John,
>
> Good work.  Some questions and comments:
> 1. I didn't see a mention of the data and metadata itself.  This archive
> package would be separate from the data?  The division between documentation
> and metadata is blurred, of course, e.g. production history is included in
> the table, but is in some cases saved as metadata. Some QA is metadata, some
> is in the product files. The spec needs to address how data and metadata are
> preserved.
> 2. Normalization of the entries is needed - there's lots of
> duplication/overlap, and some similar or identical items appear under
> different categories
> 3. Some descriptions refer to "project artifacts" - the reference should be
> to specific items in the table. Likewise, what are "developer's design
> documents"?
> 4. There needs to be a separate column for "alternate" sources for an item,
> rather than using the rationale column, which should eventually be populated
> for each item.
> 5.  Aren't items that refer to accessing the data, like NOAA item 36,
> irrelevant? The physical location of the data may change.
> 6. Error analysis under documentation belongs under Quality (and I think QA
> is too narrow of a term.  Should be just quality or quality measures).
> Validation belongs under Quality also, don't you think?
> 7. Why source code only for higher-level products?  Why not all source code
> involved in product generation?
>
> Cheers,
> SiriJodha
>
> --
> Siri-Jodha Singh KHALSA, Ph.D., SMIEEE
> National Snow and Ice Data Center
> University of Colorado
> Boulder, CO 80309-0449 Phone: 1-303-492-1445 GV: 1-303-736-9976
> http://cires.colorado.edu/~khalsa
>
> _______________________________________________
> Esip-preserve mailing list
> Esip-preserve at lists.esipfed.org
> http://www.lists.esipfed.org/mailman/listinfo/esip-preserve
>