[Esip-preserve] Identifiers and Automated Inventory Audits - Restatement of Problem

Alice Barkstrom alicebarkstrom at verizon.net
Thu Nov 26 09:12:01 EST 2009

After mulling over the response, I think I was not precise enough in
my statement about auditing.  The responses appear to deal with
auditing in the sense of assuring that a given set of files has not been
altered.  The audit I was interested in has two elements:

1.  In some cases, I have an expectation regarding a pattern of file
instances.  For example, I have a data set based on daily in situ
observations.  That means I expect to have one file per day.  The
question I want to ask is whether there are missing elements in
the collection - and if there are, which are missing.  A more involved
version of this question arises when I want to compare the completeness
of file instances for two versions of the same file collection.  As a
concrete example of this kind of query, I want to know which file
instances were filled in by a more complete second version of the
original one.

2.  Another example arises when I consider operator errors in
entering data.  As a concrete example, CERES production used
templates for file instances going into monthly averaging data
products that were developed by cutting and pasting text from
a previous example.  Because the number of days in a month
varies from month to month, the operators could enter the wrong
template (say using Januuary's template for February's template
without changing the number of days in the month).  I'd like an
"audit" to detect these operator errors in the collection.  A more
complex version of this kind of problem arises when migrating
files from an existing collection on obsolete technology and transferring
to a new one where you'd like to have more automation in the file
collation sequence.

In what I had imprecisely stated, the collation sequence serves
as the "theoretical" pattern or expectation.  There are several ways
to describe this situation.  The one Curt identifies would use provenance
tracking logic to do the tracking.  In the production algorithms I'm
familiar with, the pattern is embodied in the DAG that governs the
data production.  I have strong suspicions that Reagan Moore's
iRODs approach, the theoretical pattern is embodied in the Policies
that the archive uses to govern its file operations.

So - does that software identified in the previous responses deal
with these situations or do we still have work to do?

Bruce B.

At 02:47 PM 11/23/2009, Curt Tilmes wrote:
>Alice Barkstrom wrote:
> > Useful information.  The key will be to identify who (person or
> > machine) is the responsible identity to note when something
> > is missing and filling in the annotation.
> > At 11:52 AM 11/23/2009, Greg Janée wrote:
> >> Now, these tools don't address the problem of *why* one granule out of
> >> a sequence is missing, but assuming that somebody notices this at
> >> creation time and investigates it and decides that it's OK that the
> >> granule is missing (maybe there were sunspots that day), then these
> >> tools can verify the status quo from that point forward.
>Peter Fox and Deborah McGuinness have previously discussed some work
>similar to this on the Virtual Solar-Terrestrial Observatory [1] using
>Proof Markup Language (PML) [2] to represent the provenance of data
>ingest and processing, including encoding reasons for missing or bad
>data, and inferring reasons for later missing data (because the
>earlier data was missing, for example).
>[1] http://www.vsto.org/
>[2] http://inference-web.org/
>Esip-preserve mailing list
>Esip-preserve at lists.esipfed.org

More information about the Esip-preserve mailing list