[Esip-preserve] Identifiers and Automated Inventory Audits

Mon Nov 23 08:52:22 EST 2009

One strain of discussion we haven't heard is whether identifiers
can help with automating inventory audits of data centers.
By this I mean that periodically or in instances of concern
(at migration, after an IT security "incident") it would seem
appropriate for an archive to check to see if there have been
changes in the file inventory.  The question I'd like to raise
is whether the identifier convention adopted by the archive
can improve the quality of the audit or reduce its cost.

If I recall correctly, Reagan Moore and colleagues had devised
a way of parsing the identifiers in some NARA records to
identify miscataloged files.  Because the identifiers were
complex, the parsing was complex as well.

Here are some thoughts:

1.  Having identifiers with a good collating sequence would
be helpful.  OID's look pretty good for this purpose, so do
DOI's (maybe).  URL's are more unpleasant - there's the
ambiguity of case sensitivity, as well as the fact that the
string length variability of URL's may make it difficult to place
files in a sequence.  [An example of an pleasant collating
sequence, consider dates with YYYY-MM-DD-HH, where
YYYY is a four digit year number, MM is a two digit month
number, and HH is a two digit hour number, with 0 allowed.
An unpleasant one might have YYYY-Month_Name-Day_Number-
Hour_Number, noting that APR precedes MAR, although MAY
does follow MAR.]

2.  If we got a good collating sequence, could we use it to
discover gaps in the inventory?  If we had a collection of daily
files, a collating sequence such as
2007-01-01
2007-01-02
2007-01-04
2007-01-05
makes it a lot easier to discover that 2007-01-03 is missing.

3.  We need to think ahead to situations in which unicode
file names are more widely used.  Certainly China and India
are more important players in the global scientific community
than they used to be.  While we're used to U.S. dominance,
that wasn't always the situation (e.g. German scientific
dominance before the first World War and the use of French
in diplomacy during the same era).  U.S. dominance probably
won't be guaranteed in the long-term future.  Identifiers based
on sequences of digits are probably going to be more stable
than ones that include alphabetic characters (even with ISO
standardization of character sets).  Again, OID's and DOIs
look more stable against this kind of change than do URL-based
identifiers.  Again, unicode may present collation pleasantries,
in which an archive might have to order the character set of file
names based on collating that identifier schema and then collate
the identifiers within each character set.

Any other thoughts?

Bruce B.