[Esip-preserve] Identifiers and Automated Inventory Audits

Greg Janée gjanee at icess.ucsb.edu
Mon Nov 23 11:52:37 EST 2009

Tracking and verifying integrity is a general problem that applies to  
all kinds of archived content, and equally general tools have been  
developed to address this problem.  For example, there's ACE (Auditing  
Control Environment) <https://wiki.umiacs.umd.edu/adapt/index.php/Ace:Main 
 >, which will track checksums on arbitrary sets of files, and  
recursively checksum the checksums in a cryptographically secure way.   
It will also report if a file has gone missing anywhere.  Or there's  
Tripwire (commercial and open source versions) that will, in addition  
to the preceding, detect files that have been added to directories and  
any changes to file permissions/ownership/etc.  These tools don't rely  
on any particular identifier/naming scheme, nor do I think would they  
benefit from a standardized one.

Now, these tools don't address the problem of *why* one granule out of  
a sequence is missing, but assuming that somebody notices this at  
creation time and investigates it and decides that it's OK that the  
granule is missing (maybe there were sunspots that day), then these  
tools can verify the status quo from that point forward.


Bruce Barkstrom wrote:
> One strain of discussion we haven't heard is whether identifiers
> can help with automating inventory audits of data centers.
> By this I mean that periodically or in instances of concern
> (at migration, after an IT security "incident") it would seem
> appropriate for an archive to check to see if there have been
> changes in the file inventory.  The question I'd like to raise
> is whether the identifier convention adopted by the archive
> can improve the quality of the audit or reduce its cost.
> If I recall correctly, Reagan Moore and colleagues had devised
> a way of parsing the identifiers in some NARA records to
> identify miscataloged files.  Because the identifiers were
> complex, the parsing was complex as well.
> Here are some thoughts:
> 1.  Having identifiers with a good collating sequence would
> be helpful.  OID's look pretty good for this purpose, so do
> DOI's (maybe).  URL's are more unpleasant - there's the
> ambiguity of case sensitivity, as well as the fact that the
> string length variability of URL's may make it difficult to place
> files in a sequence.  [An example of an pleasant collating
> sequence, consider dates with YYYY-MM-DD-HH, where
> YYYY is a four digit year number, MM is a two digit month
> number, and HH is a two digit hour number, with 0 allowed.
> An unpleasant one might have YYYY-Month_Name-Day_Number-
> Hour_Number, noting that APR precedes MAR, although MAY
> does follow MAR.]
> 2.  If we got a good collating sequence, could we use it to
> discover gaps in the inventory?  If we had a collection of daily
> files, a collating sequence such as
> 2007-01-01
> 2007-01-02
> 2007-01-04
> 2007-01-05
> makes it a lot easier to discover that 2007-01-03 is missing.
> 3.  We need to think ahead to situations in which unicode
> file names are more widely used.  Certainly China and India
> are more important players in the global scientific community
> than they used to be.  While we're used to U.S. dominance,
> that wasn't always the situation (e.g. German scientific
> dominance before the first World War and the use of French
> in diplomacy during the same era).  U.S. dominance probably
> won't be guaranteed in the long-term future.  Identifiers based
> on sequences of digits are probably going to be more stable
> than ones that include alphabetic characters (even with ISO
> standardization of character sets).  Again, OID's and DOIs
> look more stable against this kind of change than do URL-based
> identifiers.  Again, unicode may present collation pleasantries,
> in which an archive might have to order the character set of file
> names based on collating that identifier schema and then collate
> the identifiers within each character set.

More information about the Esip-preserve mailing list