[Esip-preserve] Replication, Recall, and Consistency

Sat Apr 30 13:49:31 EDT 2011

I've been mulling over some notions that come up out of
Gray and Reuter's text on Transaction Processessing and
some research afterwards.  Here's the problem:

Given that a single archive (or inventory system) is less
robust to failure or destruction than multiple archives
(or systems), we usually try to approximately replicate
collections over multiple sites, say N > 1.

Now, the ACID requirements on transactions may cause
delays in making all N sites consistent.  Thus, in the
presence of failures (networks partition, disks fail, sites
do software upgrades, etc.), some fraction of the sites
will be either unavailable or inconsistent.

The question is - for data searching (or, in a similar but
not identical problem of obtaining data), how much unavailability
or inconsistency can users stand?  We could make this a bit
more quantitative by saying that at any time, we expect N' < N
sites to be unavailable and of M possible answers in a completely
consistent (and improbably dictatorial centralized system),
only M' < M would be answered consistently.  If so, for
what values of N'/N and M'/M would users regard the system
as not working?

It may also be important to note that while we have lots
of anecdotal comments to the effect that users really spend
lots of time searching for things, I'm not aware of an empirically
based set of numbers that indicates what fraction of their time
data users spend in this activity.  It probably isn't 100% - or
even close to that.  After all, people would seem to need to read
and understand papers they've searched for; they might have to
program or use tools to understand data they've downloaded;
they might even have to spend time writing proposals or reviewing
them.  On the latter issue, the new issue of Scientific American
has a one-page article by the Scientific American Board of Editors,
2011: Dr. No Money: Scientists spend too much time raising cash
instead of doing experiments, Scientific American, May, 2011,
p. 20 that says a U.S. government study "found that university
faculty members spend about 40 percent of their research time
navigating the beauracratic labyrinth, and the situation is no better
in Europe."

To be succinct, what is the requirement for recall and consistency
that we can base on empirical evidence of the way people use
their time?

Bruce B.