[Esip-preserve] Identifiers
Bruce Barkstrom
brbarkstrom at gmail.com
Mon Feb 20 13:52:19 EST 2012
The attached file contains thoughts on two communities to
whom the differences we've been discussing are likely to
matter: archive managers and data producers. As a data
producer, it has mattered a great deal whether the discussion
is about a relational database (as an object that might
contain a dataset), a file (as another object that might contain
one), or a collection of files with a very long history (say 20
years) of structure (and perhaps yet another kind of data set).
Somewhat similar concerns arise for
data managers, which I was for five years.
Bruce B.
On Sat, Feb 18, 2012 at 3:10 AM, Greg Janée <gjanee at eri.ucsb.edu> wrote:
> Mark A. Parsons wrote:
>> I don't think there is a falsifiable definition of data set. Or rather all definitions are false. It's very situational.
>
> Agreed. To put it another way, I think this attempt to define "dataset" is doomed because a dataset is a cognitive construct, and cognitive constructs do not have exact definitions and hard boundaries, but look more like overlapping categories that are characterized by exemplars and degrees of membership.
>
> Is there a *functional* reason why we need to define terms like "dataset" and "granule"? I guess a necessary (but not sufficient) condition for me to be convinced by any definitions for "dataset" and "granule" is that there is some kind of functional difference between them; some different functional affordances.
>
> From the old Alexandria days I recall a passionate debate over what constituted a "title". (That may sound quaint now, but I assure you, a librarian armed with an AACR2 reference is a formidable adversary.) What cut through that particular Gordian knot was looking at the question purely functionally: we only care about titles to the extent that we do something with them. And the answer at that time was, all we do with titles is display them in search result lists. Ergo, a "title" is that which you want to see displayed as a search result, no more, no less. Corollary: a title should be about one line wide when displayed in a typical font size.
>
> Regarding data and citation, from a functional perspective I would say that if a particular entity has an identifier, and can be independently referenced (or is independently actionable), and if the entity's provider is committed to maintaining that entity and its identifier and its independent referencability, then the entity is "citable". Notice that this definition is independent of both the size of the entity and the terminology the provider uses in referring to it.
>
> -Greg
>
-------------- next part --------------
Further Comments on Identifiers
The previous discussion appears to assume that the only community interested in
what we mean by ``data set'' is the community of data users and it seems to assume
that these users are like customers for a library of texts. However, there are at
least two other communities who have to work with collections of data: archive or
repository managers and data producers. These two communities have somewhat
different requirements for identifiers than do data users.
Archive Inventory Identifier Issues
One of the primary concerns of archive managers is inventory planning and control.
Because managers of physical items are usually concerned with the value of their
inventory, this concern may also feed into data archive accounting and legal
concerns as well.
As a concrete example of how identifiers tie to inventory control, I went to Lowe's
over the weekend to pick up some screws for a gardening bench I'd constructed.
The screws are not unique items. The plastic boxes holding the screws did have the
bar code labels on the sides, as well as English and Spanish words that identified
the screws they contained. The labels distinguished between Phillips head screws
and star drive screws, as well as the size (2-1/2" versus 1-1/4") and type
(external decking versus brass). The manager at Lowe's could count boxes of each
type to determining inventory levels. Then, he could decide whether or not to
reorder a shipment of these boxes. The accountants could have a ledger account
that might use the bar code identifier or an inventory control identifier to track
when the store bought or sold these items.
On the other hand, accountants can establish ledger accounts for unique items in
inventory. This would more likely apply to larger items, such as RV's or cars,
which do have a Vehicle Identification Numbers (VINs).
When our attention shifts to data objects, there are two kinds of data containers
that are likely to be the smallest discrete objects of interest: relational
databases and files.
Because databases only become holders of unique content if transactions are not
allowed to change the contents, applying a unique identifier for them only becomes
possible if the identifier refers to a state of the database. In order to track
state when transactions are allowed, the inventory control mechanism has to keep
the full history of transactions. A weaker identifier system takes snapshots of
the database state, but this does not maintain inventory control of the contents at
full temporal resolution.
Static files may be unique items. However, archives routinely copy files for
internal use. Such copies include staging files from the archive storage to disks
devoted to distributing copies. Assuming that the full identifier for a file is
the path name, the copy in internal storage will have a different path than the
copy made for distribution. Therefore, these two files could be regarded as having
different identifiers, even if they are replicas of each other that would have
identical cryptographic digests.
After the archive distributes the disk copy to another data archive, there would be
a third copy. If the cryptographic digest of this copy agrees with the digest
created by applying the same kind of cryptographic digest to the first, there is a
very strong case that the two files are identical. In other words, the process of
making a digital copy does not destroy the authenticity of the bit sequence it
contains. Aside from the difference in location and the storage system
configuration, there appears to be no practical distinction between the copies.
The identity of the copies raises at least five questions with practical
implications:
1. Should an archive's inventory control system track copies of files within its
systems? This could be a useful addition if there are performance differences
between the storage systems of the different copies. For example, a copy
maintained in a dark archive is going to be more difficult to bring to light than a
copy maintained on disk.
2. If the transfer of the file from one archive to another (or from the archive to
a file user) has been made with suitable legal arrangements, why should one site
receive preference as a source for other users? A practical context for this
question arises in situations where several countries arrange to exchange data
files. After the copies have been verified as having the same cryptographic
digests, a European copy of a file that originated in the U.S. would be highly
likely to be easier for a European data user to obtain that from a U.S. site.
This approach is widely used in sites that mirror each other and allow software
downloads, for example.
3. What is the value of creating a unique identifier (say a UUID) for each identical copy?
4. If the data file at the original site were destroyed and that file were
replaced by a copy of a copy from another site, has the authenticity of the new
copy at the original site been destroyed? [This might suggest that an
authenticating agent should change the location of the preferred site to the one
from which the original archive recovered its copy.]
5. In the previous situation, should the original archive change the inventory
identifier of the file if it has to replace the destroyed copy with one from
another site? This question becomes more generic if the question is `Should an
inventory control identifier allow the system to delete files?' As a practical
example, in the LaRC DAAC, there was an incident in which a router handling file
transfers from GSFC to LaRC began injecting erroneous bits into the transferred
files. Should the LaRC DAAC continue its policy of replacing the erroneous files
with correctly transmitted ones and keeping the same file identifiers in its
inventory control system?
From the standpoint of the inventory control system, these kinds of questions
translate into configuration maintenance policies.
Transformational Migration and File Identifiers
Except for ASCII character formatting and TeX, most objects in information
technology are depressingly transient. Examples abound. Since digital computers
appeared, the instruction sets have changed. The transition from 16-bit computers
to 32-bit ones to 64-bits is perhaps the easiest example of the phenomenon.
Microsoft Word has gone through a rather large number of versions. Storage media
have gone from iron core memories, through large hard disks to floppy disks to CD's
and flash drives to DVD's. Computer languages and operating systems have gone
through similar changes. At present, mobile devices, network computing, multi-core
chips, and power requirements are creating evolutionary pressures on data systems.
While storage media may be reasonably stable over a decade, the storage devices
wear out, become obsolete, and need replacing at a time interval of about three to
five years. Archives and repositories must migrate data from old hardware to new
just to be able to read and distribute data. These changes also interact with
changes in software such that old data files become obsolete on about the same time
scale.
An example of this kind of change is the impact of character format
internationalization from 8-bit ASCII characters to 16-bit Unicode character sets.
If a file needed a transformational migration from ASCII to Unicode, an English
language document could still be presented in a suitable typographic manner
(indeed, in the same type font and size). However, the bit pattern of the digital
file would be completely different in Unicode than it was in ASCII. Thus, the
cryptographic digests of two files with essentially the same characters would not
be identical.
Of course, this example suggests a further set of questions:
1. If an archive undertakes a transformational migration, should the transformed
files be allowed to have the same identifier for inventory purposes?
2. If the identifiers are allowed to remain the same, how should the archive's
inventory control system ensure consistency between the transformed copies and
undeleted copies of the originals?
3. If the archive changes the identifiers to distinguish the transformed copies,
what mechanisms does it use to direct users to the new copies?
4. What policies do federated archives need to institute to maintain consistent
inventory control under transformational migration? For example, could a secondary
archive specialize in particular formats?
Production Control Issues
There are three basic data production paradigms:
1. Exploratory data production, usually conducted by individuals or small teams of
investigators. In this paradigm, the investigation team selects a set of data to
explore, sometimes out of curiosity and sometimes to answer a specific scientific
question. This paradigm is also useful to larger teams doing quality control to
discover an error or anomaly, identify its cause and extent, and test proposed
fixes. While there may be some initial notions of the appropriate algorithms to
apply, this paradigm allows changes in the selected input data, selected
algorithms, and final output. At least in the early stages of the investigation,
there may be a diversity of production provenance graphs that describe the detailed
data production.
2. Operational data production, usually conducted by large teams for a particular
purpose, such as weather forecasting or disaster warnings. The pressure of short
latency times and large data throughput requirements usually limits the amount of
anomaly or error correction work the producer can do. Such systems usually have
very heavy configuration control mechanisms. The data producer inserts changes
into the production process on an irregular basis. The producer does not have time
to reprocess previously processed data to obtain a homogeneous uncertainty across
the full data record.
3. Climate data production, which may be conducted by large or small teams,
concentrating on a particular kind of data. Production of solar constant data or
some of the records in the Global Historical Climate Network collection at NCDC are
done by relatively small teams. Production of other records, such as the MODIS
vegetation products, MODIS Sea Surface Temperatures, and the data from the
investigation of Clouds and the Earth's Radiant Energy System (CERES) may involve
large teams whose funding and operations extend over several decades. In this
paradigm, the data producer often bunches production changes into large,
annual updates and reprocesses data so there the data record has a homogeneous
uncertainty from the beginning of the observations to the current ones. To avoid
glitches in this homogeneity, the data production team may have to insist that
input data be produced without new versions. For example CERES requires the
use of temperatures and humidities from operational weather forecasts. There were
three possible sources of such forecasts: the forecasts from the European Center
for Medium Range Weather Forecasts (ECMWF), NOAA's NCEP data, and data from the
GSFC Data Assimilation Office DAO). After an initial period of experimentation,
the CERES team was unable to obtain an agreement from ECMWF or from NOAA to provide
a homogeneous set of data in which the production configuration was held constant.
The team was able to obtain such an agreement from DAO and so the later versions
of the CERES data use that source of temperatures and humidities.
The first of these paradigms may use such tools as graphical workflow engines to
produce interactively controlled production scenarios. Operational and climate
data production are generally much more regular in their workflows. The teams
often rely on scripts that act as templates for connecting particular kinds of data
files with particular Product Generation Executables (PGE's) or `Jobs.' Because
the scripts connect the files with the jobs, the data producer teams often find it
easy to couple a file identifier schema with the production scripts. This approach
requires the identifiers to have a highly regular syntax, since the team is
attempting to ensure that data of a particular kind from a particular set of
instruments and a particular time interval couples with a particular version of a
PGE to produce a particular instance of a particular kind of output file.
For example, the CERES team has data files that contain one hour of calibrated and
geolocated radiances observed in three spectral bands from one of six instruments
on four satellites. To identify cloud properties for one of these files, the team
needs twenty channels of MODIS data from the twelve five-minute intervals that span
the selected hour (on a particular date) and from the MODIS instrument on the same
satellite. The data production team is accustomed to creating the scripts for a
large number of jobs using text editing tools. Modifying this approach to
alternative ones would entail an unacceptable distraction to the team's work which
gives emphasis to ensuring homogeneity in the data record by detecting
inconsistencies between different instruments and different production algorithms.
The total collection of files being handled for a climate data production scenario
may include thousands to millions of individual files. The amount of data is
sufficiently large that having relational databases handle the data is improbable.
The production scenarios depend on carefully designing highly structured collections
of files. As an example, the CERES team has eleven major subsystems
and about the same number of Data Product collections. For this team, a Data
Product refers to a kind of file that has common parameters and common time
intervals. For example, one kind of Data Product contains raw data over a time
period of one day and another Data Product that contains raw data over a time
period of one hour. Another kind of Data Product contains calibrated and
geolocated radiances (as well as viewing zenith, solar zenith, and relative
azimuth) in a single hour. A third kind of Data Product contains monthly and
regional averages of top of atmosphere fluxes of reflected sunlight and emitted
thermal energy. A Data Product contains Data Sets, where a Data Set uses data
from a specified data source. Some data sources are single instruments, such as
the Flight Model 1 instrument on Terra or the Flight Model 5 instrument on NPP.
Other Data Sets combine data from several instruments, such as the Flight Model 1
and Flight Model 2 instruments on Terra. Data Set Versions are collection subsets
of Data Sets where the CERES team uses a single version of the production source
code and a consistent set of input parameters appropriate to the instrument and
input data for a PGE to produce a time series of files with as uncertainties that
are as homogeneous over the record as possible.
The CERES team has selected an identifier syntax for its files that allows both
automated and human consistency checking. The science team and the data production
team at the LaRC Atmospheric Sciences Data Center have a highly evolved procedure
intended to minimize errors.
In addition, the CERES team has developed extensive documentation to inform the
user community of the differences between one part of the CERES data collection and
another. This documentation is heavily coupled to the structure and naming
conventions used in production. Changing this approach requires a substantial
resource investment, since the CERES team has spent several thousand person-years
into producing and validating this data collection.
As this example suggests, there are a number of practical policy issues connected
with identifiers for data production configuration management.
1. Because production of large data sets with complex but regular PGE templates is
unworkable without organizing the file collection into a systematic structure, the
data producers tend to create a strongly hierarchical structure of file
collections. The file identifier syntax usually reflects the structure of the file
collection hierarchy. How should other kinds of identifier schemas take the data
production team's collection structure and identifier syntax into account? Do we
need aliased identifiers or should the identifier schemas ignore those of the data
producer? Could the production team's identifiers be adopted as is?
2. Do users find successive disclosure of a file collection's structure to have a
lower learning curve for finding items than a query using more generic search
terms? [The generalization of this question leads to such topics as Web site
browsing, non-semantic searches, and user search efficiency, although they are
beyond the scope of this discussion.]
3. How should data citations be structured so that it is possible to do data
replications or verifications on collections with millions (or more) of data
values, some of which are relevant and some are not?
More information about the Esip-preserve
mailing list