[Esip-preserve] Identifiers

Mon Feb 20 13:52:19 EST 2012

The attached file contains thoughts on two communities to
whom the differences we've been discussing are likely to
matter: archive managers and data producers.  As a data
producer, it has mattered a great deal whether the discussion
is about a relational database (as an object that might
contain a dataset), a file (as another object that might contain
one), or a collection of files with a very long history (say 20
years) of structure (and perhaps yet another kind of data set).
 Somewhat similar concerns arise for
data managers, which I was for five years.

Bruce B.

On Sat, Feb 18, 2012 at 3:10 AM, Greg Janée <gjanee at eri.ucsb.edu> wrote:
> Mark A. Parsons wrote:
>> I don't think there is a falsifiable definition of data set. Or rather all definitions are false. It's very situational.
>
> Agreed.  To put it another way, I think this attempt to define "dataset" is doomed because a dataset is a cognitive construct, and cognitive constructs do not have exact definitions and hard boundaries, but look more like overlapping categories that are characterized by exemplars and degrees of membership.
>
> Is there a *functional* reason why we need to define terms like "dataset" and "granule"?  I guess a necessary (but not sufficient) condition for me to be convinced by any definitions for "dataset" and "granule" is that there is some kind of functional difference between them; some different functional affordances.
>
> From the old Alexandria days I recall a passionate debate over what constituted a "title".  (That may sound quaint now, but I assure you, a librarian armed with an AACR2 reference is a formidable adversary.)  What cut through that particular Gordian knot was looking at the question purely functionally: we only care about titles to the extent that we do something with them.  And the answer at that time was, all we do with titles is display them in search result lists.  Ergo, a "title" is that which you want to see displayed as a search result, no more, no less.  Corollary: a title should be about one line wide when displayed in a typical font size.
>
> Regarding data and citation, from a functional perspective I would say that if a particular entity has an identifier, and can be independently referenced (or is independently actionable), and if the entity's provider is committed to maintaining that entity and its identifier and its independent referencability, then the entity is "citable".  Notice that this definition is independent of both the size of the entity and the terminology the provider uses in referring to it.
>
> -Greg
>
-------------- next part --------------
Further Comments on Identifiers

The previous discussion appears to assume that the only community interested in
what we mean by ``data set'' is the community of data users and it seems to assume
that these users are like customers for a library of texts.  However, there are at
least two other communities who have to work with collections of data: archive or
repository managers and data producers.  These two communities have somewhat
different requirements for identifiers than do data users.

Archive Inventory Identifier Issues

One of the primary concerns of archive managers is inventory planning and control.
Because managers of physical items are usually concerned with the value of their
inventory, this concern may also feed into data archive accounting and legal 
concerns as well.

As a concrete example of how identifiers tie to inventory control, I went to Lowe's
over the weekend to pick up some screws for a gardening bench I'd constructed.
The screws are not unique items.  The plastic boxes holding the screws did have the
bar code labels on the sides, as well as English and Spanish words that identified 
the screws they contained.  The labels distinguished between Phillips head screws
and star drive screws, as well as the size (2-1/2" versus 1-1/4") and type 
(external decking versus brass).  The manager at Lowe's could count boxes of each 
type to determining inventory levels.  Then, he could decide whether or not to 
reorder a shipment of these boxes.  The accountants could have a ledger account 
that might use the bar code identifier or an inventory control identifier to track 
when the store bought or sold these items.

On the other hand, accountants can establish ledger accounts for unique items in
inventory.  This would more likely apply to larger items, such as RV's or cars, 
which do have a Vehicle Identification Numbers (VINs).

When our attention shifts to data objects, there are two kinds of data containers
that are likely to be the smallest discrete objects of interest: relational 
databases and files.

Because databases only become holders of unique content if transactions are not 
allowed to change the contents, applying a unique identifier for them only becomes 
possible if the identifier refers to a state of the database.  In order to track 
state when transactions are allowed, the inventory control mechanism has to keep 
the full history of transactions.  A weaker identifier system takes snapshots of 
the database state, but this does not maintain inventory control of the contents at 
full temporal resolution.

Static files may be unique items.  However, archives routinely copy files for 
internal use.  Such copies include staging files from the archive storage to disks 
devoted to distributing copies.  Assuming that the full identifier for a file is 
the path name, the copy in internal storage will have a different path than the 
copy made for distribution.  Therefore, these two files could be regarded as having 
different identifiers, even if they are replicas of each other that would have 
identical cryptographic digests.

After the archive distributes the disk copy to another data archive, there would be
a third copy.  If the cryptographic digest of this copy agrees with the digest 
created by applying the same kind of cryptographic digest to the first, there is a 
very strong case that the two files are identical.  In other words, the process of 
making a digital copy does not destroy the authenticity of the bit sequence it 
contains.  Aside from the difference in location and the storage system 
configuration, there appears to be no practical distinction between the copies.
The identity of the copies raises at least five questions with practical 
implications:

1.  Should an archive's inventory control system track copies of files within its 
systems?  This could be a useful addition if there are performance differences 
between the storage systems of the different copies.  For example, a copy 
maintained in a dark archive is going to be more difficult to bring to light than a 
copy maintained on disk.

2.  If the transfer of the file from one archive to another (or from the archive to 
a file user) has been made with suitable legal arrangements, why should one site 
receive preference as a source for other users?  A practical context for this 
question arises in situations where several countries arrange to exchange data 
files.  After the copies have been verified as having the same cryptographic 
digests, a European copy of a file that originated in the U.S. would be highly 
likely to be easier for a European data user to obtain that from a U.S. site.  
This approach is widely used in sites that mirror each other and allow software
downloads, for example.

3.  What is the value of creating a unique identifier (say a UUID) for each identical copy?

4.  If the data file at the original site were destroyed and that file were 
replaced by a copy of a copy from another site, has the authenticity of the new 
copy at the original site been destroyed?  [This might suggest that an
authenticating agent should change the location of the preferred site to the one 
from which the original archive recovered its copy.]

5.  In the previous situation, should the original archive change the inventory 
identifier of the file if it has to replace the destroyed copy with one from 
another site?  This question becomes more generic if the question is `Should an 
inventory control identifier allow the system to delete files?'  As a practical 
example, in the LaRC DAAC, there was an incident in which a router handling file 
transfers from GSFC to LaRC began injecting erroneous bits into the transferred 
files.  Should the LaRC DAAC continue its policy of replacing the erroneous files
with correctly transmitted ones and keeping the same file identifiers in its 
inventory control system?

From the standpoint of the inventory control system, these kinds of questions 
translate into configuration maintenance policies.

Transformational Migration and File Identifiers

Except for ASCII character formatting and TeX, most objects in information 
technology are depressingly transient.  Examples abound.  Since digital computers 
appeared, the instruction sets have changed.  The transition from 16-bit computers 
to 32-bit ones to 64-bits is perhaps the easiest example of the phenomenon.  
Microsoft Word has gone through a rather large number of versions.  Storage media 
have gone from iron core memories, through large hard disks to floppy disks to CD's 
and flash drives to DVD's.  Computer languages and operating systems have gone 
through similar changes.  At present, mobile devices, network computing, multi-core 
chips, and power requirements are creating evolutionary pressures on data systems.  
While storage media may be reasonably stable over a decade, the storage devices 
wear out, become obsolete, and need replacing at a time interval of about three to
five years.  Archives and repositories must migrate data from old hardware to new 
just to be able to read and distribute data.  These changes also interact with 
changes in software such that old data files become obsolete on about the same time 
scale.

An example of this kind of change is the impact of character format
internationalization from 8-bit ASCII characters to 16-bit Unicode character sets.  
If a file needed a transformational migration from ASCII to Unicode, an English 
language document could still be presented in a suitable typographic manner 
(indeed, in the same type font and size).  However, the bit pattern of the digital 
file would be completely different in Unicode than it was in ASCII.  Thus, the
cryptographic digests of two files with essentially the same characters would not 
be identical.

Of course, this example suggests a further set of questions:

1.  If an archive undertakes a transformational migration, should the transformed 
files be allowed to have the same identifier for inventory purposes?

2.  If the identifiers are allowed to remain the same, how should the archive's 
inventory control system ensure consistency between the transformed copies and 
undeleted copies of the originals?

3.  If the archive changes the identifiers to distinguish the transformed copies, 
what mechanisms does it use to direct users to the new copies?

4.  What policies do federated archives need to institute to maintain consistent 
inventory control under transformational migration?  For example, could a secondary 
archive specialize in particular formats?

Production Control Issues

There are three basic data production paradigms:

1.  Exploratory data production, usually conducted by individuals or small teams of 
investigators.  In this paradigm, the investigation team selects a set of data to 
explore, sometimes out of curiosity and sometimes to answer a specific scientific 
question.  This paradigm is also useful to larger teams doing quality control to 
discover an error or anomaly, identify its cause and extent, and test proposed
fixes.  While there may be some initial notions of the appropriate algorithms to 
apply, this paradigm allows changes in the selected input data, selected 
algorithms, and final output.  At least in the early stages of the investigation, 
there may be a diversity of production provenance graphs that describe the detailed 
data production.

2.  Operational data production, usually conducted by large teams for a particular 
purpose, such as weather forecasting or disaster warnings.  The pressure of short 
latency times and large data throughput requirements usually limits the amount of 
anomaly or error correction work the producer can do.  Such systems usually have 
very heavy configuration control mechanisms.  The data producer inserts changes 
into the production process on an irregular basis.  The producer does not have time 
to reprocess previously processed data to obtain a homogeneous uncertainty across 
the full data record.

3.  Climate data production, which may be conducted by large or small teams, 
concentrating on a particular kind of data.  Production of solar constant data or 
some of the records in the Global Historical Climate Network collection at NCDC are 
done by relatively small teams.  Production of other records, such as the MODIS 
vegetation products, MODIS Sea Surface Temperatures, and the data from the 
investigation of Clouds and the Earth's Radiant Energy System (CERES) may involve 
large teams whose funding and operations extend over several decades.  In this 
paradigm, the data producer often bunches production changes into large,
annual updates and reprocesses data so there the data record has a homogeneous 
uncertainty from the beginning of the observations to the current ones.  To avoid 
glitches in this homogeneity, the data production team may have to insist that 
input data be produced without new versions.  For example CERES requires the
use of temperatures and humidities from operational weather forecasts.  There were 
three possible sources of such forecasts: the forecasts from the European Center 
for Medium Range Weather Forecasts (ECMWF), NOAA's NCEP data, and data from the 
GSFC Data Assimilation Office DAO).  After an initial period of experimentation, 
the CERES team was unable to obtain an agreement from ECMWF or from NOAA to provide 
a homogeneous set of data in which the production configuration was held constant.  
The team was able to obtain such an agreement from DAO and so the later versions
of the CERES data use that source of temperatures and humidities.

The first of these paradigms may use such tools as graphical workflow engines to 
produce interactively controlled production scenarios.  Operational and climate 
data production are generally much more regular in their workflows.  The teams 
often rely on scripts that act as templates for connecting particular kinds of data 
files with particular Product Generation Executables (PGE's) or `Jobs.'  Because 
the scripts connect the files with the jobs, the data producer teams often find it 
easy to couple a file identifier schema with the production scripts.  This approach 
requires the identifiers to have a highly regular syntax, since the team is 
attempting to ensure that data of a particular kind from a particular set of
instruments and a particular time interval couples with a particular version of a 
PGE to produce a particular instance of a particular kind of output file.

For example, the CERES team has data files that contain one hour of calibrated and 
geolocated radiances observed in three spectral bands from one of six instruments 
on four satellites.  To identify cloud properties for one of these files, the team 
needs twenty channels of MODIS data from the twelve five-minute intervals that span 
the selected hour (on a particular date) and from the MODIS instrument on the same
satellite.  The data production team is accustomed to creating the scripts for a 
large number of jobs using text editing tools.  Modifying this approach to 
alternative ones would entail an unacceptable distraction to the team's work which 
gives emphasis to ensuring homogeneity in the data record by detecting
inconsistencies between different instruments and different production algorithms.

The total collection of files being handled for a climate data production scenario 
may include thousands to millions of individual files.  The amount of data is 
sufficiently large that having relational databases handle the data is improbable.  
The production scenarios depend on carefully designing highly structured collections 
of files.  As an example, the CERES team has eleven major subsystems 
and about the same number of Data Product collections.  For this team, a Data 
Product refers to a kind of file that has common parameters and common time 
intervals.  For example, one kind of Data Product contains raw data over a time 
period of one day and another Data Product that contains raw data over a time 
period of one hour.  Another kind of Data Product contains calibrated and 
geolocated radiances (as well as viewing zenith, solar zenith, and relative 
azimuth) in a single hour.  A third kind of Data Product contains monthly and 
regional averages of top of atmosphere fluxes of reflected sunlight and emitted 
thermal energy.  A Data Product contains Data Sets, where a Data Set uses data
from a specified data source.  Some data sources are single instruments, such as 
the Flight Model 1 instrument on Terra or the Flight Model 5 instrument on NPP.  
Other Data Sets combine data from several instruments, such as the Flight Model 1 
and Flight Model 2 instruments on Terra.  Data Set Versions are collection subsets 
of Data Sets where the CERES team uses a single version of the production source
code and a consistent set of input parameters appropriate to the instrument and 
input data for a PGE to produce a time series of files with as uncertainties that 
are as homogeneous over the record as possible.

The CERES team has selected an identifier syntax for its files that allows both 
automated and human consistency checking.  The science team and the data production 
team at the LaRC Atmospheric Sciences Data Center have a highly evolved procedure 
intended to minimize errors.

In addition, the CERES team has developed extensive documentation to inform the 
user community of the differences between one part of the CERES data collection and 
another.  This documentation is heavily coupled to the structure and naming 
conventions used in production.  Changing this approach requires a substantial 
resource investment, since the CERES team has spent several thousand person-years 
into producing and validating this data collection.

As this example suggests, there are a number of practical policy issues connected 
with identifiers for data production configuration management.

1.  Because production of large data sets with complex but regular PGE templates is 
unworkable without organizing the file collection into a systematic structure, the 
data producers tend to create a strongly hierarchical structure of file 
collections.  The file identifier syntax usually reflects the structure of the file 
collection hierarchy.  How should other kinds of identifier schemas take the data 
production team's collection structure and identifier syntax into account?  Do we 
need aliased identifiers or should the identifier schemas ignore those of the data 
producer?  Could the production team's identifiers be adopted as is?

2.  Do users find successive disclosure of a file collection's structure to have a 
lower learning curve for finding items than a query using more generic search 
terms?  [The generalization of this question leads to such topics as Web site 
browsing, non-semantic searches, and user search efficiency, although they are 
beyond the scope of this discussion.]

3.  How should data citations be structured so that it is possible to do data 
replications or verifications on collections with millions (or more) of data 
values, some of which are relevant and some are not?