[Esip-preserve] On Earth Science Data File Uniqueness

Fri Jan 28 16:02:03 EST 2011

While I'm not quite done with the paper that describes work I've been doing
on file uniqueness, I think we should open a discussion about whether files
that contain Earth science data, particularly numic data, can have unique
identifiers associated with their content.

 Altman, M (2008) A Fingerprint Method for Scientific Data Verification.
Adv. Comput. Inf. Sci. Eng.:
311-316. doi:10.1007/978-1-4020-8741-7_57.  and Altman M, King G (2007) A
Proposed Standard for
the Scholarly Citation of Quantitative Data. D-Lib Mag. 13(3/4). ISSN
1082-9873.
http://dlib.org/dlib/march07/altman/03altman.html Accessed 13 August 2010
have suggested
an approach they call "Universal Numeric Fingerprints".  These are basically
cryptographic
digests of numeric data in files - and rely on what these authors call
"canonicalization"
of the formats for numeric data.

I believe that in the Earth sciences it is sufficiently improbable that we
will find a "canonicalization"
authority that this suggestion is impractical.

Before considering that argument, I obtained a sample of monthly
precipitation data from
one of the important climate data records at NCDC, Version 2 of the GHCN
precipitation
data.  The data in this data collection comes in two files: one for station
metadata, such
as station latitude, longitude, and altitude, as well as name, a second for
the actual data,
arranged as rows of numerical values that start with the year of observation
and then
have twelve columns that represent the monthly average precipitation in that
month.
The data are described as ASCII characters separated by spaces and have one
code
for months with only a trace of precipitation and another for missing data.
The format itself
is simple enough and can be read by anyone familiar with typical programming
languages,
such as FORTRAN, C, C++, Java, or Ada.

The question that follows is what kinds of rearrangements are possible that
preserve the
scientific identity of these data?  Here are some examples:

a.  Replace ASCII characters that represent the floating point values with
binary numbers
that might be either single precision or double precision
b.  Replace ASCII characters that are either integers or decimal points with
numbers in
which the ASCII characters spell out the integer: '1' => 'one' or 'ONE' or
'One'.
c.  Permute the order of the months
d.  Rearrange the data so that each station has an array of numerical values
that represent
the data
e.  Rewrite the data into an XML format - using a DTD
f.  Rewrite the data into an XML format - using an XML Schema
g.  Merge the two files into one, using suggestion d and simply adding in
the station
name and location
h.  Rewrite the data into an HDF format that includes the array sizes and
other representation
as part of the bits in the file.

The question then is how do we handle finding out whether two files have
identical
data.  To answer that question, I turned to Annex E in the OAIS RM and
formalized
the information layers.  There are five layers in the model in that Annex:
a.  Media Layer
b.  Bit Array Layer (the bits in a file can be put into a single array)
c.  Data Element Layer (in which the bits are grouped into data elements
that contain
the scientific values that we want to preserve)
d.  Data Structure Layer (in which the data elements are grouped into
computer science
data types like arrays, lists, or character strings)
e.  Application Layer (in which mechanisms such as file openings, readings,
and closings
are available to provide a mechanism for applciations, such as
visualization, to access and
transform the data structures into other forms that are useful to users

To prove that two files have identical scientific data, two things are
needed:
1.  Identification of which data elements in a file refer to the scientific
content
(array sizes and orderings are not part of this identity - it shouldn't
matter whether
we're dealing with an array ordered by using the C language conventions or
the FORTRAN
conventions)
2.  A mapping between the order of scientific data elements in the first
file and the order
of the scientific data elements in the second (meaning something like data
element 1 in
file 1 should be compared with data element 10 in file 2, and so on).

Note that there are three degrees of stringency in the comparison:
a.  Only data elements that are part of the measurement set in the file are
to be compared
(temperature 1 = temperature 10?) - no context or error data elements are
included
b.  Data Elements that were measured and context data elements are included
(temperature 1 = temperature 10 AND latitude 1 = latitude 10?)
c.  Measurements, Context, and Error Distribution data elements are included
(temperature 1 = temperature 10 AND latitude 1 and latitude 10 AND
Temperature 1 Bias =
Temperature 10 Bias AND Temperature 1 Std Dev = Tempature 10 Std Dev)

At this point, I'm not making a recommendation about which of these degrees
of
stringency is appropriate - just that these degrees exist.  I'll also note
that the
same kinds of rearrangement can be applied to many other examples.  An
interesting
one is that the MODIS geolocation data (latitudes and longitudes) are
contained in
one kind of data product (MOD02 is the name), while the radiometric data
from the
instrument are contained (without geolocation) in individual files for each
spectral
band.  If an individual or group rearranged the data so each pixel in an
image were
associated with a particular band's spectral radiance, the data in the new
file would
be scientifically identical with the data in the two original files.

Note also that this mapping approach to equivalency ignores whether the data
element order and other representation information is included in the
files.  This
approach to representation is substantially looser than what would be
required for
dealing with files containing text and illustrations, since text works are
almost
certain to require case sensitivity, punctuation, and separators remaining
the same
from one "copy" of a document to another.

Note that under the proposed mapping approach any of the possible
arrangements
of data elements in the list a - h would be judged as "scientifically
identical" despite
the substantial rearrangements they allow (we could even include
demonstrating that
results sets from relational database queries are scientifically equivalent
to the plain
ASCII format of the two original files, which would add another
rearrangement).
>From the standpoint of the mathematics involved, there is no one preeminent
arrangement that is unique.  Thus, any object that the mapping says is
equivalent
belongs to the same equivalence class.

As a practical matter, this formal work suggests that files are like deeds
in county
records.  To verify the ownership of a particular piece of property, you
establish a
chain of ownership (provenance may be appropriate) demonstrating that the
property
in the current deed is identical with the property owned by the previous
owner and
that the previous owner had an identical piece of property passed to him.
Ownership
in this sense does not depend on an item embedded in the property.  [We
should have
a discussion of the mechanisms for verifying the chain of ownership -
although we also
should make sure we have some specialists in authorization and
authentication
involved in that discussion.]

To return to Altman, et al., the weak point of their approach is the
assumption that
there is a reasonable probability of having an authority that can establish
a canonical
arrangement of array orders (or list orders), as well as a canonical format
for numeric
data.  As a practical matter, the improbability of the canonicalization is
suggested by
the persistent of format diversity for images or text documents.  There is
no sign that
jpg, gif, ps, eps, bitmaps, tiff, or any of the other image formats are
going to disappear.
Likewise, text documents can appear as MS .doc files (with a number of
versions),
.odt files (from Open Office), .ps, .pdf, .tex, or even older formats.
Again, there's no
sign that these diverse formats are going to be "canonicalized".  TeX and
plain .txt
files are the only ones I can think of that appear to be stable on a very
long time sacle.  In the case
of Earth science data, there is the additional difficulty that various
formats used in
operational data operations are close to being embedded in international
agreements.
That's clearly the case for radiosonde data, where the WMO has a rather old
format
that's used to create weather forecasts on a short latency schedule.  It
seems unlikely
that attempts to change formats will succeed because they are likely to
involve large
sums of money and some disruption that would be entailed in a format
change.

In short, I disagree with the notion that Earth science data files can be
uniquely
identified as "objects" - although the procedure I've sketched above does
provide
for a method that can identify two or more members of the same equivalence
class.

Bruce B.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.lists.esipfed.org/pipermail/esip-preserve/attachments/20110128/0184d41e/attachment.html>