[Esip-preserve] On Earth Science Data File Uniqueness
Ruth Duerr
rduerr at nsidc.org
Mon Jan 31 16:26:39 EST 2011
Hi Bruce,
I tend to agree with your conclusions though not how you are stating them. For example, a data file is an "object" and there is a point to giving it a "unique id" even though that says nothing about whether we have mechanisms to tell whether it is in the same "equivalence class" as another "object" a file in this case that contains the same content differently formatted.
Ruth
On Jan 28, 2011, at 2:02 PM, Bruce Barkstrom wrote:
> While I'm not quite done with the paper that describes work I've been doing
> on file uniqueness, I think we should open a discussion about whether files
> that contain Earth science data, particularly numic data, can have unique
> identifiers associated with their content.
>
> Altman, M (2008) A Fingerprint Method for Scientific Data Verification. Adv. Comput. Inf. Sci. Eng.:
> 311-316. doi:10.1007/978-1-4020-8741-7_57. and Altman M, King G (2007) A Proposed Standard for
> the Scholarly Citation of Quantitative Data. D-Lib Mag. 13(3/4). ISSN 1082-9873.
> http://dlib.org/dlib/march07/altman/03altman.html Accessed 13 August 2010 have suggested
> an approach they call "Universal Numeric Fingerprints". These are basically cryptographic
> digests of numeric data in files - and rely on what these authors call "canonicalization"
> of the formats for numeric data.
>
> I believe that in the Earth sciences it is sufficiently improbable that we will find a "canonicalization"
> authority that this suggestion is impractical.
>
> Before considering that argument, I obtained a sample of monthly precipitation data from
> one of the important climate data records at NCDC, Version 2 of the GHCN precipitation
> data. The data in this data collection comes in two files: one for station metadata, such
> as station latitude, longitude, and altitude, as well as name, a second for the actual data,
> arranged as rows of numerical values that start with the year of observation and then
> have twelve columns that represent the monthly average precipitation in that month.
> The data are described as ASCII characters separated by spaces and have one code
> for months with only a trace of precipitation and another for missing data. The format itself
> is simple enough and can be read by anyone familiar with typical programming languages,
> such as FORTRAN, C, C++, Java, or Ada.
>
> The question that follows is what kinds of rearrangements are possible that preserve the
> scientific identity of these data? Here are some examples:
>
> a. Replace ASCII characters that represent the floating point values with binary numbers
> that might be either single precision or double precision
> b. Replace ASCII characters that are either integers or decimal points with numbers in
> which the ASCII characters spell out the integer: '1' => 'one' or 'ONE' or 'One'.
> c. Permute the order of the months
> d. Rearrange the data so that each station has an array of numerical values that represent
> the data
> e. Rewrite the data into an XML format - using a DTD
> f. Rewrite the data into an XML format - using an XML Schema
> g. Merge the two files into one, using suggestion d and simply adding in the station
> name and location
> h. Rewrite the data into an HDF format that includes the array sizes and other representation
> as part of the bits in the file.
>
> The question then is how do we handle finding out whether two files have identical
> data. To answer that question, I turned to Annex E in the OAIS RM and formalized
> the information layers. There are five layers in the model in that Annex:
> a. Media Layer
> b. Bit Array Layer (the bits in a file can be put into a single array)
> c. Data Element Layer (in which the bits are grouped into data elements that contain
> the scientific values that we want to preserve)
> d. Data Structure Layer (in which the data elements are grouped into computer science
> data types like arrays, lists, or character strings)
> e. Application Layer (in which mechanisms such as file openings, readings, and closings
> are available to provide a mechanism for applciations, such as visualization, to access and
> transform the data structures into other forms that are useful to users
>
> To prove that two files have identical scientific data, two things are needed:
> 1. Identification of which data elements in a file refer to the scientific content
> (array sizes and orderings are not part of this identity - it shouldn't matter whether
> we're dealing with an array ordered by using the C language conventions or the FORTRAN
> conventions)
> 2. A mapping between the order of scientific data elements in the first file and the order
> of the scientific data elements in the second (meaning something like data element 1 in
> file 1 should be compared with data element 10 in file 2, and so on).
>
> Note that there are three degrees of stringency in the comparison:
> a. Only data elements that are part of the measurement set in the file are to be compared
> (temperature 1 = temperature 10?) - no context or error data elements are included
> b. Data Elements that were measured and context data elements are included
> (temperature 1 = temperature 10 AND latitude 1 = latitude 10?)
> c. Measurements, Context, and Error Distribution data elements are included
> (temperature 1 = temperature 10 AND latitude 1 and latitude 10 AND Temperature 1 Bias =
> Temperature 10 Bias AND Temperature 1 Std Dev = Tempature 10 Std Dev)
>
> At this point, I'm not making a recommendation about which of these degrees of
> stringency is appropriate - just that these degrees exist. I'll also note that the
> same kinds of rearrangement can be applied to many other examples. An interesting
> one is that the MODIS geolocation data (latitudes and longitudes) are contained in
> one kind of data product (MOD02 is the name), while the radiometric data from the
> instrument are contained (without geolocation) in individual files for each spectral
> band. If an individual or group rearranged the data so each pixel in an image were
> associated with a particular band's spectral radiance, the data in the new file would
> be scientifically identical with the data in the two original files.
>
> Note also that this mapping approach to equivalency ignores whether the data
> element order and other representation information is included in the files. This
> approach to representation is substantially looser than what would be required for
> dealing with files containing text and illustrations, since text works are almost
> certain to require case sensitivity, punctuation, and separators remaining the same
> from one "copy" of a document to another.
>
> Note that under the proposed mapping approach any of the possible arrangements
> of data elements in the list a - h would be judged as "scientifically identical" despite
> the substantial rearrangements they allow (we could even include demonstrating that
> results sets from relational database queries are scientifically equivalent to the plain
> ASCII format of the two original files, which would add another rearrangement).
> From the standpoint of the mathematics involved, there is no one preeminent
> arrangement that is unique. Thus, any object that the mapping says is equivalent
> belongs to the same equivalence class.
>
> As a practical matter, this formal work suggests that files are like deeds in county
> records. To verify the ownership of a particular piece of property, you establish a
> chain of ownership (provenance may be appropriate) demonstrating that the property
> in the current deed is identical with the property owned by the previous owner and
> that the previous owner had an identical piece of property passed to him. Ownership
> in this sense does not depend on an item embedded in the property. [We should have
> a discussion of the mechanisms for verifying the chain of ownership - although we also
> should make sure we have some specialists in authorization and authentication
> involved in that discussion.]
>
> To return to Altman, et al., the weak point of their approach is the assumption that
> there is a reasonable probability of having an authority that can establish a canonical
> arrangement of array orders (or list orders), as well as a canonical format for numeric
> data. As a practical matter, the improbability of the canonicalization is suggested by
> the persistent of format diversity for images or text documents. There is no sign that
> jpg, gif, ps, eps, bitmaps, tiff, or any of the other image formats are going to disappear.
> Likewise, text documents can appear as MS .doc files (with a number of versions),
> .odt files (from Open Office), .ps, .pdf, .tex, or even older formats. Again, there's no
> sign that these diverse formats are going to be "canonicalized". TeX and plain .txt
> files are the only ones I can think of that appear to be stable on a very long time sacle. In the case
> of Earth science data, there is the additional difficulty that various formats used in
> operational data operations are close to being embedded in international agreements.
> That's clearly the case for radiosonde data, where the WMO has a rather old format
> that's used to create weather forecasts on a short latency schedule. It seems unlikely
> that attempts to change formats will succeed because they are likely to involve large
> sums of money and some disruption that would be entailed in a format
> change.
>
> In short, I disagree with the notion that Earth science data files can be uniquely
> identified as "objects" - although the procedure I've sketched above does provide
> for a method that can identify two or more members of the same equivalence class.
>
> Bruce B.
> _______________________________________________
> Esip-preserve mailing list
> Esip-preserve at lists.esipfed.org
> http://www.lists.esipfed.org/mailman/listinfo/esip-preserve
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.lists.esipfed.org/pipermail/esip-preserve/attachments/20110131/3922d995/attachment-0001.html>
More information about the Esip-preserve
mailing list