[Esip-preserve] On Earth Science Data File Uniqueness

Mon Jan 31 16:26:39 EST 2011

Hi Bruce,

I tend to agree with your conclusions though not how you are stating them.  For example, a data file is an "object" and there is a point to giving it a "unique id" even though that says nothing about whether we have mechanisms to tell whether it is in the same "equivalence class" as another "object" a file in this case that contains the same content differently formatted.

Ruth

On Jan 28, 2011, at 2:02 PM, Bruce Barkstrom wrote:

> While I'm not quite done with the paper that describes work I've been doing
> on file uniqueness, I think we should open a discussion about whether files
> that contain Earth science data, particularly numic data, can have unique
> identifiers associated with their content.
>  
>  Altman, M (2008) A Fingerprint Method for Scientific Data Verification. Adv. Comput. Inf. Sci. Eng.:
> 311-316. doi:10.1007/978-1-4020-8741-7_57.  and Altman M, King G (2007) A Proposed Standard for
> the Scholarly Citation of Quantitative Data. D-Lib Mag. 13(3/4). ISSN 1082-9873.
> http://dlib.org/dlib/march07/altman/03altman.html Accessed 13 August 2010 have suggested
> an approach they call "Universal Numeric Fingerprints".  These are basically cryptographic
> digests of numeric data in files - and rely on what these authors call "canonicalization"
> of the formats for numeric data.
>  
> I believe that in the Earth sciences it is sufficiently improbable that we will find a "canonicalization"
> authority that this suggestion is impractical.
>  
> Before considering that argument, I obtained a sample of monthly precipitation data from
> one of the important climate data records at NCDC, Version 2 of the GHCN precipitation
> data.  The data in this data collection comes in two files: one for station metadata, such
> as station latitude, longitude, and altitude, as well as name, a second for the actual data,
> arranged as rows of numerical values that start with the year of observation and then
> have twelve columns that represent the monthly average precipitation in that month.
> The data are described as ASCII characters separated by spaces and have one code
> for months with only a trace of precipitation and another for missing data.  The format itself
> is simple enough and can be read by anyone familiar with typical programming languages,
> such as FORTRAN, C, C++, Java, or Ada.
>  
> The question that follows is what kinds of rearrangements are possible that preserve the
> scientific identity of these data?  Here are some examples:
>  
> a.  Replace ASCII characters that represent the floating point values with binary numbers
> that might be either single precision or double precision
> b.  Replace ASCII characters that are either integers or decimal points with numbers in
> which the ASCII characters spell out the integer: '1' => 'one' or 'ONE' or 'One'.
> c.  Permute the order of the months
> d.  Rearrange the data so that each station has an array of numerical values that represent
> the data
> e.  Rewrite the data into an XML format - using a DTD
> f.  Rewrite the data into an XML format - using an XML Schema
> g.  Merge the two files into one, using suggestion d and simply adding in the station
> name and location
> h.  Rewrite the data into an HDF format that includes the array sizes and other representation
> as part of the bits in the file.
>  
> The question then is how do we handle finding out whether two files have identical
> data.  To answer that question, I turned to Annex E in the OAIS RM and formalized
> the information layers.  There are five layers in the model in that Annex:
> a.  Media Layer
> b.  Bit Array Layer (the bits in a file can be put into a single array)
> c.  Data Element Layer (in which the bits are grouped into data elements that contain
> the scientific values that we want to preserve)
> d.  Data Structure Layer (in which the data elements are grouped into computer science
> data types like arrays, lists, or character strings)
> e.  Application Layer (in which mechanisms such as file openings, readings, and closings
> are available to provide a mechanism for applciations, such as visualization, to access and
> transform the data structures into other forms that are useful to users
>  
> To prove that two files have identical scientific data, two things are needed:
> 1.  Identification of which data elements in a file refer to the scientific content
> (array sizes and orderings are not part of this identity - it shouldn't matter whether
> we're dealing with an array ordered by using the C language conventions or the FORTRAN
> conventions)
> 2.  A mapping between the order of scientific data elements in the first file and the order
> of the scientific data elements in the second (meaning something like data element 1 in
> file 1 should be compared with data element 10 in file 2, and so on).
>  
> Note that there are three degrees of stringency in the comparison:
> a.  Only data elements that are part of the measurement set in the file are to be compared
> (temperature 1 = temperature 10?) - no context or error data elements are included
> b.  Data Elements that were measured and context data elements are included
> (temperature 1 = temperature 10 AND latitude 1 = latitude 10?)
> c.  Measurements, Context, and Error Distribution data elements are included
> (temperature 1 = temperature 10 AND latitude 1 and latitude 10 AND Temperature 1 Bias =
> Temperature 10 Bias AND Temperature 1 Std Dev = Tempature 10 Std Dev)
>  
> At this point, I'm not making a recommendation about which of these degrees of
> stringency is appropriate - just that these degrees exist.  I'll also note that the
> same kinds of rearrangement can be applied to many other examples.  An interesting
> one is that the MODIS geolocation data (latitudes and longitudes) are contained in
> one kind of data product (MOD02 is the name), while the radiometric data from the
> instrument are contained (without geolocation) in individual files for each spectral
> band.  If an individual or group rearranged the data so each pixel in an image were
> associated with a particular band's spectral radiance, the data in the new file would
> be scientifically identical with the data in the two original files.
>  
> Note also that this mapping approach to equivalency ignores whether the data
> element order and other representation information is included in the files.  This
> approach to representation is substantially looser than what would be required for
> dealing with files containing text and illustrations, since text works are almost
> certain to require case sensitivity, punctuation, and separators remaining the same
> from one "copy" of a document to another.
>  
> Note that under the proposed mapping approach any of the possible arrangements
> of data elements in the list a - h would be judged as "scientifically identical" despite
> the substantial rearrangements they allow (we could even include demonstrating that
> results sets from relational database queries are scientifically equivalent to the plain
> ASCII format of the two original files, which would add another rearrangement).
> From the standpoint of the mathematics involved, there is no one preeminent
> arrangement that is unique.  Thus, any object that the mapping says is equivalent
> belongs to the same equivalence class.
>  
> As a practical matter, this formal work suggests that files are like deeds in county
> records.  To verify the ownership of a particular piece of property, you establish a
> chain of ownership (provenance may be appropriate) demonstrating that the property
> in the current deed is identical with the property owned by the previous owner and
> that the previous owner had an identical piece of property passed to him.  Ownership
> in this sense does not depend on an item embedded in the property.  [We should have
> a discussion of the mechanisms for verifying the chain of ownership - although we also
> should make sure we have some specialists in authorization and authentication
> involved in that discussion.]
>  
> To return to Altman, et al., the weak point of their approach is the assumption that
> there is a reasonable probability of having an authority that can establish a canonical
> arrangement of array orders (or list orders), as well as a canonical format for numeric
> data.  As a practical matter, the improbability of the canonicalization is suggested by
> the persistent of format diversity for images or text documents.  There is no sign that
> jpg, gif, ps, eps, bitmaps, tiff, or any of the other image formats are going to disappear.
> Likewise, text documents can appear as MS .doc files (with a number of versions),
> .odt files (from Open Office), .ps, .pdf, .tex, or even older formats.  Again, there's no
> sign that these diverse formats are going to be "canonicalized".  TeX and plain .txt
> files are the only ones I can think of that appear to be stable on a very long time sacle.  In the case
> of Earth science data, there is the additional difficulty that various formats used in
> operational data operations are close to being embedded in international agreements.
> That's clearly the case for radiosonde data, where the WMO has a rather old format
> that's used to create weather forecasts on a short latency schedule.  It seems unlikely
> that attempts to change formats will succeed because they are likely to involve large
> sums of money and some disruption that would be entailed in a format
> change.
>  
> In short, I disagree with the notion that Earth science data files can be uniquely
> identified as "objects" - although the procedure I've sketched above does provide
> for a method that can identify two or more members of the same equivalence class.
>  
> Bruce B.
> _______________________________________________
> Esip-preserve mailing list
> Esip-preserve at lists.esipfed.org
> http://www.lists.esipfed.org/mailman/listinfo/esip-preserve

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.lists.esipfed.org/pipermail/esip-preserve/attachments/20110131/3922d995/attachment-0001.html>