[Esip-preserve] Stewardship Best Practices - Identifiers

Bruce Barkstrom brbarkstrom at gmail.com
Thu Oct 7 14:35:26 EDT 2010


First, I'm attaching six very short text files containing an example of the
kinds of data rearrangement that will prevent cryptographci digests from
working.  The data are drawn from two files that are available from an ftp
site at NCDC - the location from which I got them is in the file identified
as Procedural_Documentation.txt.  The two files with "INV" in the file
name contain geolocation data for rain guage stations (lat, long, altitude).
The files with "PRCP" in the title are simple rearrangements of ASCII text
files that contain data - which appears to have been written with a
FORTRAN I5 format for each of the monthly averaged precipitation values.
You'll also find my calculation of MD5 checksums for each file.  If I
haven't
made errors and my program has no bugs, you should be able to verify
that the files you receive with this note are unchanged from what I've done.

Second, as a note on scientific equality, although the data values
in the files have been created using only text "cut-and-paste" (and checked
visually), the simple act of rearranging the order of the values makes the
files have different checksums - even though I think it is reasonable to
believe that the data are both authentic (in the sense that they are what
NCDC put into them) and identical in the sense that when pairs
of corresponding elements
in two files are examined together, they have the same scientific value.
It is easy to contrive more complex examples if we convert the ASCII integer
values to, say, double precision floats.  At that point, we'd have to turn
to a
program that would directly compare the numeric values of the data values.
Note, for example, that an ASCII '1' character actually has a binary value
of
49.  Thus, a bit-by-bit comparison of the ASCII text in these files with the
float value will not produce equality.

Third, there is some complexity involved in what we might mean by
"scientifically
identical".  Probably the simplest case involves equality only of the data
values.
As noted in the Procedural_Documentation file (near the bottom), these are
only
the data values contained in the files with PRCP - and the comparison would
not
include the station identifiers.  The next simplest case requires grouping
the data
values with the geolocation values (or, in more general terms, the
spatio-temporal
sampling pattern).  The strongest case I can think of now would require the
data
values, the sampling pattern, and the statistical distribution of error to
match for
every point.  At this point, I am not prepared to discuss computational
feasibility
or storage cost.  As a practical matter, the simplest case is tedious enough
to
automate.

A suggestion or two on how to proceed:

a.  I have some strong suspicions that the appropriate way of dealing with
the
traceability (or authenticity) is to regard the problem as equivalent of
being able
to request a deed abstract that shows all of the comparisons of scientific
equivalency that the abstract preparer is aware of, preferably including at
least
one comparison with the original file prepared by the original producer.  It
will take
me a bit of time (meaning maybe a day) to write up an outline of what this
would entail.  The WG needs to digest the material above (and - hopefully -
the
files attached) and have a discussion.

b.  The paper, as it now stands, is an interesting collection of information
about
various approaches to registering data.  I have mixed feelings about it -
particularly
whether it would lead the community to place an undue confidence in the
registration
process as a long-term way of preserving data.  As with suggestion a, I
think we need
to poll the members of the WG about their feelings on it - after considering
the
discussion we've been having.

c.  As an interim approach to identifiers, I think it would be
straightforward to
use the current naming conventions and hook them up to some of the
identifier
schemas identified in the paper.  DOI's that resolve to file or collection
names
are probably as good as anything else if we're looking for simple
identification.
However, I do not think this kind of approach should be sold as a method of
concocting permanent, unique identifiers.  At best, scientifically identical
data
collections form an equivalence class - with no special status for one
unique
member of the class.

I will see what I can do to put together an outline of what I'd suggest for
authenticity
and auditing.  I do have documentation and code for a computational approach
to
establishing scientific identity of two data sets.  I'll also note that the
HDF mapping
approach that I heard about (for the first time) at the HDF workshop looks
to me like
a pretty solidly based piece of work - although it fits within the
constraints of working
only with HDF files (at least if my understanding hasn't led me astray).
The mapping
of one file into an equivalent file is also in the work that I'm doing,
although I've started with
the layered information model in Annex E of the OAIS RM.  The content of my
mappings
is also different from the XML approach - mine mappings consist of pairs of
indexes
that refer to array elements in what Annex E identifies as the Data Element
layer.
The mapping concept seems like the most fruitful approach from my
perspective.

Hope this is helpful.  We've got a lot of listening to each other to do.

Bruce B.
On Thu, Oct 7, 2010 at 10:41 AM, Curt Tilmes <Curt.Tilmes at nasa.gov> wrote:

> On 10/07/2010 10:23 AM, alicebarkstrom at frontier.com wrote:
>
>> The alternative is to be able to verify that files hold scientifically
>> identical data by computing whether the alternatives have the same
>> values.
>>
>
> The Altman paper Ruth cited on page 19 of discusses this a bit:
> http://www.springerlink.com/content/j13u6pwh837q2711/
> "A Fingerprint Method for Scientific Data Verification".
>
> Basically producing a hash of a canonical representation of
> the data.  Regardless of the format, the prescribed canonical
> representation is the same, so the hashes are comparable.
>
> For numerous reasons (you point out several), that isn't sufficient
> for our needs, but with some more work, it could be adapted to help us
> perform a similar function.
>
>
> I've been working on a comparable method, taking hashes of a canonical
> representation of the provenance of a file and using that as a
> fingerprint to compare two files.
>
>
> I think we need to work on both approaches.  Ways to identify,
> distinguish and compare content and ways to identify, distinguish and
> compare provenance.
>
>
> Curt
> _______________________________________________
> Esip-preserve mailing list
> Esip-preserve at lists.esipfed.org
> http://www.lists.esipfed.org/mailman/listinfo/esip-preserve
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.lists.esipfed.org/pipermail/esip-preserve/attachments/20101007/9b0deef2/attachment-0001.html>
-------------- next part --------------
42572515005 ITHACA CORNELL UN               42.46  -76.45  293
42572592003 MOUNT SHASTA                    41.32 -122.32 1095
-------------- next part --------------
42572515005 Ithaca Cornell Un               42.46  -76.45  293
42572592003 Mount Shasta                    41.32 -122.32 1095
-------------- next part --------------
4257251500511900  653  782  940  394  353  541  658  800  257 1110 1562  627
4257251500511901  358  259  676  955  983  836  986 1054  455  292  749 1575
4257251500511902  371  386  711  307  480 1473 1829  856  739  919  272 1176
4257251500511903  599  538 1082  244   71 1549  721 1956  330 1557  500  315
4257259200311900 2195  264 2520  886  521  315    0   41  170 2733 1148  965
4257259200311901 3040 2283  157  693  241    0   66   91  528  490 1671  704
4257259200311902  465 5519 1052 1290  729    0   97 1057   28 1341 2824 1115
4257259200311903 2078  447 2177   33   53   84    0    0    0  445 2720  564
-------------- next part --------------
42572515005  653  782  940  394  353  541  658  800  257 1110 1562  627  358  259  676  955  983  836  986 1054  455  292  749 1575  371  386  711  307  480 1473 1829  856  739  919  272 1176  599  538 1082  244   71 1549  721 1956  330 1557  500  315
42572592003 2195  264 2520  886  521  315    0   41  170 2733 1148  965 3040 2283  157  693  241    0   66   91  528  490 1671  704  465 5519 1052 1290  729    0   97 1057   28 1341 2824 1115 2078  447 2177   33   53   84    0    0    0  445 2720  564
-------------- next part --------------
  42.46  -76.45  293  653  782  940  394  353  541  658  800  257 1110 1562  627  358  259  676  955  983  836  986 1054  455  292  749 1575  371  386  711  307  480 1473 1829  856  739  919  272 1176  599  538 1082  244   71 1549  721 1956  330 1557  500  315
  41.32 -122.32 1095 2195  264 2520  886  521  315    0   41  170 2733 1148  965 3040 2283  157  693  241    0   66   91  528  490 1671  704  465 5519 1052 1290  729    0   97 1057   28 1341 2824 1115 2078  447 2177   33   53   84    0    0    0  445 2720  564
-------------- next part --------------
The files in these examples were based on three data files obtained from NOAA's National Climatic Data Center (NCDC)
on Thursday, October 7, 2010 from the ftp site

ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/v2

at about 11:30 a.m. EDT.

The files are identified as
v2.prcp.adj.Z - a zipped file of size 5,478,170 Bytes originally placed in the NCDC ftp directory on 10/7/2010 at 3:41 a.m.
v2_prcp_inv.mht - a text web page of size 1,297,170 Bytes originally placed in the NCDC ftp directory on 1/25/2002 at 12:00 a.m.
v2_prcp_readme.mht - a text web page of size 5,137 Bytes originally placed in the NCDC ftp directory on 6/13/2008 at 12:00 a.m.

I transferred the downloaded file from the computer I use for Web access to the larger MS Windows XP Pro system
I use for technical work.  In the larger machine, I did the following:

1.  Unzipped the v2.prcp.adj.Z file to create the ASCII text file v2.prcp.adj, which is 17,324,307 bytes according
to the Properties popup of this system.
2.  Selected all of the text in v2_prcp_inv.mht and copied it to a MS Wordpad file v2_prcp_inv.txt, which is 1,317,760 bytes
according to the Properties popup.
3.  Selected all of the text in v2_prcp_readme.mht and copied that to a MS Wordpad file V2_PRCP_Readme.txt,
which is 5,257 bytes according to the Properties popup.

The  nature of the files is described in the readme file, as follows:

"GHCN Version 2 Precipitation Version 2 Documentation

These files are associated with GHCN v2 precipitation:
-readme.precip.V2
-v2.prcp.Z
-v2.prcp_adj.Z
-v2.prcp.inv.Z
-v2.prcp.adj.inv.Z
-v2.prcp.duplicates.Z
-v2.prcp.failed.qc.data.Z
-v2.prcp.failed.qc.duplicates.Z
-v2.country.codes.Z

New monthly data are added to v2.prcp a few days after the end of
the month.  Please note that sometimes these new data are later
replaced with data with different values due to, for example,
occasional corrections to the transmitted data that countries
will send over the Global Telecommunications System.

=>  readme.prcp.V2 is this brief documentation file.

...

=>  v2.prcp_adj is the adjusted data file.

This file contains far fewer station records than the raw data set,
It has the same format as the raw data file, but contains
data that has been adjusted for inhomogeneities.  Not only are these
station records free of inhomogeneities, they also
contain additional records for some stations, particularly those in
the former Soviet Union.

...

=>  v2.prcp.inv  Metadata file.

This metadata file contains the station id, station name,
country, latitude, longitude, and elevation
The format is as follows:
station number (i11), space (1x), station name (a20), country (a10),
latitude (f7.2), longitude (f8.2), and elevation in meters (i5)"

The text files containing data and metadata have apparently been
written by a FORTRAN program (as you can see from the format
identified just above for the metadata file).  The data file is organized
by line, such that

"Each line of the data file has:
 
station number which has three parts:
        country code (3 digits)
        nearest WMO station number (5 digits)
        modifier (3 digits) (this is usually 000 if it is that WMO station)
 
This file does not contain any duplicates of the stations, however, it
does have a duplicate number which indicates whether there are duplicates
in the companion file.  A 1 in the duplicate number indicates there
are no duplicates and a 0 indicates that there are duplicates in the
file v2.prcp.duplicates.  

Year:
        four digit year
 
Data:
        12 monthly values each as a 5 digit integer.  
	The data are monthly total precipitation recorded
	at the station in tenths of mm.  (Divide by 10 to get millimeters.)
	Missing monthly values are given as -9999.
	Trace precipitation is indicated by -8888."

4.  Following this downloading and unpacking, I selected records from two
sites using the Wordpad editor to look for lines containing data from
Ithaca, NY for the years 1900, 1901, 1902, and 1903 and for lines
containing data from Mount Shasta for the same years.  I also extracted
the metadata for these two sites to obtain latitude, longitude, and altitude.
The station identifier for the Ithaca site is 42572515005, which is the
text string I used to search for these data  The station identifier for the
Mount Shasta site is 42572592003.

The extracted metadata is contained in a text file ITHACA_SHASTA_INV.txt.
The extracted data in the original order is contained in a text file ITHACA_SHASTA_PRCP.txt.
Note that the year number is still attached to the identifier at the beginning of
each line of text.

5.  I then edited the data file so that the four years of monthly averages were
concatenated into single lines with the station key as the first eleven characters
in the line.  That data file is identified as ITHACA_SHASTA_PRCP_4Yr.txt.

6.  I then created an ASCII text file with two lines, replacing the station identifier
with the latitude, longitude, and altitude as the first three numeric values, followed
by the 48 monthly averaged precipitation values for the station.  This data file is
identified as ITHACA_SHASTA_PRCP_4Yr_GEOLOC.txt.  Note that the year and
month are now contained tacitly in the order of the data array elements.

7.  I modified the ITHACA_SHASTA_INV.txt file by changing "ITHACA" to "Ithaca"
on the first line and "MOUNT SHASTA" to "Mount Shasta" on the second.

The claim I make regarding scientific identity is that if one is basing scientific identity
solely on the basis of the data values, the following file collections are identical in content,
even though the ordering of the values is different in each file:

a.  ITHACA_SHASTA_PRCP.txt
b.  ITHACA_SHASTA_PRCP_4Yr.txt
c.  ITHACA_SHASTA_PRCP_4Yr_GEOLOC.txt

If geolocation needs to enter into scientific identity (which is a more stringent kind
of condition), then the following file collections are identical:

d.  ITHACA_SHASTA_INV.txt and ITHACA_SHASTA_PRCP.txt
e.  ITHACA_SHASTA_INV.txt and ITHACA_SHASTA_PRCP_4Yr.txt
f.  ITHACA_SHASTA_PRCP_4Yr_GEOLOC.txt
g.  ITHACA_SHASTA_INV_Case_Sensitive.txt and ITHACA_SHASTA_PRCP.txt
h.  ITHACA_SHASTA_INV_Case_Sensitive.txt and ITHACA_SHASTA_PRCP_4Yr.txt

Just for checking, here is my calculation of the MD5 message digest for these files.

MD5_Test_2 with Filename input ITHACA_SHASTA_INV.txt                 has a Message_Digest : edd74669036ae4447273eda6fd78fa8b
MD5_Test_2 with Filename input ITHACA_SHASTA_INV_Case_Sensitive.txt  has a Message_Digest : 6f2144b28d1eda552716e67bc7711590

MD5_Test_2 with Filename input ITHACA_SHASTA_PRCP_4Yr.txt            has a Message_Digest : ad325f5a422b23972cdbef72b39e8d95
MD5_Test_2 with Filename input ITHACA_SHASTA_PRCP_4Yr_GEOLOC.txt     has a Message_Digest : 33256bc015fee2864ad2832467ab447a
MD5_Test_2 with Filename input ITHACA_SHASTA_PRCP.txt                has a Message_Digest : 4fd62be98fdb31b2f1700db0b44db489

MD5 is similar to the basis for the UNF method suggested in Altman, M (2008) A Fingerprint Method for 
Scientific Data Verification.  Adv. Comput. Inf. Sci. Eng.: 311-316. doi:10.1007/978-1-4020-8741-7_57.

Note that the going from all upper case to upper and lower case in the metadata file causes the cryptographic
digest of the two files to diverge.




More information about the Esip-preserve mailing list