[Esip-preserve] Stewardship Best Practices - Identifiers

Thu Oct 7 15:02:57 EDT 2010

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.lists.esipfed.org/pipermail/esip-preserve/attachments/20101007/12c6e65e/attachment-0001.html>
-------------- next part --------------
The files in these examples were based on three data files obtained from NOAA's National Climatic Data Center (NCDC)
on Thursday, October 7, 2010 from the ftp site

ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/v2

at about 11:30 a.m. EDT.

The files are identified as
v2.prcp.adj.Z - a zipped file of size 5,478,170 Bytes originally placed in the NCDC ftp directory on 10/7/2010 at 3:41 a.m.
v2_prcp_inv.mht - a text web page of size 1,297,170 Bytes originally placed in the NCDC ftp directory on 1/25/2002 at 12:00 a.m.
v2_prcp_readme.mht - a text web page of size 5,137 Bytes originally placed in the NCDC ftp directory on 6/13/2008 at 12:00 a.m.

I transferred the downloaded file from the computer I use for Web access to the larger MS Windows XP Pro system
I use for technical work.  In the larger machine, I did the following:

1.  Unzipped the v2.prcp.adj.Z file to create the ASCII text file v2.prcp.adj, which is 17,324,307 bytes according
to the Properties popup of this system.
2.  Selected all of the text in v2_prcp_inv.mht and copied it to a MS Wordpad file v2_prcp_inv.txt, which is 1,317,760 bytes
according to the Properties popup.
3.  Selected all of the text in v2_prcp_readme.mht and copied that to a MS Wordpad file V2_PRCP_Readme.txt,
which is 5,257 bytes according to the Properties popup.

The  nature of the files is described in the readme file, as follows:

"GHCN Version 2 Precipitation Version 2 Documentation

These files are associated with GHCN v2 precipitation:
-readme.precip.V2
-v2.prcp.Z
-v2.prcp_adj.Z
-v2.prcp.inv.Z
-v2.prcp.adj.inv.Z
-v2.prcp.duplicates.Z
-v2.prcp.failed.qc.data.Z
-v2.prcp.failed.qc.duplicates.Z
-v2.country.codes.Z

New monthly data are added to v2.prcp a few days after the end of
the month.  Please note that sometimes these new data are later
replaced with data with different values due to, for example,
occasional corrections to the transmitted data that countries
will send over the Global Telecommunications System.

=>  readme.prcp.V2 is this brief documentation file.

...

=>  v2.prcp_adj is the adjusted data file.

This file contains far fewer station records than the raw data set,
It has the same format as the raw data file, but contains
data that has been adjusted for inhomogeneities.  Not only are these
station records free of inhomogeneities, they also
contain additional records for some stations, particularly those in
the former Soviet Union.

...

=>  v2.prcp.inv  Metadata file.

This metadata file contains the station id, station name,
country, latitude, longitude, and elevation
The format is as follows:
station number (i11), space (1x), station name (a20), country (a10),
latitude (f7.2), longitude (f8.2), and elevation in meters (i5)"

The text files containing data and metadata have apparently been
written by a FORTRAN program (as you can see from the format
identified just above for the metadata file).  The data file is organized
by line, such that

"Each line of the data file has:

station number which has three parts:
        country code (3 digits)
        nearest WMO station number (5 digits)
        modifier (3 digits) (this is usually 000 if it is that WMO station)

This file does not contain any duplicates of the stations, however, it
does have a duplicate number which indicates whether there are duplicates
in the companion file.  A 1 in the duplicate number indicates there
are no duplicates and a 0 indicates that there are duplicates in the
file v2.prcp.duplicates.  

Year:
        four digit year

Data:
        12 monthly values each as a 5 digit integer.  
	The data are monthly total precipitation recorded
	at the station in tenths of mm.  (Divide by 10 to get millimeters.)
	Missing monthly values are given as -9999.
	Trace precipitation is indicated by -8888."

4.  Following this downloading and unpacking, I selected records from two
sites using the Wordpad editor to look for lines containing data from
Ithaca, NY for the years 1900, 1901, 1902, and 1903 and for lines
containing data from Mount Shasta for the same years.  I also extracted
the metadata for these two sites to obtain latitude, longitude, and altitude.
The station identifier for the Ithaca site is 42572515005, which is the
text string I used to search for these data  The station identifier for the
Mount Shasta site is 42572592003.

The extracted metadata is contained in a text file ITHACA_SHASTA_INV.txt.
The extracted data in the original order is contained in a text file ITHACA_SHASTA_PRCP.txt.
Note that the year number is still attached to the identifier at the beginning of
each line of text.

5.  I then edited the data file so that the four years of monthly averages were
concatenated into single lines with the station key as the first eleven characters
in the line.  That data file is identified as ITHACA_SHASTA_PRCP_4Yr.txt.

6.  I then created an ASCII text file with two lines, replacing the station identifier
with the latitude, longitude, and altitude as the first three numeric values, followed
by the 48 monthly averaged precipitation values for the station.  This data file is
identified as ITHACA_SHASTA_PRCP_4Yr_GEOLOC.txt.  Note that the year and
month are now contained tacitly in the order of the data array elements.

7.  I modified the ITHACA_SHASTA_INV.txt file by changing "ITHACA" to "Ithaca"
on the first line and "MOUNT SHASTA" to "Mount Shasta" on the second.

The claim I make regarding scientific identity is that if one is basing scientific identity
solely on the basis of the data values, the following file collections are identical in content,
even though the ordering of the values is different in each file:

a.  ITHACA_SHASTA_PRCP.txt
b.  ITHACA_SHASTA_PRCP_4Yr.txt
c.  ITHACA_SHASTA_PRCP_4Yr_GEOLOC.txt

If geolocation needs to enter into scientific identity (which is a more stringent kind
of condition), then the following file collections are identical:

d.  ITHACA_SHASTA_INV.txt and ITHACA_SHASTA_PRCP.txt
e.  ITHACA_SHASTA_INV.txt and ITHACA_SHASTA_PRCP_4Yr.txt
f.  ITHACA_SHASTA_PRCP_4Yr_GEOLOC.txt
g.  ITHACA_SHASTA_INV_Case_Sensitive.txt and ITHACA_SHASTA_PRCP.txt
h.  ITHACA_SHASTA_INV_Case_Sensitive.txt and ITHACA_SHASTA_PRCP_4Yr.txt

Just for checking, here is my calculation of the MD5 message digest for these files.

MD5_Test_2 with Filename input ITHACA_SHASTA_INV.txt                 has a Message_Digest : edd74669036ae4447273eda6fd78fa8b
MD5_Test_2 with Filename input ITHACA_SHASTA_INV_Case_Sensitive.txt  has a Message_Digest : 6f2144b28d1eda552716e67bc7711590

MD5_Test_2 with Filename input ITHACA_SHASTA_PRCP_4Yr.txt            has a Message_Digest : ad325f5a422b23972cdbef72b39e8d95
MD5_Test_2 with Filename input ITHACA_SHASTA_PRCP_4Yr_GEOLOC.txt     has a Message_Digest : 33256bc015fee2864ad2832467ab447a
MD5_Test_2 with Filename input ITHACA_SHASTA_PRCP.txt                has a Message_Digest : 4fd62be98fdb31b2f1700db0b44db489

MD5 is similar to the basis for the UNF method suggested in Altman, M (2008) A Fingerprint Method for 
Scientific Data Verification.  Adv. Comput. Inf. Sci. Eng.: 311-316. doi:10.1007/978-1-4020-8741-7_57.

Note that the going from all upper case to upper and lower case in the metadata file causes the cryptographic
digest of the two files to diverge.