[Esip-preserve] On Earth Science Data File Uniqueness

Wed Feb 16 08:10:25 EST 2011

On 02/15/11 08:29, Bruce Barkstrom wrote:
> I think we're going to have to work this through in detail - meaning
> scenarios at about the level required for authentication and
> cryptography.  MD5 and SHA-1 both tie into bit-level content, so two
> files would have to be the same at that level to get the same
> identifier.  This would give the same identifier to copies of files
> created as backup.  The other versions of UUID's would have separate
> identifiers - but then you have the "orphan file problem" we
> discussed before: you need a registry of ID's to know the backup
> copy is a bit-for-bit copy of the original.  If we go this route,
> we're going to need a real mathematician, probably one familiar with
> number theory and such.  I don't qualify, I'm only an applied
> mathematician.
>
> As a minor note, I believe both MD5 and SHA-1 are believed to be
> "broken" (or "slightly flawed") cryptographic digests.  This means
> that there might be some way for someone to forge IDs.  Don't know
> that there have been any successful uses of the vulnerability - but
> most cryptographers would probably think that was just a matter of
> time.

Indeed --
http://en.wikipedia.org/wiki/MD5#Security
http://www.schneier.com/blog/archives/2005/02/sha1_broken.html

I haven't done a huge amount of research on those, but my
understanding is that each of the published attacks compromise
cryptographic uses like non-repudiation, since they allow a malicious
agent to change the content in such a way that the digital signature
is still the same.

My belief is that our purpose of fixity -- providing a way for the
consumer of our data to check the integrity of the content after
transfer or storage -- wouldn't really be affected by those attacks.
We are concerned about random changes to the content, not malicious
changes.  (Is this valid?  Do we have use cases where we *are*
concerned about a malicious man-in-the-middle?)

NIST has "FIPS PUB 186-3 - Digital Signature Standard (DSS)" [1] that
discusses all of this at great length.

It describes the applications:
"A digital signature algorithm allows an entity to authenticate the
integrity of signed data and the identity of the signatory. The
recipient of a signed message can use a digital signature as evidence
in demonstrating to a third party that the signature was, in fact,
generated by the claimed signatory. This is known as non-repudiation,
since the signatory cannot easily repudiate the signature at a later
time. A digital signature algorithm is intended for use in electronic
mail, electronic funds transfer, electronic data interchange, software
distribution, data storage, and other applications that require data
integrity assurance and data origin authentication."

Our use would fall into that "other applications that require data
integrity assurance".

"FIPS PUB 180-3, Secure Hash Standard" [2] goes into detail about the
recommended algorithms.

"Applicability: This Standard is applicable to all Federal departments
and agencies for the protection of sensitive unclassified information
[...] This standard shall be implemented whenever a secure hash
algorithm is required for Federal applications, including use by other
cryptographic algorithms and protocols."

I could argue that our use "integrity assurance" doesn't require
cryptographic algorithms -- we aren't protecting against malicious
agents, just random corruption.  I also don't believe our scientific
data falls into the category of "SBU - sensitive but unclassified".

It recommends (requires?) SHA-1, SHA-224, SHA-256, SHA-384 or SHA-512.
I suspect in light of the published attacks on SHA-1, it will be
removed in the next release of that standard.

Curt

[1] http://csrc.nist.gov/publications/fips/fips186-3/fips_186-3.pdf
[2] http://csrc.nist.gov/publications/fips/fips180-3/fips180-3_final.pdf