[Esip-preserve] Identifiers

Sat Feb 18 09:34:31 EST 2012

I'll agree that the definition thing is about what cognitive construct
we're talking about.  However, there are three points to the discussion:

1. To see that we understand the "cultural" context that people
are using as a referent when they talk about a term.  The OED
definition(s) Mark presented discloses two meanings - one fairly
generic and encompassing a collection of discrete (and generic)
objects, while the second refers to a singular object that is
grounded in our everyday experience with the computers on
our desks.  (Incidentally, in the very different cultural context
arising in hand metal work, there's a "bastard file".  We might
find this notion useful.)

2.  To provide a concrete enough definition to allow
us to count the objects for computational scaling purposes.
It matters whether we're dealing with one data file or two
million.  To extend this slightly different, a data set maintained
in a relational database is going to have quite different granularities
for the items ingested and distributed that will be the case for
a collection of digital files.  In the former, the scalings are likely
to be based on the transaction rate.  In the second, it's likely
to be based on the rate at which files are received or distributed.
A system designer needs to have a precise definition of the kind
of objects he or she will have to design in order to produce
reasonable estimates of cost and required resources.

3.  To clarify understand whether a collection of data or
data sets has an internal structure that might have nameable
parts.  For example, in my original published use of the term
'Data Set,' it was a collection of files belonging to a 'Data Product',
where the files in the 'Data Set' came from a single, identified
data source.  A 'Data Set' might be subdivided into 'Data Set
Versions' - and these further subdivided into 'Variants'
for which the major difference is the format of the file
[see Barkstrom, B. R., 2003: Data Product Configuration
Management and Versioning in Large-Scale Production of
Satellite Scientific Data, in B. Westfechtel, A. and van den Houk,
eds., LNCS2649, Software Configuration Management, pp 118-133].
In this use of the term, a 'Data Set' identifies a collection of files belonging
to a particular layer in a hierarchy, much as Annex E of the OAIS RM
has named layers ("Media Layer", "Bit Stream Layer", and so on)
for digital data items.

It's also important to note that the working assumption in the
note is that the only way to search for an item is to query for
it with a name.  An alternative approach is to traverse a network
of, say, web pages, where the user uses links to find items of
interest.  In that approach to searching, the user starts at a page
that has the root of the tree (the Data Producer page), and finds a set
of hints as to which Data Product would be appropriate.  When he
or she selects a Data Product, they arrive on a page with a list
of Data Sets with hints as to how these differ.  After proceeding
to the selected Data Set page, there's a list of Versions, and then
one of Variants, and finally a request to select a time interval that
gives a list of files.

[As a minor note, my junior high vocational aptitude testing disclosed
that librarian and lawyer were my second and third professional
inclinations - although far behind my interest in becoming a
scientist.  I have a copy of Smiraglia's book "The Nature of 'A Work':
Implications for the Organization of Knowledge".]

Bruce B.

On Sat, Feb 18, 2012 at 3:10 AM, Greg Janée <gjanee at eri.ucsb.edu> wrote:
> Mark A. Parsons wrote:
>> I don't think there is a falsifiable definition of data set. Or rather all definitions are false. It's very situational.
>
> Agreed.  To put it another way, I think this attempt to define "dataset" is doomed because a dataset is a cognitive construct, and cognitive constructs do not have exact definitions and hard boundaries, but look more like overlapping categories that are characterized by exemplars and degrees of membership.
>
> Is there a *functional* reason why we need to define terms like "dataset" and "granule"?  I guess a necessary (but not sufficient) condition for me to be convinced by any definitions for "dataset" and "granule" is that there is some kind of functional difference between them; some different functional affordances.
>
> From the old Alexandria days I recall a passionate debate over what constituted a "title".  (That may sound quaint now, but I assure you, a librarian armed with an AACR2 reference is a formidable adversary.)  What cut through that particular Gordian knot was looking at the question purely functionally: we only care about titles to the extent that we do something with them.  And the answer at that time was, all we do with titles is display them in search result lists.  Ergo, a "title" is that which you want to see displayed as a search result, no more, no less.  Corollary: a title should be about one line wide when displayed in a typical font size.
>
> Regarding data and citation, from a functional perspective I would say that if a particular entity has an identifier, and can be independently referenced (or is independently actionable), and if the entity's provider is committed to maintaining that entity and its identifier and its independent referencability, then the entity is "citable".  Notice that this definition is independent of both the size of the entity and the terminology the provider uses in referring to it.
>
> -Greg
>