[Esip-preserve] A Summary of Yesterday's discussion on Data Sets

Bruce Barkstrom brbarkstrom at gmail.com
Fri Feb 17 16:45:24 EST 2012


I think yesterday's discussion was useful.  Here's an attempt to
capture some of it in the form of a dictionary where each term
has several definitions.  I do not think we need to try to develop
a single "consensus" definition for these terms.  Rather, this
approach seeks to provide a reflection of the very different
mental models present in the group, as well as in the data
producer and user communities.

Bruce B.

A Dictionary for Terms Related to Data Sets

Introduction

Although the term `data set' is widely used in writings describing
collections of
Earth science data, it is difficult to find a clear definition.  For
example, the
Open Archive Information System (OAIS) Reference Model (RM) does not include the
term `data set' in its list of defined terms, although it uses it in
that document's
Appendix A.  The ISO 19115 standard notes that ``the definition of what
constitutes a `dataset' is more problematic and reflects the institutional and
software environments of the originating organization.'' [Appendix G, p.~119]

The ESIP Federation Cluster on Data Preservation and Stewardship discussed the
meaning of this term fairly extensively.  As a result, it was clear that there
were a variety of different meanings to this term.  In attempt to clarify the
possible uses of the term and show its ambiguity, the group considered using
the standard dictionary approach.  That is, rather than trying to
present a single
definition, a dictionary presents a numbered list of alternative meanings.  In
addition, it seemed useful to present examples of each alternative's use.
I didn't catch enough of those to be useful, so I've just put in the
definitions.

Data Set Definitions

Data Set:
1.  A logical collection of data
2.  A granule
3.  A collection of granules
4.  A relational database
5.  A collection of data values
6.  A file containing data
7.  A collection of files containing data

Ancillary Definitions and Notes

A.  The term `Data' is not defined above.  It may be useful to be more precise.

Data:
1.  A collection of datum values (noting that one unabridged dictionary
says data is the plural of datum)
2.  A datum is a numerical value for a measurand or a character string
identifying
a biological or geological specimen.

B.  The term `Granule' is not defined above.  This term has a fairly long and
perhaps obscure history.

Granule:
1.  A term used to identify an inventory entry for a file in a data
archive's catalog
(based on an informal recollection of the use of this term in the early phases
of NASA's EOSDIS design, when the system's designer's wanted the inventory item
to remain defined even if tape storage devices fragmented the file by placing
it on different tapes).
1a.  A somewhat broader definition would allow the inventory entry to include
metadata and documentation.
2.  A collection of data, metadata, and documentation roughly equivalent to
the OAIS RM's notion of a Dissemination Information Package.

C.  The term `Metadata' is also overloaded.

Metadata:
1.  Data about data.
2.  A collection of records organized in a fashion appropriate for storage in
a relational database.  Each metadata record contains fields.  In the OAIS RM,
metadata is typically classified as Representation, Provenance, Context, or
Fixity.
3.  A collection of records (as in definition 2) together with digital
or written
documents intended for communicating humanly understandable
information, particularly
about the provenance and context of a data collection.

D.  The term `File' is also not easily definable.  Wikipedia's
articles on the term
`Computer Files' and `File Names' are helpful.  Knuth's circular
definition indicates
some of the ambiguity: ``The collection of all records is called a
`table' or `file',
where the word `table' is used to indicate a small file and `file' to indicate a
large table.  A large file or a group of files is frequently called a
`database.' ''
[Knuth, D. E., 1998: The Art of Computer Programming: Volume 3,
Sorting and Searching,
Second Edition, Addison-Wesley, Boston, MA]

File:
1.  A computer file is a block of arbitrary information, or resource
for storing
information, which is available to a computer program and is usually
based on some
kind of durable storage.  [Wikipedia, article on Computer Files]
2.  At the lowest level (corresponding to the Bit-Stream Level in Annex E of the
OAIS RM), many modern operating systems consider files simply as a
one-dimensional
array of bits.  [Wikipedia, article on Computer Files - but modified
to make their
term `sequence of bytes' read `array of bits' and conflate meanings
with the OAIS RM.]
3  At a higher level, where the content of the file is being
considered, these binary
digits may represent integer values, text characters, image pixels,
audio or anything else.
It is up to the program using the file to understand the meaning and
internal layout
of information in the file and present it to a user as more meaningful
information
(like text, images, sounds, or executable application programs).
[Wikipedia, article
on Computer Files]  Note that thinking of the file as an array of
`higher level' data
elements corresponds with the Aggregation Layer in Annex E of the OAIS
RM, but makes
the reading and writing more complex because the data elements no
longer have the same
size.


More information about the Esip-preserve mailing list