[Esip-preserve] Fwd: [ESSI] Wynholds, "Linking to Scientific Data: Identity Problems of Unruly and Poorly Bounded Digital Objects"

Bruce Barkstrom brbarkstrom at gmail.com
Mon Dec 20 11:35:44 EST 2010


I got a copy of the paper and started mulling over the content, having
already raised concerns regarding stateful files that are updated, as well
as non-static collections.  I even tried categorizing files and collections
using Protege.  As a result, I'm not sure we can really expect to have
unique identifiers for unchanging objects unless all data collection and
processing has terminated.  If data are still being collected or if
processing
is still going on that will add to the collection, then the contents of the
collection will depend on time, which means if we got an identifier at
one time and compared the contents being identified when we went
back later, we wouldn't have the same contents - which seems a bit
strange for the concept of a "unique" identifier.

The analogy that may be useful is organic - like is this tree the same
one I looked at two years ago?  New branches may have formed in the
spring and an ice storm may have broken some branches off.  So - how
do we define the tree I see now as the same as it was two years ago.

In database terms, it looks like collections are subject to updating and
deleting.  The two methods of tracking state are either to use snapshots
of the collection contents or to use auditable transactions.  The former
gives coarse granularity; the latter fine-grained granularity.

More difficult situations can arise when data are appended to files, as
is apparently done with the GHCN data that are one of the critical data
sources for the IPCC assessments.  The case I'm most familiar with is
the monthly average precipitation (rain guage) records, where the
documentation
suggests that NCDC may update the file as they receive data, apparently
by appending new data points.

I think that the paper in the e-mail is really expecting a static situation
for collections and data objects in the collections.  That expectation may
simply be the result of having the wrong mental model for (or lack of
experience
with) the volatility of Earth science data production.  I'll also note that
the
example she gives assumes that data are produced by investigators who
engage in "unruly" creation processes.  This mental model may be appropriate
for individual investigators or investigators with a small team supported by
three-year grants.  However, I'm reasonably certain that the bulk of the
data in Earth sciences is produced by teams that have much longer lifetimes
and who may have a highly disciplined, industrial approach to data
production.
NOAA, for example, has a six-year planning cycle and such data collection
projects as NPOESS or its predecessor do not engage in a lot of unplanned
or exploratory data production.

Maybe we need to redo our mental model and ask how we document
collections and objects with these properties while they are being built.

It might also be more useful to begin to think about moving beyond "Data
Management
Plans" that are created at the beginning of a project and concentrating some
of our attention on approaches to getting reasonably complete documentation
of "happenings" along the way - ranging from records of satellite
operational
events (like orbit adjust maneuvers) or in situ data collection glitches
(power
outages, for example) to slow changes in the environment (like vegetation
changes around in situ sites).  We would probably also benefit by thinking
about how to get investigators to leave understandable error budgets for
information preservation.

Bruce B.

On Wed, Dec 8, 2010 at 2:24 PM, Ruth Duerr <rduerr at nsidc.org> wrote:

> I thought folks on this list would be interested....
>
> Begin forwarded message:
>
>  *From: *Joe Hourcle <oneiros at grace.nascom.nasa.gov>
> *Date: *December 8, 2010 10:55:14 AM MST
> *To: *earth_and_space_science_informatics at listserv.gsfc.nasa.gov
> *Subject: **[ESSI] Wynholds, "Linking to Scientific Data: Identity
> Problems of Unruly and Poorly Bounded Digital Objects"*
> *Reply-To: *Joe Hourcle <oneiros at grace.nascom.nasa.gov>
>
>
> I'm currently at the Digital Curation Conference, and I thought many of our
> group
> might be interested in the Best Student Paper :
>
> Laura Wynholds, "Linking to Scientific Data: Identity Problems of Unruly
> and Poorly Bounded Digital Objects"
>
> Abstract:
>
> Within information systems, a significant aspect of search and retrieval
> across information objects, such as datasets, journal articles, or images,
> relies on the identity construction of the objects. This paper uses identity
> to refer to the qualities or characteristics of an information object that
> make it definable and recognizable, and can be used to distinguish it from
> other objects. Identity, in this context, can be seen as the foundation from
> which citations, metadata, and identifiers are constructed.
>
> In recent years the idea of including datasets within the scientific record
> has been gaining significant momentum, with publishers, granting agencies
> and libraries engaging with the challenge. However, the task has been
> fraught with questions of best practice for establishing this
> infrastructure, especially in regards to how citations, metadata, and
> identifiers should be constructed. These questions suggests a problem with
> how dataset identities are formed, such that an engagement with the
> definition of datasets as conceptual objects is warranted.
>
> This paper explores some of the ways in which scientific data is an unruly
> and poorly bounded object, and goes on to propose that in order for datasets
> to fulfill the roles expected for them, the following identity functions are
> essential for scholarly publications: (i) the dataset is constructed as a
> semantically and logically concrete object, (ii) the identity of the dataset
> is embedded, inherent and/or inseparable, (iii) the identity embodies a
> framework of authorship, rights, and limitations, and (iv) the identity
> translates into an actionable mechanism for retrieval or reference.
>
>
> (And I checked with her after her talk, and she said it was okay to
> distribute ... but I can't remember if this mailing list allows attachments
> or not)
>
> -Joe
>
>
>
>
> _______________________________________________
> Esip-preserve mailing list
> Esip-preserve at lists.esipfed.org
> http://www.lists.esipfed.org/mailman/listinfo/esip-preserve
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.lists.esipfed.org/pipermail/esip-preserve/attachments/20101220/1449ac60/attachment.html>


More information about the Esip-preserve mailing list